Model Overview

Model Summary

Vision Transformer (ViT) adapts the Transformer architecture, originally designed for natural language processing, to the domain of computer vision. It treats images as sequences of patches, similar to how Transformers treat sentences as sequences of words.. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Installation

Keras and KerasHub can be installed with:

pip install -U -q keras-hub
pip install -U -q keras

Presets

Model ID	img_size	Acc	Top-5	Parameters
Base
vit_base_patch16_224_imagenet	224	-	-	85798656
vit_base_patch_16_224_imagenet21k	224	-	-	85798656
vit_base_patch_16_384_imagenet	384	-	-	86090496
vit_base_patch32_224_imagenet21k	224	-	-	87455232
vit_base_patch32_384_imagenet	384	-	-	87528192
Large
vit_large_patch16_224_imagenet	224	-	-	303301632
vit_large_patch16_224_imagenet21k	224	-	-	303301632
vit_large_patch16_384_imagenet	224	-	-	303690752
vit_large_patch32_224_imagenet21k	224	-	-	305510400
vit_large_patch32_384_imagenet	224	-	-	305607680
Huge
vit_huge_patch14_224_imagenet21k	224	-	-	630764800

Example Usage

Pretrained ViT model

image_classifier = keras_hub.models.ImageClassification.from_preset(
    "vit_base_patch16_384_imagenet"
)

input_data = np.random.uniform(0, 1, size=(2, 224, 224, 3))
image_classifier(input_data)

Load the backbone weights and fine-tune model for custom dataset.

backbone = keras_hub.models.Backbone.from_preset(
    "vit_base_patch16_384_imagenet"
)
preprocessor = keras_hub.models.ViTImageClassifierPreprocessor.from_preset(
    "vit_base_patch16_384_imagenet"
)
model = keras_hub.models.ViTImageClassifier(
    backbone=backbone,
    num_classes=len(CLASSES),
    preprocessor=preprocessor,
)

Example Usage with Hugging Face URI

Pretrained ViT model

image_classifier = keras_hub.models.ImageClassification.from_preset(
    "hf://keras/vit_base_patch16_384_imagenet"
)

input_data = np.random.uniform(0, 1, size=(2, 224, 224, 3))
image_classifier(input_data)

Load the backbone weights and fine-tune model for custom dataset.

backbone = keras_hub.models.Backbone.from_preset(
    "hf://keras/vit_base_patch16_384_imagenet"
)
preprocessor = keras_hub.models.ViTImageClassifierPreprocessor.from_preset(
    "hf://keras/vit_base_patch16_384_imagenet"
)
model = keras_hub.models.ViTImageClassifier(
    backbone=backbone,
    num_classes=len(CLASSES),
    preprocessor=preprocessor,
)

Downloads last month: 16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including keras/vit_base_patch16_384_imagenet

ViT

Collection

11 items • Updated Mar 25, 2025

Paper for keras/vit_base_patch16_384_imagenet

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper • 2010.11929 • Published Oct 22, 2020 • 15

keras
/

vit_base_patch16_384_imagenet

Model Overview

Model Summary

Links:

Installation

Presets

Example Usage

Pretrained ViT model

Load the backbone weights and fine-tune model for custom dataset.

Example Usage with Hugging Face URI

Pretrained ViT model

Load the backbone weights and fine-tune model for custom dataset.

Collection including keras/vit_base_patch16_384_imagenet

ViT

Paper for keras/vit_base_patch16_384_imagenet

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale