| model_vit | R Documentation |
Vision Transformer (ViT) models implement the architecture proposed in the paper An Image is Worth 16x16 Words. These models are designed for image classification tasks and operate by treating image patches as tokens in a Transformer model.
model_vit_b_16(pretrained = FALSE, progress = TRUE, ...)
model_vit_b_32(pretrained = FALSE, progress = TRUE, ...)
model_vit_l_16(pretrained = FALSE, progress = TRUE, ...)
model_vit_l_32(pretrained = FALSE, progress = TRUE, ...)
model_vit_h_14(pretrained = FALSE, progress = TRUE, ...)
pretrained |
(bool): If TRUE, returns a model pre-trained on ImageNet. |
progress |
(bool): If TRUE, displays a progress bar of the download to stderr. |
... |
Other parameters passed to the model implementation. |
| Model | Top-1 Acc | Top-5 Acc | Params | GFLOPS | File Size | Weights Used | Notes | |-----------|-----------|-----------|---------|--------|-----------|---------------------------|------------------------| | vit_b_16 | 81.1% | 95.3% | 86.6M | 17.56 | 346 MB | IMAGENET1K_V1 | Base, 16x16 patches | | vit_b_32 | 75.9% | 92.5% | 88.2M | 4.41 | 353 MB | IMAGENET1K_V1 | Base, 32x32 patches | | vit_l_16 | 79.7% | 94.6% | 304.3M | 61.55 | 1.22 GB | IMAGENET1K_V1 | Large, 16x16 patches | | vit_l_32 | 77.0% | 93.1% | 306.5M | 15.38 | 1.23 GB | IMAGENET1K_V1 | Large, 32x32 patches | | vit_h_14 | 88.6% | 98.7% | 633.5M | 1016.7 | 2.53 GB | IMAGENET1K_SWAG_E2E_V1 | Huge, 14x14 patches |
TorchVision Recipe: https://github.com/pytorch/vision/tree/main/references/classification
SWAG Recipe: https://github.com/facebookresearch/SWAG
Weights Selection:
All models use the default IMAGENET1K_V1 weights for consistency, stability, and official support from TorchVision.
These are supervised weights trained on ImageNet-1k.
For vit_h_14, the default weight is IMAGENET1K_SWAG_E2E_V1, pretrained on SWAG and fine-tuned on ImageNet.
model_vit_b_16(): ViT-B/16 model (Base, 16×16 patch size)
model_vit_b_32(): ViT-B/32 model (Base, 32×32 patch size)
model_vit_l_16(): ViT-L/16 model (Base, 16×16 patch size)
model_vit_l_32(): ViT-L/32 model (Base, 32×32 patch size)
model_vit_h_14(): ViT-H/14 model (Base, 14×14 patch size)
Other classification_model:
model_alexnet(),
model_convnext,
model_efficientnet,
model_efficientnet_v2,
model_facenet,
model_inception_v3(),
model_maxvit(),
model_mobilenet_v2(),
model_mobilenet_v3,
model_resnet,
model_vgg
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.