model_vit: Vision Transformer Implementation
In torchvision: Models, Datasets and Transformations for Images

model_vit

R Documentation

Vision Transformer Implementation

Description

Vision Transformer (ViT) models implement the architecture proposed in the paper An Image is Worth 16x16 Words. These models are designed for image classification tasks and operate by treating image patches as tokens in a Transformer model.

Usage

model_vit_b_16(pretrained = FALSE, progress = TRUE, ...)

model_vit_b_32(pretrained = FALSE, progress = TRUE, ...)

model_vit_l_16(pretrained = FALSE, progress = TRUE, ...)

model_vit_l_32(pretrained = FALSE, progress = TRUE, ...)

model_vit_h_14(pretrained = FALSE, progress = TRUE, ...)

Arguments

`pretrained`	(bool): If TRUE, returns a model pre-trained on ImageNet.
`progress`	(bool): If TRUE, displays a progress bar of the download to stderr.
`...`	Other parameters passed to the model implementation.

Details

Model Variants and Performance (ImageNet-1k)

| Model     | Top-1 Acc | Top-5 Acc | Params  | GFLOPS | File Size | Weights Used              | Notes                  |
|-----------|-----------|-----------|---------|--------|-----------|---------------------------|------------------------|
| vit_b_16  | 81.1%     | 95.3%     | 86.6M   | 17.56  | 346 MB    | IMAGENET1K_V1             | Base, 16x16 patches    |
| vit_b_32  | 75.9%     | 92.5%     | 88.2M   | 4.41   | 353 MB    | IMAGENET1K_V1             | Base, 32x32 patches    |
| vit_l_16  | 79.7%     | 94.6%     | 304.3M  | 61.55  | 1.22 GB   | IMAGENET1K_V1             | Large, 16x16 patches   |
| vit_l_32  | 77.0%     | 93.1%     | 306.5M  | 15.38  | 1.23 GB   | IMAGENET1K_V1             | Large, 32x32 patches   |
| vit_h_14  | 88.6%     | 98.7%     | 633.5M  | 1016.7 | 2.53 GB   | IMAGENET1K_SWAG_E2E_V1    | Huge, 14x14 patches    |

TorchVision Recipe: https://github.com/pytorch/vision/tree/main/references/classification
SWAG Recipe: https://github.com/facebookresearch/SWAG

Weights Selection:

All models use the default IMAGENET1K_V1 weights for consistency, stability, and official support from TorchVision.
These are supervised weights trained on ImageNet-1k.
For vit_h_14, the default weight is IMAGENET1K_SWAG_E2E_V1, pretrained on SWAG and fine-tuned on ImageNet.

Functions

model_vit_b_16(): ViT-B/16 model (Base, 16×16 patch size)
model_vit_b_32(): ViT-B/32 model (Base, 32×32 patch size)
model_vit_l_16(): ViT-L/16 model (Base, 16×16 patch size)
model_vit_l_32(): ViT-L/32 model (Base, 32×32 patch size)
model_vit_h_14(): ViT-H/14 model (Base, 14×14 patch size)

torchvision
Models, Datasets and Transformations for Images

model_vit: Vision Transformer Implementation
In torchvision: Models, Datasets and Transformations for Images

Vision Transformer Implementation

Description

Usage

Arguments

Details

Model Variants and Performance (ImageNet-1k)

Functions

See Also

Related to model_vit in torchvision...

R Package Documentation

Browse R Packages

We want your feedback!

torchvision Models, Datasets and Transformations for Images

model_vit: Vision Transformer Implementation In torchvision: Models, Datasets and Transformations for Images

Vision Transformer Implementation

Description

Usage

Arguments

Details

Model Variants and Performance (ImageNet-1k)

Functions

See Also

Related to model_vit in torchvision...

R Package Documentation

Browse R Packages

We want your feedback!

torchvision
Models, Datasets and Transformations for Images

model_vit: Vision Transformer Implementation
In torchvision: Models, Datasets and Transformations for Images