Microsoft Research Introduces a General-Purpose Multimodal Core Model “BEIT-3” That Achieves Peak Transfer Performance on Vision and Vision Language Tasks

The machine learning community has recently shifted its focus to the convergence of language, vision, and multimodal pretraining. The main intention behind this is to create basic general-purpose models that can handle multiple modalities and be easily customized for various downstream tasks. A Microsoft research team recently presented BEiT-3 (BERT Pretraining of Image Transformers), a state-of-the-art general-purpose multimodal baseline model for vision and vision language tasks, in the article Image as a Foreign Language: BEiT Pre-training for all vision and vision-language tasks. The model improves the convergence technique in three aspects: backbone design, pre-training work, and scaling the model, enabling it to achieve peak performance.

The team proposed a state-of-the-art shared multipath transformer network as the backbone of their architecture. The network has been pre-trained on huge single-mode and multi-modal data to enable it to encode various modalities. Multiway Transformer blocks use a pool of feedback networks to represent various modalities and a shared self-attention module that learns to align various modalities and provides deep fusion for multimodal activities. Within this common framework, BEiT-3 unifies the modeling of hidden “language” on images, texts and image-text pairs (also called “parallel sentences”). The team uses a single masked data modeling lens on single-modal and multi-modal data during the BEiT-3 pretraining procedure. They conceal text or image patches to train the model to anticipate hidden tokens. They use 21 million image-text pairs and 15 million photos for multimodal data, which they acquired from several open databases. The single-mode data consists of a 160 GB text corpus and 14 million images from ImageNet-21K.

The researchers used BEiT-3 on well-known public benchmarks such as visual question answering (VQA), visual reasoning, image captioning, and semantic segmentation as part of their empirical investigation. According to these experimental evaluations, BEiT-3 achieves peak performance on language model-related tasks such as object detection, semantic segmentation, image classification, visual reasoning, visual question answering, image subtitling and cross-modal. recovery. The central concept of BEIT-3 is that an image can be treated like a foreign language, allowing researchers to quickly and consistently perform masked “language” modeling on images, text, and image-text pairs. . The team also puts multi-way transformers in a new light by showing how well they can represent various vision and vision language tasks, making them an attractive choice for general-purpose modeling. The team believes BEIT-3 is a good way to scale multimodal foundation models because it is simple and efficient. To facilitate cross-linguistic and cross-modality transfer, researchers are working on BEIT-3 multilingual pre-training and adding other modalities such as audio. Overall, Microsoft researchers’ BEiT-3 proposal presents a promising new path to efficiently scale multimodal baseline models while advancing the development of these models.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, github link and reference article.

Please Don't Forget To Join Our ML Subreddit


Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.


Comments are closed.