We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to fuwei@microsoft.com.
Foundation Architecture
TorchScale - A Library of Foundation Architectures (repo)
Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.
Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond
Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-1: A Multimodal Large Language Model (MLLM)
MetaLM: Language Models are General-Purpose Interfaces
The Big Convergence - Large-scale self-supervised pre-training across tasks (predictive and generative), languages (100+ languages), and modalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)
Language & Multilingual
UniLM: unified pre-training for language understanding and generation
InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages
DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages
MiniLM: small and fast pre-trained models for language understanding and generation
AdaLM: domain, language, and task adaptation of pre-trained models
EdgeLM(NEW): small pre-trained models on edge/client devices
SimLM (NEW): large-scale pre-training for similarity matching
VL-BEiT (NEW): Generative Vision-Language Pre-training - evolution of BEiT to multimodal
BEiT-3 (NEW): a general-purpose multimodal foundation model, and a major milestone of The Big Convergence of Large-scale Pre-training Across Tasks, Languages, and Modalities.
Sep, 2023: Kosmos-2.5 - a multimodal literate model for machine reading of text-intensive images.
[Model Release] May, 2023: TextDiffuser models and code.
[Model Release] March, 2023: BEiT-3 pretrained models and code.
March, 2023: Kosmos-1 - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).
January, 2023: VALL-E a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. See https://aka.ms/valle for demos of our work.
[Model Release] January, 2023: E5 - Text Embeddings by Weakly-Supervised Contrastive Pre-training.
[Model Release] November, 2022: XDocBASE models for cross-format document understanding.
[Model Release] September, 2022: TrOCRBASE and LARGE models for Scene Text Recognition (STR).
[Model Release] September, 2022: BEiT v2 code and pretrained models.
August, 2022: BEiT-3 - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
July, 2022: SimLM - Large-scale self-supervised pre-training for similarity matching
June, 2022: DiT and LayoutLMv3 were accepted by ACM Multimedia 2022.
June, 2022: MetaLM - Language models are general-purpose interfaces to foundation models (language/multilingual, vision, speech, and multimodal)
June, 2022: VL-BEiT - bidirectional multimodal Transformer learned from scratch with one unified pretraining task, one shared backbone, and one-stage training, supporting both vision and vision-language tasks.
[Model Release] June, 2022: LayoutLMv3 Chinese - Chinese version of LayoutLMv3
[Code Release] May, 2022: Aggressive Decoding - Lossless Speedup for Seq2seq Generation
April, 2022: Transformers at Scale = DeepNet + X-MoE
[Model Release] April, 2022: LayoutLMv3 - Pre-training for Document AI with Unified Text and Image Masking
May, 2021: LayoutLMv2, InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
April, 2021: LayoutXLM is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark XFUND is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
September, 2019: UniLMv1 was accepted by NeurIPS 2019.
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the transformers project.