COS-MOE: COntrastive Streamlined multimodal model with Mixture Of Experts

Wang, Alex Jinpeng1     Li, Linjie 2     Lin, Kevin Qinghong 1     Wang, Jianfeng 2     Lin, Kevin2     Wang, Jianfeng 2     Wang, Lijuan 2     Mike Zheng Shou1    
1National University of Singapore     2Microsoft    

TL;DR

  • A Clean Multi-modality Model with Mixture Of Experts (MOE) as LLM.
  • Trained on billions-level data include image-text pair, video-text pair, interleaved image-text pair and interleaved video-text pair.
  • Support interleaved inference.

Overview

  • CosMoE have 12 billion active parameters (47 bliion vailable params in total).
  • Training COSMOE.

  • COSMOE-56B demonstrates superior convergence over CosMo-8B, particularly evident in the Interleaved Image Text loss metric.
  • Experiments

    Method VQA V2 OK-VQA VizWiZ HatefulMems
    COSMOE-8B PT 47.2 32.7 22.5 57.1
    COSMOE-8x7b PT 49.5 42.2 21.6 63.5
    COSMOE-8x7b SFT 53.4 38.5 30.4 64.2
    IDEFICS-80B SFT 37.4 36.9 26.2 58.9
  • 1B pretrain data come from MMC4, OBELICS, DataComp.
  • Our pre-train data do not include downstream datasets like CoCo, SAM, OK-VQA or COCO-style data.
  • For supervised fune-tuning, we follow IDEFICS for fair comparison.
  • BibTeX

          
            @article{wang2024cosmo,
              title={COSMO: Contrastive Streamlined Multimodal Model with Interleaved Pre-Training},
              author={Wang, Alex Jinpeng and Li, Linjie and Lin, Kevin Qinghong and Wang Jianfeng and Lin, Kevin and Yang, Zhengyuan  and Wang, Lijuan and Shou, Mike Zheng},
              journal={arXiv preprint arXiv:2401.00849},
              year={2024}
            }
    
          

    awesome webpage template