COS-MOE: COntrastive Streamlined multimodal model with Mixture Of Experts

Wang, Alex Jinpeng¹ Li, Linjie ² Lin, Kevin Qinghong ¹ Wang, Jianfeng ² Lin, Kevin² Wang, Jianfeng ² Wang, Lijuan ² Mike Zheng Shou¹

¹National University of Singapore ²Microsoft

Paper Github

TL;DR

A Clean Multi-modality Model with Mixture Of Experts (MOE) as LLM.
Trained on billions-level data include image-text pair, video-text pair, interleaved image-text pair and interleaved video-text pair.
Support interleaved inference.

Overview

CosMoE have 12 billion active parameters (47 bliion vailable params in total).

Training COSMOE.

COSMOE-56B demonstrates superior convergence over CosMo-8B, particularly evident in the Interleaved Image Text loss metric.

Experiments

Method	VQA V2	OK-VQA	VizWiZ	HatefulMems
COSMOE-8B PT	47.2	32.7	22.5	57.1
COSMOE-8x7b PT	49.5	42.2	21.6	63.5
COSMOE-8x7b SFT	53.4	38.5	30.4	64.2
IDEFICS-80B SFT	37.4	36.9	26.2	58.9

1B pretrain data come from MMC4, OBELICS, DataComp.

Our pre-train data do not include downstream datasets like CoCo, SAM, OK-VQA or COCO-style data.

For supervised fune-tuning, we follow IDEFICS for fair comparison.

BibTeX

      
        @article{wang2024cosmo,
          title={COSMO: Contrastive Streamlined Multimodal Model with Interleaved Pre-Training},
          author={Wang, Alex Jinpeng and Li, Linjie and Lin, Kevin Qinghong and Wang Jianfeng and Lin, Kevin and Yang, Zhengyuan  and Wang, Lijuan and Shou, Mike Zheng},
          journal={arXiv preprint arXiv:2401.00849},
          year={2024}
        }

awesome webpage template