Method | Text Source | Text Tokens | T-Shots | OK-VQA | TextVQA | VizWiZ | VQAV2 | COCO | Flickr | Mean |
Open-Flamingo9B Baseline† | Raw Text | 10 | 0 | 18.1 | 14.8 | 21.5 | 26.5 | 40.1 | 32.1 | 25.5 |
62 | 4 | 23.8 | 18.1 | 23.7 | 40.5 | 57.5 | 35.3 | 33.2(7.7↑) | ||
426 | 32 | 25.2 | 16.4 | 25.5 | 34.6 | 66.1 | 38.5 | 34.4(8.9↑) | ||
+ModelName | +Rendered Image | 10 | 0 | 16.2 | 16.8 | 15.4 | 30.6 | 42.3 | 33.5 | 25.8 |
10 | 4 | 17.2 | 21.8 | 19.7 | 35.2 | 52.4 | 35.2 | 30.3(4.5↑) | ||
10 | 32 | 21.3 | 22.6 | 21.5 | 38.8 | 60.3 | 37.0 | 33.6(7.8↑) |
VisInContext effectively incorporates in-context text with visual tokens, demonstrating significant performance improvements with consistent token usage.
Here, T-shots refer to text-only in-context examples. Tokens indicate the length of the input to the LLM. Text source describes the preprocessing method for in-context examples. † denotes our implementation on 180M pretraining data.
Method | Text | ICL Tokens ↑ | Shots | OK-VQA | TextVQA | VizWiZ | VQAV2 | COCO | Flickr | HM | Mean |
Open-Flamingo MOE† | Raw Text | 256 | 0 | 40.2 | 21.3 | 23.3 | 47.8 | 82.3 | 59.4 | 60.4 | 47.8 |
4 | 42.5 | 22.2 | 32.2 | 49.8 | 90.5 | 63.5 | 63.8 | 52.1 | |||
32 | 46.8 | 23.2 | 40.5 | 49.9 | 98.2 | 66.2 | 66.0 | 55.8 | |||
+ ModelName | + Rendered Image | 2048 | 0 | 39.5 | 26.4 | 26.3 | 48.5 | 84.4 | 60.5 | 62.2 | 49.7 |
4 | 44.3 | 28.9 | 32.0 | 50.3 | 94.2 | 65.3 | 65.5 | 54.4 | |||
32 | 46.3 | 31.2 | 41.2 | 51.0 | 101.3 | 68.4 | 65.2 | 57.8 |
Increasing in-context text length with ModelName significantly improves performance on multi-modality downstream tasks.
The model is pre-trained with a 56B MOE model. ICL stands for in-context text length. HM is short for hatefulmemes. With ModelName, we increase the ICL from 256 to 2048, leading to clear improvements over the baseline. † indicates our implementation.
@article{wang2024visincontext,
title={Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning},
author={Wang, Alex Jinpeng and Li, Linjie and Lin, Yiqi and Li, Min and Wang, Lijuan and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2406.02547},
year={2024}
}
|