Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

Wang, Alex Jinpeng1     Li, Linjie 2     Lin, Yiqi 1     Li, Min 3     Wang, Lijuan 2     Shou, Mike Zheng1    
1National University of Singapore     2Microsoft     3Central South University     

Summary

  • We introduce Visualized In-Context Text Processing (VisInContext), a novel method that increases in-context text length using visual tokens.
  • We demonstrate that VisInContext is effective for both training and inference stage with much lower computational cost.
  • As a byproduct, our method also shows great potential in document understanding on popular document QA tasks and our newly proposed sequential document retrieval task.

Overview

  • The VisInContext pipeline builds upon the Flamingo model for in-context few-shot modeling (represented in gray). VisInContext processes interleaved image-text data by rendering portions of the in-context text into images. This approach maintains the Text Token Length of the model while allowing for a significantly extended In-context Text Length.
  • OCR Validation

  • The OCR ability is highly improved with VisInContext.
  • Text Image Experiments

    Method Text Source Text Tokens T-Shots OK-VQA TextVQA VizWiZ VQAV2 COCO Flickr Mean
    Open-Flamingo9B Baseline Raw Text 10 0 18.1 14.8 21.5 26.5 40.1 32.1 25.5
    62 4 23.8 18.1 23.7 40.5 57.5 35.3 33.2(7.7↑)
    426 32 25.2 16.4 25.5 34.6 66.1 38.5 34.4(8.9↑)
    +ModelName +Rendered Image 10 0 16.2 16.8 15.4 30.6 42.3 33.5 25.8
    10 4 17.2 21.8 19.7 35.2 52.4 35.2 30.3(4.5↑)
    10 32 21.3 22.6 21.5 38.8 60.3 37.0 33.6(7.8↑)

    VisInContext effectively incorporates in-context text with visual tokens, demonstrating significant performance improvements with consistent token usage.
    Here, T-shots refer to text-only in-context examples. Tokens indicate the length of the input to the LLM. Text source describes the preprocessing method for in-context examples. denotes our implementation on 180M pretraining data.

    Few-shot Modeling Ability Evaluation

    Method Text ICL Tokens ↑ Shots OK-VQA TextVQA VizWiZ VQAV2 COCO Flickr HM Mean
    Open-Flamingo MOE Raw Text 256 0 40.2 21.3 23.3 47.8 82.3 59.4 60.4 47.8
    4 42.5 22.2 32.2 49.8 90.5 63.5 63.8 52.1
    32 46.8 23.2 40.5 49.9 98.2 66.2 66.0 55.8
    + ModelName + Rendered Image 2048 0 39.5 26.4 26.3 48.5 84.4 60.5 62.2 49.7
    4 44.3 28.9 32.0 50.3 94.2 65.3 65.5 54.4
    32 46.3 31.2 41.2 51.0 101.3 68.4 65.2 57.8

    Increasing in-context text length with ModelName significantly improves performance on multi-modality downstream tasks.
    The model is pre-trained with a 56B MOE model. ICL stands for in-context text length. HM is short for hatefulmemes. With ModelName, we increase the ICL from 256 to 2048, leading to clear improvements over the baseline. indicates our implementation.

    BibTeX

          
            @article{wang2024visincontext,
              title={Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning},
              author={Wang, Alex Jinpeng and Li, Linjie and Lin, Yiqi and Li, Min  and Wang, Lijuan and Shou, Mike Zheng},
              journal={arXiv preprint arXiv:2406.02547},
              year={2024}
            }
    
          

    awesome webpage template