Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Wang, Alex Jinpeng1     Li, Linjie 2     Yang , Zhengyuan 2     Wang, Lijuan 2     Li, Min 1    
1Central South University     2Microsoft    

Summary

  • We present the first model LongTextAR specifically designed for long-text image generation, addressing a significant gap in existing text-to-image methods that typically handle only short sentences.
  • We pinpoint weak tokenization as a critical barrier for effective text rendering in existing multimodal autoregressive models, such as Chameleon.
  • offers customizable text rendering with control over font attributes while generalizing to natural image generation through co-training. Our experiments demonstrate its potential for applications like document generation and PowerPoint editing.

Word Accuracy Comparison

Overview

  • Our trained text-focused tokenizer converts the long-text image into discrete token IDs. A corresponding long-text prompt is generated, and the model is then tasked with predicting the image token IDs based on this long text prompt.
  • Control over font attributes

  • LongTextAR have good control over font attributes. (Middle Row is Ground-truth)
  • The long-text generation ability is highly improved with LongTextAR.
  • Applications

    BibTeX

          
            @article{wang2025beyond,
              title={Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models},
              author={Wang, Alex Jinpeng and Li, Linjie and Yang, Zhengyuan and Wang, Lijuan and Li, Min},
              journal={arXiv preprint arXiv:2503.20198},
              year={2025}
            }
    
          

    awesome webpage template