Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. In this paper, we present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, Vitron incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which Vitron supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. To ensure an effective and precise message passing from LLM to backend modules for function invocation, we propose a novel hybrid method by simultaneously integrating discrete textual instructions and continuous signal embeddings. Further, we design various pixel-level spatiotemporal vision-language alignment learning for Vitron to reach the best fine-grained visual capability. Finally, a cross-task synergy module is advised to learn to maximize the task-invariant fine-grained visual features, enhancing the synergy between different visual tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, Vitron showcases its extensive capabilities in the four main vision task clusters. Overall, this work illuminates the great potential of developing a more unified multimodal generalist.
Figure 1: Vitron supports four 4 vision task clusters, spanning visual comprehension to generation, from low level to high level.
Figure 2: Overview of the Vitron framework.
Specifically, we design a structured invocation template, including 1) Module name, 2) Invocation command, and 3) Region (optional) specifying a fine-grained vision feature needed for certain tasks. The feature embeddings include both task-specific features and task-invariant fine-grained features. The purpose of this design is to achieve feature decoupling, during which we aim to have the task-invariant fine-grained features shared as widely as possible among all tasks to facilitate synergy between different tasks.
Table 1: Summary of backend modules in Vitron.
We train the Vitron to endow it with robust vision understanding and task execution capabilities, with three distinct phases.
Initially, the training equips the multimodal large language model (MLLM) with fundamental multimodal understanding features. This includes aligning the model's decoder with the encoders for a consistent backend response.
The focus is to enhance the model's ability to precisely understand and process both images and videos at a pixel level. The model is trained to accurately map the spatial details of visuals.
To ensure different specialist modules work synergistically in a generalist model, we separate signal feature embeddings into task-specific and task-invariant categories. By sharing task-invariant features widely, the modules can better support each other, enhancing overall system synergy. We introduce a cross-task synergy learning module using adversarial training to differentiate these features. This setup involves specialists making predictions with these features while a discriminator assesses task identity based on shared features alone. When the discriminator fails to recognize the task, the shared features are considered optimally generalized for cross-task application.
Figure 3: Illustration of the synergy module.
@inproceedings{fei2024vitron,
title={VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing},
author={Fei, Hao and Wu, Shengqiong and Zhang, Hanwang and Chua, Tat-Seng and Yan, Shuicheng},
year={2024},
journal={Proceedings of the Advances in neural information processing systems},
}