VITRON:

A Unified Pixel-level Vision LLM for

Understanding, Generating, Segmenting, Editing

1Skywork AI, Singapore   2National University of Singapore   3Nanyang Technological University
(NeurIPS 2024)

Abstract

Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. In this paper, we present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static images and dynamic videos. Building on top of an LLM backbone, Vitron incorporates encoders for images, videos, and pixel-level regional visuals within its frontend modules, while employing state-of-the-art visual specialists as its backend, via which Vitron supports a spectrum of vision end tasks, spanning visual comprehension to visual generation, from low level to high level. To ensure an effective and precise message passing from LLM to backend modules for function invocation, we propose a novel hybrid method by simultaneously integrating discrete textual instructions and continuous signal embeddings. Further, we design various pixel-level spatiotemporal vision-language alignment learning for Vitron to reach the best fine-grained visual capability. Finally, a cross-task synergy module is advised to learn to maximize the task-invariant fine-grained visual features, enhancing the synergy between different visual tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, Vitron showcases its extensive capabilities in the four main vision task clusters. Overall, this work illuminates the great potential of developing a more unified multimodal generalist.

Teaser

Figure 1: Vitron supports four 4 vision task clusters, spanning visual comprehension to generation, from low level to high level.

Video Presentation

Technical Description


• Architecture of Vitron

Teaser

Figure 2: Overview of the Vitron framework.


  • Frontend Block: Vision-Language Encoding. Vitron includes image encoder, video encoder, regional sketch encoder.
  • Central Block: Core LLM. Vitron employs the Vicuna (7B, version 1.5).
  • Backend Block: User Responding & Module Invocation. Vitron employs a text-centric invocation strategy for integrating diverse off-the-shelf SoTA modules, for supporting a spectrum of vision end tasks, spanning visual understanding to visual generation, from low level to high level.

Specifically, we design a structured invocation template, including 1) Module name, 2) Invocation command, and 3) Region (optional) specifying a fine-grained vision feature needed for certain tasks. The feature embeddings include both task-specific features and task-invariant fine-grained features. The purpose of this design is to achieve feature decoupling, during which we aim to have the task-invariant fine-grained features shared as widely as possible among all tasks to facilitate synergy between different tasks.


Teaser

Table 1: Summary of backend modules in Vitron.


• Pixel-aware Synergistic Vision-Language Understanding Tuning

We train the Vitron to endow it with robust vision understanding and task execution capabilities, with three distinct phases.


1) Basic Multimodal Comprehension and Generation Skill Training

Initially, the training equips the multimodal large language model (MLLM) with fundamental multimodal understanding features. This includes aligning the model's decoder with the encoders for a consistent backend response.

  • Overall Vision-Language Alignment Learning. The objective is to harmonize visual and textual inputs to ensure accurate associations. This involves training the model with predefined pairs from reliable sources, like image-caption and video-caption pairs, enhancing its ability to generate descriptive texts that match given visuals accurately.
  • Text Invocation Instruction Tuning. This training step enhances the system's command execution by using over 55,000 instruction tuning samples to refine its ability to generate precise text instructions.
  • Embedding-oriented Decoder Alignment Tuning. Besides using explicit textual instruction to invocate downstream modules, the signal feature embedding/representation (from LLM) should also be fed to the modules. Thus, we align the feature embedding with all the visual modules’ input encoders via the decoding-side projection layers.

2) Fine-grained Spatiotemporal Vision Grounding Instruction Tuning

The focus is to enhance the model's ability to precisely understand and process both images and videos at a pixel level. The model is trained to accurately map the spatial details of visuals.

  • Image Spatial Grounding. The training is tailored to enable the model to identify specific regions within an image and relate them to text, aiding in tasks like detailed image captioning and segmentation.
  • Video Spatial-Temporal Grounding. For videos, the model learns to recognize and describe spatial and temporal aspects, which supports tasks such as video captioning that's contextually grounded and tracking specific video elements.
  • Grounding-aware Vision QA. Beyond simple perception, the model is also trained on higher-level visual reasoning tasks. This involves question-answering based on image and video analysis to facilitate deeper semantic understanding through questions directly related to the visual content grounded in previous tasks.

3) Cross-task Synergy Learning

To ensure different specialist modules work synergistically in a generalist model, we separate signal feature embeddings into task-specific and task-invariant categories. By sharing task-invariant features widely, the modules can better support each other, enhancing overall system synergy. We introduce a cross-task synergy learning module using adversarial training to differentiate these features. This setup involves specialists making predictions with these features while a discriminator assesses task identity based on shared features alone. When the discriminator fails to recognize the task, the shared features are considered optimally generalized for cross-task application.

Teaser

Figure 3: Illustration of the synergy module.


Demonstrations



• Example-1: Image Segmentation

avatar-user

Could you help me segment this item?

avatar-bot

Sure! Looks like you're aiming to cut out a grey pot.

avatar-user

I'm actually curious about the target in the middle of the two pots, could you assist?

avatar-bot

Absolutely! That would be a dog. Let's outline it.

avatar-user

You are smart! Then highlight the item I pointed out here, will you?

avatar-bot

Sure thing! Let's focus on the pot in front the dog.


• Example-2: Video Tracking

avatar-user

Can you outline the grey hen for me in the video?

avatar-bot

Just checked out the video, it's a bunch of hens drinking water. Alright, now proceed to track the pine forest scene for you now.

avatar-user

Oh, and can you make sure to focus on the one in the middle at any timeframe in the video.

avatar-bot

Got it, there are actually three penguins. I'll stay on the middle one then.


• Example-3: Image Understanding

avatar-user

Could you tell me what's happening in this photo?

avatar-bot

The picture features a group of people riding in a small electric propulsion boat on a body of water. There are three people visible in the boat, with one person sitting in the front, another in the middle, and the third person in the back. The boat appears to be a motorboat with an electric motor, providing a more environmentally friendly mode of transportation. The people seem to be enjoying their time on the water, possibly engaging in recreational activities or simply taking a leisurely ride.

avatar-user

So what is the possible relationship between the two persons according to the picture?

avatar-bot

In the image, there is a man holding a tennis racket and a chair umpire on a tennis court. The relationship between the two persons is that the chair umpire is responsible for overseeing the match and making decisions related to the game, such as calling balls and strikes, while the man holding the tennis racket is a tennis player participating in the match.

avatar-user

Then, why the person in black siting on a such high chair?

avatar-bot

The person in black is sitting on a high chair, also known as a chair umpire's chair, to have a better vantage point to observe the tennis match and make accurate judgments. The chair's elevated position allows the umpire to have a clear view of the entire court, including the lines and the players' movements. This helps the umpire to make fair and impartial decisions during the match, such as calling balls and strikes, and ensuring that the game is played according to the rules.


• Example-4: Video Understanding

avatar-user

Can you tell me what's happening with the girl in the video?

avatar-bot

Looks like the girl is getting a mini golf lesson there. She's gripping a golf stick, and there's a man bend down, to teach the girl how to swing.

avatar-user

And how'd she manage to hit the ball on the floor?

avatar-bot

She hits the ball on the floor with a golf club, which is a piece of equipment used in the sport of golf.


• Example-5: Vision Generation

avatar-user

I would like to see a video where a dogs run past a car on the street in the snow.

avatar-bot

Absolutely, I've got just the video for you.

avatar-user

Hmm, it's cool but not quite what I had in mind. Let me paint a picture for you: how about starting with adding a white dog next to the car?

avatar-bot

Got it, I've placed a white dog right next to the car, on its right side, to be precise.

avatar-user

Awesome, now could you convert that image into a dynamic video where the dog is seen running past the car?

avatar-bot

Sure thing, here's your video showcasing the dog making a swift move past the car.


• Example-6: Image Editing

avatar-user

Could you modify the chair I marked to give it four legs?

avatar-bot

Sure thing! The chair you pointed out is sitting on a metal base. I've swapped that out for four legs instead.

avatar-user

And can you clear off all the stuff on the table?

avatar-bot

Done. The flower and the wine bottle? Gone. The table's all clear now.


avatar-user

Nice. How about then adding white cat lounging on the floor?

avatar-bot

I went ahead and added two white cats lying on the floor. Check out the updated image.



• Example-7: Video Editing

avatar-user

Could you swap out the background for a snowy scene, kind of like a bear strolling through a winter forest.

avatar-bot

Great, no problem. I've made the adjustments to the background in the video, replacing it directly with a forest covered in white snow. Here's the video with the replacement.

avatar-user

That's a wonderful idea! Replacing the brown bear with a polar bear would indeed complement the snowy background better.

avatar-bot

That makes sense. Polar bears typically inhabit the frigid Arctic region, where the climate remains cold year-round. Consequently, they are often found in snowy landscapes or on icy surfaces. Placing a polar bear in a snowy setting aligns more with their natural habitat.


Related Links

You may refer to related work that serves as foundations for our framework and code repository, such as Vicuna, GELIGEN, SEEM, StableVideo, I2VGen-XL, and Zeroscope. We also partially draw inspirations from NExT-GPT.

BibTeX

@inproceedings{fei2024vitron,
  title={VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing},
  author={Fei, Hao and Wu, Shengqiong and Zhang, Hanwang and Chua, Tat-Seng and Yan, Shuicheng},
  year={2024},
  journal={Proceedings of the Advances in neural information processing systems},
}