VITRON:

A Unified Pixel-level Vision LLM for

Understanding, Generating, Segmenting, Editing

1Skywork AI, Singapore   2National University of Singapore   3Nanyang Technological University

Existing vision LLMs might still encounter challenges such as superficial instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. To fill the gaps, we present Vitron, a universal pixel-level vision LLM, designed for comprehensive understanding (perceiving and reasoning), generating, segmenting (grounding and tracking), editing (inpainting) of both static image and dynamic video content.

Teaser

Figure 1: Task support and key features of Vitron.

Video Presentation

Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. To fill the gaps, we present Vitron, a universal pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing of both static image and dynamic video content. Utilizing an LLM backbone, Vitron incorporates specialized encoders for images, videos, and pixel-level regional visuals within its frontend architecture, while as its backend, employing a text-centric invocation strategy for integrating diverse state-of-the-art off-the-shelf modules tailored for an array of vision-related end tasks. Via this, Vitron supports a spectrum of vision end tasks, spanning visual understanding to visual generation, from low level to high level. Through joint vision-language alignment and fine-grained region-aware instruction tuning, Vitron achieves precise pixel-level perception. We further enhance its capabilities with invocation-oriented instruction tuning, allowing for flexible and precise module invocation for downstream vision tasks. Demonstrated over 12 visual tasks and evaluated across 22 datasets, Vitron showcases its extensive capabilities in the four main vision task clusters, e.g., segmentation, understanding, content generation, and editing. Various demonstrations also illustrate Vitron's fortes in visual manipulation and user interactivity. Overall, this work illuminates the great potential of developing a more unified and interactive visual multimodal generalist, setting new frontiers for the next vision research.

Technical Description


• Architecture

Teaser

Figure 2: Overview of the Vitron framework.


  • Frontend Block: Vision-Language Encoding. Vitron includes image encoder, video encoder, regional sketch encoder.
  • Central Block: Core LLM. Vitron employs the Vicuna (7B, version 1.5).
  • Backend Block: User Responding & Module Invocation. Vitron employs a text-centric invocation strategy for integrating diverse off-the-shelf SoTA modules, for supporting a spectrum of vision end tasks, spanning visual understanding to visual generation, from low level to high level.


Teaser

Table 1: Summary of backend modules in Vitron.


• Pixel-aware Vision-Language Understanding Learning

We train the Vitron to endow it with robust vision understanding and task execution capabilities, with three distinct phases.

  • Step-I: Overall Vision-Language Alignment Learning. We map the input vision language features to a unified feature space, thereby enabling it to comprehend input multimodal signals effectively.
  • Step-II: Fine-grained Spatiotemporal Vision Grounding Instruction Tuning. We propose a fine-grained spatiotemporal vision grounding instruction tuning for Vitron. The core idea is to enable the LLM to ground the fine-grained spatiality of images and the detailed temporality of videos.
  • Step-III: Output-side Invocation-oriented Instruction Tuning. The above second phase of training endows both the LLM and the frontend encoders with the ability to understand vision at the pixel level. This final step, instruction tuning for invocation, aims to equip the system with the precise capability to execute commands, allowing the LLM to generate appropriate and correct invocation text. Different terminal vision tasks might require distinct invocation commands. To unify this, we try to standardize the LLM's response output into a structured text format, which includes:
    • 1) User response output, which directly replies to the user's input.
    • 2) Module name, indicating which function or task is to be executed.
    • 3) Invocation command, a meta-instruction for triggering the task module.
    • 4) Region (optional), specifying a fine-grained vision feature needed for certain tasks, such as in video tracking or vision editing, where backend modules require this information. For the region, based on LLM's pixel-level understanding, a bounding box described by the coordinates will be output.

Teaser

Figure 3: A video tracking example for the structured LLM response for module invocation.


Demonstrations



• Example-1: Image Segmentation

avatar-user

Could you help me segment this item?

avatar-bot

Sure! Looks like you're aiming to cut out a grey pot.

avatar-user

I'm actually curious about the target in the middle of the two pots, could you assist?

avatar-bot

Absolutely! That would be a dog. Let's outline it.

avatar-user

You are smart! Then highlight the item I pointed out here, will you?

avatar-bot

Sure thing! Let's focus on the pot in front the dog.


• Example-2: Video Tracking

avatar-user

Can you outline the grey hen for me in the video?

avatar-bot

Just checked out the video, it's a bunch of hens drinking water. Alright, now proceed to track the pine forest scene for you now.

avatar-user

Oh, and can you make sure to focus on the one in the middle at any timeframe in the video.

avatar-bot

Got it, there are actually three penguins. I'll stay on the middle one then.


• Example-3: Image Understanding

avatar-user

Could you tell me what's happening in this photo?

avatar-bot

The picture features a group of people riding in a small electric propulsion boat on a body of water. There are three people visible in the boat, with one person sitting in the front, another in the middle, and the third person in the back. The boat appears to be a motorboat with an electric motor, providing a more environmentally friendly mode of transportation. The people seem to be enjoying their time on the water, possibly engaging in recreational activities or simply taking a leisurely ride.

avatar-user

So what is the possible relationship between the two persons according to the picture?

avatar-bot

In the image, there is a man holding a tennis racket and a chair umpire on a tennis court. The relationship between the two persons is that the chair umpire is responsible for overseeing the match and making decisions related to the game, such as calling balls and strikes, while the man holding the tennis racket is a tennis player participating in the match.

avatar-user

Then, why the person in black siting on a such high chair?

avatar-bot

The person in black is sitting on a high chair, also known as a chair umpire's chair, to have a better vantage point to observe the tennis match and make accurate judgments. The chair's elevated position allows the umpire to have a clear view of the entire court, including the lines and the players' movements. This helps the umpire to make fair and impartial decisions during the match, such as calling balls and strikes, and ensuring that the game is played according to the rules.


• Example-4: Video Understanding

avatar-user

Can you tell me what's happening with the girl in the video?

avatar-bot

Looks like the girl is getting a mini golf lesson there. She's gripping a golf stick, and there's a man bend down, to teach the girl how to swing.

avatar-user

And how'd she manage to hit the ball on the floor?

avatar-bot

She hits the ball on the floor with a golf club, which is a piece of equipment used in the sport of golf.


• Example-5: Vision Generation

avatar-user

I would like to see a video where a dogs run past a car on the street in the snow.

avatar-bot

Absolutely, I've got just the video for you.

avatar-user

Hmm, it's cool but not quite what I had in mind. Let me paint a picture for you: how about starting with adding a white dog next to the car?

avatar-bot

Got it, I've placed a white dog right next to the car, on its right side, to be precise.

avatar-user

Awesome, now could you convert that image into a dynamic video where the dog is seen running past the car?

avatar-bot

Sure thing, here's your video showcasing the dog making a swift move past the car.


• Example-6: Image Editing

avatar-user

Could you modify the chair I marked to give it four legs?

avatar-bot

Sure thing! The chair you pointed out is sitting on a metal base. I've swapped that out for four legs instead.

avatar-user

And can you clear off all the stuff on the table?

avatar-bot

Done. The flower and the wine bottle? Gone. The table's all clear now.


avatar-user

Nice. How about then adding white cat lounging on the floor?

avatar-bot

I went ahead and added two white cats lying on the floor. Check out the updated image.



• Example-7: Video Editing

avatar-user

Could you swap out the background for a snowy scene, kind of like a bear strolling through a winter forest.

avatar-bot

Great, no problem. I've made the adjustments to the background in the video, replacing it directly with a forest covered in white snow. Here's the video with the replacement.

avatar-user

That's a wonderful idea! Replacing the brown bear with a polar bear would indeed complement the snowy background better.

avatar-bot

That makes sense. Polar bears typically inhabit the frigid Arctic region, where the climate remains cold year-round. Consequently, they are often found in snowy landscapes or on icy surfaces. Placing a polar bear in a snowy setting aligns more with their natural habitat.


Related Links

You may refer to related work that serves as foundations for our framework and code repository, such as Vicuna, GELIGEN, SEEM, StableVideo, I2VGen-XL, and Zeroscope. We also partially draw inspirations from NExT-GPT.

BibTeX

@articles{hao2024vitron,
  title={Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing},
  author={Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan},
  journal={CoRR},
  year={2024}
}