Elon Musk’s xAI previews Grok-1.5V, its first multimodal model

Elon Musk’s xAI has officially introduced its first-generation multimodal model that can understand documents, translate code, and process real-world situations.

The tool, named Grok -1.5V, is said to have ‘strong text capabilities’ and will soon be available to early testers and existing Grok users.

The update comes just a week after the open release of Grok-1 which concluded its pre-training phase in October 2023.

“Grok-1.5 comes with improved reasoning capabilities and a context length of 128,000 tokens,” the company said in a blog post on the xAI website.

This long context understanding is a new feature that will allow Grok to have an increased memory capacity of up to 16 times the previous context length. This means it’ll be able to utilize information from longer documents, along with more complex prompts.

The model will still work in an instruction-following capacity but will now be able to understand documents, science diagrams, charts, screenshots, and photographs. It can also translate diagrams into Python code.

👀https://t.co/etua7Jqih8

— xAI (@xai) April 13, 2024

Grok-1.5V can understand the real world

“In order to develop useful real-world AI assistants, it is crucial to advance a model’s understanding of the physical world. Towards this goal, we are introducing a new benchmark, RealWorldQA,” said the team behind Grok-1.5V.

The benchmark will be used to evaluate the real-world spatial understanding capabilities of multimodal models. The team has provided some examples including asking Grok which way can a car turn and which object is the largest in a flat-lay photo.

The initial release of the benchmark includes more than 700 photos, all with a question or easily verifiable answer.

Looking into the future, the team described the need to upgrade multimodal models: “Advancing both our multimodal understanding and generation capabilities are important steps in building beneficial AGI that can understand the universe.

“In the coming months, we anticipate to make significant improvements in both capabilities, across various modalities such as images, audio, and video.”

Featured Image: Via Ideogram