- New multimodal feature lets users upload images for analysis
- Company says “visual primitives” approach improves reasoning efficiency and accuracy
China’s leading large language model DeepSeek has rolled out an image recognition feature to users, marking a shift away from text-only interaction and into multimodal AI that can process and interpret visual inputs.
The update allows users to upload images directly for analysis, enabling the model to read text, identify objects and interpret visual context — including details in cultural relics or implied meaning in facial expressions.
DeepSeek’s image understanding feature began a phased rollout in late April 2026 and was made widely available to users on May 9, according to the company.
Alongside the release, DeepSeek published a technical report outlining its multimodal architecture, which it calls “Thinking with Visual Primitives.”
The company said the approach differs from conventional multimodal systems by integrating spatial elements such as points and bounding boxes directly into the reasoning process, rather than relying on high-level language descriptions.
DeepSeek argues this helps address what it describes as a “reference gap” in traditional models when handling dense visual scenes, where vague textual descriptions can lead to reasoning drift or loss of precision.
On efficiency, the company said processing an 800 by 800-pixel image consumes around 90 tokens, compared with about 870 tokens for Claude Sonnet 4.6 and roughly 1,100 tokens for Gemini 3 Flash, according to its report.
In benchmark tests, DeepSeek reported a score of 89.2% on Pixmo-Count, a benchmark dataset used to evaluate AI models’ ability to accurately count objects in images.
It outperformed Gemini 3 Flash at 88.2%. In maze navigation tasks, it scored 66.9%, while other leading models did not exceed 51%.
However, the company also noted limitations in the current system. The model’s knowledge base is only updated through 2025, meaning it may misidentify newer products.
It also remains unstable in more complex tasks such as optical illusions and difficult counting problems, and does not yet support image generation.
The launch is seen as a step toward closing gaps in DeepSeek’s multimodal capabilities and is expected to intensify competition among Chinese AI developers in visual understanding systems.
