DeepSeek unveils AI model that uses visual perception to compress text input


New release continues Chinese start-up's efforts to raise AI models' efficiency, while driving down the costs of building and using them. — SCMP

DeepSeek on Monday released a new multimodal artificial intelligence model that can handle large and complex documents with significantly fewer tokens – the smallest unit of text that a model processes – by using visual perception as a compression medium for information.

The open-source DeepSeek-OCR (optical character recognition) model, available via online developer platforms Hugging Face and GitHub, was the result of an “investigation into the role of vision encoders” to compress text for large language models (LLMs), the Hangzhou-based AI start-up said in a blog post.

By using that approach, LLMs would be able to process a massive amount of text without incurring a proportional increase in computing cost.

“Through DeepSeek-OCR, we demonstrated that vision-text compression can achieve significant token reduction – seven to 20 times – for different historical context stages, offering a promising direction” to address long-context challenges in LLMs, the company said.

That showed DeepSeek’s steadfast efforts to raise the efficiency of AI models, while driving down the costs of building and using them – a principle that the company followed in the development of its breakthrough open-source models V3 and R1 that were released in December and February, respectively.

According to the company’s blog post, DeepSeek-OCR consisted of two main components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder.

The former acts as the model’s core engine. It maintains low activation under high-resolution inputs, while achieving strong compression ratios to reduce the number of tokens.

The decoder, a Mixture-of-Experts (MoE) model with 570 million parameters, reconstructs the original text. The MoE architecture divides the model into separate sub-networks, or “experts”, that specialise in a subset of the input data to jointly perform a task.

Apart from handling standard vision tasks such as image captioning and object detection, DeepSeek-OCR can also be used to parse highly structured visual content – including tables, formulas and geometric diagrams – which can benefit its application in the fields of finance and science, according to the company.

Citing benchmark tests, the company said that when the number of text tokens is within ten times the size of visual tokens – meaning a compression ratio below 10× – DeepSeek-OCR achieved 97 per cent decoding accuracy.

Even at a 20× ratio, the model recorded around 60 per cent accuracy, highlighting its ability to preserve information despite extreme compression.

DeepSeek-OCR can also be used to parse highly structured visual content – including tables, formulas and geometric diagrams. Photo: dpa

On OmniDocBench, a benchmark for diverse document understanding, DeepSeek-OCR outperformed major OCR models such as GOT-OCR 2.0 and MinerU 2.0, while using far fewer tokens.

The new model can also generate more than 200,000 pages of training data per day on a computing system powered by a single Nvidia A100-40G graphics processing unit, according to the company.

With DeepSeek-OCR, users would be able to handle scalable ultra-long context processing, in which recent content was preserved at a high resolution, while older contexts consumed lesser computing resources. That suggests DeepSeek-OCR could pave the way for theoretically unlimited context architectures that balance information retention with efficiency.

In late September, the company launched DeepSeek V3.2-Exp – an “experimental” version of its V3 model that improves training and inference efficiency, while sharply reducing the application programming interface costs. – South China Morning Post

Follow us on our official WhatsApp channel for breaking news alerts and key updates!

Next In Tech News

Meta delays release of Phoenix mixed-reality glasses to 2027, Business Insider reports
Opinion: How can you tell if something’s been written by ChatGPT? Let’s delve
'Stealing from a thief': How ChatGPT helped Delhi man outsmart scammer, make him 'beg' for forgiveness
A US man was indicted for allegedly cyberstalking women. He says he took advice from ChatGPT.
Apple, Tesla accused of profiting from horrific abuses, environmental destruction
Exclusive-How Netflix won Hollywood's biggest prize, Warner Bros Discovery
Hollywood unions alarmed by Netflix's $72 billion Warner Bros deal
US lawmakers press Google, Apple to remove apps tracking immigration agents
Meta acquires AI-wearables startup Limitless
New York Times sues Perplexity AI for 'illegal' copying of content

Others Also Read