DeepSeek unveils AI model that uses visual perception to compress text input

New release continues Chinese start-up's efforts to raise AI models' efficiency, while driving down the costs of building and using them. — SCMP

DeepSeek on Monday released a new multimodal artificial intelligence model that can handle large and complex documents with significantly fewer tokens – the smallest unit of text that a model processes – by using visual perception as a compression medium for information.

The open-source DeepSeek-OCR (optical character recognition) model, available via online developer platforms Hugging Face and GitHub, was the result of an “investigation into the role of vision encoders” to compress text for large language models (LLMs), the Hangzhou-based AI start-up said in a blog post.

By using that approach, LLMs would be able to process a massive amount of text without incurring a proportional increase in computing cost.

“Through DeepSeek-OCR, we demonstrated that vision-text compression can achieve significant token reduction – seven to 20 times – for different historical context stages, offering a promising direction” to address long-context challenges in LLMs, the company said.

That showed DeepSeek’s steadfast efforts to raise the efficiency of AI models, while driving down the costs of building and using them – a principle that the company followed in the development of its breakthrough open-source models V3 and R1 that were released in December and February, respectively.

According to the company’s blog post, DeepSeek-OCR consisted of two main components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder.

The former acts as the model’s core engine. It maintains low activation under high-resolution inputs, while achieving strong compression ratios to reduce the number of tokens.

The decoder, a Mixture-of-Experts (MoE) model with 570 million parameters, reconstructs the original text. The MoE architecture divides the model into separate sub-networks, or “experts”, that specialise in a subset of the input data to jointly perform a task.

Apart from handling standard vision tasks such as image captioning and object detection, DeepSeek-OCR can also be used to parse highly structured visual content – including tables, formulas and geometric diagrams – which can benefit its application in the fields of finance and science, according to the company.

Citing benchmark tests, the company said that when the number of text tokens is within ten times the size of visual tokens – meaning a compression ratio below 10× – DeepSeek-OCR achieved 97 per cent decoding accuracy.

Even at a 20× ratio, the model recorded around 60 per cent accuracy, highlighting its ability to preserve information despite extreme compression.

DeepSeek-OCR can also be used to parse highly structured visual content – including tables, formulas and geometric diagrams. Photo: dpa

On OmniDocBench, a benchmark for diverse document understanding, DeepSeek-OCR outperformed major OCR models such as GOT-OCR 2.0 and MinerU 2.0, while using far fewer tokens.

The new model can also generate more than 200,000 pages of training data per day on a computing system powered by a single Nvidia A100-40G graphics processing unit, according to the company.

With DeepSeek-OCR, users would be able to handle scalable ultra-long context processing, in which recent content was preserved at a high resolution, while older contexts consumed lesser computing resources. That suggests DeepSeek-OCR could pave the way for theoretically unlimited context architectures that balance information retention with efficiency.

In late September, the company launched DeepSeek V3.2-Exp – an “experimental” version of its V3 model that improves training and inference efficiency, while sharply reducing the application programming interface costs. – South China Morning Post

Topic:

SCMP China AI Internet Technology

Report a mistake

What is the issue about?

Spelling and grammatical error

Factually incorrect

Story is irrelevant

Thank you for your report!

DeepSeek unveils AI model that uses visual perception to compress text input

Zurich Malaysia’s Chinese New Year film: A quiet celebration of love and connection

Next In Tech News

Others Also Read

Thank you for downloading.

DeepSeek unveils AI model that uses visual perception to compress text input

Related News

Poor fire retardance, alleged worker smoking among fatal Hong Kong blaze theories, SCMP reports

How Chow Yun-fat became a Hong Kong cinema superstar without losing his humility - A special profile from SCMP

SCMP special: Is China’s J-35 fighter jet stealthier than the US’ F-35?

Related News

Poor fire retardance, alleged worker smoking among fatal Hong Kong blaze theories, SCMP reports

How Chow Yun-fat became a Hong Kong cinema superstar without losing his humility - A special profile from SCMP

SCMP special: Is China’s J-35 fighter jet stealthier than the US’ F-35?

Zurich Malaysia’s Chinese New Year film: A quiet celebration of love and connection

Next In Tech News

Trending in Tech

Others Also Read

Thank you for downloading.