Multimodal Intelligence Research Group-College of Al, Tsinghua University

Research

Multimodal Intelligence Research Group

PI：Yuan YAO

Research Direction: Mainly engaged in research related to multimodal large models and natural language processing.

Research Directions

Our research group focuses on building the capability system of multimodal large models, with a particular emphasis on deep image-text and video understanding. Key areas include:

1. Innovation in Visual Foundation Architectures

(High-definition and efficient modality encoding and fusion architectures)

2. Omni-Modal Streaming Capabilities

(Real-time streaming capabilities integrating text, vision, and speech)

3. Efficient Scientific Training Methods

(Efficient multimodal knowledge learning and transfer)

4. Multimodal Reinforcement Learning

(Multimodal deep thinking and reasoning capabilities)

Key Achievements

Representative Achievement 1 — MiniCPM-V: A GPT-4V-Level MLLM on Your Phone

Multimodal large language models (MLLMs) typically have massive parameter sizes and high computational costs, making them difficult to deploy on personal devices or in offline scenarios. Our team developed the efficient multimodal large model MiniCPM-V, which, with only 8B core parameters, surpasses leading multimodal models such as OpenAI’s GPT-4V and Google’s Gemini Pro in single-image, multi-image, and video understanding, achieving world-class performance. Built upon our research outcomes including RLAIF-V [CVPR'25 Highlights], RLHF-V [CVPR'24], LLaVA-UHD [ECCV'24], and VisCPM [ICLR'24 Spotlight], MiniCPM-V supports high-resolution image processing with arbitrary aspect ratios, offers state-of-the-art OCR capabilities, exhibits low hallucination rates, enables multilingual multimodal interaction in over 30 languages, and runs efficiently on edge devices such as smartphones. Our latest model, MiniCPM-o, further achieves GPT-4o-202405-level performance in vision, speech, and real-time multimodal streaming interaction. The MiniCPM series has ranked No. 1 on Hugging Face Trending, GitHub Trending, and Papers With Code Trending Research, with over 10 million cumulative downloads on open-source platforms and over 19.8k GitHub stars. The related paper was published in the international journal Nature Communications in 2025 and has received over 500 citations on Google Scholar.

- Open-source code: [https://github.com/OpenBMB/MiniCPM-V](https://github.com/OpenBMB/MiniCPM-V)

- Paper: [Efficient GPT-4V-level multimodal large language model for deployment on edge devices](https://www.nature.com/articles/s41467-025-61040-5)

- Demo: [https://minicpm-omni-webdemo.internetofagents.net](https://minicpm-omni-webdemo.internetofagents.net)

Representative Achievement 2 — RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Multimodal large models often suffer from severe hallucination issues, where model outputs contain content inconsistent with the image. Even GPT-4V exhibits obvious hallucinations in 45.9% of image-based responses. Our team proposed RLAIF-V, a fine-grained correctional human feedback-based alignment framework for multimodal models, which significantly reduces hallucinations at both the data and algorithm levels and enables effective test-time scaling. The open-source model trained using this method surpasses GPT-4V on multiple hallucination benchmarks. The open-source dataset ranked No. 2 on Hugging Face Dataset Trending, and the related work was published as a Highlight paper at the top-tier AI conference CVPR 2025.

- Paper: [RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback](https://arxiv.org/abs/2312.00849)

- Open-source code: [https://rlhf-v.github.io](https://rlhf-v.github.io)

Representative Achievement 3 — A Deep-Learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals

Our team proposed KV-PLM, the first multimodal large model bridging molecular structures and natural language. KV-PLM enables bidirectional retrieval between molecular structures and natural language: given a drug molecule structure, the model can generate natural language descriptions of its properties to support drug analysis; given a natural language description of a drug’s properties, the model can retrieve the corresponding molecular structure to support drug design and repurposing. The related work was published in *Nature Communications* and selected as an Editors’ Highlight.

- Paper: [A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals](https://www.nature.com/articles/s41467-022-28494-3)

Representative Publications

1. Efficient GPT-4V level multimodal large language model for deployment on edge devices. Nature Communications. 2025.

2. RLAIF-V: Open-source AI feedback leads to super GPT-4V trustworthiness. CVPR 2025 Highlights.

3. GUICourse: From general vision language models to versatile GUI agents. ACL 2025.

4. LLaVA-UHD: An LMM perceiving any aspect ratio and high-resolution images. ECCV 2024.

5. RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. CVPR 2024.

6. Large multilingual models pivot zero-shot multimodal learning across languages. ICLR 2024 Spotlight.

7. NExT-Chat: An LMM for chat, detection and segmentation. ICML 2024.

8. VPGTrans: Transfer visual prompt generator across LLMs. NeurIPS 2023.

9. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature Communications. 2022. Editors' Highlights.