AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Making LLAVAs Tiny with MoE-Knowledge Distillation

LLaVA (Large Language Visual Assistant) models are a game-changer for understanding and interacting with the world around us. They combine the power of large language models (LLMs) with vision encoders to perform tasks like image captioning and visual question answering. But, LLaVAs are also resource-intensive, requiring massive training datasets and computational power. This can make them difficult to deploy on devices with limited resources.

In a new paper published on the preprint server arXiv, researchers at Alibaba, The Chinese University of Hong Kong, and Hong Kong Beihang University introduce LLaVA-MoD, a new framework for efficiently training smaller LLaVAs. The key idea is to use “knowledge distillation” – a technique where a smaller model learns from a larger, more powerful model. This enables the smaller model to learn complex tasks without requiring as much training data.

To make knowledge distillation even more efficient, the researchers use a Mixture-of-Experts (MoE) architecture. Instead of using a single large language model, the s-LLaVA (small-scale LLaVA) is built with multiple “experts,” each specializing in a different aspect of the task. These experts can be activated selectively, allowing the s-LLaVA to make better use of its resources.

The researchers propose a two-stage training strategy. In the first stage, called Mimic Distillation, the s-LLaVA learns to mimic the output distribution of the large l-LLaVA (large-scale LLaVA). This helps the s-LLaVA learn the general knowledge required to perform the task.

In the second stage, called Preference Distillation, the s-LLaVA goes beyond just imitating the l-LLaVA. It learns to discriminate between good and bad examples, which helps to improve its performance on tasks like hallucination (generating incorrect or misleading information).

To illustrate the effectiveness of their approach, the researchers compared LLaVA-MoD with other state-of-the-art models on a range of benchmarks. They found that LLaVA-MoD outperforms existing models, achieving competitive performance while requiring significantly less training data and computational resources. For example, their 2-billion parameter LLaVA-MoD model surpasses a 7-billion parameter model by an average of 8.8% across benchmarks, while using only 0.3% of the training data.

LLaVA-MOD demonstrates that it is possible to train smaller, more efficient LLaVAs without sacrificing performance. This opens the door to a wider range of applications for these powerful models, enabling developers to bring the power of visual language understanding to more devices and scenarios.