MiniMind Project

"The Greatest Path is the Simplest"

This project aims to train a super-small multimodal vision-language model, MiniMind-V, with just a cost of 3 RMB and 2 hours of work, starting from scratch!
The smallest version of MiniMind-V is only about ¼ the size of GPT-3, designed to enable fast inference and even training on personal GPUs.
MiniMind-V is an extension of the visual capabilities of the MiniMind pure language model.
The project includes full code for the minimalist structure of large VLM models, dataset cleaning, pretraining, and supervised fine-tuning (SFT).
This is not only the smallest implementation of an open-source VLM model but also a concise tutorial for beginners in vision-language models.
The hope is that this project can provide a useful example to inspire others and share the joy of creation, helping to drive progress in the wider AI community!
To avoid misunderstandings, the "2 hours" is based on testing (`1 epoch`) with an NVIDIA 3090 hardware device (single GPU), and the "3 RMB" refers to GPU server rental costs.