- This project aims to train a super-small multimodal vision-language model, MiniMind-V, with
just a cost of 3 RMB and 2 hours of work, starting from scratch!
- The smallest version of MiniMind-V is only about ¼ the size of GPT-3, designed to
enable fast inference and even training on personal GPUs.
- MiniMind-V is an extension of the visual capabilities of the MiniMind pure language model.
- The project includes full code for the minimalist structure of large VLM models, dataset cleaning,
pretraining, and supervised fine-tuning (SFT).
- This is not only the smallest implementation of an open-source VLM model but also a concise tutorial for
beginners in vision-language models.
- The hope is that this project can provide a useful example to inspire others and share the joy of creation,
helping to drive progress in the wider AI community!
- To avoid misunderstandings, the "2 hours" is based on testing (`1 epoch`) with an
NVIDIA 3090 hardware device (single GPU), and
the "3 RMB" refers to GPU server rental costs.