- This open-source project aims to train a small-parameter, visually-capable language model
MiniMind-V from scratch, with the goal of achieving this in as little as 3 hours.
- MiniMind-V is also extremely lightweight, with the smallest version being approximately
1/7000 the size of GPT3, striving to enable quick inference and even training on personal GPUs.
- MiniMind-V provides the full-stage code for a simplified large model structure, dataset
cleaning and preprocessing, supervised pretraining, supervised instruction fine-tuning (SFT).
It also includes code for expanding to sparse models with mixed experts (MoE).
- This is not just an implementation of an open-source model; it is also a tutorial for beginners to enter the
field of Vision-Language Models (VLM).
- We hope this project can serve as a starting point for researchers, providing an introductory example that
helps everyone quickly get started and inspires more exploration and innovation in the VLM domain.
- To prevent misinterpretation, "from scratch" specifically refers to building upon the
pure language model MiniMind (which is a GPT-like model trained entirely from scratch) to further expand its
capabilities from 0 to 1 in terms of visual abilities.
For detailed information on the latter, please refer to the twin project MiniMind.
- To avoid misinterpretation, "fastest 3 hours" means you need a machine with hardware
configuration superior
to mine. Detailed specifications will be provided below.