Train a 26M Vision-Language Model from zero.
1 hour. ¥1.3. One 3090.
See the World.
📸 Vision Multimodal
Full vision-language capabilities on personal GPUs.
📦 Complete VLM Stack
Vision encoder → LLM → Projector → Pretrain → SFT
💰 Ultra-Affordable
Single 3090 GPU, minimal compute resources needed.
📖 Pure PyTorch
Transparent implementation. Learn from reading the code.
🚀 Production Ready
OpenAI API compatible, ready for real-world deployment.
🔌 Extensible
Multi-image support, fine-tuning, and more capabilities.
| Model |
Parameters |
Inference Memory |
Release |
| MiniMind2-V |
104M |
0.6 GB |
2025.02.20 |
| MiniMind2-Small-V |
26M |
1.1 GB |
2025.02.20 |
- Bug fix: Model weight mismatch issue resolved
- Adapted to minimind-1024 updates
- Code refactoring: Training and evaluation script standardization
- Added complete checkpoint resumption support
- Compatibility updates
- Adapted to new features in the minimind repository
- Standardized parts of the code
- MiniMind2-V updated alongside MiniMind2
- Significant reduction of redundant code
- Major simplification of model structure
- Updated dataset format with new SFT datasets
- Better performance than previous VLM!
- MiniMind-V released on schedule
- First open-source vision-language model release
🎯 Try It Online
📦 Get the Code
💡 Why MiniMind-V?
- Train vision-language models from scratch on consumer GPUs
- Pure PyTorch implementation with no hidden complexity
- Learn multimodal AI by building and reading code
- Works offline on your personal hardware
- Complete pipeline from data to deployment
- OpenAI API compatible for easy integration
- Smaller than most open-source VLMs yet powerful
- Perfect for education and research
💭 "Seeing the world through a smaller lens, yet just as sharp."