Train a 67M Vision-Language Model from zero.
1 hour. ¥1.3. One 3090.
See the World.
📸 Vision Multimodal
Full vision-language capabilities on personal GPUs.
📦 Complete VLM Stack
SigLIP2 encoder → LLM → Projector → Pretrain → SFT
💰 Ultra-Affordable
Single 3090 GPU, minimal compute resources needed.
📖 Pure PyTorch
Transparent implementation. Learn from reading the code.
🚀 Production Ready
OpenAI API compatible, ready for real-world deployment.
🔌 Extensible
Multi-image support, fine-tuning, and more capabilities.
| Model |
Parameters |
Inference Memory |
Release |
| minimind-3v-moe |
202M-A67M |
1.0 GB |
2026.04.01 |
| minimind-3v |
67M |
0.5 GB |
2026.04.01 |
| MiniMind2-V |
104M |
1.1 GB |
2025.02.20 |
| MiniMind2-Small-V |
26M |
0.6 GB |
2025.02.20 |
- Added minimind-3v (67M) and minimind-3v-moe (201M-A67M) models
- Unified 768+8 architecture, supporting both dense and moe modes
- Switched Visual Encoder from CLIP to SigLIP2 (siglip2-base-p16-ve)
- Replaced QFormer with MLP Projection + reshape compression
- Dataset format updated to parquet, mixed data sources, updated tokenizer with image placeholder <|image_pad|>, new WebUI with dynamic model directory scanning and dropdown model switching
- Model code refactored, LLM/VLM unified for Transformers format
- Training scripts support DDP multi-GPU, bfloat16 mixed precision, torch.compile acceleration
- Bug fix: mismatched model weights
- Adapted to the latest minimind updates
- Refactored training and evaluation scripts
- Added complete checkpoint resume support
- Compatibility updates
- Adapted to new features in the minimind repository
- Standardized parts of the code
- MiniMind2-V updated alongside MiniMind2
- Significant reduction of redundant code
- Major simplification of model structure
- Updated dataset format with new SFT datasets
- Better performance than previous VLM!
- MiniMind-V released on schedule
- First open-source vision-language model release
🎯 Try It Online
📦 Get the Code
💡 Why MiniMind-V?
- Train vision-language models from scratch on consumer GPUs
- Pure PyTorch implementation with no hidden complexity
- Learn multimodal AI by building and reading code
- Works offline on your personal hardware
- Complete pipeline from data to deployment
- OpenAI API compatible for easy integration
- Smaller than most open-source VLMs yet powerful
- Perfect for education and research
💭 "Seeing the world through a smaller lens, yet just as sharp."