Train a 65M Vision-Language Model from zero.
2 hours. ¥2.6. One 3090.
See the World.
📸 Vision Multimodal
Full vision-language capabilities on personal GPUs.
📦 Complete VLM Stack
SigLIP2 encoder → LLM → Projector → Pretrain → SFT
💰 Ultra-Affordable
Single 3090 GPU, minimal compute resources needed.
📖 Pure PyTorch
Transparent implementation. Learn from reading the code.
🚀 Production Ready
OpenAI API compatible, ready for real-world deployment.
🔌 Extensible
Multi-image support, fine-tuning, and more capabilities.
| Model |
Parameters |
Inference Memory |
Release |
| minimind-3v-moe |
200M-A65M |
1.0 GB |
2026.04.20 |
| minimind-3v |
65M |
0.5 GB |
2026.04.20 |
| MiniMind2-V |
104M |
1.1 GB |
2025.02.20 |
| MiniMind2-Small-V |
26M |
0.6 GB |
2025.02.20 |
- New checkpoints: minimind-3v (65M) / minimind-3v-moe (200M-A65M)
- Projector: added LayerNorm, removed reshape token merging (P32 natively outputs 64 tokens)
- Vision Encoder switched to SiglipVisionModel (P32, fixed 256×256)
- Training data moved to ALLaVA-4V (Pretrain 1.27M / SFT 2.9M, merged into single-stage SFT)
- Freeze strategy: freeze_llm=1 unfreezes first + last layers; Pretrain/SFT defaults now 2/1; max_seq_len 360 → 450
- Misc bugfixes and small tweaks
- Added minimind-3v (67M) and minimind-3v-moe (201M-A67M) models
- Unified 768+8 architecture, supporting both dense and moe modes
- Switched Visual Encoder from CLIP to SigLIP2 (siglip2-base-p16-256-ve)
- Replaced QFormer with MLP Projection + reshape compression
- Dataset format updated to parquet, mixed data sources, updated tokenizer with image placeholder <|image_pad|>, new WebUI with dynamic model directory scanning and dropdown model switching
- Model code refactored, LLM/VLM unified for Transformers format
- Training scripts support DDP multi-GPU, bfloat16 mixed precision, torch.compile acceleration
- Bug fix: mismatched model weights
- Adapted to the latest minimind updates
- Refactored training and evaluation scripts
- Added complete checkpoint resume support
- Compatibility updates
- Adapted to new features in the minimind repository
- Standardized parts of the code
- MiniMind2-V updated alongside MiniMind2
- Significant reduction of redundant code
- Major simplification of model structure
- Updated dataset format with new SFT datasets
- Better performance than previous VLM!
- MiniMind-V released on schedule
- First open-source vision-language model release
🎯 Try It Online
📦 Get the Code
💡 Why MiniMind-V?
- Train vision-language models from scratch on consumer GPUs
- Pure PyTorch implementation with no hidden complexity
- Learn multimodal AI by building and reading code
- Works offline on your personal hardware
- Complete pipeline from data to deployment
- OpenAI API compatible for easy integration
- Smaller than most open-source VLMs yet powerful
- Perfect for education and research
💭 "Seeing the world through a smaller lens, yet just as sharp."