Visitor Badge GitHub Stars License Last Commit

Train a 26M Vision-Language Model from zero.
1 hour. ¥1.3. One 3090.
See the World.

26M

Parameters

1h

Training

¥1.3

Cost

1/7000

vs GPT-3

✨ What You Get

📸 Vision Multimodal

Full vision-language capabilities on personal GPUs.

📦 Complete VLM Stack

Vision encoder → LLM → Projector → Pretrain → SFT

💰 Ultra-Affordable

Single 3090 GPU, minimal compute resources needed.

📖 Pure PyTorch

Transparent implementation. Learn from reading the code.

🚀 Production Ready

OpenAI API compatible, ready for real-world deployment.

🔌 Extensible

Multi-image support, fine-tuning, and more capabilities.

📦 Models
Model Parameters Inference Memory Release
MiniMind2-V 104M 0.6 GB 2025.02.20
MiniMind2-Small-V 26M 1.1 GB 2025.02.20
📰 What's New
🎉 2025-10-24 (Latest)
  • Bug fix: Model weight mismatch issue resolved
  • Adapted to minimind-1024 updates
  • Code refactoring: Training and evaluation script standardization
  • Added complete checkpoint resumption support
🔥 2025-04-27 +
  • Compatibility updates
  • Adapted to new features in the minimind repository
  • Standardized parts of the code
🔥 2025-02-20 +
  • MiniMind2-V updated alongside MiniMind2
  • Significant reduction of redundant code
  • Major simplification of model structure
  • Updated dataset format with new SFT datasets
  • Better performance than previous VLM!
🎬 2024-10-05 (First Release) +
  • MiniMind-V released on schedule
  • First open-source vision-language model release
🎮 Inside MiniMind-V
MiniMind-V Demo VLM Structure VLM Structure MOE

💡 Why MiniMind-V?

💭 "Seeing the world through a smaller lens, yet just as sharp."