Visitor Badge GitHub Stars License Last Commit

Train a 65M Vision-Language Model from zero.
2 hours. ¥2.6. One 3090.
See the World.

65M

Parameters

2h

Training

¥2.6

Cost

1/2600

vs GPT-3

✨ What You Get

📸 Vision Multimodal

Full vision-language capabilities on personal GPUs.

📦 Complete VLM Stack

SigLIP2 encoder → LLM → Projector → Pretrain → SFT

💰 Ultra-Affordable

Single 3090 GPU, minimal compute resources needed.

📖 Pure PyTorch

Transparent implementation. Learn from reading the code.

🚀 Production Ready

OpenAI API compatible, ready for real-world deployment.

🔌 Extensible

Multi-image support, fine-tuning, and more capabilities.

📦 Models
Model Parameters Inference Memory Release
minimind-3v-moe 200M-A65M 1.0 GB 2026.04.20
minimind-3v 65M 0.5 GB 2026.04.20
MiniMind2-V 104M 1.1 GB 2025.02.20
MiniMind2-Small-V 26M 0.6 GB 2025.02.20
📰 What's New
🎉 2026-04-20 (Latest)
  • New checkpoints: minimind-3v (65M) / minimind-3v-moe (200M-A65M)
  • Projector: added LayerNorm, removed reshape token merging (P32 natively outputs 64 tokens)
  • Vision Encoder switched to SiglipVisionModel (P32, fixed 256×256)
  • Training data moved to ALLaVA-4V (Pretrain 1.27M / SFT 2.9M, merged into single-stage SFT)
  • Freeze strategy: freeze_llm=1 unfreezes first + last layers; Pretrain/SFT defaults now 2/1; max_seq_len 360 → 450
  • Misc bugfixes and small tweaks
🛠️ 2026-04-01 +
  • Added minimind-3v (67M) and minimind-3v-moe (201M-A67M) models
  • Unified 768+8 architecture, supporting both dense and moe modes
  • Switched Visual Encoder from CLIP to SigLIP2 (siglip2-base-p16-256-ve)
  • Replaced QFormer with MLP Projection + reshape compression
  • Dataset format updated to parquet, mixed data sources, updated tokenizer with image placeholder <|image_pad|>, new WebUI with dynamic model directory scanning and dropdown model switching
  • Model code refactored, LLM/VLM unified for Transformers format
  • Training scripts support DDP multi-GPU, bfloat16 mixed precision, torch.compile acceleration
🛠️ 2025-10-24 +
  • Bug fix: mismatched model weights
  • Adapted to the latest minimind updates
  • Refactored training and evaluation scripts
  • Added complete checkpoint resume support
🔥 2025-04-27 +
  • Compatibility updates
  • Adapted to new features in the minimind repository
  • Standardized parts of the code
🔥 2025-02-20 +
  • MiniMind2-V updated alongside MiniMind2
  • Significant reduction of redundant code
  • Major simplification of model structure
  • Updated dataset format with new SFT datasets
  • Better performance than previous VLM!
🎬 2024-10-05 (First Release) +
  • MiniMind-V released on schedule
  • First open-source vision-language model release
🎮 Inside MiniMind-V
MiniMind-V Demo VLM Structure VLM Structure MOE

💡 Why MiniMind-V?

💭 "Seeing the world through a smaller lens, yet just as sharp."