Zhang199

Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-16-512

No description available.

Model Documentation

TinyLLaVA-Video

Here, we introduce TinyLLaVA-Video-Qwen2.5-3B-Group-16-512. For LLM and vision tower, we choose Qwen2.5-3B and siglip-so400m-patch14-384, respectively. The model adopts the Video-Level Group Resampler, samples 16 frames from each video, and represents the video sequence using 512 tokens.

Result

| Model (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU | | :----------------------------------------: | :------------: | :-------------: | :-------: | :--------------: | :----------: | | Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-1fps-512 | 1fps/512 | 47.7 | 47.0 | 42.0 | 52.6 | | Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-16-512 | 16/512 | 47.0 | 45.5 | 42.4 | 52.5 | | Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Naive-16-512 | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 | | Zhang199/TinyLLaVA-Video-Phi2-Naive-16-512 | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 |

Files & Weights

Filename	Size	Action
model-00001-of-00002.safetensors	4.62 GB
model-00002-of-00002.safetensors	2.14 GB
training_args.bin	0.00 GB

Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-16-512

Model Documentation

Result

Files & Weights

Access Download