omni-research
omni-research/Tarsier2-Recap-7b
Base Model: Qwen2-VL-7B-Instruct - Training Data: Tarsier2-Recap-585K Model date: Tarsier2-Recap-7b was trained in December 2024. Paper or r...
Model Documentation
Tarsier Model Card
Introduction
Tarsier2-Recap-7b is build upon Qwen2-VL-7B-Instruct by distilling the video description capabilities of Tarsier2-7b. Specifically, we finetuned Qwen2-VL-7B-Instruct on Tarsier2-Recap-585K for 2 epochs with a learning rate of 2e-5. Tarsier2-Recap-7b shares a similar video captioning ability as Tarsier2-7b, reaching an overall F1 score of 40.7% on DREAM-1K, which is only behind Tarsier2-7b (42.0%) and surpasses GPT-4o's 39.2%. See the Tarsier2 technical report for more details. _Note: Please use Tarsier2-7b if you need the full-blooded Tarsier2._Model details
License
Qwen/Qwen2-VL-7B-Instruct license.Intended use
Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.Model Performance
Video Description
We evaluate Tarsier2-Recap-7b on DREAM-1K, a detailed video description benchmark featuring dynamic and diverse videos, assessing the model’s ability to describe fine-grained actions and events. Here is the evaluation result:
_Note: The results of Tarsier2-Recap-7b is different from the results we reported in Table 11 in the Tarsier2 technical report, as Tarsier2-Recap-7b is more fully trained (2 epochs vs 1 epoch)._