TencentARC

TencentARC/TimeLens-8B

No description available.

Model Documentation

TimeLens-8B



πŸ“‘ Paper | πŸ’» Code | 🏠 Project Page | πŸ€— Model & Data

✨ Model Description



TimeLens-8B is an MLLM with state-of-the-art video temporal grounding performance among open-source models, finetuned from Qwen3-VL-8B-Instruct. It is trained with carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe proposed in our paper, utilizing our high-quality VTG training dataset TimeLens-100K.

πŸ“Š Performance



TimeLens-8B achieves state-of-the-art video temporal grounding performance among open-source models:

Model Charades-TimeLens ActivityNet-TimeLens QVHighlights-TimeLens
R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU R1
@0.3
R1
@0.5
R1
@0.7
mIoU
Qwen2.5-VL-7B-Instruct 59.7 37.8 16.6 39.3 44.1 31.0 16.1 31.4 41.5 27.8 15.2 31.6
TimeLens-7BπŸš€ 70.5 55.6 28.4 48.8 62.8 51.0 32.6 46.2 74.1 62.7 43.1 56.0
Qwen3-VL-8B-Instruct 69.2 53.4 27.5 48.3 62.1 51.2 34.4 46.8 74.2 64.6 49.3 59.4
TimeLens-8BπŸš€ 76.6 63.0 35.2 55.2 68.9 58.4 40.6 53.2 80.2 71.6 55.5 65.5


> For detailed comparison with other models, please refer to the πŸ† Leaderboard.

πŸš€ Usage



Install the following packages:
bash
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14

use Flash-Attention 2 to speed up generation

pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir


Using πŸ€—Transformers for Inference:
python
import requests
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

def download_video(url): save_path = os.path.basename(url) if not os.path.exists(save_path): print(f"Downloading video from {url}...") response = requests.get(url, stream=True) response.raise_for_status() with open(save_path, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): f.write(chunk) return save_path

Load model and processor

model = AutoModelForImageTextToText.from_pretrained( "TencentARC/TimeLens-8B", dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )

processor = AutoProcessor.from_pretrained( "TencentARC/TimeLens-8B", padding_side="left", do_resize=False, )

Prepare input

query = "A man drinks water with a glass" video_path = download_video("https://huggingface.co/datasets/JungleGym/TimeLens-Assets/resolve/main/2Y8XQ.mp4")

GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in
  • seconds'."


  • messages = [{ 'role': 'user', 'content': [ { 'type': 'video', 'video': video_path, 'min_pixels': 64 * 32 * 32, 'total_pixels': 14336 * 32 * 32, 'fps': 2, }, { 'type': 'text', 'text': GROUNDER_PROMPT.format(query) } ] }]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) images, videos, video_kwargs = process_vision_info( messages, image_patch_size=16, return_video_kwargs=True, return_video_metadata=True, )

    videos, video_metadatas = zip(*videos) videos, video_metadatas = list(videos), list(video_metadatas)

    inputs = processor( text=[text], images=images, videos=videos, video_metadata=video_metadatas, padding=True, return_tensors='pt', **video_kwargs, ).to("cuda")

    output_ids = model.generate( **inputs, do_sample=False, max_new_tokens=512, )

    generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids) ] answer = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f"Answer: {answer}")


    Citation



    If you find our work helpful for your research and applications, please cite our paper:

    bibtex
    @article{zhang2025timelens,
      title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
      author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
      journal={arXiv preprint arXiv:2512.14698},
      year={2025}
    }
    

    Files & Weights

    FilenameSizeAction
    model-00001-of-00004.safetensors 4.65 GB
    model-00002-of-00004.safetensors 4.58 GB
    model-00003-of-00004.safetensors 4.58 GB
    model-00004-of-00004.safetensors 2.52 GB