TencentARC

TencentARC/TimeLens-8B

No description available.

Model Documentation

TimeLens-8B

📑 Paper | 💻 Code | 🏠 Project Page | 🤗 Model & Data

✨ Model Description

TimeLens-8B is an MLLM with state-of-the-art video temporal grounding performance among open-source models, finetuned from Qwen3-VL-8B-Instruct. It is trained with carefully crafted RLVR (reinforcement learning with verifiable rewards) recipe proposed in our paper, utilizing our high-quality VTG training dataset TimeLens-100K.

📊 Performance

TimeLens-8B achieves state-of-the-art video temporal grounding performance among open-source models:

Model	Charades-TimeLens				ActivityNet-TimeLens				QVHighlights-TimeLens
Model	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU	R1 @0.3	R1 @0.5	R1 @0.7	mIoU
Qwen2.5-VL-7B-Instruct	59.7	37.8	16.6	39.3	44.1	31.0	16.1	31.4	41.5	27.8	15.2	31.6
TimeLens-7B🚀	70.5	55.6	28.4	48.8	62.8	51.0	32.6	46.2	74.1	62.7	43.1	56.0
Qwen3-VL-8B-Instruct	69.2	53.4	27.5	48.3	62.1	51.2	34.4	46.8	74.2	64.6	49.3	59.4
TimeLens-8B🚀	76.6	63.0	35.2	55.2	68.9	58.4	40.6	53.2	80.2	71.6	55.5	65.5

> For detailed comparison with other models, please refer to the 🏆 Leaderboard.

🚀 Usage

Install the following packages:

bash
pip install transformers==4.57.1 accelerate==1.6.0 torch==2.6.0 torchvision==0.21.0
pip install qwen-vl-utils[decord]==0.0.14
use Flash-Attention 2 to speed up generation
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir

Using 🤗Transformers for Inference:

python
import requests
import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info


def download_video(url):
    save_path = os.path.basename(url)
    if not os.path.exists(save_path):
        print(f"Downloading video from {url}...")
        response = requests.get(url, stream=True)
        response.raise_for_status()
        with open(save_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
    return save_path

Load model and processor
model = AutoModelForImageTextToText.from_pretrained(
    "TencentARC/TimeLens-8B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    "TencentARC/TimeLens-8B",
    padding_side="left",
    do_resize=False,
)

Prepare input
query = "A man drinks water with a glass"
video_path = download_video("https://huggingface.co/datasets/JungleGym/TimeLens-Assets/resolve/main/2Y8XQ.mp4")

GROUNDER_PROMPT = "Please find the visual event described by the sentence '{}', determining its starting and ending times. The format should be: 'The event happens in   seconds'."


messages = [{
    'role': 'user',
    'content': [
        {
            'type': 'video',
            'video': video_path,
            'min_pixels': 64 * 32 * 32,
            'total_pixels': 14336 * 32 * 32,
            'fps': 2,
        },
        {
            'type': 'text',
            'text': GROUNDER_PROMPT.format(query)
        }
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
  messages,
  image_patch_size=16,
  return_video_kwargs=True,
  return_video_metadata=True,
)

videos, video_metadatas = zip(*videos)
videos, video_metadatas = list(videos), list(video_metadatas)

inputs = processor(
  text=[text],
  images=images,
  videos=videos,
  video_metadata=video_metadatas,
  padding=True,
  return_tensors='pt',
  **video_kwargs,
).to("cuda")

output_ids = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=512,
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)
]
answer = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {answer}")

Citation

If you find our work helpful for your research and applications, please cite our paper:

bibtex
@article{zhang2025timelens,
  title={TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs},
  author={Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin},
  journal={arXiv preprint arXiv:2512.14698},
  year={2025}
}

Files & Weights

Filename	Size	Action
model-00001-of-00004.safetensors	4.65 GB
model-00002-of-00004.safetensors	4.58 GB
model-00003-of-00004.safetensors	4.58 GB
model-00004-of-00004.safetensors	2.52 GB

TencentARC/TimeLens-8B

Model Documentation

TimeLens-8B

✨ Model Description

📊 Performance

🚀 Usage

use Flash-Attention 2 to speed up generation

Load model and processor

Prepare input

Citation

Files & Weights

Access Download