VLSOT

Monocular-Video-Based 3D Visual Language Tracking

Visual-Language Tracking (VLT) is an emerging paradigm that bridges the human-machine performance gap by integrating visual and linguistic cues, extending single-object tracking to text-driven video comprehension.

However, existing VLT research remains confined to 2D spatial domains, lacking the capability for 3D tracking in monocular video—a task traditionally reliant on expensive sensors (e.g., point clouds, depth measurements, radar) without corresponding language descriptions for their outputs.
The code are publicly available (https://github.com/astudyber/Mono3DVLT), advancing low-cost monocular 3D tracking with language grounding.

Active

Click to enter and fill out the application form Get Dataset

Click to manage your methods or submit results Submit Task

Submission Guidelines

Mono3DVLT-V2X Dataset : By applying, you can obtain the download link for the annotation file corresponding to the dataset. To avoid infringement, please obtain the images corresponding to the dataset from the V2X-Seq dataset: "V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. "

3D目标检测预测文件格式规范

1. 文件命名规则

预测文件必须按照以下格式命名：

{样本ID}_pred_box0_.txt

示例： sample001_pred_box0_.txt, car_003_pred_box0_.txt

2. 文件内容格式

每个文件包含30行，每行对应一帧的预测结果：

1 x_min y_min z_min x_max y_max z_max
2 x_min y_min z_min x_max y_max z_max
3 x_min y_min z_min x_max y_max z_max
...
30 x_min y_min z_min x_max y_max z_max

格式说明

列号	内容	说明
第1列	`1-30`	帧索引（从1到30）
第2-7列	`x_min y_min z_min x_max y_max z_max`	3D包围盒坐标

3. 坐标要求

坐标顺序：必须确保 x_min ≤ x_max, y_min ≤ y_max, z_min ≤ z_max
数据类型：浮点数，建议保留4位小数
单位：米（meters）
坐标系：3D世界坐标系

注意： 坐标值必须使用英文句点(.)作为小数点分隔符，不能使用逗号(,)

4. 示例文件内容

1 0.0413 -0.0061 -0.3771 0.4055 -0.2095 0.2607
2 0.0421 -0.0060 -0.3745 0.4040 -0.2057 0.2586
3 0.0423 -0.0060 -0.3760 0.4028 -0.2031 0.2531
4 0.0434 -0.0060 -0.3736 0.4024 -0.2010 0.2514
5 0.0441 -0.0060 -0.3724 0.4011 -0.1982 0.2453
...
30 0.0605 -0.0056 -0.3588 0.3789 -0.1567 0.1814

5. 文件夹结构

提交的预测结果应按以下结构组织：

pred_folder/
├── 1_pred_box0_.txt
├── 2_pred_box0_.txt
├── 3_pred_box0_.txt
└── ...

建议： 使用ZIP格式压缩整个文件夹后提交，文件名如 submission_teamname.zip

Method Leaderboard

7 Methods 6 Metrics

This leaderboard shows methods that are online and have submitted results. Methods are ranked based on their performance metrics.

@inproceedings{wei2025mono3dvlt,
  title={Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking},
  author={Wei, Hongkai and Yang, Yang and Sun, Shijie and Feng, Mingtao and Song, Xiangyu and Lei, Qi and Hu, Hongli and Wang, Rong and Song, Huansheng and Akhtar, Naveed and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={13886--13896},
  year={2025}
}

@inproceedings{zhan2024mono3dvg,
  title={Mono3dvg: 3d visual grounding in monocular images},
  author={Zhan, Yang and Yuan, Yuan and Xiong, Zhitong},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={6988--6996},
  year={2024}
}

@inproceedings{zhou2023joint,
  title={Joint visual grounding and tracking with natural language specification},
  author={Zhou, Li and Zhou, Zikun and Mao, Kaige and He, Zhenyu},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={23151--23160},
  year={2023}
}

@inproceedings{deng2021transvg,
  title={Transvg: End-to-end visual grounding with transformers},
  author={Deng, Jiajun and Yang, Zhengyuan and Chen, Tianlang and Zhou, Wengang and Li, Houqiang},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={1769--1779},
  year={2021}
}

@inproceedings{yang2020improving,
  title={Improving one-stage visual grounding by recursive sub-query construction},
  author={Yang, Zhengyuan and Chen, Tianlang and Wang, Liwei and Luo, Jiebo},
  booktitle={European conference on computer vision},
  pages={387--404},
  year={2020},
  organization={Springer}
}

@article{2020Zero,
  title={Zero-Shot Grounding of Objects From Natural Language Queries},
  author={ Sadhu, Arka  and  Chen, Kan  and  Nevatia, Ram },
  journal={IEEE},
  year={2020},
}

@article{2019Dynamic,
  title={Dynamic Graph Attention for Referring Expression Comprehension},
  author={ Yang, Sibei  and  Li, Guanbin  and  Yu, Yizhou },
  journal={IEEE},
  year={2019},
}

Method	SR@0.5 Higher is better	SR@0.7 Higher is better	AOR Higher is better	PR@1.0 Higher is better	ACE Lower is better	PR@0.5 Higher is better
Mono3DVLT-MT Last submission: 2025-08-07	81.6300	68.9400	85.1200	81.5600	0.5210	62.3600
Mono3DVG-TR Last submission: 2025-08-31	71.7500	63.4700	79.1300	75.8900	0.5940	58.9100
JointNLT+backproj Last submission: 2025-08-31	61.4000	51.7000	68.3100	70.4300	0.6970	53.6300
TransVG+backproj Last submission: 2025-08-31	54.5000	41.6200	58.6200	66.7800	0.7830	47.5600
ReSC+backproj Last submission: 2025-08-31	50.3300	38.4100	53.4300	68.5200	0.7920	47.5600
ZSGNet+backproj Last submission: 2025-08-31	37.2900	25.3700	40.6900	37.4100	1.1320	16.3000
FAOA+backproj Last submission: 2025-08-31	35.4300	29.6200	41.1800	43.4000	1.0590	25.3200