VLSOT


Monocular-Video-Based 3D Visual Language Tracking
  • Visual-Language Tracking (VLT) is an emerging paradigm that bridges the human-machine performance gap by integrating visual and linguistic cues, extending single-object tracking to text-driven video comprehension.
  • 图片
  • However, existing VLT research remains confined to 2D spatial domains, lacking the capability for 3D tracking in monocular video—a task traditionally reliant on expensive sensors (e.g., point clouds, depth measurements, radar) without corresponding language descriptions for their outputs.
  • The code are publicly available (https://github.com/hongkai-wei/Mono3DVLT), advancing low-cost monocular 3D tracking with language grounding.

Submission Guidelines

Mono3DVLT-V2X Dataset : By applying, you can obtain the download link for the annotation file corresponding to the dataset. To avoid infringement, please obtain the images corresponding to the dataset from the V2X-Seq dataset: "V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. "


3D目标检测预测文件格式规范

3D目标检测预测文件格式规范

1. 文件命名规则

预测文件必须按照以下格式命名:

{样本ID}_pred_box0_.txt
示例: sample001_pred_box0_.txt, car_003_pred_box0_.txt

2. 文件内容格式

每个文件包含30行,每行对应一帧的预测结果:

1 x_min y_min z_min x_max y_max z_max
2 x_min y_min z_min x_max y_max z_max
3 x_min y_min z_min x_max y_max z_max
...
30 x_min y_min z_min x_max y_max z_max

格式说明

列号 内容 说明
第1列 1-30 帧索引(从1到30)
第2-7列 x_min y_min z_min x_max y_max z_max 3D包围盒坐标

3. 坐标要求

  1. 坐标顺序:必须确保 x_min ≤ x_max, y_min ≤ y_max, z_min ≤ z_max
  2. 数据类型:浮点数,建议保留4位小数
  3. 单位:米(meters)
  4. 坐标系:3D世界坐标系
注意: 坐标值必须使用英文句点(.)作为小数点分隔符,不能使用逗号(,)

4. 示例文件内容

1 0.0413 -0.0061 -0.3771 0.4055 -0.2095 0.2607
2 0.0421 -0.0060 -0.3745 0.4040 -0.2057 0.2586
3 0.0423 -0.0060 -0.3760 0.4028 -0.2031 0.2531
4 0.0434 -0.0060 -0.3736 0.4024 -0.2010 0.2514
5 0.0441 -0.0060 -0.3724 0.4011 -0.1982 0.2453
...
30 0.0605 -0.0056 -0.3588 0.3789 -0.1567 0.1814

5. 文件夹结构

提交的预测结果应按以下结构组织:

pred_folder/
├── 1_pred_box0_.txt
├── 2_pred_box0_.txt
├── 3_pred_box0_.txt
└── ...
建议: 使用ZIP格式压缩整个文件夹后提交,文件名如 submission_teamname.zip

Method Leaderboard

7 Methods 6 Metrics
This leaderboard shows methods that are online and have submitted results. Methods are ranked based on their performance metrics.
@inproceedings{wei2025mono3dvlt,
  title={Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking},
  author={Wei, Hongkai and Yang, Yang and Sun, Shijie and Feng, Mingtao and Song, Xiangyu and Lei, Qi and Hu, Hongli and Wang, Rong and Song, Huansheng and Akhtar, Naveed and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={13886--13896},
  year={2025}
}
@inproceedings{zhan2024mono3dvg,
  title={Mono3dvg: 3d visual grounding in monocular images},
  author={Zhan, Yang and Yuan, Yuan and Xiong, Zhitong},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={6988--6996},
  year={2024}
}
@inproceedings{zhou2023joint,
  title={Joint visual grounding and tracking with natural language specification},
  author={Zhou, Li and Zhou, Zikun and Mao, Kaige and He, Zhenyu},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={23151--23160},
  year={2023}
}
@inproceedings{deng2021transvg,
  title={Transvg: End-to-end visual grounding with transformers},
  author={Deng, Jiajun and Yang, Zhengyuan and Chen, Tianlang and Zhou, Wengang and Li, Houqiang},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={1769--1779},
  year={2021}
}
@inproceedings{yang2020improving,
  title={Improving one-stage visual grounding by recursive sub-query construction},
  author={Yang, Zhengyuan and Chen, Tianlang and Wang, Liwei and Luo, Jiebo},
  booktitle={European conference on computer vision},
  pages={387--404},
  year={2020},
  organization={Springer}
}
@article{2020Zero,
  title={Zero-Shot Grounding of Objects From Natural Language Queries},
  author={ Sadhu, Arka  and  Chen, Kan  and  Nevatia, Ram },
  journal={IEEE},
  year={2020},
}
@article{2019Dynamic,
  title={Dynamic Graph Attention for Referring Expression Comprehension},
  author={ Yang, Sibei  and  Li, Guanbin  and  Yu, Yizhou },
  journal={IEEE},
  year={2019},
}
Method SR@0.5 Higher is better SR@0.7 Higher is better AOR Higher is better PR@1.0 Higher is better ACE Lower is better PR@0.5 Higher is better
Mono3DVLT-MT
Last submission: 2025-08-07
81.6300 68.9400 85.1200 81.5600 0.5210 62.3600
Mono3DVG-TR
Last submission: 2025-08-31
71.7500 63.4700 79.1300 75.8900 0.5940 58.9100
JointNLT+backproj
Last submission: 2025-08-31
61.4000 51.7000 68.3100 70.4300 0.6970 53.6300
TransVG+backproj
Last submission: 2025-08-31
54.5000 41.6200 58.6200 66.7800 0.7830 47.5600
ReSC+backproj
Last submission: 2025-08-31
50.3300 38.4100 53.4300 68.5200 0.7920 47.5600
ZSGNet+backproj
Last submission: 2025-08-31
37.2900 25.3700 40.6900 37.4100 1.1320 16.3000
FAOA+backproj
Last submission: 2025-08-31
35.4300 29.6200 41.1800 43.4000 1.0590 25.3200