VLMOD

Understanding Multi-Object World from Monocular View

Language and binocular vision play a crucial role in human understanding of the world. Advancements in artiﬁcial intelligence have also made it possible for machines to develop 3D perception capabilities essential for high-level scene understanding.

However, only monocular cameras are often available in practice due to cost and space constraints. Enabling machines to achieve accurate 3D understanding from a monocular view is practical but presents signiﬁcant challenges. We introduce MonoMulti-3DVG, a novel task aimed at achieving multi-object 3D Visual Grounding (3DVG) based on monocular RGB images, allowing machines to better understand and interact with the 3D world.
Our code are available at https://github.com/astudyber/MonoMulti-3DVG

Active

Click to enter and fill out the application form Get Dataset

Click to manage your methods or submit results Submit Task

Submission Guidelines

MonoMulti3D-ROPE Dataset: By applying, you can obtain the download link for the annotation file corresponding to the dataset. To avoid infringement, please obtain the images corresponding to the dataset from the Rope3D dataset: "Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task"

In the test set, you will receive a series of JSON files corresponding to the image names.

'public_description' consists of three textual descriptions
'test_data' is a 3D coordinate corresponding to N current images

What you need to do now is to generate a txt truth file with the same name for each JSON test file

Example input JSON:

{
    "public_description": ["text A","text B","text C"],

    "test_data": ["car 0 1 1.801794718014159 432.329712 161.031845 489.911163 195.766006 1.330764 1.723862 4.595827 -23.599788856 -17.3461417017 107.76086501 1.58619703569 white",
        "car 0 0 1.8714389585917863 271.055511 260.568756 369.843994 326.276459 1.226748 1.293225 4.509353 -17.0402855068 -6.51409203148 60.3178077787 1.59610576362 black",
        "car 0 0 1.9176106259634207 152.466553 233.118301 249.770584 294.002746 1.368777 1.354342 4.576945 -23.1791681767 -8.31255693637 68.7484072947 1.5924206198 white",
        "car 0 0 4.548582196278255 1380.783569 357.924286 1497.573242 443.193329 1.000533 1.625677 4.351937 10.2291397219 -2.78002294124 45.1663406086 4.77130207357 red"]
}

Example output txt:

Finally, submit a zip file containing the predicted values of all test samples in txt format. Please strictly follow the above format for submission. After submission, wait a few minutes for the system to rate.

Method Leaderboard

7 Methods 6 Metrics

This leaderboard shows methods that are online and have submitted results. Methods are ranked based on their performance metrics.

@inproceedings{guo2025beyond,
  title={Beyond Human Perception: Understanding Multi-Object World from Monocular View},
  author={Guo, Keyu and Huang, Yongle and Sun, Shijie and Song, Xiangyu and Feng, Mingtao and Liu, Zedong and Song, Huansheng and Wang, Tiantian and Li, Jianxin and Akhtar, Naveed and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={3751--3760},
  year={2025}
}

@inproceedings{zhang2023multi3drefer,
  title={Multi3drefer: Grounding text description to multiple 3d objects},
  author={Zhang, Yiming and Gong, ZeMing and Chang, Angel X},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={15225--15236},
  year={2023}
}

@inproceedings{cai20223djcg,
  title={3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds},
  author={Cai, Daigang and Zhao, Lichen and Zhang, Jing and Sheng, Lu and Xu, Dong},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={16464--16473},
  year={2022}
}

@inproceedings{zhao20213dvg,
  title={3dvg-transformer: Relation modeling for visual grounding on point clouds},
  author={Zhao, Lichen and Cai, Daigang and Sheng, Lu and Xu, Dong},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={2928--2937},
  year={2021}
}

@inproceedings{yuan2021instancerefer,
  title={Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring},
  author={Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Zhang, Ruimao and Wang, Sheng and Li, Zhen and Cui, Shuguang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={1791--1800},
  year={2021}
}

@inproceedings{achlioptas2020referit3d,
  title={Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes},
  author={Achlioptas, Panos and Abdelreheem, Ahmed and Xia, Fei and Elhoseiny, Mohamed and Guibas, Leonidas},
  booktitle={European conference on computer vision},
  pages={422--440},
  year={2020},
  organization={Springer}
}

@inproceedings{chen2020scanrefer,
  title={Scanrefer: 3d object localization in rgb-d scans using natural language},
  author={Chen, Dave Zhenyu and Chang, Angel X and Nie{\ss}ner, Matthias},
  booktitle={European conference on computer vision},
  pages={202--221},
  year={2020},
  organization={Springer}
}

Method	F1 Higher is better	Precision Higher is better	Recall Higher is better	TP Higher is better	FP Lower is better	FN Lower is better
CyclopsNet Last submission: 2025-08-06	69.0900	54.4600	94.4700	63139	52806	3698
M3DRef-CLIP Last submission: 2025-08-31	63.7000	49.8200	88.3000	55954	56364	7416
3DJCG Last submission: 2025-08-31	63.4100	49.7700	87.3600	55657	56172	8051
3DVG-Trans Last submission: 2025-08-31	61.7100	48.7100	84.1700	56232	59221	10573
Instancerefer Last submission: 2025-08-31	58.6800	45.3000	83.2700	50397	60854	10122
ReferIt3D Last submission: 2025-08-31	58.0100	44.7500	82.4100	49883	61579	10647
ScanRefer Last submission: 2025-08-30	56.6700	44.1900	78.9800	48411	61140	12884