VLMOD
Understanding Multi-Object World from Monocular View
- Language and binocular vision play a crucial role in human understanding of the world. Advancements in artificial intelligence have also made it possible for machines to develop 3D perception capabilities essential for high-level scene understanding.
- However, only monocular cameras are often available in practice due to cost and space constraints. Enabling machines to achieve accurate 3D understanding from a monocular view is practical but presents significant challenges. We introduce MonoMulti-3DVG, a novel task aimed at achieving multi-object 3D Visual Grounding (3DVG) based on monocular RGB images, allowing machines to better understand and interact with the 3D world.
- Our code are available at https://github.com/JasonHuang516/MonoMulti-3DVG

Submission Guidelines
MonoMulti3D-ROPE Dataset: By applying, you can obtain the download link for the annotation file corresponding to the dataset. To avoid infringement, please obtain the images corresponding to the dataset from the Rope3D dataset: "Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task"
In the test set, you will receive a series of JSON files corresponding to the image names.
- 'public_description' consists of three textual descriptions
- 'test_data' is a 3D coordinate corresponding to N current images
What you need to do now is to generate a txt truth file with the same name for each JSON test file
- Example input JSON:
{ "public_description": ["text A","text B","text C"], "test_data": ["car 0 1 1.801794718014159 432.329712 161.031845 489.911163 195.766006 1.330764 1.723862 4.595827 -23.599788856 -17.3461417017 107.76086501 1.58619703569 white", "car 0 0 1.8714389585917863 271.055511 260.568756 369.843994 326.276459 1.226748 1.293225 4.509353 -17.0402855068 -6.51409203148 60.3178077787 1.59610576362 black", "car 0 0 1.9176106259634207 152.466553 233.118301 249.770584 294.002746 1.368777 1.354342 4.576945 -23.1791681767 -8.31255693637 68.7484072947 1.5924206198 white", "car 0 0 4.548582196278255 1380.783569 357.924286 1497.573242 443.193329 1.000533 1.625677 4.351937 10.2291397219 -2.78002294124 45.1663406086 4.77130207357 red"] }
- Example output txt:
0 0 0 0 0 0 0 0 0 0 1 1
Finally, submit a zip file containing the predicted values of all test samples in txt format. Please strictly follow the above format for submission. After submission, wait a few minutes for the system to rate.
Method Leaderboard
7 Methods
6 Metrics
This leaderboard shows methods that are online and have submitted results. Methods are ranked based on their performance metrics.
Method | F1 Higher is better | Precision Higher is better | Recall Higher is better | TP Higher is better | FP Lower is better | FN Lower is better |
---|---|---|---|---|---|---|
CyclopsNet
Last submission: 2025-08-06
|
69.0900 | 54.4600 | 94.4700 | 63139 | 52806 | 3698 |
M3DRef-CLIP
Last submission: 2025-08-31
|
63.7000 | 49.8200 | 88.3000 | 55954 | 56364 | 7416 |
3DJCG
Last submission: 2025-08-31
|
63.4100 | 49.7700 | 87.3600 | 55657 | 56172 | 8051 |
3DVG-Trans
Last submission: 2025-08-31
|
61.7100 | 48.7100 | 84.1700 | 56232 | 59221 | 10573 |
Instancerefer
Last submission: 2025-08-31
|
58.6800 | 45.3000 | 83.2700 | 50397 | 60854 | 10122 |
ReferIt3D
Last submission: 2025-08-31
|
58.0100 | 44.7500 | 82.4100 | 49883 | 61579 | 10647 |
ScanRefer
Last submission: 2025-08-30
|
56.6700 | 44.1900 | 78.9800 | 48411 | 61140 | 12884 |