VLMOD


Understanding Multi-Object World from Monocular View
  • Language and binocular vision play a crucial role in human understanding of the world. Advancements in artificial intelligence have also made it possible for machines to develop 3D perception capabilities essential for high-level scene understanding.
  • 图片1
  • However, only monocular cameras are often available in practice due to cost and space constraints. Enabling machines to achieve accurate 3D understanding from a monocular view is practical but presents significant challenges. We introduce MonoMulti-3DVG, a novel task aimed at achieving multi-object 3D Visual Grounding (3DVG) based on monocular RGB images, allowing machines to better understand and interact with the 3D world.
  • Our code are available at https://github.com/JasonHuang516/MonoMulti-3DVG

Submission Guidelines

MonoMulti3D-ROPE Dataset: By applying, you can obtain the download link for the annotation file corresponding to the dataset. To avoid infringement, please obtain the images corresponding to the dataset from the Rope3D dataset: "Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task"


In the test set, you will receive a series of JSON files corresponding to the image names.

  1. 'public_description' consists of three textual descriptions
  2. 'test_data' is a 3D coordinate corresponding to N current images

What you need to do now is to generate a txt truth file with the same name for each JSON test file

  • Example input JSON:
    {
        "public_description": ["text A","text B","text C"],
    
        "test_data": ["car 0 1 1.801794718014159 432.329712 161.031845 489.911163 195.766006 1.330764 1.723862 4.595827 -23.599788856 -17.3461417017 107.76086501 1.58619703569 white",
            "car 0 0 1.8714389585917863 271.055511 260.568756 369.843994 326.276459 1.226748 1.293225 4.509353 -17.0402855068 -6.51409203148 60.3178077787 1.59610576362 black",
            "car 0 0 1.9176106259634207 152.466553 233.118301 249.770584 294.002746 1.368777 1.354342 4.576945 -23.1791681767 -8.31255693637 68.7484072947 1.5924206198 white",
            "car 0 0 4.548582196278255 1380.783569 357.924286 1497.573242 443.193329 1.000533 1.625677 4.351937 10.2291397219 -2.78002294124 45.1663406086 4.77130207357 red"]
    }
    
  • Example output txt:
  • 0 0 0
    0 0 0
    0 0 0
    0 1 1
    

Finally, submit a zip file containing the predicted values of all test samples in txt format. Please strictly follow the above format for submission. After submission, wait a few minutes for the system to rate.

Method Leaderboard

7 Methods 6 Metrics
This leaderboard shows methods that are online and have submitted results. Methods are ranked based on their performance metrics.
@inproceedings{guo2025beyond,
  title={Beyond Human Perception: Understanding Multi-Object World from Monocular View},
  author={Guo, Keyu and Huang, Yongle and Sun, Shijie and Song, Xiangyu and Feng, Mingtao and Liu, Zedong and Song, Huansheng and Wang, Tiantian and Li, Jianxin and Akhtar, Naveed and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={3751--3760},
  year={2025}
}
@inproceedings{zhang2023multi3drefer,
  title={Multi3drefer: Grounding text description to multiple 3d objects},
  author={Zhang, Yiming and Gong, ZeMing and Chang, Angel X},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={15225--15236},
  year={2023}
}
@inproceedings{cai20223djcg,
  title={3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds},
  author={Cai, Daigang and Zhao, Lichen and Zhang, Jing and Sheng, Lu and Xu, Dong},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={16464--16473},
  year={2022}
}
@inproceedings{zhao20213dvg,
  title={3dvg-transformer: Relation modeling for visual grounding on point clouds},
  author={Zhao, Lichen and Cai, Daigang and Sheng, Lu and Xu, Dong},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={2928--2937},
  year={2021}
}
@inproceedings{yuan2021instancerefer,
  title={Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring},
  author={Yuan, Zhihao and Yan, Xu and Liao, Yinghong and Zhang, Ruimao and Wang, Sheng and Li, Zhen and Cui, Shuguang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={1791--1800},
  year={2021}
}
@inproceedings{achlioptas2020referit3d,
  title={Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes},
  author={Achlioptas, Panos and Abdelreheem, Ahmed and Xia, Fei and Elhoseiny, Mohamed and Guibas, Leonidas},
  booktitle={European conference on computer vision},
  pages={422--440},
  year={2020},
  organization={Springer}
}
@inproceedings{chen2020scanrefer,
  title={Scanrefer: 3d object localization in rgb-d scans using natural language},
  author={Chen, Dave Zhenyu and Chang, Angel X and Nie{\ss}ner, Matthias},
  booktitle={European conference on computer vision},
  pages={202--221},
  year={2020},
  organization={Springer}
}
Method F1 Higher is better Precision Higher is better Recall Higher is better TP Higher is better FP Lower is better FN Lower is better
CyclopsNet
Last submission: 2025-08-06
69.0900 54.4600 94.4700 63139 52806 3698
M3DRef-CLIP
Last submission: 2025-08-31
63.7000 49.8200 88.3000 55954 56364 7416
3DJCG
Last submission: 2025-08-31
63.4100 49.7700 87.3600 55657 56172 8051
3DVG-Trans
Last submission: 2025-08-31
61.7100 48.7100 84.1700 56232 59221 10573
Instancerefer
Last submission: 2025-08-31
58.6800 45.3000 83.2700 50397 60854 10122
ReferIt3D
Last submission: 2025-08-31
58.0100 44.7500 82.4100 49883 61579 10647
ScanRefer
Last submission: 2025-08-30
56.6700 44.1900 78.9800 48411 61140 12884