Anonymous Submission 2026

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

Anonymous Authors

Submitted to NeurIPS 2026

AnnotateAnything teaser
AnnotateAnything converts passive 3D assets into manipulation-ready assets with language, visual, and executable action annotations.

Abstract

Simulation enables scalable robot data collection, but raw 3D assets usually provide only geometry and appearance. AnnotateAnything is an automatic annotation framework that turns object- and room-scale 3D assets into manipulation-ready assets with structured language, visual, and action annotations. A visual-language annotation pipeline infers semantics, parts, keypoints, occupancy maps, and interaction priors, while a physics-based action annotation pipeline converts these priors into executable labels such as grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, garment trajectories, and navigation targets. The resulting annotations can be consumed by reusable atomic skills for large-scale simulation data collection and downstream robot learning tasks.

Method Overview

AnnotateAnything is organized around a visual-language-action annotation pipeline.

1

Visual-language annotation

Generate multi-level object and room descriptions, semantic keypoints, part masks, occupancy maps, and floor-plan cues.

2

Physics annotation

Ground visual-language priors into executable action candidates through target generation, optimization, validation, and augmentation.

3

Skill-consumable data

Use reusable atomic skills to consume grasp, dexterous grasp, articulation, garment, hanging, insertion, and navigation labels.

Atomic skill examples
Examples of atomic skills and simulation rollouts generated from AnnotateAnything annotations.

Demo GIFs

Representative annotation-enabled rollouts across articulated objects, dexterous and parallel grippers, and bimanual skills.

Robot opening an articulated door
Articulated door opening
Dexterous hand grasping an object
Dexterous grasping
Two-arm bimanual manipulation
Bimanual manipulation
In-hand object rotation
In-hand rotation

Annotation Types

LevelAnnotationExample outputs
AssetLanguageSparse tags; object-, part-, and task-level descriptions
AssetVisual3D keypoints; part masks; geometric regions
RoomLanguageScene descriptions; object relations; task contexts
RoomVisualOccupancy maps; floor plans; top-view layouts
ActionManipulationGrasps; dexterous contacts; articulation waypoints; insertion and hanging poses
ActionSceneNavigation targets; approach poses; interaction-ready base poses

Results

Broad coverageRigid, articulated, deformable, garment, and room-scale assets.
Executable labelsPhysics validation filters visually plausible but infeasible action candidates.
Reusable skillsAnnotations are directly consumed by atomic skills and rollout pipelines.

Detailed quantitative tables will be added after the anonymous submission is finalized.

Citation

@inproceedings{anonymous2026annotateanything,
  title     = {AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation},
  author    = {Anonymous Authors},
  booktitle = {Submitted to NeurIPS},
  year      = {2026}
}