Anonymous Submission 2026

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

Anonymous Authors

Submitted to NeurIPS 2026

Paper Demos Code Coming Soon

AnnotateAnything teaser — AnnotateAnything converts passive 3D assets into manipulation-ready assets with language, visual, and executable action annotations.

Abstract

Simulation enables scalable robot data collection, but raw 3D assets usually provide only geometry and appearance. AnnotateAnything is an automatic annotation framework that turns object- and room-scale 3D assets into manipulation-ready assets with structured language, visual, and action annotations. A visual-language annotation pipeline infers semantics, parts, keypoints, occupancy maps, and interaction priors, while a physics-based action annotation pipeline converts these priors into executable labels such as grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, garment trajectories, and navigation targets. The resulting annotations can be consumed by reusable atomic skills for large-scale simulation data collection and downstream robot learning tasks.

Method Overview

AnnotateAnything is organized around a visual-language-action annotation pipeline.

Visual-language annotation

Generate multi-level object and room descriptions, semantic keypoints, part masks, occupancy maps, and floor-plan cues.

Physics annotation

Ground visual-language priors into executable action candidates through target generation, optimization, validation, and augmentation.

Skill-consumable data

Use reusable atomic skills to consume grasp, dexterous grasp, articulation, garment, hanging, insertion, and navigation labels.

Atomic skill examples — Examples of atomic skills and simulation rollouts generated from AnnotateAnything annotations.

Demo GIFs

Representative annotation-enabled rollouts across articulated objects, dexterous and parallel grippers, and bimanual skills.

Robot opening an articulated door — Articulated door opening

Dexterous hand grasping an object — Dexterous grasping

Two-arm bimanual manipulation — Bimanual manipulation

In-hand object rotation — In-hand rotation

Annotation Types

Level	Annotation	Example outputs
Asset	Language	Sparse tags; object-, part-, and task-level descriptions
Asset	Visual	3D keypoints; part masks; geometric regions
Room	Language	Scene descriptions; object relations; task contexts
Room	Visual	Occupancy maps; floor plans; top-view layouts
Action	Manipulation	Grasps; dexterous contacts; articulation waypoints; insertion and hanging poses
Action	Scene	Navigation targets; approach poses; interaction-ready base poses

Results

Broad coverageRigid, articulated, deformable, garment, and room-scale assets.

Executable labelsPhysics validation filters visually plausible but infeasible action candidates.

Reusable skillsAnnotations are directly consumed by atomic skills and rollout pipelines.

Detailed quantitative tables will be added after the anonymous submission is finalized.

Citation

@inproceedings{anonymous2026annotateanything,
  title     = {AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation},
  author    = {Anonymous Authors},
  booktitle = {Submitted to NeurIPS},
  year      = {2026}
}