MolmoAct2 · BimanualYAM Training Data Analysis

Key Findings

Finding 01 — Data Concentration

tablebuss (101 datasets, 4,177 instr) + foldclo (92 datasets, 3,678 instr) + scan (71 datasets, 2,035 instr) together account for ~48% of all instructions. The distribution is heavily long-tailed with the top 3 families dominating.

Finding 02 — Instruction Diversity vs Volume

charging (dedup ratio 0.21) and plate-cleaning (0.23) have extremely low diversity despite high episode counts — the re-annotation pipeline generates many near-identical instructions. toy (0.92) and gro (0.87) have much richer per-episode instruction variety.

Finding 03 — Success Rate vs Coverage

Paper success rates (Table 19) are only reported for 5 task families. cup maps to the highest success (77% avg), consistent with high dataset count (13) and clean pick-and-place structure. tool/pegboard scores only 14% despite 20 datasets — suggesting data volume alone doesn't guarantee performance on dexterous tasks.

⚠️

Pour task is absent from all 737 datasets. Zero coverage of any pouring, liquid transfer, or tilting behavior in the training data. Zero-shot failure on pour task instructions is expected — the action expert has no robot demonstration pairing this instruction family to motor behaviors. This is distinct from the VLM backbone's semantic understanding of "pour."

Distribution Overview

Datasets per Task Family

Instructions per Task Family (Total vs Unique)

Success Rate vs Dataset Count (paper Table 19)

Only families with paper-reported success rates shown. Bubble size = unique instruction count.

Deduplication Ratio per Family

unique / total instructions. Low ratio = repetitive re-annotation; high = diverse.

Instruction Vocabulary Analysis

Finding 04 — Vocabulary Structure

Top tokens across all instructions are color words (blue, white, black, green, yellow) and object nouns (box, bowl, cup, plate). Instruction diversity is driven almost entirely by object-attribute variation, not task-type variation. The action verbs are a small, repetitive set: place, pick up, fold, grasp, put. "Pour" and related liquid-transfer vocabulary are entirely absent — consistent with zero-shot failure on pour tasks.

All Instructions — Word Cloud

Word cloud of all 20,554 instructions in BimanualYAM

All 20,554 instructions. Color and object words dominate; task verbs are a narrow, repetitive set.

Unique Instructions Only — Word Cloud

Word cloud of 11,451 unique instructions in BimanualYAM

Deduplicated to unique instructions. Distribution tightens further — relative weight of rare action verbs drops when re-annotation copies are removed.

Per-Family Detail Table

Family	Datasets	Total Instr	Unique Instr	Dedup Ratio	Paper Tasks (MolmoAct2 %)	Avg Success