Tony Wang
02/27/2025
X-Embodiment: “Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation”
“SFV: Reinforcement Learning of Physical Skills from Videos”.
If robots are ever going to truly understand the world, they need to learn like we do—by watching, experimenting, and transferring skills across different situations. Two recent papers push the boundaries of how robots can generalize knowledge across embodiments and learn physical skills from video. One asks: Can a single model control vastly different robots—arms, drones, quadrupeds—by finding commonalities in their motion? The other asks: Can we skip expensive motion capture and teach robots directly from online videos?
One of the biggest roadblocks in robotics is that most models are hyper-specialized—a robotic arm trained to stack blocks won’t suddenly know how to navigate a hallway. But humans don’t work that way. We use shared sensorimotor principles across different tasks.
The X-Embodiment paper explores this by training a single goal-conditioned policy across 18 datasets spanning robotic manipulation, navigation, and even driving. The key idea? Motion, whether it’s a robotic arm reaching for a cup or a wheeled robot moving toward a waypoint, follows similar geometric constraints. By aligning action coordinate frames across embodiments, they show that co-training improves both manipulation and navigation:
This reinforces the idea that robots can share structured knowledge across bodies, even when their action spaces look completely different.
But does this mean we can train a single model to control all robots? Probably not yet. The policy still struggles with embodiment-specific constraints—manipulators are limited by joint angles, wheeled robots can’t move vertically. While the paper proves that large-scale cross-embodiment learning is possible, the next challenge is ensuring fine-grained control without losing generalization.
Imagine if a humanoid robot could learn parkour just by watching viral clips of free runners. That’s the promise of SFV (Skills from Videos)—a system that combines pose estimation from videos with reinforcement learning to teach simulated robots to perform dynamic skills.
The pipeline works like this:
What’s impressive is that this method enables robots to learn highly dynamic behaviors like flips, cartwheels, and martial arts—without any motion capture data. Since traditional motion capture is expensive and limited, this unlocks a massive dataset of real-world demonstrations from online videos.
But here’s the catch:
Both of these papers hint at a future where robots learn more like humans—not just from carefully curated datasets, but by watching, reasoning, and generalizing. But there’s still a gap between data-driven learning and true embodied intelligence.
At the end of the day, the real question is: How do we balance generalization and specialization? Can we build a single robot policy that adapts to any task, or do we always need some level of fine-tuning? The answer might be somewhere in the middle—a generalist foundation model that learns core sensorimotor principles, with task-specific refinement for high-precision control.
Either way, the future of robot learning isn’t just about bigger datasets. It’s about smarter, more structured representations of the world.
Only Antonio today. No prelec
-> how to align with learning data?
-> how to realize generalist robot?
Antonio is optimistic about this direction.
https://hgaurav2k.github.io/hop/, HOP with people hand pose & learn from its trajectory
ASI’s challenge
using video to generate keypoints for RL is still very hot today
Question: model arch?
Question: any reproduction ? is that work?
navigation dataset could benefit the manipulation task
how
co-training improves performance for ALOHA and TELLO(unseen in training data)
“you should not trust too much about your isaacsim simulator — antonio”
RoboNet from Kostas group
https://www.analyticsinsight.net/latest-news/researchers-developed-robonet-easy-capture-diverse-datamore
more i forgot..
certain finetuning may be necessary for cross embodiment training
Debate on Generalist policy, pi0 and Helix…
robotists should always find failure & limitations