LEGO Logo

Learning to Grasp Anything by Playing with Random Toys

Dantong Niu1,*, Yuvan Sharma1,*, Baifeng Shi1,*
Rachel Ding1, Matteo Gioia2,§, Haoru Xue1, Henry Tsai1, Konstantinos Kallidromitis3, Anirudh Pai1
Shankar Shastry1,†, Trevor Darrell1,†, Jitendra Malik1,†, Roei Herzig1,†

*Equal contribution    Equal advising    §work done while interning at ItalAI
1University of California, Berkeley    2Sapienza University of Rome    3Panasonic   

TLDR: Training robots on random toys enables zero-shot grasping of real-world objects.

LEGO Teaser

Zero Shot Grasping Demos: Dexterous Hands + Humanoid

Trained on Randomized Toys, Tested on Real-World Objects

Grasping Demonstrations for Randomized Toys (Training Set)

Zero-Shot Grasping on Real World Objects (Evaluation Set)

Zero Shot Grasping Demos: Franka DROID

Trained on Randomized Toys, Tested on Real-World Objects

Grasping Demonstrations for Randomized Toys (Training Set)

Zero-Shot Grasping on YCB Real World Objects (Evaluation Set)

Zero Shot Grasping Demos: ManiSkill YCB Benchmark

Trained on Randomized Toys, Tested on Real-World Objects

Grasping Demonstrations for Randomized Toys (Training Set)

Zero-Shot Grasping on Real World Objects (Evaluation Set)

Background

"Treat nature by means of the cylinder, the sphere, the cone, everything brought into proper perspective."

Paul Cézanne

  • Robotic manipulation policies often struggle to generalize to novel objects, limiting real-world utility.
  • Inspired by the ability of children to learn manipulation skills from simple toys, we investigate whether robots can develop generalizable dexterous grasping skills by learning from a small set of simple, randomized "toy" objects.
  • Training uses objects composed of just four shape primitives: spheres, cuboids, cylinders, and rings, and we find our model can generalize zero-shot to 64 real-world objects.
  • We find that the key to this generalization is an object-centric visual representation induced by our detection pooling mechanism.
  • Evaluated in simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art methods using more in-domain data.
  • We also study how zero-shot generalization scales with the number/diversity of training toys and demonstrations per toy.
  • This approach provides a promising path toward scalable and generalizable robotic manipulation learning.

Detection Pooling

  • To learn a policy that generalizes to novel objects from randomized toys, we design the vision encoder to be object-centric using a mechanism called detection pooling.
  • Detection pooling ensures that the visual feature focuses on the object to be grasped:
    • First, we obtain the object segmentation mask for each frame using SAM 2 (or using ground truth masks in simulation).
    • We then use the object mask to set the attention mask in the vision encoder, preventing attention between object and non-object patch tokens.
    • This ensures object patch tokens only contain features from the object itself.
    • Positional embeddings still allow the encoder to understand the object's location in the scene.
  • The final object-centric visual feature is obtained by applying mean pooling on the object patch tokens, which is then fed to the policy model (a standard transformer architecture).
  • Empirical results show that detection pooling is crucial for strong zero-shot generalization, compared to other pooling methods (mean or attention pooling) that do not restrict attention within the ViT and only pool the final output tokens (see Results section).
LEGO Architecture
(a) LEGO uses a ViT with detection pooling to extract features of the target object and uses a transformer to predict future actions based on the visual features and the proprioception. (b) The ViT extracts features that focus on the target object via detection pooling, which restrains the attention to the object patches using an attention mask and performs mean pooling on the output object patch tokens to get the final object-centric vision feature.

Creating Randomized Robot Toys

To create randomized toys for learning generalizable grasping skills, we draw inspiration from Cézanne's classic idea that everyday objects can be decomposed into combinations of simple shape primitives. Specifically, we use four basic primitives—cuboids, spheres, cylinders, and rings—and generate randomized toys by selecting up to five primitives per toy and assigning them random positions and orientations, while enforcing intersections to ensure the overall object forms solid structure. Each toy is then assigned one of four colors: red, green, blue, or yellow. For real-world experiments, we 3D print 250 of these toys. Examples of the randomized toys are shown in the figure below.

LEGO Toys
Randomized Robot Toys created from Primitive Shapes.

Results

ManiSkill Simulation Grasping Results

Results of zero-shot grasping in the ManiSkill simulation environment on 65 objects from the YCB Object Benchmark. All models shown are trained on the same number of randomized toy grasping demonstrations. Our model outperforms finetuned baselines, achieving a 80% success rate, with detection pooling proving key to generalization.
Method # Demonstrations
250 500 1000 1500 2000 2500
OpenVLA-OFT 30.10 36.35 22.31 15.38 14.71 12.79
π₀-FAST 8.85 7.60 7.69 8.56 4.23 4.13
Ours - Attn Pooling 34.71 40.10 44.23 48.27 49.81 51.63
Ours - CLS Pooling 24.71 20.29 36.92 41.44 42.40 49.81
Ours - Mean Pooling 32.98 30.38 36.15 39.90 40.29 40.58
Ours - Det. Pooling 56.63 68.17 71.15 74.62 76.83 80.00

Real-World Franka DROID Grasping Results

Results of zero-shot grasping in the real-world Franka DROID setting, on 64 objects from the YCB Object Benchmark. Models tuned on toys are trained with a total of 1500 grasping demonstrations across 250 toys. During evaluation, each YCB object is tested 16 times across a predefined 4x4 grid, and results are averaged to get the final success rate. LEGO outperforms large-scale robotic models such as zero-shot π₀-FAST and OpenVLA, and ShapeGrasp, which uses a pre-trained LLM for choosing grasp points, achieving an overall success rate of 67%.
OpenVLA (Tuned on Toys + Large-Scale Pretraining)
50%
ShapeGrasp (Zero Shot, uses LLM)
50%
π₀-FAST (Zero Shot, Large-Scale Pretraining with DROID data)
50%
Ours (Tuned on Toys)
50%
π₀-FAST (Tuned on Toys + Large-Scale Pretraining with DROID data)
50%

Real-World H1-2 + Dexterous Hands Grasping Results

Results of zero-shot grasping in the real-world H1-2 humanoid with dexterous hands setting, on 13 real-world objects. Models were tuned on a total of 500 grasping demonstrations across 250 toys. During evaluation, each object was tested 5 times across a predefined grid, and results were averaged to get the final success rate. LEGO outperforms both large-scale VLA models tested, π₀-FAST and OpenVLA, achieving an overall success rate of 51%.
OpenVLA (Tuned on Toys + Large-Scale Pretraining)
50%
π₀-FAST (Tuned on Toys + Large-Scale Pretraining)
50%
Ours (Tuned on Toys)
50%

Scaling Ablations

We perform an ablation study in simulation to examine how both the number of unique toys in the training set and the number of grasping demonstrations influence performance. Specifically, we construct six object sets containing 1, 25, 125, 250, 500, and 1000 unique toys, respectively. For each set, we collect 2,500 grasping demonstrations and train our model using varying numbers of demonstrations per set. The results, shown in the left panel of the figure below, indicate that increasing the number of unique objects improves performance, but with diminishing returns. In contrast, the number of demonstrations has a stronger impact on learning generalizable grasping. The right panel of the figure illustrates the results of an ablation study on the size of the model's transformer policy, where we find ViT-B (86M parameters) to achieve the best overall performance.
LEGO Architecture
Scaling Ablations. (Left) Grasping success rate as a function of the number of unique toys in the training set and the number of demonstrations per set. (Right) Grasping success rate as a function of the size of the transformer policy model.

Citation

@misc{niu2025learninggraspplayingrandom,
      title={Learning to Grasp Anything by Playing with Random Toys}, 
      author={Dantong Niu and Yuvan Sharma and Baifeng Shi and Rachel Ding and Matteo Gioia and Haoru Xue and Henry Tsai and Konstantinos Kallidromitis and Anirudh Pai and Shankar Shastry and Trevor Darrell and Jitendra Malik and Roei Herzig},
      year={2025},
      eprint={2510.12866},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.12866}, 
}