This figure compares the reward of Franka Gripper experiments, 3 Human Prompt experiments, and experiments on our proposed method across 12 tasks. For every method, the bars with the original dark color in the legend indicate the average reward of the five runs, while the bars with a paler color visible above them indicate the best reward over those runs.
VLMgineer works consistently well across tasks, in terms of both average and best rewards. We dive into interesting individual method comparisons now. As expected, the default Franka Panda two-finger gripper fails on the majority of these tasks. What is perhaps more noteworthy is that VLMgineer outperforms human-prompting! This is true across all tasks on both metrics, showing better and more reliable performance. While human prompts occasionally produced strong solutions, their results were less consistent and efficient. In tasks like CleanTable and ScoreGoal, both approaches reached similar peak rewards, but our method did so with significantly shorter paths.