Everything you need to know to score robot behavior.
We are training a humanoid robot to follow instructions like a good human worker. The dataset consists of head-mounted camera videos of the humanoid doing various tasks — folding clothes, picking up objects, cleaning the kitchen, and many others. Each video contains many instructions, with at most one instruction active at any time.
The robot will learn to maximize its score, so your scores directly shape its behavior.
Each episode is segmented into labels — each label has a start frame, end frame, and a language instruction (e.g. "pick up the red block").
Pretend you gave the instruction to the robot to accomplish some goal. Your job is to score from 0 to 5 how much you like the robot's behavior while that instruction is active. Think of it like a star review — how satisfied are you with how the robot executed the task?
| Score | Meaning | Description |
|---|---|---|
| 0 | Catastrophic | The robot endangers or injures a human, destroys something very valuable (e.g. a TV), or creates a dangerous situation (e.g. starts a fire) |
| 1 | Very bad | Highly unsatisfied. The robot doesn't follow the instruction at all, breaks or destroys something, or hurts itself |
| 2 | Bad | The robot attempts the task but does it poorly — wrong object, clumsy execution, or mostly fails to accomplish the goal |
| 3 | Ok | The robot makes a reasonable attempt. The task is partially accomplished but with noticeable issues or inefficiency |
| 4 | Good | The robot accomplishes the task well. Minor imperfections but you'd be satisfied with the performance |
| 5 | Perfect | You can't find anything to criticize. The robot does exactly as you wish and behaves like an expert human or even superhuman |
Before scoring, ask yourself: does this instruction make sense in this situation? Could you reasonably give this instruction to the robot given what's in the scene?
If the instruction makes sense, don't edit it — just score the behavior. A robot that fails to follow a perfectly good instruction should get a low score, not an edited instruction.
For example, the instruction is "pick up the hammer":
Wrong boundaries — Check that the start and end frames match the task. For example, the instruction is "pick up the hammer" and the robot picks it up but then also walks to the kitchen — the end frame should be right after the robot finishes picking up the hammer. Use i to set the start frame and o to set the end frame to the current video position.
Always fix instruction and boundaries first, then score the corrected segment.
Once the instruction and boundaries are correct, score how well the robot executed the task (press 0–5). See the scoring rubric above.
| / | Edit the instruction text |
| i | Set start frame to current video position |
| o | Set end frame to current video position |
| u | Undo your last structural edit |
Edits create a new version of the label (the original is preserved).
| Space / k | Play / pause |
| 0 - 5 | Score the selected label |
| n / Tab | Next unreviewed label |
| p | Previous unreviewed label |
| b | Seek to start of current label |
| a | Toggle auto-pause at label end |
| h / l | Step back / forward 1 frame |
| H / L | Step back / forward 1 second |
| i / o | Set start / end frame |
| / | Edit instruction text |
| u | Undo last edit |
| ? | Show shortcut overlay in viewer |
Open these example episodes to see reviewed labels with scoring notes explaining why each score was given. Click any label in the timeline to see its notes in the side panel.
No example episodes configured yet. An admin can add them from the Admin panel.