Text-to-motion generation has made remarkable progress over the last few years. Today, you can type a simple prompt such as: “A person walks forward, waves, and sits down.” And an AI model can generate a corresponding animation. But things become much more difficult when instructions get longer and more detailed. For example: “A person runs forward, turns left, throws a punch, follows it with a kick, raises their arms in victory, and then waves happily.” For a human, this seq