The next generation of AI models for complex human motion animation

Jun 4
5 min read

Text-to-motion generation has made remarkable progress over the last few years. Today, you can type a simple prompt such as:

“A person walks forward, waves, and sits down.”

And an AI model can generate a corresponding animation.

But things become much more difficult when instructions get longer and more detailed.

For example:

“A person runs forward, turns left, throws a punch, follows it with a kick, raises their arms in victory, and then waves happily.”

For a human, this sequence is easy to understand. For most AI systems, however, it remains a challenging task.

This blog post is based on our recent research paper "Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM".

📄 Read the preprint: https://arxiv.org/abs/2605.29906

The problem with today's text-to-motion models

Most existing text-to-motion systems try to convert language directly into motion. In practice, this means a single model must simultaneously:

understand the meaning of the text;
determine the correct sequence of actions;
organize behavior over time;
generate realistic body motion frame by frame;
maintain consistency with the physical constraints of human movement.

It's a bit like asking an architect to also be the engineer, construction crew, and project manager at the same time. As prompts become longer and more complex, the likelihood of errors increases.

Models may skip actions, execute them in the wrong order, or produce unnatural transitions between movements. Even when the generated motion appears visually plausible, it may not fully respect the physical structure of the task. Characters can begin an action before completing the previous one, lose coherence over long sequences, or drift away from the intended behavior described in the text.

These challenges become particularly apparent for long compositional prompts that contain multiple actions and temporal dependencies. In such cases, generating motion directly from text requires the model to solve planning and execution simultaneously, which can make reliable generation difficult.

One possible solution is to separate these responsibilities. Instead of asking a single model to understand language, plan behavior, and generate physically plausible motion at the same time, we can decompose the problem into two stages: planning and execution.

This is where Behavioral Foundation Models (BFMs) become useful.

What is a Behavioral Foundation Model?

A Behavioral Foundation Model (BFM) is a generative model designed to produce motion from high-level behavioral representations.

Unlike text-to-motion models, a BFM does not operate directly on natural language. Instead, it receives a sequence of behavioral embeddings describing actions, transitions, and other motion-related concepts. Its role is to transform these behavioral instructions into realistic full-body animation.

In a sense, a BFM acts as a dedicated motion engine. Given a behavioral program, it knows how to execute it and generate the corresponding animation.

Planning Before Motion

Instead of generating character poses directly from text, the model first creates a compact behavioral plan – a high-level program describing what the character intends to do. Why plan?

When people perform complex sequences of actions, they do not consciously think about the position of every joint or muscle. Instead, they first form a high-level plan of what they want to accomplish.

Consider a simple everyday situation: you notice your bus arriving and decide to catch it. You run toward the bus, step inside, look around for an available seat, walk toward it, and finally sit down. At no point do you explicitly plan the angle of your knees, the rotation of your shoulders, or the position of your feet. Of course, we do not consciously plan every movement involved in these actions. We think in terms of goals and behaviors, while our brains automatically coordinate the countless low-level motions required to execute them.

The same idea is the basis of our research. Instead of generating motion directly, we first generate a behavioral plan and then let a specialized motion model handle the execution.

Introducing Text2BFM

Inspired by this observation, we developed Text2BFM, a text-to-motion framework that separates behavioral planning from motion generation.

The system consists of two main components. First, a text encoder converts the input instruction into a semantic representation that captures the meaning of the described actions. A denoising backbone then transforms this representation into a sequence of behavioral embeddings that serve as a compact plan for the motion.

These embeddings do not specify individual body poses or joint trajectories. Instead, they represent the intended behavior of the character in the latent policy space of a pretrained BFM. The predicted behavioral embeddings executed by the frozen BFM to generate the final motion sequence.

The overall process can be viewed as:

Text → Behavioral Plan → BFM → Motion

The key idea is that language understanding and motion synthesis are handled by different components. Text2BFM focuses on predicting a coherent behavioral plan, while the BFM focuses on executing that plan and producing realistic motion.

This separation allows the system to better preserve long action sequences, maintain temporal consistency, and leverage the rich behavioral knowledge already encoded within the Behavioral Foundation Model. Because the BFM was trained through interaction with an environment, it has learned not only how behaviors are structured but also how they can be executed in a physically consistent manner. As a result, the generated motions naturally respect the dynamics and constraints of the environment, leading to more realistic and coherent motion.

Why this matters

The biggest advantage of this approach becomes apparent when dealing with long and compositional prompts.

Consider a prompt like:

“Perform a cartwheel, then a spin, followed by two twirls, and finish with a jump.”

For such instructions, a model must not only generate realistic motion but also preserve the structure of the described action sequence. This is often challenging for direct text-to-motion approaches, which must reason about planning and motion generation simultaneously.

In Text2BFM, the instruction is first transformed into a sequence of behavioral embeddings that capture the structure of the intended behavior. By operating in the behavioral space learned by the BFM, the model can focus on organizing actions and their temporal relationships before motion is generated.

As a result, Text2BFM is better able to preserve long sequences of actions and maintain consistency throughout the motion. Our experiments show significant improvements in text-motion alignment, particularly for long compositional instructions containing multiple consecutive actions.

Potential Applications

The ability to generate long, structured motion sequences from natural language could make character animation significantly more accessible and scalable.

Potential applications include:

game development;
virtual avatars and digital humans;
VR and immersive experiences;
educational and training simulations;
robotics research.

Instead of manually stitching together dozens of animation clips, creators could describe a complex behavior in natural language and generate an entire motion sequence automatically.

More broadly, approaches such as Text2BFM may help bridge the gap between high-level human intentions and physically realistic character motion, enabling more natural ways of controlling virtual agents and digital characters.

Looking ahead

While the results are promising, many challenges remain.

Some of the most difficult problems include:

rare and highly dynamic acrobatic motions;
interactions with objects and the surrounding environment;
coordinated behaviors involving multiple characters.

At the same time, our work highlights a promising direction for the field. Rather than treating motion generation as a direct language-to-pose mapping problem, it may be beneficial to introduce an intermediate behavioral layer that explicitly captures the structure of human actions.

Behavioral Foundation Models provide a natural way to realize this idea. By separating behavioral planning from motion synthesis, they allow language models to focus on understanding what should happen, while specialized motion models focus on how it should be executed.

As foundation models for behavior continue to improve, we believe they will become an increasingly important building block for the next generation of text-to-motion systems.