MotionGPT: Human Motion as a Foreign Language

Section 1 introduction over the past few years we've seen incredible advancements in pre-trained large language models like GPT Bert and T5 which have brought together the fields of language image mesh and multimodal modeling however a universal pre-trained model for human motion and language still hasn't emerged the introduction of such a model designed to handle a wide range of motion related tasks could significantly benefit various sectors such as gaming robotics virtual assistants and human

behavior analysis earlier studies on human motion have investigated several tasks including motion creation motion captioning and motion prediction some recent text to motion Studies have tried to use pre-trained models relevant to language for instance the MDM model learns a motion diffusion model with conditional text tokens from clip while MLD enhances the efficiency of the motion diffusion process by integrating motion latent space motion clip and tm2t on the other hand focus on modeling th

e intertwined relationship between motion and its text description however these approaches regard motion and language as separate entities and often require strictly paired motion and Text data moreover since these models are task specific they struggle to effectively generalize to unseen tasks or data due to their limited understanding of the relationship between motion and language our focus is on developing a pre-trained model that understands the correlation between motion and language whic

h can then be applied to a variety of tasks using a larger and more diverse data set there are two main challenges in pre-training a promising motion language model the first is modeling the relationship between language and motion and the second is creating a uniform multitask framework that can be generalized to new tasks human motion like human language tends to have semantic links which can be viewed as a kind of body language building on this Insight we draw from the vision language pre-tra

ining model bait 3 to consider human motion as a unique foreign language by bringing together motion and language data and coding them within a single vocabulary the relationship between motion and language becomes more clear with the recent surge in larger scale language data and models pre-training a motion language model holds great potential to enhance performance and motion tasks furthermore this pre-training on language allows for textual instructions making the model more adaptable and us

er-friendly for a variety of motion tasks in this work we introduce a uniform motion language framework which we've named motion GPT this framework leverages the robust language generation and zero shot transferabilities of pre-trained language models for human motion related tasks to equip motion GPT with the capability to understand and produce human-like motions we first create a motion vocabulary using a motion-specific vector quantized variational autoencoder vqvae model similar to an Engli

sh vocabulary and then turn raw motion data into a sequence of motion tokens these tokens are then processed by a pre-trained language model that learns the underlying structure and rules of the motion language along with its relationship with the corresponding textual descriptions we devised a two-stage training scheme to effectively integrate language and motion and motion GPT first the language model is pre-trained on the raw motion dataset to learn the basic structure and rules of the motion

language for prompt tuning the language model is fine-tuned on an instruction data set which includes both textual descriptions and motion data to learn the correlation between the two modalities our experiments show that motion GPT performs exceptionally well in text to motion motion to text motion prediction and motion in between tasks to summarize our contributions one we've proposed a uniform motion language pre-trained model motion GPT which treats human motion as a foreign language brings

natural language models into motion related generation and performs various motion tasks with a single model 2. we Vey introduced a motion language training scheme with instruction tuning which learns from task feedback and provides promising results through prompts 3. we've created a general motion Benchmark for multitask evaluation where motion GPT shows strong performance across various tasks including text emotion motion to text motion prediction and motion in between all the codes and data

related to these tasks are available section summary the paper discusses the need for a pre-trained motion language model that can support various motion relevant tasks and benefit Fields like gaming robotics virtual assistant and human behavior analysis the authors propose a uniform motion language framework called motion GPT that treats human motion as a foreign language and integrates motion and language data within a single vocabulary they also introduce a motion language training scheme wi

th instruction tuning and Achieve state-of-the-art performance on diverse motion tasks section 2 related work section 2 related work creating human-like motion using a range of inputs like text action and partial motion is the main goal of human notion synthesis one of the most essential tasks in this field is text to motion generation which allows for the creation of motion using user-friendly and convenient language inputs models like MDM MLD and T2 and GPT have made substantial contributions

to this area MDM suggests a diffusion-based model which is trained on multiple motion tasks independently MLD further refines the latent diffusion model to generate motions taking different conditional inputs into account t2mgpt on the other hand examines a generative framework that is based on a vector quantized variational autoencoder vqvae and the generative pre-trained Transformer GPT for creating motion moreover there's the motion completion task which deals with the generation of motion ba

sed on partial motions creating intermediate motion while the start and end portions remain constant despite the promising outcomes shown by these methods for various human motion tasks most of them struggle to use a single model to handle multiple tasks to address this limitation we suggest a unique approach that views human motion as a foreign language and capitalizes on the strong language generation and transfer capabilities of pre-trained language models in the realm of human motion caption

ing we use natural languages to describe human motion by learning the mapping from motions to language with the help of two statistical models recurrent networks have also played a crucial role in this process a recent innovation in this area tm2t introduced a new way to represent motion that compresses motion into a brief sequence of discrete variables then employs a neural translation Network to build links between the two modalities although earlier research such as tm2t integrated captioning

modules into their training pipeline for motion generation these methods are limited to bi-directional translation between text and motion within one uniform framework large-scale language models powered by extensive data sets and sizable model structures have shown remarkable comprehension and generation capabilities pushing natural language processing to the next level Bert for instance pre-trains deep bi-directional language representations that Aid various Downstream tasks meanwhile T5 offe

rs a unified framework that transforms all text-based language problems into a text-to-text format further Studies have shown that by fine-tuning pre-trained models using pairs of instructions and corresponding answers the performance of these models can be enhanced flan demonstrated an instruction tuning technique that outperforms non-tuned models in unseen tasks multimodal models that can process text alongside other modalities such as images audio and videos have recently gained momentum for

example clip learns a semantic latent representation that pairs images with corresponding language descriptions despite these strides the creation of multimodal language models capable of handling human motion is still in its infancy current text to motion generation methods which can be described as caption to motion have their limitations while they can create motion from text descriptions they often struggle to handle user instructions in a context-specific way such as instruct GPT does in re

sponse to this we suggest motion GPT a model that merges natural language models with human motion tasks effectively this solution would provide a comprehensive answer to motion synthesis challenges section summary several methods have been proposed for generating human-like motion including text emotion motion completion and motion captioning however these methods are limited in their ability to handle multiple tasks and provide context-specific instructions to address this the authors propose

motion GPT which integrates natural language models with human motion tasks to provide a unified solution for motion synthesis problems this approach leverages the strong language generation and zero shot transferabilities of pre-trained language models and aligns the latent space with a Motion Auto encoder Section 3 method in our method section we delve into the details of our proposed model motion GPT which has been designed to incorporate significant language data and models into tasks involv

ing motion generation this new framework can be visualized as a combination of two primary components a motion tokenizer and a motion aware language model firstly the motion tokenizer translates raw motion data into discrete motion tokens the motion aware language model informed by large language pre-training models then learns to comprehend these motion tokens Guided by accompanying textual descriptions for motion related tasks we've implemented a three-stage training process for motion GPT foc

using on the training of the motion tokenizer motion language pre-training and instruction tuning our motion tokenizer consists of two parts a motion encoder and a motion decoder it translates a sequence of M frames of motion into L motion tokens and it can also convert these tokens back into the original motion here each token represents a portion of the motion with L being the factor by which the motion is divided into tokens once the model is given a sentence of length n that describes a moti

on related question or demand the goal of motion GPT is to generate an appropriate answer in the form of L tokens these tokens could represent either human motion or text resulting in either a sequence of motion or descriptive sentence related to the original motion to convey Motion in discrete tokens we've pre-trained a 3D human motion tokenizer based on the vector quantized variational autoencoders vqvae architecture like the general tokenizer mentioned above our motion tokenizer also comprise

s an encoder and a decoder the encoder is designed to generate discrete motion tokens packed with information while the decoder can transform these tokens back into motion sequences this approach allows us to present motion as if it were a language making it easier to merge motion and language for various tasks in more detail the motion encoder applies 1D convolutions to framewise motion features along the time dimension to obtain a set of latent vectors we then transform these vectors into a se

t of codebook entries through a process called discrete quantization our learnable code book is essentially a collection of K latent embedding vectors each with a dimension of D quantization is the process where each latent Vector is replaced with its closest code book entry effectively mapping these high dimensional vectors to a finite set of symbols or tokens the motion decoder then Maps these tokens back into the raw motion space recreating the motion with M frames to train this motion tokeni

zer we utilize three specific loss functions for training and optimization the Reconstruction loss the embedding loss and the commitment loss to enhance the quality of the generated motion we also employ techniques such as L1 smooth loss and velocity regularization in the Reconstruction loss as well as exponential moving average in codebook reset techniques to optimize the use of the codebook during training more detailed information about the structure and training of our motion tokenizer can b

e found in the supplementary section summary the motion GPT framework is proposed to integrate large language data and models into motion generation tasks it consists of a motion tokenizer that converts raw motion data into discrete motion tokens and a motion aware language model that understands the motion tokens from language pre-training models the three-stage training scheme of motion GPT includes training the motion tokenizer motion language pre-training and instruction tuning section 3.2 m

otion aware language model in the section titled 3.2 motion aware language model we introduce a novel approach to integrate human motion and language in a unified model this is made possible by our unique motion tokenizer which translates a sequence of human movements into a series of motion tokens this process allows us to represent human motion and language in a similar fashion leading to a more cohesive and integrated model to elaborate each motion token in the series is given a unique index

number creating a sequence of indices this methodology is akin to how previous language models such as T5 handle text by encoding it into wordpiece tokens however where previous methods typically treated text and motion separately we Endeavor to unify them our goal is to model text and human motion together and consistently to achieve this we merge the original text vocabulary with the motion vocabulary which aligns with our motion code book this combined vocabulary includes special tokens which

denote the start and end of the motion consequently we have a comprehensive single vocabulary that can represent a wide range of motion related tasks with both the input and output words originating from the same vocabulary the words in this context can symbolize natural language human motion or even a blend of the two based on the task at hand our model named motion GPT is highly flexible it can represent and generate a variety of motion related outputs using a single model for Generation task

s we use a Transformer based model that effectively Maps input sequences to the output our input consists of a sequence of tokens where each token is part of our unified vocabulary and the total number of tokens signifies the input length likewise the output is also a series of tokens from the same vocabulary with its length indicating the output length the model operates in such a way that the input tokens are processed by the Transformer encoder and the decoder then predicts the likely next to

ken at each step this prediction process is performed in an auto aggressive manner I.E each prediction depends on the previous ones the ultimate aim during training is to maximize the log likelihood of the data distribution by doing so motion GPT learns to recognize the underlying patterns and relationships within the data which enables the accurate and meaningful generation of the target words during the inference process Target tokens are sampled recursively from the predicted distribution Unt

il the End token is reached this approach allows us to generate the target sequence in a step-by-step manner each token is determined based on the previously generated tokens and the given Source input allowing us to create outputs that are as close to the original sequence as possible section summary the motion aware language model Maps human motion to motion tokens and combines them with text wordpiece tokens to learn motion and language jointly the unified text motion vocabulary allows for th

e flexible representation and generation of diverse motion related outputs within a single model the Transformer based model predicts the probability distribution of the potential next token at each step in an auto regressive manner enabling the generation of the target sequence in a step-by-step manner section 3.3 training strategy our approach to training revolves around the transformation of T5 models which have traditionally only been exposed to language data our goal is to bridge the gap be

tween motion and language allowing these models to understand human motion Concepts we accomplish this by introducing a motion vocabulary to the existing language model this training process unfolds in three phases one first we teach the model to understand and represent human motion as discrete tokens a process we refer to as training the motion tokenizer by doing this any sequence of human motions can be translated into a series of motion tokens making it compatible with the textual informatio

n that the model is already familiar with once we fine-tuned this process we don't alter the motion tokenizer throughout the remaining stages 2. in the second stage we train the T5 models on natural language data sets but with a Twist this time the data is a mix of language and motion we use both supervised and unsupervised techniques for this purpose in the unsupervised training a certain percentage of tokens in the input sequence are randomly replaced with a special marker the target sequence

is then created by extracting the missing spans of tokens marked by the special tokens used in the input additionally we learn the relation between motion and language using paired data sets of text and notion this involves training our motion GPT model on motion language translation where the input is either a human motion or text description our aim in this stage is to equip the model with a good understanding of the relationship between text and motion three in the final stage we fine-tune th

e model's ability to interpret instructions related to motion this involves creating a text motion data set that we treat as instructions we base this data set on existing text to motion data sets like human ml3d and kit we Define 15 core motion tasks such as generating a motion sequence from text creating a caption for emotion predicting emotion and others for each task we provide numerous instruction templates resulting in over a thousand unique tasks each with its own prompt for instance for

a motion generation task the instruction might be can you generate a motion sequence that depicts a person emulating the Motions of a Walt stance for a motion captioning task The Prompt could be provide an accurate caption describing the motion of less than motion underscore tokens greater than where less than motion underscore tokens greater than is a sequence of motion tokens created by our motion tokenizer our results show that this kind of instruction tuning enhances the model's performance

across a wide range of tasks and even improves performance on tasks it hasn't seen before you can find additional examples of prompts in the supplementary material section summary the training strategy for our model includes three stages one training of motion tokenizer to represent human motion as discrete tokens 2. motion language pre-training stage to learn the relationship between motion and language using unsupervised and supervised objectives and three instruction tuning stage to tune the

model based on prompt-based instructions for different motion relevant tasks we construct a multitask text motion data set by formulating it as instructions resulting in more than one thousand different tasks each having a unique instruction prompt which leads to Improvement across various tasks and enhances the model performance for unseen tasks or prompts section 4 experiments section 4 experiments are work motion GPT was put to the test through multiple tasks and data sets connected with moti

on we'll discuss the specific settings of the data sets we used how we measured our performance and the fine points of our implementation we'll start by showing how motion GPT Compares with other state-of-the-art approaches across an array of tasks then we'll examine how it performed in particular comparisons on tasks like turning text into motion and vice versa predicting motion and creating Motion in between for a more in-depth look at our results as well as user studies and further implementa

tion specifics refer to our supplemental material section 4.1 experimental setup data sets we're looking at a general motion synthesis that supports a range of task settings so we've used a variety of previous data sets and modified benchmarks to assess motion GPT our study mainly used two data sets for turning text into motion human ml3d and kit the kit data set provides over 6000 textual descriptions matched with nearly 4000 motion sequences while the newer human ml3d data set has around 15 00

0 motion sequences from a mass and about 45 000 sequence level text descriptions to evaluate motion GPT on tasks like predicting motion and completing motion we use the motion sequences in the human ml3d dataset which is part of the larger amass data set to ensure a fair comparison we use the same representation of motion as previous studies combining joint velocities positions and rotations this consistent representation helps future studies with motion GPT evaluation metrics we evaluated our m

odel using four types of metrics one for motion quality we use the fretched Inception distance FID to measure the difference between the feature distributions of the generated and real motions when completing motion we used average displacement error Aid and final displacement error fde to evaluate how accurate the predicted motion was 2. for Generation diversity we use the diversity div metric to assess the variance and the features extracted from the motions multimodality M measures how divers

e the generated motions are from the same text description three for text matching we used a metric called motion retrieval Precision R Precision to evaluate the accuracy of the match between texts and motions multimodal distance m-dist measures how close the motions and texts are 4. for linguistic quality we use natural language studies metrics like blue Rouge cider and Bert score to evaluate the quality of the generated motion captions implementation details our motion tokenizer codebook kin R

carrot 512 times 512 was used for most of our comparisons the motion encoder E uses a temporal down sampling rate of 4. our language model is based on the T5 architecture and our Baseline model consists of 12 layers in both the encoder and decoder the output size of our feed forward networks is D underscore text FF equals 3072 and the attention mechanisms have an inner dimension of D underscore text KV equals 64. the other sub-layers and embeddings have a size of D underscore text model equals

768. we use the atom W Optimizer to train our models we trained our motion tokenizers with a learning rate of 10 karat minus 4 and a mini batch size of 256 while our language models have a learning rate of 2 times 10 karat minus 4 for pre-training and 10 karat minus 4 for instruction tuning with a mini batch size of 16 for both stages the motion tokenizer was trained over 150 000 iterations and the language model underwent 300 000 iterations during pre-training and another 300 000 during instruc

tion tuning all our models were trained on eight Tesla V100 GPS section summary the performance of motion gpts is evaluated across various motion relevant tasks and data sets including text to motion motion to text motion prediction and motion in between the evaluation metrics include motion quality generation diversity text matching and linguistic quality the implementation details include the motion tokenizer codebook motion encoder language model and Optimizer used for training section 4.2 co

mparisons on motion relevant tasks section 4.2 is entitled comparisons on motion relevant tasks and it presents a thorough comparison of our model motion GPT with other leading approaches in the field the section is further broken down into comparisons on multiple tasks text to motion motion to text and motion prediction and in between tasks in the first part we discuss our work on various motion related tasks all viewed through a unique framework that considers human motion as a foreign languag

e we utilize The Flaunt T5 base model pre-trained on a substantial 220 million parameters as our backbone this model is then fine-tuned through a pre-training stage and instruction tuning phase for all subsequent comparisons we stack motion GPT against other Cutting Edge methodologies in key tasks such as text triggered motion generation motion captioning motion prediction and motion in between the results demonstrate that our motion GPT holds its own in all evaluated tasks showcasing its potent

ial to manage a broad spectrum of motion tasks using a single model the comparisons on text emotion subsection centers on a task that involves creating human motion sequences from given text inputs we use the pre-trained motion GPT model and fine tune it for the text to motion task we measure our model's performance against other top models on the human ml3d and kit datasets our results with a 95 confidence level based on 20 repeated runs show that motion GPT is competitive in most of the measur

ed metrics the next subsection comparisons on motion to text is about a task that generates a text description from a provided human motion sequence we gauge the pre-trained Motion gpd's Performance against the recent tm2t model on the human ml3d dataset we also calculate the average number of words used in the text descriptions to compare results further please note that the evaluations consider pre-processed ground truth text which disregards grammatical tense and the plural forms of words in

our study we use the actual ground truth text descriptions for a more accurate evaluation the comparison reveals that motion GPT performs exceptionally well in generating text descriptions of given motions the final subsection comparisons on motion prediction and in between presents a general evaluation of motion completion here we utilize a section of the amass data set a motion only collection to test the motion completion ability of motion GPT we give the first 20 of the motion sequence as co

nditions for the motion prediction task and for in between we mask about half the motion randomly for completion we find two motion GPT specifically for this task and use FID Aid and fde as evaluation metrics we also test NDM on motion prediction using the provided model that also supports Motion in between through a technique called mask Motion in painting our results indicate that motion GPT has the highest quality and diversity in motion completion section summary the authors introduce a fram

ework that treats human motion as a foreign language allowing for exploration of diverse motion relevant tasks they evaluate their motion GPT model against state-of-the-art methods on tasks such as text condition motion generation motion captioning motion prediction and motion in between achieving competitive performance across all evaluated tasks additionally they compare their model with recent work on text descriptions of given motions and demonstrate that motion GPT overperforms recent work

section 4.3 ablation studies in section 4.3 we discussed several studies that further examine the details of our motion GPT model as a language model that understands and generates motion motion GPT uses T5 as its underlying structure we train these models first with a general pre-training stage followed by specific instruction tuning the size of the model and the approach to training both play key roles in how well the motion GPT performs we put this to the test using various common motion task

s with more thorough investigations detailed in the supplement materials looking first at the model sizes we tested different sizes across four types of motion tasks we compared the performance of models of 60 million 220 million and 770 million parameters in addition to the original 220 million parameter motion GPT the results revealed that the base 220 million parameter model performed impressively even when compared to the smaller 60 million parameter model interestingly increasing the model

size beyond the base did not lead to significant enhancements in some cases the larger models even performed worse especially in tasks involving intermediate motion we suspect this might be due to the limited size of current motion data sets for instance the human ml3d data set consists of just 15 000 motion sequences a number dwarfed by the billions of data points available in language and image data next we evaluated the effect of our unique instruction tuning method on different model sizes t

he results indicated that this technique improves the flexibility of motion GPT enabling it to handle more types of motion tasks such as motion completion and enhancing its performance on the text to motion task however we noticed a decrease in performance for tasks involving pure text generation potentially because of the limited amount of text descriptions and their Associated motions in Section 5 we delve into a discussion about motion GPD strengths and weaknesses to the best of our knowledge

this is the first attempt at using a language model to generate human motion however there are certain limitations to be aware of motion GPT is designed to work only with the motion of human bodies and doesn't consider other aspects like facial expressions hand gestures or even animal movement furthermore our method does not account for interactions between multiple humans or between humans and objects or the environment despite these limitations We Believe motion GPT holds significant potentia

l for modeling human interactions with an emotion language framework and generating controllable motions we envision motion GPT as a unified motion language model capable of producing believable human motion and descriptive language based on given prompts in comparison to similar motion diffusion methods motion GPT competes favorably in tasks like motion generation motion captioning motion prediction and intermediate motion tasks all with the use of a single pre-trained generative model as we co

ntinue to accumulate more language data and refine our models motion GPT will be able to handle tasks like natural question answering our extensive testing across various tasks related to human motion has shown the effectiveness and scalability of motion GPT section summary the performance of motion gpts is influenced by both model size and training strategy model sizes were evaluated across four motion tasks with the 220m base model achieving remarkable performance compared to the smaller 60m m

odel instruction tuning enhances the versatility of motion GPT enabling more motion tasks like motion completion and improving the Motion Performance of the text to motion task but for Pure text generation tasks the model performance is downgraded

MotionGPT: Human Motion as a Foreign Language

Related articles

Comments