Realistic multi-agent motion simulations are essential for the advancement of self-driving algorithms. However,the majority of existing works tend to overlook the kinematic realism of the simulated motions. In this paper, we present SceneDM, a novel consistent diffusion model designed to jointly generate consistent and realistic motions for all types of agents within a traffic scene. To employ temporal dependencies and improve the kinematic realism of the generated motions, we introduce an innovative constructive noise pattern alongside smoothing regularization techniques integrated into the framework of the diffusion model. Moreover, the inference procedure of this model is tailored to effectively ensure local temporal consistency. Furthermore, a scene-level scoring function is incorporated to evaluate the safety and road adherence of the generated agents’ motions, helping to filter out unrealistic simulations. Through empirical validation in the Waymo Sim Agents task, we substantiate the effectiveness of SceneDM in improving the smoothness and realism of generated agent trajectories.
SceneDM consists of a scene encoder and a Transformer-based denoisier network. The denoiser network utilizes the agent embedding learned by the scene encoder to remove noise from the noisy trajectory. During training, SceneDM first augments the trajectory sequence by concatenating adjacent motion states and imposes the same noise to the overlapping part. Models are then optimized to predict the added noise. Furthermore, we introduce a smoothness regularization to improve the generated trajectory's smoothness. At the generating phase, we achieve temporal consistency of the latent variable through the proposed temporal-consistent guidance. Finally, a scene-level scoring module is designed to filter out unrealistic simulations, ensuring the practicality of the generated samples.
With HD maps and historal trajectories as input, SceneDM initially samples noise from the Gaussian distribution and gradually removes trajectory noise until convergence. By repeating the sampling and denoising process, SceneDM is capable to generate multiple diverse scenes, fully exploiting the multi-modal characteristics of the traffic scene.
Multi-modal behavior: ↑ The blue agent turns left or keeps waiting, while the yellow agent waits to enter the main road or turns left.
Multi-modal behavior: ↑ The agents maintain a safe distance from each other, which demonstrates the effectiveness of SceneDM in generating trajectories with scene consistency for the agents in the scene. Furthermore, the generated trajectories exhibit multimodal behaviors such as lane changing and deceleration, effectively simulating real-world traffic situations.
SceneDM is capable of simultaneously generating scene-consistent trajectories for multiple types of agents in the scene.
Reactive agents: ↑ The vehicle decelerates in front of the pedestrian walkway, waiting for the bicycle to pass. Similar situations are further presented below.