SceneDM: Scene-level Multi-agent Trajectory Generation with Consistent Diffusion Models
Paper   |   Webpage

Abstract

SVG Image
Fig. 1: Illustration of the scene generation process.

Realistic scene-level multi-agent motion simulations are crucial for developing and evaluating self-driving algorithms. However, most existing works focus on generating trajectories for a certain single agent type, and typically ignore the consistency of generated trajectories. In this paper, we propose a novel framework based on diffusion models, called SceneDM, to generate joint and consistent future motions of all the agents, including vehicles, bicycles, pedestrians, etc., in a scene. To enhance the consistency of the generated trajectories, we resort to a new Transformer-based network to effectively handle agent-agent interactions in the inverse process of motion diffusion. In consideration of the smoothness of agent trajectories, we further design a simple yet effective consistent diffusion approach, to improve the model in exploiting short-term temporal dependencies. Furthermore, a scene-level scoring function is attached to evaluate the safety and road-adherence of the generated agent's motions and help filter out unrealistic simulations. Finally, SceneDM achieves state-of-the-art results on the Waymo Sim Agents Benchmarks.



Overview

PNG Image
Fig. 2: Overview of the proposed method.

SceneDM consists of a scene encoder and a Transformer-based denoisier network. The denoiser network utilizes the agent embedding learned by the scene encoder to remove noise from the noisy trajectory. During training, SceneDM first augments the trajectory sequence by concatenating adjacent motion states and imposes the same noise to the overlapping part. Models are then optimized to predict the added noise. Furthermore, we introduce a smoothness regularization to improve the generated trajectory's smoothness. At the generating phase, we achieve temporal consistency of the latent variable through the proposed temporal-consistent guidance. Finally, a scene-level scoring module is designed to filter out unrealistic simulations, ensuring the practicality of the generated samples.

Results and Analysis


Synthesizing new traffic scenarios

With HD maps and historal trajectories as input, SceneDM initially samples noise from the Gaussian distribution and gradually removes trajectory noise until convergence. By repeating the sampling and denoising process, SceneDM is capable to generate multiple diverse scenes, fully exploiting the multi-modal characteristics of the traffic scene.


Multi-modal behavior

Multi-modal behavior: ↑ The blue agent turns left or keeps waiting, while the yellow agent waits to enter the main road or turns left.


Multi-modal behavior: ↑ The agents maintain a safe distance from each other, which demonstrates the effectiveness of SceneDM in generating trajectories with scene consistency for the agents in the scene. Furthermore, the generated trajectories exhibit multimodal behaviors such as lane changing and deceleration, effectively simulating real-world traffic situations.

Multi-agent behavior

SceneDM is capable of simultaneously generating scene-consistent trajectories for multiple types of agents in the scene.


Reactive agents: ↑ The vehicle decelerates in front of the pedestrian walkway, waiting for the bicycle to pass. Similar situations are further presented below.


-->