Nowadays, peak computational performance in many problem domains is achieved through the use of specialized hardware such as graphics processing units (GPUs) and field-programmable gate arrays (FPGAs). The goal of this project is to allow CityMoS to make use of modern heterogeneous hardware to accelerate city-scale simulation-based evaluations.
We explore versatile hardware platforms to accelerate agent-based simulation. These hardware platforms include multi-core CPUs, many-core CPUs, Graphics Processing Units (GPUs), Accelerated Processing Units (APUs), Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs) and System on Chip (SoC).
The adaption of a CPU-based sequential simulation for execution on a heterogeneous platform comes with five challenges:
Our goal is to reduce the manual work involved in developing a simulation targeting heterogeneous hardware comprised of CPUs, GPUs, and FPGAs. A middleware detects the simulation parts fit to run on a certain type of hardware. The middleware takes the simulation code as the input and automatically decomposes the simulation according to the available hardware. It then generates the corresponding code and orchestrates the assignment of code segments to the hardware devices.
To explore performance potentials, we developed an agent-based traffic simulation in OpenCL that can run on a CPU, a GPU, or an APU.
We also conducted a comprehensive performance study on execution schemes for this traffic simulation on a heterogeneous CPU/GPU platform. The execution schemes are illustrated in the figure bellow.
The simulation is decomposed into three stages: SENSE, THINK, and ACT. Moderate speedup over the sequential execution on a CPU can be achieved when only the THINK stage is parallelised on the GPU. The performance can be improved by using the fused GPU of an APU. Due to the zero-copy technique, the data transfer overhead between the fused CPU and the fused GPU is eliminated.
Substantial speedup can be achieved when parallelising all stages. We achieved a speedup of up to 28.7x over the CPU-based execution