During the last years Field Programmable Gate Arrays have become increasingly important for high-performance computing. A possible way to effectively (and easily) exploit them is to rely on a GPU-like paradigm. Interestingly, there are a number of GPU-like HDL projects out there, providing programmability support resembling familiar GPU devices, although they often rely on substantial simplifications. In the paper below, we present a design of a non-coherent scratchpad memory for hardware multithreaded vector processors (the shared memory, in the NVIDIA parlance), that can be used to effectively address the problem of bank conflicts, a major source of performance loss with many parallel kernels. As the key insight, the configurable GPU-like oriented scratchpad memory offers built-in support for application-specific bank remapping. The core is fully synthetizable on FPGA with a contained hardware cost. We also validated the presented architecture with a cycle-accurate event-driven emulator written in C++ as well as an RTL simulator tool. Last, we demonstrated the impact of bank remapping and other parameters available with the proposed configurable shared scratchpad memory by evaluating the performance of two real-world parallelized kernels.

Cilardo, Alessandro; Gagliardi, Mirko; Donnarumma, Ciro

A Configurable Shared Scratchpad Memory for GPU-like Processors (Inproceeding)

Advances on P2P, Parallel, Grid, Cloud and Internet Computing: Proceedings of the 11th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC--2016), pp. 3–14, Springer International Publishing, 2017.


Results for a Matrix Multiplication kernel and a 5×5 Mean Filter kernel, respectively. Number of conflicts by varying: Number of lanes; Number of banks; Mapping strategy