2022-07-27 –, Blue
The package VectorEngine.jl enables the use of the SX-Aurora Tsubasa Vector Engine (VE) as an accelerator in hybrid Julia programs. It builds on GPUCompiler.jl leveraging the VEDA API as well as the LLVM-VE compiler and the Region Vectorizer. Even though the VE is very different from GPUs, using only few cores and arithmetic units with very long vector length, the package enables programmers to use it in a very similar way, simplifying the porting of Julia applications.
The talk introduces the VectorEngine.jl [1] package, the first port of the Julia programming language to the NEC SX-Aurora Tsubasa Vector Engine (VE) [2]. It describes the design choices made for enabling the Julia programming languages, architecture specific details, similarities and differences between VEs and GPUs and the currently supported features.
The current instances of the VE are equipped with 6 HBM2 modules that deliver 1.55 TB/s memory bandwidth to 8 or 10 cores. Each core consists of a full fledged scalar processing unit (SPU) and a vector processing unit (VPU) running with very long vector lengths of up to 256 x 64 bit or 512 x 32 bit words. With C, C++ and Fortran the VE can run programs natively, completely on the VE, OpenMP and MPI parallelized, with Linux system calls being processed on the host machine. Native VE programs can offload function calls to the host CPU (reverse offloading). Alternatively the VE can be used as an accelerator, with the main program running on the host CPU and performance-critical kernels being offloaded to the VE with the help of libraries like AVEO or VEDA [3]. Prominent users of the SX-Aurora Vector Engines are in weather and climate research (eg. Deutscher Wetterdienst), earth sciences research (JAMSTEC, Earth Simulator), fusion research (National Institute for Fusion Science, Japan).
For enabling the VE for Julia use we chose to use the normal offoading programming paradigm that treats the VE as an accelerator running particular code kernels. The GPUCompiler.jl module was slightly expanded and used in VectorEngine.jl to support VEDA on VE, similar to the GPU specific implementations CUDA.jl, AMDGPU.jl and oneAPI.jl. Although VEs are very different from GPUs, chosing a usage pattern similar to GPUs is the most promissing approach for reducing porting efforts and making multi-accelerator Julia code maintainable. With VectorEngine.jl we can declare device-side arrays and structures, copy data between host and device side, declare kernel functions, create cross-compiled objects that can be executed on the accelerator, or use a simple macro like @veda to run a function on the device side and hide steps like compilation and arguments transfer from the user.
For cross-compiling VE device code we use the LLVM-VE compiler. It is a slightly extended version of the upstream LLVM compiler that supports VE as an official architecture since late 2021. For vectorization inside the Julia device code we use the Region Vectorizer [4], an advanced outer loop vectorizer capable of handling divergent control flow. The Region Vectorizer does not do data-dependency analysis, therefore loops that need to be vectorized must be annotated by the programmer.
At the time of the submission of the talk proposal VE device side Julia supports a very limited runtime, quite similar to that of other GPUs. It includes device arrays, transfer of structures, vectorization using the Region Vectorizer and device-side ccalls to libc functions as well as other VE libraries. We discuss the target of implementing most of the Julia runtime on the device side, a step that would enable a much wider range of codes on the accelerator.
[1] VectorEngine.jl github repository: https://github.com/sx-aurora-dev/VectorEngine.jl
[2] K. Komatsu et al, Performance evaluation of a vector supercomputer sx-aurora TSUBASA, https://dl.acm.org/doi/10.5555/3291656.3291728
[3] VEDA github repository: https://github.com/sx-aurora/veda
[4] Simon Moll, Vectorization system for unstructured codes with a Data-parallel Compiler IR, 2021, dissertation https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/32453
Erich has worked on optimizing CFD and structural mechanics algorithms for parallel vector supercomputers, did Linux kernel development and research in distributed systems software and parallel file systems. Currently he leads a research and development group at NEC HPC Europe, his work topics cover system software and compilers for heterogeneous computing with NEC's SX-Aurora Vector Engine, augmenting HPC simulations with AI and integrating cloud technologies into HPC clusters.