Device Code Generation and Execution
Contents
The runtime provides convenience functions that are required in order to execute code on different devices:
- allocate, copy, and release memory on a device
- code generation and execution for different devices
- intrinsics for different devices
Platforms
When building the runtime, it looks for the following platforms and builds the platform if it is found:
- Host CPU (default, always present)
- CUDA
- OpenCL
- HSA
Each platform will have devices associated at runtime. Devices are enumerated starting with 0. For example, properly configured system with a NVIDIA GPU A will result in the following configuration:
- Platform 0: Host
- Platform 1: CUDA
- Device 0: GPU A
- Platform 2: OpenCL
- Device 0: GPU A
- Platform 3: HSA
- dummy platform, no device
Calling runtime functions for a platform or device that is not present terminates the program.
Note: The AnyDSL meta project provides a config-rocm.sh.template to build all dependencies for the HSA platform using the ROCm software stack provided by AMD.
Memory Management
Memory management functions work on Buffers
that track the device & platform (device
) and the allocated memory (data
):
struct Buffer {
data : &[i8],
size : i64,
device : i32
}
Convenience functions are provided to allocate, copy, and release memory. These work on Buffers and the platform will be implicitly injected and derived when needed.
fn alloc_cpu(size: i64) -> Buffer;
fn alloc_cuda(dev: i32, size: i64) -> Buffer;
fn alloc_opencl(dev: i32, size: i64) -> Buffer;
fn alloc_hsa(dev: i32, size: i64) -> Buffer;
fn release(buf: Buffer) -> ();
fn copy(src: Buffer, dst: Buffer) -> ();
fn copy_offset(src: Buffer, off_src: i64, dst: Buffer, off_dst: i64, size: i64) -> ();
Code Generation and Execution
Code generation and execution for a platform is exposed via functions in Impala:
- Host CPU: by default all code will be generate for the host CPU
- CUDA:
cuda
andnvvm
- OpenCL:
opencl
- HSA:
amdgpu
The signature for the code generations backends is as follows:backend(device, grid, block, fun);
- device: the device of the corresponding platform
- grid & block: blocking of the problem into sub-problems
- fun: function for which code will be generated
Note that the grid configuration is provided as in OpenCL. That is, it defines the total number of threads to be launched, which needs to be a multiple of the block size.
A typical example will look like this:
let grid = (1024, 1024, 1);
let block = (32, 1, 1);
let device = 0;
cuda(device, grid, block, || { ... out(idx) = in(idx); });
synchronize_cuda(device);
Using the with
syntax results in a more pleasing syntax:
let grid = (1024, 1024, 1);
let block = (32, 1, 1);
let device = 0;
with cuda(device, grid, block) {
let idx = cuda_threadIdx_x();
out(idx) = in(idx);
}
synchronize_cuda(device);
The Accelerator
struct is provided to abstract over different compute devices:
struct Accelerator {
exec : fn((i32, i32, i32), // grid
(i32, i32, i32), // block
fn(WorkItem) -> ()) -> (),
sync : fn() -> (),
alloc : fn(i64) -> Buffer,
alloc_unified : fn(i64) -> Buffer,
barrier : fn() -> ()
}
It uses the WorkItem
struct to provide functions for thread index or block index retrieval:
struct WorkItem {
tidx : fn() -> i32,
tidy : fn() -> i32,
tidz : fn() -> i32,
bidx : fn() -> i32,
bidy : fn() -> i32,
bidz : fn() -> i32,
gidx : fn() -> i32,
gidy : fn() -> i32,
gidz : fn() -> i32,
bdimx : fn() -> i32,
bdimy : fn() -> i32,
bdimz : fn() -> i32,
gdimx : fn() -> i32,
gdimy : fn() -> i32,
gdimz : fn() -> i32,
nblkx : fn() -> i32,
nblky : fn() -> i32,
nblkz : fn() -> i32
}
Using one of the pre-defined accelerators allows to use the same code for different devices:
let device = 0;
let acc = cuda_accelerator(device);
let grid = (1024, 1, 1);
let block = (32, 1, 1);
for work_item in acc.exec(grid, block) {
let idx = work_item.gidx();
out(idx) = in(idx);
}
acc.sync();
Device Intrinsics
The Intrinsics
struct is provided to abstract over device-specific intrinsics, similar to the Accelerator
struct:
struct Intrinsics {
expf : fn(f32) -> f32,
sinf : fn(f32) -> f32,
cosf : fn(f32) -> f32,
logf : fn(f32) -> f32,
sqrtf : fn(f32) -> f32,
powf : fn(f32, f32) -> f32,
...
}
Using one of the pre-defined intrinsics allows to use the same code for different devices:
let math = cuda_intrinsics;
...
for work_item in acc.exec(grid, block) {
let idx = work_item.gidx();
out(idx) = math.sinf(in(idx));
}
...
Address Spaces
Each GPU memory type has an address space associated, which needs to be annotated. In Impala, the following address spaces are supported:
- default -> host memory
- 1 -> global memory
- 3 -> shared memory
Correct code will only be emitted in case the address space is valid.
Read-only arrays in global GPU memory are of type &[1][T]
and write-able arrays &mut[1][T]
.
let arr = alloc_cuda(dev, size);
let out = alloc_cuda(dev, size);
...
for work_item in acc.exec(grid, block) {
let idx = work_item.gidx();
let arr_ptr = bitcast[ &[1][f32]](arr.data);
let out_ptr = bitcast[&mut[1][f32]](out.data);
out_ptr(idx) = arr_ptr(idx);
}
...
The address space annotation is manual at the moment, but will be automated with the new upcoming type system.
Memory of compile-time known size in shared (CUDA), local (OpenCL), or group (HSA) memory can be requested using reserve_shared
.
...
for work_item in acc.exec(grid, block) {
...
let shared = reserve_shared[f32](32);
shared(tidx) = arr_ptr(idx);
}
...
Profiling
Profiling of kernels is disabled by default. To enable profiling, set the ANYDSL_PROFILE
environment variable to FULL
:
ANYDSL_PROFILE=FULL ./binary
CUDA libdevice
Some code generated by the nvvm and cuda backend make use of the libdevice.10.bc
to provide intrinsics. To specify a file use the ANYDSL_CUDA_LIBDEVICE_PATH
environment variable:
ANYDSL_CUDA_LIBDEVICE_PATH="my/path/libdevice.10.bc"
If none is specified, a default found during the cmake configure step will be used. This might not be available if the application was deployed to a different device than the one it was originally configured for.
NVVM Code Generation Optimization
When generating target code using the nvvm
backend, we emit NVVM IR.
Target ptx code will be generated from the NVVM IR at runtime using the llvm nvptx code generator.
nvptx code generation optimizations can be specified by the ANYDSL_LLVM_ARGS
environment variable:
ANYDSL_LLVM_ARGS="-nvptx-sched4reg -nvptx-fma-level=2 -nvptx-prec-divf32=0 -nvptx-prec-sqrtf32=0 -nvptx-f32ftz=1"
As an alternative to the nvptx code generator, we offer target ptx code generation via NVIDIA’s libnvvm interface. To make use of this backend, the nvvm file needs to be in bitcode format as required by the CUDA installation. For the current NVVM IR version 1.5, this means LLVM 5.0 based bitcode.
AMDGPU Code Generation Optimization
When generating target code using the amdgpu
backend, we emit AMDGPU IR.
Target gcn code will be generated from the AMDGPU IR at runtime using the llvm gcn code generator.
gcn code generation optimizations can be specified by the ANYDSL_LLVM_ARGS
environment variable:
ANYDSL_LLVM_ARGS="-amdgpu-sroa -amdgpu-load-store-vectorizer -amdgpu-scalarize-global-loads -amdgpu-internalize-symbols -amdgpu-early-inline-all -amdgpu-sdwa-peephole -amdgpu-dpp-combine -enable-amdgpu-aa -amdgpu-late-structurize=0 -amdgpu-function-calls -amdgpu-simplify-libcall -amdgpu-ir-lower-kernel-arguments -amdgpu-atomic-optimizations -amdgpu-mode-register"
Example
A simple example that shows how to generate code for different GPUs can be found in Stincilla.