Device Code Generation and Execution

Describes how device code generation and execution works in Impala.


The runtime provides convenience functions that are required in order to execute code on different devices:


When building the runtime, it looks for the following platforms and builds the platform if it is found:

  1. Host CPU (default, always present)
  2. CUDA
  3. OpenCL
  4. HSA

Each platform will have devices associated at runtime. Devices are enumerated starting with 0. For example, properly configured system with a NVIDIA GPU A will result in the following configuration:

Calling runtime functions for a platform or device that is not present terminates the program.

Note: HSA platform is tested on a system using the ROCm 1.9.0 software stack provided by AMD.

Memory Management

Memory management functions work on Buffers that track the device & platform (device) and the allocated memory (data):

struct Buffer {
    data : &[i8],
    size : i64,
    device : i32

Convenience functions are provided to allocate, copy, and release memory. These work on Buffers and the platform will be implicitly injected and derived when needed.

fn alloc_cpu(size: i32) -> Buffer;
fn alloc_cuda(dev: i32, size: i32) -> Buffer;
fn alloc_opencl(dev: i32, size: i32) -> Buffer;
fn alloc_hsa(dev: i32, size: i32) -> Buffer;

fn release(buf: Buffer) -> ();

fn copy(src: Buffer, dst: Buffer) -> ();
fn copy_offset(src: Buffer, off_src: i32, dst: Buffer, off_dst: i32, size: i32) -> ();

Code Generation and Execution

Code generation and execution for a platform is exposed via functions in Impala:

  1. Host CPU: by default all code will be generate for the host CPU
  2. CUDA: cuda and nvvm
  3. OpenCL: opencl
  4. HSA: amdgpu The signature for the code generations backends is as follows:
    backend(device, grid, block, fun);
    • device: the device of the corresponding platform
    • grid & block: blocking of the problem into sub-problems
    • fun: function for which code will be generated

A typical example will look like this:

let grid   = (1024, 1024, 1);
let block  = (32, 1, 1);
let device = 0;
cuda(device, grid, block, || { ... out(idx) = in(idx); });

Using the with syntax results in a more pleasing syntax:

let grid   = (1024, 1024, 1);
let block  = (32, 1, 1);
let device = 0;
with cuda(device, grid, block) {
    let idx = cuda_threadIdx_x();
    out(idx) = in(idx);

The Accelerator struct is provided to abstract over different compute devices:

struct Accelerator {
    exec          : fn((i32, i32, i32), // grid
                       (i32, i32, i32), // block
                       fn(WorkItem) -> ()) -> (),
    sync          : fn() -> (),
    alloc         : fn(i32) -> Buffer,
    alloc_unified : fn(i32) -> Buffer,
    barrier       : fn() -> ()

It uses the WorkItem struct to provide functions for thread index or block index retrieval:

struct WorkItem {
    tidx  : fn() -> i32,
    tidy  : fn() -> i32,
    tidz  : fn() -> i32,
    bidx  : fn() -> i32,
    bidy  : fn() -> i32,
    bidz  : fn() -> i32,
    gidx  : fn() -> i32,
    gidy  : fn() -> i32,
    gidz  : fn() -> i32,
    bdimx : fn() -> i32,
    bdimy : fn() -> i32,
    bdimz : fn() -> i32,
    gdimx : fn() -> i32,
    gdimy : fn() -> i32,
    gdimz : fn() -> i32,
    nblkx : fn() -> i32,
    nblky : fn() -> i32,
    nblkz : fn() -> i32

Using one of the pre-defined accelerators allows to use the same code for different devices:

let device = 0;
let acc    = cuda_accelerator(device);
let grid   = (1024, 1, 1);
let block  = (32, 1, 1);

for work_item in acc.exec(grid, block) {
    let idx = work_item.gidx();
    out(idx) = in(idx);

Device Intrinsics

The Intrinsics struct is provided to abstract over device-specific intrinsics, similar to the Accelerator struct:

struct Intrinsics {
    expf  : fn(f32) -> f32,
    sinf  : fn(f32) -> f32,
    cosf  : fn(f32) -> f32,
    logf  : fn(f32) -> f32,
    sqrtf : fn(f32) -> f32,
    powf  : fn(f32, f32) -> f32,

Using one of the pre-defined intrinsics allows to use the same code for different devices:

let math = cuda_intrinsics;
for work_item in acc.exec(grid, block) {
    let idx = work_item.gidx();
    out(idx) = math.sinf(in(idx));

Address Spaces

Each GPU memory type has an address space associated, which needs to be annotated. In Impala, the following address spaces are supported:

Correct code will only be emitted in case the address space is valid. Read-only arrays in global GPU memory are of type &[1][T] and write-able arrays &mut[1][T].

let arr = alloc_cuda(dev, size);
let out = alloc_cuda(dev, size);
for work_item in acc.exec(grid, block) {
    let idx = work_item.gidx();
    let arr_ptr = bitcast[   &[1][f32]](;
    let out_ptr = bitcast[&mut[1][f32]](;
    out_ptr(idx) = arr_ptr(idx);

The address space annotation is manual at the moment, but will be automated with the new upcoming type system.

Memory of compile-time known size in shared (CUDA), local (OpenCL), or group (HSA) memory can be requested using reserve_shared.

for work_item in acc.exec(grid, block) {
    let shared = reserve_shared[f32](32);
    shared(tidx) = arr_ptr(idx);


Profiling of kernels is disabled by default. To enable profiling, set the ANYDSL_PROFILE environment variable to FULL:


NVVM Code Generation Optimization

When generating target code using the nvvm backend, we emit NVVM IR. Target ptx code will be generated from the NVVM IR at runtime using the llvm nvptx code generator. nvptx code generation optimizations can be specified by the ANYDSL_LLVM_ARGS environment variable:

ANYDSL_LLVM_ARGS="-nvptx-sched4reg -nvptx-fma-level=2 -nvptx-prec-divf32=0 -nvptx-prec-sqrtf32=0 -nvptx-f32ftz=1"

As an alternative to the nvptx code generator, we offer target ptx code generation via NVIDIA’s libnvvm interface. To make use of this backend, the nvvm file needs to be in bitcode format as required by the CUDA installation. For the current NVVM IR version 1.5, this means LLVM 5.0 based bitcode.


A simple example that shows how to generate code for different GPUs can be found in Stincilla.

Cross Compilation

For cross compilation, the target triple and target cpu can be set via the environment variables ANYDSL_TARGET_TRIPLE and ANYDSL_TARGET_CPU: