arrow-up icon

Introduction to SYCL Programming: A Beginner’s Guide

Avatar
shinnosuke.takemoto |July 10, 2025 | HPC

* This blog post is an English translation of an article originally published in Japanese on April 3, 2025.

Introduction

This is Takemoto from Solutions Division II.

Previously on this blog, the articles “Trying to Write OpenCL in a Single Source with SYCL(Japanese only) (2017)” and “Comparing the Performance of SYCL Implementations(Japanese only) (2023)” were published. These articles described SYCL environment setup methods and SYCL source code, but there wasn’t much discussion on the preceding stages of SYCL overview and usage. Therefore, this article will explain SYCL from an overview to the essential basic functions and terminology, along with creating a simple program. This article was created in compliance with SYCL2020.

Target Audience

  • Those interested in SYCL
  • Those who understand basic C++
  • Those who understand basic parallel programming
    • Knowledge of CUDA will make understanding easier, as there are comparisons with CUDA.

What is SYCL?

Officially:

SYCL is an open industry standard for programming a heterogeneous system. The design of SYCL allows standard C++ source code to be written such that it can run on either an heterogeneous device or on the host.

In other words, SYCL is a unified standard prepared to handle various devices.

Originally, if you wanted to make a CPU, NVIDIA GPU, AMD GPU, etc., perform the same processing, you would need to develop using different languages and frameworks for each. However, by writing code in SYCL, it can be run on all these devices.

Typical merits of using SYCL include:

  • Ability to describe processing for various devices with this one standard.
  • Can be written in a single source code.
    • Code intended for device execution can also be written in a single .cpp file.
  • C++17 can be used in SYCL2020.

Simple Program

We will describe a very simple program and explain the basic syntax and features of SYCL based on this program.

#include <sycl/sycl.hpp>
#include <iostream>

int main() {
    constexpr int N = 10;
    int arr[N] = {0};

    {
        // A buffer is one way to transfer data between host and device
        sycl::buffer<int, 1> buf(arr, sycl::range<1>(N));

        // Get a default queue
        sycl::queue myQueue;

        // Submit a command group to the device
        myQueue.submit([&](sycl::handler &cgh) {
            // Get an accessor
            sycl::accessor acc(buf, cgh, sycl::write_only);

            // Describe the processing to be executed on the device
            cgh.parallel_for(sycl::range<1>(N), [=](sycl::item<1> id) {
                // Execute on the device
                acc[id.get_linear_id()] = id.get_linear_id();
            });
        }); // At this point, the contents of buf are copied to arr
    }

    for (int i = 0; i < N; i++) {
        std::cout << "stdout_result: " << arr[i] << "\n" << std::endl;
    }

    return 0;
}

Output Result

stdout_result: 0
stdout_result: 1
stdout_result: 2
stdout_result: 3
stdout_result: 4
stdout_result: 5
stdout_result: 6
stdout_result: 7
stdout_result: 8
stdout_result: 9

The above program, if written as code processed simply by the CPU, would look like this:

#include <iostream>

int main() {
    constexpr int N = 10;
    int arr[N] = {0};

    for (int i = 0; i < N; i++) {
        arr[i] = i;
    }

    for (int i = 0; i < N; i++) {
        std::cout << "stdout_result: " << arr[i] << "\n" << std::endl;
    }

    return 0;
}

Host and Device

The SYCL platform model consists of a host connected to one or more devices, called devices.

This is how it’s described in the official documentation. These terms were frequently used in the comments of the program above, and host/device refer to the following:

  • Host: Control processor
  • Device: Computation processor

Mainly, the host is a CPU, and devices can be CPUs, GPUs, DSPs, etc. For example, in CUDA, the host is the CPU, and the device is the GPU.

As will be described later, the anonymous function specified as the second argument to cgh.parallel_for is executed on the device.

Also, while it mentions the OpenCL platform, OpenCL is an API for controlling multiple devices. For details, please refer to the official website.

SYCL Kernel Function

A function object that can be executed on a device is called a SYCL kernel function.

As stated officially, a function written to be executed on a device is called a SYCL kernel function. In CUDA terms, this corresponds to a function with the __global__ attribute.

The anonymous function specified as the second argument to cgh.parallel_for in the program above is a SYCL kernel function.

Command Group

In SYCL, the operations required to process data on a device are represented using a command group function object. Each command group function object is given a unique command group handler object to perform all the necessary work required to correctly process data on a device using a kernel. In this way, the group of commands for transferring and processing data is enqueued as a command group on a device for execution. A command group is submitted atomically to a SYCL queue.

Briefly, a command group refers to “an operation that processes data on the device.”

The anonymous function specified as the argument to submit in the program above is a command group. It can be said to be a broader function object that includes accessors, etc., in addition to the SYCL kernel function.

sycl::queue

A SYCL command queue is an object that holds command groups to be executed on a SYCL device.

It is a data structure (object) for holding command groups. sycl::queue is often declared with information about “on which device the command groups it holds will be executed,” as shown below:

// 1. Holds command groups to be executed on the default device (GPU, CPU or ...)
sycl::queue q(sycl::default_selector_v);
// 2. Same meaning as 1
sycl::queue q;
// 3. Holds command groups to be executed on the default CPU
sycl::queue q(sycl::cpu_selector_v);
// 4. Holds command groups to be executed on the default GPU
sycl::queue q(sycl::gpu_selector_v);
submit

A queue can be used to submit command groups to be executed by the SYCL runtime using the queue::submit member function.

As such, it is used to store (push) command group function objects into a sycl::queue.

Checking the Kernel Execution Device

By writing code like the following, you can check at program runtime on which device and platform it is executing.

sycl::queue q;
std::cout << q.get_device().get_info<sycl::info::device::name>() << std::endl; // Device name
std::cout << q.get_device().get_info<sycl::info::device::vendor>() << std::endl; // Device vendor
std::cout << q.get_device().get_platform().get_info<sycl::info::platform::name>() << std::endl; // Platform name
std::cout << q.get_device().get_platform().get_info<sycl::info::platform::vendor>() << std::endl; // Platform vendor

sycl::buffer

The buffer class defines a shared array of one, two or three dimensions that can be used by the SYCL kernel and has to be accessed using accessor classes.

Arrays (up to 1, 2, or 3 dimensions) that you want to use within a SYCL kernel function can be defined as sycl::buffer. However, simply defining it as sycl::buffer is not enough; it can only be used via an accessor, which will be explained below.

In the sample program above, buf is defined as a sycl::buffer for the host array arr.

Accessor

The functionality for accessing various memories from a device is called an accessor. While “various memories” was mentioned, besides buffers (sycl::buffer), there are memories accessible only by the host or only by the device. An example of memory accessible only by the device is “local memory,” and this is the functionality for accessing such memories.

sycl::accessor

sycl::accessor is a type of accessor, a class for accessing sycl::buffer.

In the program above, acc is defined as a sycl::accessor for accessing the sycl::buffer buf in write-only mode. While “write-only” was written, you can actually specify read/write permission by specifying the third argument of the sycl::accessor constructor as follows:

sycl::read_only  // Read-only
sycl::write_only // Write-only
sycl::read_write // Read-write

Changes made to the contents of sycl::buffer via an accessor are not immediately reflected in the host-side array (in the program above, arr). In CUDA, data is explicitly copied from device to host using cudaMemcpy. However, in the SYCL program above, the timing for reflection is when the destructor of sycl::accessor is called (when its lifetime ends). At this time, the destructor’s operation automatically copies the values to the host-side array.

parallel_for (Data Parallel Kernel)

Up to this point, we’ve understood how to access arrays (data) on the device. Now, one of the functions for actually processing this data (processing on the device) is parallel_forparallel_for can call a SYCL kernel function and perform parallel processing by specifying parameters.

For parallel processing, it’s necessary to specify processing units and, for each processing unit, which data to process. In SYCL, this is specified with sycl::range and sycl::id. While “processing unit” was mentioned, in SYCL, this is called a work-item. Therefore, sycl::range specifies the number of work-itemssycl::id specifies the ID of each work-item, and parallel_for describes the processing content to be actually executed by each work-item.

Here, if the number of work-items is smaller than the maximum number of threads on the device, processing completes in one batch. However, if the number of work-items is larger than the maximum number of threads on the device, processing is performed in multiple batches.

This batch is called a work-group.

sycl::range and sycl::id

The first argument to parallel_for specifies sycl::rangesycl::range indicates the number of work-items; in the program above, it generates N, i.e., 10 work-items.

In the case of a 1D array like this, there’s no particular reason to use sycl::range, but for 2D or 3D arrays, sycl::range can be used to explicitly indicate the dimensionality of the array. The second argument to parallel_for specifies the SYCL kernel function, and sycl::id is passed as an argument to this function object. sycl::id carries information about each work-item, allowing its ID to be checked.

In the program above, get_linear_id() is used to get the ID of each work-item and access the corresponding index of the array data. To rephrase this in CUDA terms, sycl::range corresponds to CUDA’s dim3, and sycl::id corresponds to CUDA’s threadIdx or blockIdx.

A Slightly More Complex Program

Let’s make the SYCL program a bit more complex. The following program implements matrix multiplication in SYCL.

#include <sycl/sycl.hpp>
#include <iostream>

int main() {
    constexpr int M = 2;
    constexpr int N = 3;
    constexpr int P = 4;

    // Create a queue
    sycl::queue q;

    // Create unified shared memory
    auto A = sycl::malloc_shared<float>(M * N, q);
    auto B = sycl::malloc_shared<float>(N * P, q);
    auto C = sycl::malloc_shared<float>(M * P, q);

    // Initialize the contents of unified shared memory on the host
    for (int i = 0; i < M * N; i++) {
        A[i] = i;
    }
    for (int i = 0; i < N * P; i++) {
        B[i] = i;
    }

    q.submit([&](sycl::handler &cgh) {
        // parallel_for for a 2D array
        cgh.parallel_for(sycl::range<2>(M, P), [=](sycl::id<2> idx) {
            int i = idx[0]; // Get x
            int j = idx[1]; // Get y
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                // Operate on unified shared memory on the device
                sum += A[i * N + k] * B[k * P + j];
            }
            C[i * P + j] = sum;
        });
    });

    q.wait();

    for (int i = 0; i < M * P; i++) {
        std::cout << "C[" << i << "] = " << C[i] << std::endl;
    }

    return 0;
}

The output result is as follows:

C[0] = 20
C[1] = 23
C[2] = 26
C[3] = 29
C[4] = 56
C[5] = 68
C[6] = 80
C[7] = 92

Compared to the simple program presented first, the following differences can be seen:

  • Using sycl::malloc_shared to allocate arrays in shared memory.
  • Using a 2D sycl::range to parallelize matrix multiplication.
  • Using a 2D sycl::id to access matrix elements.
  • Using q.wait() to wait for parallel execution to complete.

These contents will be explained below.

sycl::malloc_*: Unified Shared Memory (USM)

Unified Shared Memory (USM) provides a pointer-based alternative to the buffer programming model. USM enables:
• Easier integration into existing code bases by representing allocations as pointers rather than buffers, with full support for pointer arithmetic into allocations.
• Fine-grain control over ownership and accessibility of allocations, to optimally choose between per formance and programmer convenience.
• A simpler programming model, by automatically migrating some allocations between SYCL devices and the host.

This is a feature available from SYCL 2020, serving as an alternative to sycl::buffer and sycl::accessor, and corresponds to cudaMallocManaged in CUDA. While the contents of sycl::buffer need to be accessed from the device through an accessor via sycl::accessor, USM allows access from both host and device through pointers.

USM includes sycl::malloc_hostsycl::malloc_device, and sycl::malloc_shared. Using sycl::malloc_shared allows the use of a single unified address accessible from both host and device.

sycl::range: Parallel Execution Range

As already shown, sycl::range is used to specify the range of parallel execution.

In this example, a 2D sycl::range<2> is used for accessing a 2D array. Up to 3 dimensions can be specified, in which case it would be sycl::range<3>. Similar to CUDA’s dim3, the first argument specifies the size in the x-direction, the second in the y-direction, and the third in the z-direction.

sycl::id: Parallel Execution ID

As already shown, sycl::id is used to obtain the ID of parallel execution.

Similar to sycl::range, up to 3 dimensions can be specified, in which case it would be sycl::id<3>. It’s similar to CUDA’s threadIdx, but there’s a difference in how IDs are obtained: for sycl::id<3> idx, the x-direction ID is idx[0], the y-direction ID is idx[1], and the z-direction ID is idx[2].

Synchronization with wait()

sycl::queue has a function called wait() which can be used for synchronization. It can be thought of as a similar function to cudaDeviceSynchronize() in CUDA.

In the program above, after allocating memory with sycl::malloc_shared, the processing content is described with submit, and wait() is used to wait for this processing to complete. After this wait(), synchronization is ensured, so the contents of sycl::malloc_shared will reflect the processing sent by submit. Conversely, before wait(), synchronization is not guaranteed, so it’s not assured that the contents of sycl::malloc_shared reflect the processing sent by submit.

Building and Running the Program

We will slightly modify the “Slightly More Complex Program,” increase the array sizes, add timing code, and actually run it on multiple architectures. We prepared the following two execution environments for this:

OSCPUGPUMemory
ubuntu22.04i9-14900KNVIDIA RTX 3060Ti(CUDA 12.6)DDR5 64GB
ubuntu22.04i9-12900KIntel Arc A770(OpenCL 3.0)DDR5 32GB

Since the GPU can be specified at compile time, we will change this part to compile and run using the GPU on each PC.

The code to be executed is as follows:

#include <sycl/sycl.hpp>
#include <chrono>
#include <cmath>
#include <iostream>
#include <vector>

int main() {
    const auto repeat_num = 100;
    std::vector<float> time(repeat_num);

    constexpr int M = 1 << 10;
    constexpr int N = 1 << 10;
    constexpr int P = 1 << 10;

    sycl::queue q(sycl::gpu_selector_v);

    auto A = sycl::malloc_shared<float>(M * N, q);
    auto B = sycl::malloc_shared<float>(N * P, q);
    auto C = sycl::malloc_shared<float>(M * P, q);

    for (int i = 0; i < M * N; i++) {
        A[i] = i;
    }
    for (int i = 0; i < N * P; i++) {
        B[i] = i;
    }

    for (int n = 0; n < repeat_num; n++) {
        auto start = std::chrono::system_clock::now();

        q.submit([&](sycl::handler &cgh) {
            cgh.parallel_for(sycl::range<2>(M, P), [=](sycl::id<2> idx) {
                int i = idx[0];
                int j = idx[1];
                float sum = 0.0f;
                for (int k = 0; k < N; k++) {
                    sum += A[i * N + k] * B[k * P + j];
                }
                C[i * P + j] = sum;
            });
        });

        q.wait();

        auto end = std::chrono::system_clock::now();
        time[n] = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
    }

    float sum = 0.0f;
    float mins = time[0];
    float maxs = time[0];
    for (int i = 0; i < repeat_num; i++) {
        sum += time[i];
        if (time[i] < mins) {
            mins = time[i];
        }
        if (time[i] > maxs) {
            maxs = time[i];
        }
    }
    float avg = sum / repeat_num;
    float diff_sum = 0.0f;
    for (int i = 0; i < repeat_num; i++) {
       diff_sum += pow(time[i] - avg, 2);
    }
    float sd = sqrt(diff_sum / repeat_num);
    
    std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << std::endl;
    std::cout << "Average time: " << avg << " usec" << std::endl;
    std::cout << "Min time: " << mins << " usec" << std::endl;
    std::cout << "Max time: " << maxs << " usec" << std::endl;
    std::cout << "Standard Deviation: " << sd << " usec" << std::endl;
    
    return 0;
}

SYCL Environment Setup

This time, we will use Intel oneAPI DPC++. For environment setup methods, please refer to the content introduced previously on this blog: “Comparing the Performance of SYCL Implementations” and “Build DPC++ toolchain with support for NVIDIA CUDA.”

Execution Results

The results of running by changing the GPU used are as follows.

In this environment, we changed the GPU used with the -fsycl-targets option at compile time. When using the RTX 3060 Ti, we added -fsycl-targets=nvptx64-nvidia-cuda, and when using the Arc A770, we compiled without adding this option.

The output when running on RTX 3060 Ti is as follows:

Device: NVIDIA GeForce RTX 3060 Ti
Average time: 2880.14 usec
Min time: 2824 usec
Max time: 5833 usec
Standard Deviation: 298.558 usec

The output when running on Arc A770 is as follows:

Device: Intel(R) Arc(TM) A770 Graphics
Average time: 8173.93 usec
Min time: 8095 usec
Max time: 12936 usec
Standard Deviation: 480.527 usec

Presented in a table, it is as follows:

GPUAverage Time (μs)Min Time (μs)Max Time (μs)Standard Deviation (μs)
GeForce RTX 3060 Ti288028245833299
Intel Arc A7708174809512936481

We can confirm the output for both RTX 3060 Ti and Arc A770, and it’s confirmed that execution on each device is indeed possible from the same code.

Looking at the processing times, it’s also confirmed that they reflect the performance of each GPU.

Summary

In this article, we did the following:

  • Explanation of simple SYCL syntax and terminology
  • Explanation of a program using SYCL
  • Building a program using SYCL
    • Execution on multiple devices, processing time measurement
    • Explanation that the same program can be executed on multiple devices

References

Author

shinnosuke.takemoto
shinnosuke.takemoto