* This blog post is an English translation of an article originally published in Japanese on April 3, 2025.
Introduction
This is Takemoto from Solutions Division II.
Previously on this blog, the articles “Trying to Write OpenCL in a Single Source with SYCL(Japanese only) (2017)” and “Comparing the Performance of SYCL Implementations(Japanese only) (2023)” were published. These articles described SYCL environment setup methods and SYCL source code, but there wasn’t much discussion on the preceding stages of SYCL overview and usage. Therefore, this article will explain SYCL from an overview to the essential basic functions and terminology, along with creating a simple program. This article was created in compliance with SYCL2020.
Target Audience
- Those interested in SYCL
- Those who understand basic C++
- Those who understand basic parallel programming
- Knowledge of CUDA will make understanding easier, as there are comparisons with CUDA.
What is SYCL?
Officially:
SYCL is an open industry standard for programming a heterogeneous system. The design of SYCL allows standard C++ source code to be written such that it can run on either an heterogeneous device or on the host.
In other words, SYCL is a unified standard prepared to handle various devices.
Originally, if you wanted to make a CPU, NVIDIA GPU, AMD GPU, etc., perform the same processing, you would need to develop using different languages and frameworks for each. However, by writing code in SYCL, it can be run on all these devices.
Typical merits of using SYCL include:
- Ability to describe processing for various devices with this one standard.
- Can be written in a single source code.
- Code intended for device execution can also be written in a single
.cpp
file.
- Code intended for device execution can also be written in a single
C++17
can be used in SYCL2020.
Simple Program
We will describe a very simple program and explain the basic syntax and features of SYCL based on this program.
#include <sycl/sycl.hpp>
#include <iostream>
int main() {
constexpr int N = 10;
int arr[N] = {0};
{
// A buffer is one way to transfer data between host and device
sycl::buffer<int, 1> buf(arr, sycl::range<1>(N));
// Get a default queue
sycl::queue myQueue;
// Submit a command group to the device
myQueue.submit([&](sycl::handler &cgh) {
// Get an accessor
sycl::accessor acc(buf, cgh, sycl::write_only);
// Describe the processing to be executed on the device
cgh.parallel_for(sycl::range<1>(N), [=](sycl::item<1> id) {
// Execute on the device
acc[id.get_linear_id()] = id.get_linear_id();
});
}); // At this point, the contents of buf are copied to arr
}
for (int i = 0; i < N; i++) {
std::cout << "stdout_result: " << arr[i] << "\n" << std::endl;
}
return 0;
}
Output Result
stdout_result: 0
stdout_result: 1
stdout_result: 2
stdout_result: 3
stdout_result: 4
stdout_result: 5
stdout_result: 6
stdout_result: 7
stdout_result: 8
stdout_result: 9
The above program, if written as code processed simply by the CPU, would look like this:
#include <iostream>
int main() {
constexpr int N = 10;
int arr[N] = {0};
for (int i = 0; i < N; i++) {
arr[i] = i;
}
for (int i = 0; i < N; i++) {
std::cout << "stdout_result: " << arr[i] << "\n" << std::endl;
}
return 0;
}
Host and Device
The SYCL platform model consists of a host connected to one or more devices, called devices.
This is how it’s described in the official documentation. These terms were frequently used in the comments of the program above, and host/device refer to the following:
- Host: Control processor
- Device: Computation processor
Mainly, the host is a CPU, and devices can be CPUs, GPUs, DSPs, etc. For example, in CUDA, the host is the CPU, and the device is the GPU.
As will be described later, the anonymous function specified as the second argument to cgh.parallel_for
is executed on the device.
Also, while it mentions the OpenCL platform, OpenCL is an API for controlling multiple devices. For details, please refer to the official website.
SYCL Kernel Function
A function object that can be executed on a device is called a SYCL kernel function.
As stated officially, a function written to be executed on a device is called a SYCL kernel function. In CUDA terms, this corresponds to a function with the __global__
attribute.
The anonymous function specified as the second argument to cgh.parallel_for
in the program above is a SYCL kernel function.
Command Group
In SYCL, the operations required to process data on a device are represented using a command group function object. Each command group function object is given a unique command group handler object to perform all the necessary work required to correctly process data on a device using a kernel. In this way, the group of commands for transferring and processing data is enqueued as a command group on a device for execution. A command group is submitted atomically to a SYCL queue.
Briefly, a command group refers to “an operation that processes data on the device.”
The anonymous function specified as the argument to submit
in the program above is a command group. It can be said to be a broader function object that includes accessors, etc., in addition to the SYCL kernel function.
sycl::queue
A SYCL command queue is an object that holds command groups to be executed on a SYCL device.
It is a data structure (object) for holding command groups. sycl::queue
is often declared with information about “on which device the command groups it holds will be executed,” as shown below:
// 1. Holds command groups to be executed on the default device (GPU, CPU or ...)
sycl::queue q(sycl::default_selector_v);
// 2. Same meaning as 1
sycl::queue q;
// 3. Holds command groups to be executed on the default CPU
sycl::queue q(sycl::cpu_selector_v);
// 4. Holds command groups to be executed on the default GPU
sycl::queue q(sycl::gpu_selector_v);
submit
A queue can be used to submit command groups to be executed by the SYCL runtime using the queue::submit member function.
As such, it is used to store (push) command group function objects into a sycl::queue
.
Checking the Kernel Execution Device
By writing code like the following, you can check at program runtime on which device and platform it is executing.
sycl::queue q;
std::cout << q.get_device().get_info<sycl::info::device::name>() << std::endl; // Device name
std::cout << q.get_device().get_info<sycl::info::device::vendor>() << std::endl; // Device vendor
std::cout << q.get_device().get_platform().get_info<sycl::info::platform::name>() << std::endl; // Platform name
std::cout << q.get_device().get_platform().get_info<sycl::info::platform::vendor>() << std::endl; // Platform vendor
sycl::buffer
The buffer class defines a shared array of one, two or three dimensions that can be used by the SYCL kernel and has to be accessed using accessor classes.
Arrays (up to 1, 2, or 3 dimensions) that you want to use within a SYCL kernel function can be defined as sycl::buffer
. However, simply defining it as sycl::buffer
is not enough; it can only be used via an accessor, which will be explained below.
In the sample program above, buf
is defined as a sycl::buffer
for the host array arr
.
Accessor
The functionality for accessing various memories from a device is called an accessor. While “various memories” was mentioned, besides buffers (sycl::buffer
), there are memories accessible only by the host or only by the device. An example of memory accessible only by the device is “local memory,” and this is the functionality for accessing such memories.
sycl::accessor
sycl::accessor
is a type of accessor, a class for accessing sycl::buffer
.
In the program above, acc
is defined as a sycl::accessor
for accessing the sycl::buffer
buf
in write-only mode. While “write-only” was written, you can actually specify read/write permission by specifying the third argument of the sycl::accessor
constructor as follows:
sycl::read_only // Read-only
sycl::write_only // Write-only
sycl::read_write // Read-write
Changes made to the contents of sycl::buffer
via an accessor are not immediately reflected in the host-side array (in the program above, arr
). In CUDA, data is explicitly copied from device to host using cudaMemcpy
. However, in the SYCL program above, the timing for reflection is when the destructor of sycl::accessor
is called (when its lifetime ends). At this time, the destructor’s operation automatically copies the values to the host-side array.
parallel_for
(Data Parallel Kernel)
Up to this point, we’ve understood how to access arrays (data) on the device. Now, one of the functions for actually processing this data (processing on the device) is parallel_for
. parallel_for
can call a SYCL kernel function and perform parallel processing by specifying parameters.
For parallel processing, it’s necessary to specify processing units and, for each processing unit, which data to process. In SYCL, this is specified with sycl::range
and sycl::id
. While “processing unit” was mentioned, in SYCL, this is called a work-item
. Therefore, sycl::range
specifies the number of work-items
, sycl::id
specifies the ID of each work-item
, and parallel_for
describes the processing content to be actually executed by each work-item
.
Here, if the number of work-item
s is smaller than the maximum number of threads on the device, processing completes in one batch. However, if the number of work-item
s is larger than the maximum number of threads on the device, processing is performed in multiple batches.
This batch is called a work-group
.
sycl::range
and sycl::id
The first argument to parallel_for
specifies sycl::range
. sycl::range
indicates the number of work-item
s; in the program above, it generates N
, i.e., 10 work-item
s.
In the case of a 1D array like this, there’s no particular reason to use sycl::range
, but for 2D or 3D arrays, sycl::range
can be used to explicitly indicate the dimensionality of the array. The second argument to parallel_for
specifies the SYCL kernel function, and sycl::id
is passed as an argument to this function object. sycl::id
carries information about each work-item
, allowing its ID to be checked.
In the program above, get_linear_id()
is used to get the ID of each work-item
and access the corresponding index of the array data
. To rephrase this in CUDA terms, sycl::range
corresponds to CUDA’s dim3
, and sycl::id
corresponds to CUDA’s threadIdx
or blockIdx
.
A Slightly More Complex Program
Let’s make the SYCL program a bit more complex. The following program implements matrix multiplication in SYCL.
#include <sycl/sycl.hpp>
#include <iostream>
int main() {
constexpr int M = 2;
constexpr int N = 3;
constexpr int P = 4;
// Create a queue
sycl::queue q;
// Create unified shared memory
auto A = sycl::malloc_shared<float>(M * N, q);
auto B = sycl::malloc_shared<float>(N * P, q);
auto C = sycl::malloc_shared<float>(M * P, q);
// Initialize the contents of unified shared memory on the host
for (int i = 0; i < M * N; i++) {
A[i] = i;
}
for (int i = 0; i < N * P; i++) {
B[i] = i;
}
q.submit([&](sycl::handler &cgh) {
// parallel_for for a 2D array
cgh.parallel_for(sycl::range<2>(M, P), [=](sycl::id<2> idx) {
int i = idx[0]; // Get x
int j = idx[1]; // Get y
float sum = 0.0f;
for (int k = 0; k < N; k++) {
// Operate on unified shared memory on the device
sum += A[i * N + k] * B[k * P + j];
}
C[i * P + j] = sum;
});
});
q.wait();
for (int i = 0; i < M * P; i++) {
std::cout << "C[" << i << "] = " << C[i] << std::endl;
}
return 0;
}
The output result is as follows:
C[0] = 20
C[1] = 23
C[2] = 26
C[3] = 29
C[4] = 56
C[5] = 68
C[6] = 80
C[7] = 92
Compared to the simple program presented first, the following differences can be seen:
- Using
sycl::malloc_shared
to allocate arrays in shared memory. - Using a 2D
sycl::range
to parallelize matrix multiplication. - Using a 2D
sycl::id
to access matrix elements. - Using
q.wait()
to wait for parallel execution to complete.
These contents will be explained below.
sycl::malloc_*
: Unified Shared Memory (USM)
Unified Shared Memory (USM) provides a pointer-based alternative to the buffer programming model. USM enables:
• Easier integration into existing code bases by representing allocations as pointers rather than buffers, with full support for pointer arithmetic into allocations.
• Fine-grain control over ownership and accessibility of allocations, to optimally choose between per formance and programmer convenience.
• A simpler programming model, by automatically migrating some allocations between SYCL devices and the host.
This is a feature available from SYCL 2020, serving as an alternative to sycl::buffer
and sycl::accessor
, and corresponds to cudaMallocManaged
in CUDA. While the contents of sycl::buffer
need to be accessed from the device through an accessor via sycl::accessor
, USM allows access from both host and device through pointers.
USM includes sycl::malloc_host
, sycl::malloc_device
, and sycl::malloc_shared
. Using sycl::malloc_shared allows the use of a single unified address accessible from both host and device.
sycl::range
: Parallel Execution Range
As already shown, sycl::range
is used to specify the range of parallel execution.
In this example, a 2D sycl::range<2>
is used for accessing a 2D array. Up to 3 dimensions can be specified, in which case it would be sycl::range<3>
. Similar to CUDA’s dim3
, the first argument specifies the size in the x-direction, the second in the y-direction, and the third in the z-direction.
sycl::id
: Parallel Execution ID
As already shown, sycl::id
is used to obtain the ID of parallel execution.
Similar to sycl::range
, up to 3 dimensions can be specified, in which case it would be sycl::id<3>
. It’s similar to CUDA’s threadIdx, but there’s a difference in how IDs are obtained: for sycl::id<3>
idx, the x-direction ID is idx[0]
, the y-direction ID is idx[1]
, and the z-direction ID is idx[2]
.
Synchronization with wait()
sycl::queue
has a function called wait()
which can be used for synchronization. It can be thought of as a similar function to cudaDeviceSynchronize()
in CUDA.
In the program above, after allocating memory with sycl::malloc_shared
, the processing content is described with submit
, and wait()
is used to wait for this processing to complete. After this wait()
, synchronization is ensured, so the contents of sycl::malloc_shared
will reflect the processing sent by submit
. Conversely, before wait()
, synchronization is not guaranteed, so it’s not assured that the contents of sycl::malloc_shared
reflect the processing sent by submit
.
Building and Running the Program
We will slightly modify the “Slightly More Complex Program,” increase the array sizes, add timing code, and actually run it on multiple architectures. We prepared the following two execution environments for this:
OS | CPU | GPU | Memory |
ubuntu22.04 | i9-14900K | NVIDIA RTX 3060Ti(CUDA 12.6) | DDR5 64GB |
ubuntu22.04 | i9-12900K | Intel Arc A770(OpenCL 3.0) | DDR5 32GB |
Since the GPU can be specified at compile time, we will change this part to compile and run using the GPU on each PC.
The code to be executed is as follows:
#include <sycl/sycl.hpp>
#include <chrono>
#include <cmath>
#include <iostream>
#include <vector>
int main() {
const auto repeat_num = 100;
std::vector<float> time(repeat_num);
constexpr int M = 1 << 10;
constexpr int N = 1 << 10;
constexpr int P = 1 << 10;
sycl::queue q(sycl::gpu_selector_v);
auto A = sycl::malloc_shared<float>(M * N, q);
auto B = sycl::malloc_shared<float>(N * P, q);
auto C = sycl::malloc_shared<float>(M * P, q);
for (int i = 0; i < M * N; i++) {
A[i] = i;
}
for (int i = 0; i < N * P; i++) {
B[i] = i;
}
for (int n = 0; n < repeat_num; n++) {
auto start = std::chrono::system_clock::now();
q.submit([&](sycl::handler &cgh) {
cgh.parallel_for(sycl::range<2>(M, P), [=](sycl::id<2> idx) {
int i = idx[0];
int j = idx[1];
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[i * N + k] * B[k * P + j];
}
C[i * P + j] = sum;
});
});
q.wait();
auto end = std::chrono::system_clock::now();
time[n] = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
}
float sum = 0.0f;
float mins = time[0];
float maxs = time[0];
for (int i = 0; i < repeat_num; i++) {
sum += time[i];
if (time[i] < mins) {
mins = time[i];
}
if (time[i] > maxs) {
maxs = time[i];
}
}
float avg = sum / repeat_num;
float diff_sum = 0.0f;
for (int i = 0; i < repeat_num; i++) {
diff_sum += pow(time[i] - avg, 2);
}
float sd = sqrt(diff_sum / repeat_num);
std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << std::endl;
std::cout << "Average time: " << avg << " usec" << std::endl;
std::cout << "Min time: " << mins << " usec" << std::endl;
std::cout << "Max time: " << maxs << " usec" << std::endl;
std::cout << "Standard Deviation: " << sd << " usec" << std::endl;
return 0;
}
SYCL Environment Setup
This time, we will use Intel oneAPI DPC++. For environment setup methods, please refer to the content introduced previously on this blog: “Comparing the Performance of SYCL Implementations” and “Build DPC++ toolchain with support for NVIDIA CUDA.”
Execution Results
The results of running by changing the GPU used are as follows.
In this environment, we changed the GPU used with the -fsycl-targets
option at compile time. When using the RTX 3060 Ti, we added -fsycl-targets=nvptx64-nvidia-cuda
, and when using the Arc A770, we compiled without adding this option.
The output when running on RTX 3060 Ti is as follows:
Device: NVIDIA GeForce RTX 3060 Ti
Average time: 2880.14 usec
Min time: 2824 usec
Max time: 5833 usec
Standard Deviation: 298.558 usec
The output when running on Arc A770 is as follows:
Device: Intel(R) Arc(TM) A770 Graphics
Average time: 8173.93 usec
Min time: 8095 usec
Max time: 12936 usec
Standard Deviation: 480.527 usec
Presented in a table, it is as follows:
GPU | Average Time (μs) | Min Time (μs) | Max Time (μs) | Standard Deviation (μs) |
GeForce RTX 3060 Ti | 2880 | 2824 | 5833 | 299 |
Intel Arc A770 | 8174 | 8095 | 12936 | 481 |
We can confirm the output for both RTX 3060 Ti and Arc A770, and it’s confirmed that execution on each device is indeed possible from the same code.
Looking at the processing times, it’s also confirmed that they reflect the performance of each GPU.
Summary
In this article, we did the following:
- Explanation of simple SYCL syntax and terminology
- Explanation of a program using SYCL
- Building a program using SYCL
- Execution on multiple devices, processing time measurement
- Explanation that the same program can be executed on multiple devices
References
- Official documentation repository: https://github.com/KhronosGroup/SYCL-Docs
- Official documentation (English): https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
- This might be an old document, so please check if it’s the latest version before using.
- Official documentation (Japanese): https://www.isus.jp/wp-content/uploads/pdf/sycl-2020_JA.pdf
- This might be an old document, so please check if it’s the latest version before using.
- C++ with SYCL* Programming (Intel): https://hpc-event.jp/hpsc2023/material/hpsc2023_day3_workshop_part1.pdf
- This is an easy-to-understand resource for beginners.
- SYCL* Program Structure (Intel): https://hpc-event.jp/hpsc2023/doc/training/sycl_programming_workshop_part2.pdf
- This explains an older version of SYCL, but many aspects are similar to the SYCL2020 specification, making it an easy-to-understand resource for beginners.
- Heterogeneous programming with SYCL (ENCCS): https://enccs.github.io/sycl-workshop/
- OpenCL Official Site: https://www.khronos.org/opencl/
- Sample code for kernel execution device selection:
https://www.isus.jp/wp-content/uploads/pdf/TheParallelUniverse_Issue_52_07.pdf - SYCL-compatible online development environment: https://godbolt.org/
- Compilation and execution are possible by setting the compiler to
x86-64 icx (latest)
and adding-fsycl
to the compile options. - See here for detailed usage instructions: https://www.khronos.org/blog/compiler-explorer-developer-tool-adds-sycl-2020-support
- Compilation and execution are possible by setting the compiler to