
Photo by the author Ideogram
GPUs are great for tasks in which the same operation should be performed on different data fragments. This is known as Single manual, many data (simd) approach. Unlike processors, which have only a few powerful cores, GPUs have thousands of smaller ones, which can simultaneously start these repetitive operations. You will be many patterns in machine learning, for example when adding or multiplying gigantic vectors, because each calculation is independent. It is an ideal scenario for using GPU to speed up parallel tasks.
Nvidia created Miracle As a way of programmers to write programs on GPU instead of a processor. It is based on C and allows you to write special nuclear functions that can simultaneously perform many operations. The problem is that writing miracles in C or C ++ is not exactly warm to beginners. You need to deal with such things as manual memory allocation, thread coordination and understanding of how GPU works at a low level. It can be overwhelming, especially if you are used to writing code in Python.
It is there Numbers It can assist you. It allows you to write nuclear miracles with Python using the LLVM compiler infrastructure (low -level machine) for direct compilation of Python code to the nucleus compatible with miracle. Thanks to the Just-in-Time (JIT) compilation, you can Adnote your functions using the decorator, and Number supports everything else.
In this article, we will operate a common example of adding a vector and convert a plain processor code to the nucleus Miracles with Numba. Adding a vector is a perfect example of parallelism, because adding in one index is independent of other indicators. It is an ideal Simd scenario, so all indicators can be added at the same time to supplement the vector in one operation.
Note that you will need a miracle graphic processor to track this article. You can operate Colab Free GPU T4 or local GPU with NVIDIA TOOLKIT and NVCC installed.
# Environmental configuration and NUMBA installation
Numba is available as a Python package and can be installed with PIP. In addition, we will operate Numbers for vector operations. Configure the Python environment using the following commands:
python3 -m venv venv
source venv/bin/activate
pip install numba-cuda numpy
# Adding a vector on the processor
Let’s take a plain example of adding a vector. For two given vectors, we add the appropriate values from each index to get the final value. We will operate the NumPy to generate random Float32 vectors and generate the final exit using the loop.
import numpy as np
N = 10_000_000 # 10 million elements
a = np.random.rand(N).astype(np.float32)
b = np.random.rand(N).astype(np.float32)
c = np.zeros_like(a) # Output array
def vector_add_cpu(a, b, c):
"""Add two vectors on CPU"""
for i in range(len(a)):
c[i] = a[i] + b[i]
Here is the division of the code:
- Initiate two vectors with 10 million random random -cycle numbers
- We also create an empty vector
cTo store the result - .
vector_add_cpuThe function is simply looped by each index and adds elements fromaANDbstoring the resultc
This is Serial operation; Each addition takes place one by one. Although it works well, this is not the most productive approach, especially for gigantic data sets. Because every add -on is independent of others, it is an ideal candidate for parallel performance on a graphics processor.
In the next section you will see how to convert the same operation to run the GPU with Numba. By distributing each element added in thousands of GPU threads, we can do the task much faster.
# Adding a vector to GPU with Number
You will now operate a number to define Python’s function, which can work on miracles and perform it in Python. We conduct the same operation to add vectors, but now it can work in parallel for each index of the NumPy board, leading to faster performance.
Here is a testicular code:
from numba import config
# Required for newer CUDA versions to enable linking tools.
# Prevents CUDA toolkit and NVCC version mismatches.
config.CUDA_ENABLE_PYNVJITLINK = 1
from numba import cuda, float32
@cuda.jit
def vector_add_gpu(a, b, c):
"""Add two vectors using CUDA kernel"""
# Thread ID in the current block
tx = cuda.threadIdx.x
# Block ID in the grid
bx = cuda.blockIdx.x
# Block width (number of threads per block)
bw = cuda.blockDim.x
# Calculate the unique thread position
position = tx + bx * bw
# Make sure we don't go out of bounds
if position < len(a):
c[position] = a[position] + b[position]
def gpu_add(a, b, c):
# Define the grid and block dimensions
threads_per_block = 256
blocks_per_grid = (N + threads_per_block - 1) // threads_per_block
# Copy data to the device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
# Launch the kernel
vector_add_gpu[blocks_per_grid, threads_per_block](d_a, d_b, d_c)
# Copy the result back to the host
d_c.copy_to_host(c)
def time_gpu():
c_gpu = np.zeros_like(a)
gpu_add(a, b, c_gpu)
return c_gpu
Let's break what happens above.
// Understanding the GPU function
. @cuda.jit The decorator tells the numbers to treat the following function as a miracle nucleus; A special function that will work in parallel in many GPU threads. During the performance, the Number will compile this function to the compatible code with miracles and the C-Uap translation supported for you.
@cuda.jit
def vector_add_gpu(a, b, c):
...
This function will work on thousands of threads at the same time. However, we need a way to find out which part of the data should be operated by every thread. This is what the next few lines do:
txThis is the thread identifier in his blockbxis a block identifier in the gridbwhow many threads are in the block
We connect them with Calculatewhich informs every thread that the element of the boards should add. It should be noted that threads and blocks may not always provide a correct index, because they work in the power of 2. This can lead to incorrect indicators when the vector length is not in line with basic architecture. That is why we add a protection condition to confirm the index before we add a vector. This prevents the lack of unrelated error of the executive when accessing the board.
When we know the unique position, we can now add values, as we did to implement the processor. The line below will be in line with the processor's implementation:
c[position] = a[position] + b[position]
// Starting the testicle
. gpu_add The function sets the items:
- Determines how many threads and blocks to operate. You can experiment with different values of blocks and threads and print appropriate values in the GPU nucleus. This can assist understand how the GPU indexing works.
- Copies input boards (
aINbANDc) From CPU memory to GPU memory, so vectors are available in the GPU framework. - Launches the GPU nucleus with
vector_add_gpu[blocks_per_grid, threads_per_block]. - Finally, it copies the result with a return from GPU to
cA board so that we can access values on the processor.
# Comparison of implementation and potential speed
Now, when we have both the CPU and GPU vectors, it's time to see how they compare. It is critical to verify the results and strengthening of the performance that we can get thanks to the parallelism of miracles.
import timeit
c_cpu = time_cpu()
c_gpu = time_gpu()
print("Results match:", np.allclose(c_cpu, c_gpu))
cpu_time = timeit.timeit("time_cpu()", globals=globals(), number=3) / 3
print(f"CPU implementation: {cpu_time:.6f} seconds")
gpu_time = timeit.timeit("time_gpu()", globals=globals(), number=3) / 3
print(f"GPU implementation: {gpu_time:.6f} seconds")
speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")
First, we start both implementation and check if their results fit. This is critical to make sure that our GPU works properly and the output should be the same as the processor.
Then we operate the built -in Python timeit The measuring module, as long as each version lasts. We start each function several times and take the average to get a reliable time. Finally, we calculate how many times faster is the GPU version compared to the processor. You should see a substantial difference because GPU can perform many operations at the same time, while the processor supports them individually in the loop.
Here is the expected result on the NVIDIA T4 GPU on Colum. Note that the exact speed may vary depending on the version of miracle and basic equipment.
Results match: True
CPU implementation: 4.033822 seconds
GPU implementation: 0.047736 seconds
GPU speedup: 84.50x
This plain test helps to demonstrate the power to accelerate the GPU and why it is so useful for tasks related to a gigantic amount of data and parallel work.
# Wrapping
And that's all. Now you wrote your first nucleus miracle from Numba without writing any C code or miracles. Number enables a plain interface to operate GPU by Python and makes it easier for Python to start programming miracles.
You can now operate the same template to write advanced miracle algorithms, which are common in machine learning and deep learning. If you find a problem after the Simd paradigm, it is always a good idea to operate GPU to improve workmanship.
The complete code is available in the Colab notebook, to which you can access Here. I invite you to test and make plain changes to better understand how indexing and performing miracles work internally.
Canwal Mehreen He is a machine learning engineer and a technical writer with a deep passion for data learning and AI intersection with medicine. He is the co -author of the ebook "maximizing performance from chatgpt". As a Google 2022 generation scholar for APAC, it tells diversity and academic perfection. It is also recognized as a variety of terradate at Tech Scholar, Mitacs Globalink Research Scholar and Harvard Wecode Scholar. Kanwalwal is a scorching supporter of changes, after establishing FemCodes to strengthen women in the STEM fields.
