With Triton, OpenAI has introduced a new programming language that is specially designed for machine learning (ML) applications. The open source language, now published in version 1.0, is based on Python, which is widespread in the ML environment. Under the hood, it aims to optimize the architecture of the GPU – currently exclusively from Nvidia.
Motivation to develop the Language presented in 2019 is, according to the OpenAI blog, that ML frameworks are on the one hand not powerful enough for extensive artificial neural networks (ANN). The hardware-related GPU programming, for example with Nvidia’s CUDA, on the other hand, has a high entry hurdle and places high demands on the manual optimization of processes.
Tailored to the GPU
In order to run software as efficiently as possible on GPUs, the code must be tailored to the architecture. Memory coalescing ensures that memory transfers from DRAM are combined into large transactions. It is also important to optimize the shared memory management for the data stored in the SRAM. Finally, the calculations must be partitioned and scheduled both within individual streaming multiprocessors (SMs) and across SMs.
While developers under CUDA have to undertake the associated compiler optimizations manually, the Triton compiler obviously takes care of memory coalescing, shared memory management and scheduling within the SMs automatically. Manual adjustments are only required for the overall scheduling.
Python-Snytax based on Numba
In terms of syntax, Triton relies on Python. When implementing the specific GPU functions it looks like the Numba software package, which translates numerical functions into machine code with a JIT (Just-in-Time) compiler. Like Numba, Triton uses decorated Python functions. This can be used to define kernels that run in parallel with different
program_ids run in a grid of instances.
In contrast to Numba, Triton knows pointer arithmetic. The following code example shows the simplest implementation of a kernel that performs all block operations in a single thread and uses pointer arithmetic:
BLOCK = 512 # This is a GPU kernel in Triton. # Different instances of this # function may run in parallel. @jit def add(X, Y, Z, N): # In Triton, each kernel instance # executes block operations on a # single thread: there is no construct # analogous to threadIdx pid = program_id(0) # block of indices idx = pid * BLOCK + arange(BLOCK) mask = idx < N # Triton uses pointer arithmetics # rather than indexing operators x = load(X + idx, mask=mask) y = load(Y + idx, mask=mask) store(Z + idx, x + y, mask=mask) ... grid = (ceil_div(N, BLOCK),) # no thread-block add[grid](x, y, z, x.shape)
For performance reasons, Triton relies on a modular system architecture, the focus of which is an intermediate code called Triton-IR (Intermediate Representation, IR). Multi-dimensional value blocks play a central role in IR. The Triton compiler backend simplifies and optimizes the intermediate code in order to prepare it for the LLVM compiler architecture and ultimately to tailor it to PTX (Parallel Thread Execution). In version 1.0, all optimizations are aimed at Nvidia’s GPUs, and compiling for CPUs or AMD GPUs is currently not planned.
more details can be found in a post on the OpenAI blog which Philippe Tillet wrote. He played a key role in inventing the language and introduced it in 2019 before he started working at OpenAI. OpenAI started as a research project in 2015 and one of the founders was Elon Musk. In the beginning, many projects in the area of reinforcement learning emerged. The company achieved great fame with the Generative Pre-trained Transformer 3 (GPT-3) language model. In 2019, OpenAI said goodbye to the pure non-profit business and is now investing itself in start-ups in the field of machine learning – most recently with 100 million US dollars.
That Triton repository can be found on GitHub, and Tillet explicitly invites you to fork. He also asks for assistance in the implementation for platforms beyond the Nvidia GPUs.