How to achieve Inter-cta sync on Nvidia GPU

Background

　In many deep learning and scientific applications, inter-CTA-sync is a necessary pattern. But
even with CUDA9’s new feature coperative launch, inter-CTA-sync is never trivial. This blog will dive into this topic with a reduction as our model.

Say we need to perform a reduction over a entire grid of threads and a natural idea is to reduce intra-block first and inter-block as a second step, so called two-step protocal. It is the best version in terms of performance and comprehensibility. The second step is what need to pay attention to because other blocks have to consume data my block produced in a safe way(Producer-Consumer model). Sure, two kernel launch can be a kill but not in our scope.

Sample code

__device__ int count=0;
 
__global__ static void sum(int* data_gpu,int* block_gpu,int *sum_gpu,int length)
{
    extern __shared__ int blocksum[];
    __shared__ int islast;
    int offset;
 
    const int tid=threadIdx.x;
    const int bid=blockIdx.x;
    blocksum[tid]=0;
    for(int i=bid*THREAD_NUM+tid;i<length;i+=BLOCK_NUM*THREAD_NUM)
        blocksum[tid]+=data_gpu[i];
 
    __syncthreads();
    offset=THREAD_NUM/2;
    while(offset>0)
    {
        if(tid<offset)  blocksum[tid]+=blocksum[tid+offset];
        offset>>=1;
        __syncthreads();
    }
 
    if(tid==0)
    {
        block_gpu[bid]=blocksum[0];
        __threadfence();
 
        int value=atomicAdd(&count,1);
        islast=(value==gridDim.x-1);
    }
 
    __syncthreads();
 
    if(islast)
    {
        if(tid==0)
        {
            int s=0;
            for(int i=0;i<BLOCK_NUM;i++)  s+=block_gpu[i];
            *sum_gpu=s;
        }
    }
}

The one-step reduction procedure presented in this section is mainly an example to show you how you can use the data generated by one or more other threads during the execution of one kernel, one or more threads. Let’s start with threadfence. First, threadfence has two main roles, and three levels of control: one role is to control the validity of accesses (writes only). The other is to control that the compiler does not over-optimize (for reads and writes) the top and bottom of the fence (from a line-of-code perspective) when generating instructions. This is actually a combination of hardware and software, the former being mainly hardware, the latter being mainly software (compiler behavior control). Let’s start with threadfence’s control of memory writes. It has a detail that is not specified here: it can suspend the thread (warp) that calls it for a certain period of time until the access writes made by that thread (warp) have reached a certain level. Then the execution of the thread (warp) can be resumed. That is, at what level of completion for the partner thread in the block, for other threads on the GPU, or globally (e.g., the CPU or other cards participating in P2P Access), as described in this section. This leads to the 3 levels of threadfence functions with different suffixes in this section:

1
2
3

__threadfence_block();
__threadfence();
__threadfence_system();

Each of these three levels is higher than the other. When used, it causes threads (warp) to be suspended for longer and longer periods of time (suspension does not necessarily have a negative performance impact. This is because the hardware can choose to switch to another thread or warp during this time, which means it can be masked). If we look at it from the hardware point of view only, after the thread executes the instructions written by the access (generated by the compiler), it will continue to execute without the 3 functions (the corresponding instructions generated), and not wait for the access to complete.

From a hardware perspective, if a thread executes the access-written instructions (generated by the compiler) without these three functions (generated instructions), it will continue to execute without waiting for the access to complete. With these three instructions, the thread will pause and wait for the writing process to complete to a point where it can be “seen” by other threads at different levels before continuing. This is why I emphasized that this is a write in the “control access (write)” brackets. This is because for reads, you can automatically wait when an operand is used. However, the write does not have this function. After the data (from the register) and the access instruction are submitted to the LSU in the SM, that is, after the access instruction is successfully issued, the thread continues to execute. These three functions, on the other hand, effectively introduce three different levels of delay, allowing multiple hardware pipelines to continue execution only after a certain level of write operation has been completed. This is useful in many scenarios (e.g. the example at the end of this chapter where a thread in a block reads data generated by a thread in another block). There is no information on how long this causes threads/warp to pause. One interesting thing is that, when a certain computational power card has a time error at a certain level of the instruction, then the compiler does a patch operation when generating a specific level of the pause instruction for that particular computational power card. (Two instructions are generated in a row to fix the worst-case timing problem and ensure that the write operation is executed properly somewhere).
And that’s not all, we all know that hardware can’t leave software, otherwise it’s a soulless corpse. Not only does the hardware have certain operations on the threadfence family of functions (the corresponding instructions), but the compiler, when encountering the three functions, besides generating the three corresponding instructions normally, also causes some behavioral changes: this is what the previous text said, causing some changes in the optimization behavior, which is also described in this section. In layman’s terms, this means separating the accesses before and after the threadfence family of functions,
The compiler will not cross the threadfence boundary and reorder the access statements,
to optimize performance while maintaining safety and logical correctness. This is why there are a lot of repetitive long sentences in this section.
The other part of the long sentence, which says, “A thread’s …. The write operation …. When and where it can be observed by whom (3 levels)…” is a change in the compiler’s behavior, plus a pause in control due to the 3 different instructions generated by the hardware. These are the two main points of threadfence. The user should see for himself the three levels. As mentioned before, the main difference is the pause time (not from official sources, from actual observation).
But these two points are not all, this chapter also talks about a volatile keyword.
It also happens to have two roles. It is also used in software and hardware. In the example at the end of this chapter, this keyword is indispensable. The software role is similar to that of the threadfence function, where compiler control and changes in behavior occur: this is actually familiar to everyone.
The compiler will not use an old value that already exists, such as a fixed p[offset] that has already been read. Instead, it generates a new access instruction to read it. The other point is hardware, as we all know that current cards (e.g. Pascal, GTX1080 with 6.1 computing power), have a uniform (logical) L2 cache, and a separate L1 cache for each SM. (also called unified cache in this computing power, may be different in other computing power, here unified cache), that is, the card has 1 L2 cache (in GPU), and 20 independent L1 cache (in each SM), once SM 10 Once SM 10 writes a value, changing the content of its L1 cache and/or global L2 cache (to the value just written), SM 9, if it still reads the normal access memory, may get the old content in SM 9’s L1 cache directly because it was there before, immediately. At this point, with the volatile keyword, the compiler will not only not use the case mentioned in point (1), but will issue a new access after the encounter, and will also control the cache control policy of the access instruction so that it can pass over the old value in the L1 of the current SM, to get the correct result. In other words, some cards with multiple L1s and unified L2s do not maintain consistency under certain circumstances. This is about volatile.
Then let’s talk about this example. This example is a simplified version, and some details have been removed. But it is still sufficient as an example here. If users want to see the full version, there are many online. From NV, from different authors. You can refer to it when the time comes. Let me start with the important point: the mix of normal access and atomic operations. First of all, as we said before, this is a single-step statute summing kernel. It only takes one kernel start, to find the final sum in parallel for a large amount of data on a buffer (a statute operation is a process that takes a large range of inputs and gives a small range of outputs, you can also think of it as a shadowing process from a large set to a small set). You can also think of it as a projection from a large set to a small set). Of course, this example is just a simple summation. Since the kernel only wants to do all the operations in one boot, the code is split into two parts, the first part is just the normal summation of the respective blocks, and then saved to the explicit memory. The second part is the last block elected, which does the second step, reading the internal sums of the previous blocks, and then finding the final sum. This completes the global summation of the single kernel. Because here involves, read the results of other threads, so use the previously mentioned volatile + threadfence, but here the threadfence also has an important role, is the common common access (result[…]) = … write) and atomic access (atomInc). There is no completion order between the normal access instruction and the atomic instruction. By inserting a threadfence function between the two instructions, this code causes all previously terminated kernels to run through the atomic operation only when they have paused long enough for the written part to be globally visible and successfully validated. This allows the final block to be read as soon as the result of the atomic operation is available (a successful election). This prevents a subsequent atomic operation from completing while the previous normal access (which holds the sum of the parts) is still on its way. That’s the main caveat here.
It is also important to note that there are two other versions of this common statute summation. Version 2 requires two kernel boots, but the code is much simpler. Version 2 of the kernel only internally calculates its own partial sum for each block, and saves its own partial sum (not written here, there are plenty of forums for that), which means that version 2 of the kernel only needs the first half of version 1 of the kernel. The second half is not needed. Then, by a simple trick (booting it twice in a row), the second boot only needs to set 1 block. This version of the code is simple (only half of it), and circumvents the threadfence operation, the atomInc operation, and the process of controlling the scope of the atomic operation inside the block here with an if (which is performed once collectively) and then broadcasting the result back in shared memory. This simplifies things considerably, but often results in better performance. So code 2 is highly recommended (the only thing you have to pay for is to boot the kernel twice. But the kernel itself has been greatly simplified)