ops1_LayerNorm算子的CUDA实现与优化

ops(1)：LayerNorm 算子的 CUDA 实现与优化

原文链接: https://zhuanlan.zhihu.com/p/694974164
发布时间: 2024-04-29
作者: ifromeast

概述

本文系统介绍了 LayerNorm 算子的 CUDA 实现与优化全过程，从朴素实现到高度优化版本，性能提升达数十倍。文章涵盖：

LayerNorm 前向和反向传播的完整数学推导
从 CPU 到 GPU 的逐步优化过程
Shared Memory、协作组、向量化等多种优化技术
详细的性能测试数据和分析

测试环境: GPU V100/A100，输入 shape 为 [8, 1024, 768]

一、LayerNorm 前向过程的实现与优化

1.1 LayerNorm 算法原理

目的: 减少深度神经网络中层与层之间的 Covariate Shift，提高网络收敛速度。

数学公式:

对于 m 维向量 x，LayerNorm 的输出为：

y = w ⊙ (x - μ) / √(σ² + ε) + b

其中：

均值: μ = (1/m) Σ xᵢ
标准差: σ = √[(1/m) Σ (xᵢ - μ)²]
可学习参数: w (权重) 和 b (偏置)
ε: 防止除零的小常数 (1e-5)

输入输出参数:

int B, T, C;              // shape: [B, T, C] = [8, 1024, 768]
const float* inp;         // 输入 x
float* mean;              // 均值 μ (用于反向传播)
float* rstd;              // 标准差倒数 1/σ (用于反向传播)
const float* weight;      // 可学习权重 w
const float* bias;        // 可学习偏置 b
float* out;               // 输出 y

1.2 CPU 实现

CPU 实现非常直观，分为四个步骤：

void layernorm_forward_cpu(float* out, float* mean, float* rstd,
                           const float* inp, const float* weight, const float* bias,
                           int B, int T, int C) {
    float eps = 1e-5f;
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            const float* x = inp + b * T * C + t * C;
            
            // 1. 计算均值
            float m = 0.0f;
            for (int i = 0; i < C; i++) {
                m += x[i];
            }
            m = m / C;
            
            // 2. 计算方差
            float v = 0.0f;
            for (int i = 0; i < C; i++) {
                float xshift = x[i] - m;
                v += xshift * xshift;
            }
            v = v / C;
            
            // 3. 计算标准差倒数
            float s = 1.0f / sqrtf(v + eps);
            
            // 4. 归一化并应用权重和偏置
            float* out_bt = out + b * T * C + t * C;
            for (int i = 0; i < C; i++) {
                float n = s * (x[i] - m);
                float o = n * weight[i] + bias[i];
                out_bt[i] = o;
            }
            
            // 缓存 mean 和 rstd 供反向传播使用
            mean[b * T + t] = m;
            rstd[b * T + t] = s;
        }
    }
}

1.3 简单 CUDA 实现 (V1)

优化思路: LayerNorm 在 C 维度上计算，因此可以在 [B, T] 维度上并行。

实现方式:

const int N = B * T;
const int grid_size = ceil_div(N, block_size);
layernorm_forward_kernel1<<<grid_size, block_size>>>(out, mean, rstd, inp, weight, bias, N, C);

每个线程处理一个 [B, T] 位置，kernel 实现与 CPU 内层循环相同。

性能数据 (V100, 重复 2000 次平均):

block_size	时间 (ms)	带宽 (GB/s)
32	0.5360	93.90
64	0.5220	96.41
128	0.5247	95.93
256	0.7831	64.27
512	1.4427	34.89
1024	2.9885	16.84

分析:

最优 block_size 为 64-128（与 V100 的 SM 最大线程数/最大 block 数 = 2048/32 = 64 一致）
LayerNorm 是 IO bound 操作，带宽利用率反映真实性能
block_size 一般设置为 128 或 256

1.4 使用 Shared Memory (V2)

优化思路: 均值和方差计算是典型的规约操作，使用 shared memory 加速。

关键代码 (计算均值):

__global__ void mean_kernel(float* mean, const float* inp, int N, int C, int block_size) {
    extern __shared__ float shared[];
    int idx = blockIdx.x;
    int tid = threadIdx.x;
    const float* x = inp + idx * C;
    
    // Thread coarsening
    float sum = 0.0f;
    for (int i = tid; i < C; i += block_size) {
        sum += x[i];
    }
    shared[tid] = sum;
    __syncthreads();
    
    // Reduction
    for (int stride = block_size / 2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (tid < stride) {
            shared[tid] += shared[tid + stride];
        }
    }
    
    if (tid == 0) {
        mean[idx] = shared[0] / C;
    }
}

性能数据:

block_size	时间 (ms)	带宽 (GB/s)	提升
32	0.1485	338.84	3.6x
64	0.1460	344.68	3.6x
128	0.1471	342.06	3.6x
256	0.1715	293.51	3.7x
512	0.2381	211.37	6.1x
1024	0.4147	121.35	7.2x

提升: 相比 V1 性能提升约 4 倍。

1.5 使用协作组 (V3)

优化思路: 使用 CUDA Cooperative Groups 的硬件加速 reduce 操作。

关键代码:

__global__ void layernorm_forward_kernel3(float* __restrict__ out, float* __restrict__ mean, 
                                          float* __restrict__ rstd, const float* __restrict__ inp, 
                                          const float* __restrict__ weight, const float* __restrict__ bias, 
                                          int N, int C) {
    namespace cg = cooperative_groups;
    cg::thread_block block = cg::this_thread_block();
    cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
    
    int idx = blockIdx.x * warp.meta_group_size() + warp.meta_group_rank();
    if(idx >= N) return;
    
    const float* x = inp + idx * C;
    
    // 计算均值
    float sum = 0.0f;
    for (int i = warp.thread_rank(); i < C; i += warp.size()) {
        sum += x[i];
    }
    sum = cg::reduce(warp, sum, cg::plus<float>{});  // 硬件加速
    float m = sum / C;
    
    // 计算标准差
    sum = 0.0f;
    for (int i = warp.thread_rank(); i < C; i += warp.size()) {
        float diff = x[i] - m;
        sum += diff * diff;
    }
    sum = cg::reduce(warp, sum, cg::plus<float>{});
    float s = rsqrtf(sum / C + 1e-5f);
    
    // 归一化
    float* o = out + idx * C;
    for (int c = warp.thread_rank(); c < C; c += warp.size()) {
        float n = s * (__ldcs(x+c) - m);  // streaming load
        __stcs(o+c, n * weight[c] + bias[c]);  // streaming store
    }
}

性能数据:

block_size	时间 (ms)	带宽 (GB/s)
32	0.0945	532.55
64	0.1129	445.63
128	0.1139	442.04
256	0.1161	433.67
512	0.1178	427.14
1024	0.1200	419.28

提升: 相比 V2 性能略有提升（约 1.5x）。

1.6 简化方差计算 (V4)

优化思路: 利用数学恒等式简化方差计算。

数学推导:

σ² = E[(X - μ)²]
   = E[X² - 2μX + μ²]
   = E[X²] - 2μE[X] + μ²
   = E[X²] - 2μ² + μ²
   = E[X²] - μ²

即: var(x) = mean(x²) - mean(x)²

关键代码:

// 同时计算 sum(x) 和 sum(x²)
float sum = 0.0f;
float sum2 = 0.0f;
for (int i = warp.thread_rank(); i < C; i += warp.size()) {
    float xi = x[i];
    sum += xi;
    sum2 += xi * xi;
}
sum = cg::reduce(warp, sum, cg::plus<float>{});
sum2 = cg::reduce(warp, sum2, cg::plus<float>{});
sum /= C;   // mean(x)
sum2 /= C;  // mean(x²)

float m = sum;
float var = sum2 - sum * sum;  // 方差
float s = rsqrtf(var + 1e-5f);

性能数据:

block_size	时间 (ms)	带宽 (GB/s)
32	0.0904	556.87
64	0.0956	526.21
128	0.0957	525.68
256	0.0959	524.76
512	0.0962	523.33
1024	0.0968	519.87

提升: 进一步提升性能（约 1.05x）。

1.7 Block 级别 Reduce (V5)

优化思路: 使用整个 block 而非单个 warp 处理一行，通过两级 reduce 提升并行度。

实现方式:

每个线程计算部分和
Warp 级 reduce
将 warp 结果写入 shared memory
再次 warp reduce 得到最终结果

关键代码:

__global__ void layernorm_forward_kernel5(float* __restrict__ out, float* __restrict__ mean, 
                                          float* __restrict__ rstd, const float* __restrict__ inp, 
                                          const float* __restrict__ weight, const float* __restrict__ bias, 
                                          int N, int C) {
    namespace cg = cooperative_groups;
    cg::thread_block block = cg::this_thread_block();
    cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
    
    __shared__ float shared_sum[32];   // 最多 32 个 warp
    __shared__ float shared_sum2[32];
    
    int num_warps = blockDim.x / 32;
    int warp_id = threadIdx.x / 32;
    int lane_id = threadIdx.x % 32;
    int idx = blockIdx.x;
    
    const float* x = inp + idx * C;
    
    // 1. 线程级累加
    float thread_sum = 0.0f;
    float thread_sum2 = 0.0f;
    for (int i = threadIdx.x; i < C; i += blockDim.x) {
        float xi = x[i];
        thread_sum += xi;
        thread_sum2 += xi * xi;
    }
    
    // 2. Warp 级 reduce
    float warp_sum = cg::reduce(warp, thread_sum, cg::plus<float>{});
    float warp_sum2 = cg::reduce(warp, thread_sum2, cg::plus<float>{});
    
    // 3. 写入 shared memory
    shared_sum[warp_id] = warp_sum;
    shared_sum2[warp_id] = warp_sum2;
    __syncthreads();
    
    // 4. 第二次 warp reduce
    warp_sum = (lane_id < num_warps) ? shared_sum[lane_id] : 0.0f;
    warp_sum2 = (lane_id < num_warps) ? shared_sum2[lane_id] : 0.0f;
    float block_sum = cg::reduce(warp, warp_sum, cg::plus<float>{});
    float block_sum2 = cg::reduce(warp, warp_sum2, cg::plus<float>{});
    
    // 5. 计算均值、方差、标准差
    block_sum /= C;
    block_sum2 /= C;
    float m = block_sum;
    float var = block_sum2 - m * m;
    float s = rsqrtf(var + 1e-5f);
    
    // 6. 归一化
    float* o = out + idx * C;
    for (int i = threadIdx.x; i < C; i += blockDim.x) {
        float n = s * (__ldcs(x+i) - m);
        __stcs(o+i, n * weight[i] + bias[i]);
    }
}

性能数据:

block_size	时间 (ms)	带宽 (GB/s)
32	0.1138	442.20
64	0.0846	595.24
128	0.0754	667.89
256	0.0752	669.55
512	0.0872	577.30
1024	0.1238	406.53

最优性能: block_size=256 时达到 669.55 GB/s。

二、LayerNorm 反向过程的实现与优化

2.1 反向梯度推导

目标: 已知 ∂L/∂y，求 ∂L/∂w、∂L/∂b、∂L/∂x。

简化记号: 令 x̂ = (x - μ) / σ，则 y = w ⊙ x̂ + b

参数梯度 (简单):

∂L/∂wᵢ = ∂L/∂yᵢ · x̂ᵢ
∂L/∂bᵢ = ∂L/∂yᵢ

输入梯度 (复杂):

∂L/∂xᵢ = (1/σ) · [∂L/∂yᵢ · wᵢ - (1/m) · (Σⱼ ∂L/∂yⱼ · wⱼ + x̂ᵢ · Σⱼ ∂L/∂yⱼ · wⱼ · x̂ⱼ)]

详细推导:

均值梯度:

∂μ/∂xᵢ = 1/m

标准差梯度:

∂σ/∂xᵢ = (1/m) · σ⁻¹ · (xᵢ - μ)

归一化值梯度:

∂x̂ⱼ/∂xᵢ = σ⁻¹ · δᵢⱼ - (1/m) · σ⁻¹ - (1/m) · σ⁻¹ · x̂ᵢ · x̂ⱼ

最终梯度:

∂L/∂xᵢ = Σⱼ (∂L/∂yⱼ · wⱼ · ∂x̂ⱼ/∂xᵢ)

2.2 CPU 实现

void layernorm_backward_cpu(float* dinp, float* dweight, float* dbias,
                            const float* dout, const float* inp, const float* weight, 
                            const float* mean, const float* rstd,
                            int B, int T, int C) {
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            const float* dout_bt = dout + b * T * C + t * C;
            const float* inp_bt = inp + b * T * C + t * C;
            float* dinp_bt = dinp + b * T * C + t * C;
            const float mean_bt = mean[b * T + t];
            const float rstd_bt = rstd[b * T + t];
            
            // 1. 计算两个规约值
            float dnorm_mean = 0.0f;
            float dnorm_norm_mean = 0.0f;
            for (int i = 0; i < C; i++) {
                float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
                float dnorm_i = weight[i] * dout_bt[i];
                dnorm_mean += dnorm_i;
                dnorm_norm_mean += dnorm_i * norm_bti;
            }
            dnorm_mean = dnorm_mean / C;
            dnorm_norm_mean = dnorm_norm_mean / C;
            
            // 2. 累加所有梯度
            for (int i = 0; i < C; i++) {
                float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
                float dnorm_i = weight[i] * dout_bt[i];
                
                dbias[i] += dout_bt[i];
                dweight[i] += norm_bti * dout_bt[i];
                
                float dval = 0.0f;
                dval += dnorm_i;                      // term 1
                dval -= dnorm_mean;                   // term 2
                dval -= norm_bti * dnorm_norm_mean;   // term 3
                dval *= rstd_bt;                      // final scale
                dinp_bt[i] += dval;
            }
        }
    }
}

2.3 简单 CUDA 实现 (V1)

实现方式: 直接将 CPU 代码并行化，每个线程处理一个 [B, T] 位置。

关键点: 使用 atomicAdd 累加 dweight 和 dbias（多个线程写同一位置）。

__global__ void layernorm_backward_kernel1(...) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= B*T) return;
    
    // ... 计算 dnorm_mean 和 dnorm_norm_mean ...
    
    for (int i = 0; i < C; i++) {
        // ...
        atomicAdd(&dbias[i], dout_bt[i]);
        atomicAdd(&dweight[i], norm_bti * dout_bt[i]);
        // ...
    }
}

性能数据:

block_size	时间 (ms)
32	3.6287
64	2.8921
128	3.1105
256	4.4975
512	6.4611
1024	7.3204

2.4 使用 Shared Memory 和协作组 (V2)

优化思路:

使用 shared memory 缓存 dweight 和 dbias，减少 global memory 的 atomic 操作
使用协作组加速 reduce 操作

关键代码:

__global__ void layernorm_backward_kernel2(...) {
    extern __shared__ float shared[];  // 大小 = 2 * C
    
    namespace cg = cooperative_groups;
    cg::thread_block block = cg::this_thread_block();
    cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
    
    int idx = blockIdx.x * warp.meta_group_size() + warp.meta_group_rank();
    if(idx >= N) return;
    
    // Shared memory 分配
    float* dbias_shared = shared;
    float* dweight_shared = shared + C;
    
    // 初始化 shared memory
    for(int i = threadIdx.x; i < C; i += blockDim.x){
        dbias_shared[i] = 0.0f;
        dweight_shared[i] = 0.0f;
    }
    __syncthreads();
    
    // 计算两个规约值
    float dnorm_mean = 0.0f;
    float dnorm_norm_mean = 0.0f;
    for (int i = warp.thread_rank(); i < C; i += warp.size()) {
        float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
        float dnorm_i = weight[i] * dout_bt[i];
        dnorm_mean += dnorm_i;
        dnorm_norm_mean += dnorm_i * norm_bti;
    }
    dnorm_mean = cg::reduce(warp, dnorm_mean, cg::plus<float>{});
    dnorm_norm_mean = cg::reduce(warp, dnorm_norm_mean, cg::plus<float>{});
    dnorm_mean /= C;
    dnorm_norm_mean /= C;
    
    // 累加梯度到 shared memory
    for (int i = warp.thread_rank(); i < C; i += warp.size()) {
        float norm_bti = (inp_bt[i] - mean_bt) * rstd_bt;
        float dnorm_i = weight[i] * dout_bt[i];
        
        atomicAdd(&dbias_shared[i], dout_bt[i]);
        atomicAdd(&dweight_shared[i], norm_bti * dout_bt[i]);
        
        float dval = dnorm_i - dnorm_mean - norm_bti * dnorm_norm_mean;
        dval *= rstd_bt;
        dinp_bt[i] += dval;
    }
    __syncthreads();
    
    // 写回 global memory
    for(int i = threadIdx.x; i < C; i += blockDim.x){
        atomicAdd(&dbias[i], dbias_shared[i]);
        atomicAdd(&dweight[i], dweight_shared[i]);
    }
}

性能数据:

block_size	时间 (ms)	提升
32	0.4004	9.1x
64	0.3138	9.2x
128	0.2472	12.6x
256	0.2388	18.8x
512	0.2395	27.0x
1024	0.2405	30.4x

提升: 相比 V1 性能提升 10-30 倍！

三、性能对比与总结

3.1 前向传播性能对比

版本	优化技术	最优带宽 (GB/s)	相比V1提升
V1	朴素并行	96.41	1.0x
V2	Shared Memory	344.68	3.6x
V3	协作组	532.55	5.5x
V4	简化方差	556.87	5.8x
V5	Block Reduce	669.55	6.9x

3.2 反向传播性能对比

版本	优化技术	最优时间 (ms)	相比V1提升
V1	朴素并行	2.8921	1.0x
V2	Shared Memory + 协作组	0.2388	12.1x

3.3 关键优化技术总结

Shared Memory
- 减少 global memory 访问
- 加速规约操作
- 前向提升 3.6x，反向提升 10x+
协作组 (Cooperative Groups)
- 硬件加速的 reduce 操作
- 灵活的同步粒度控制
- 提升约 1.5x
数学优化
- 方差计算简化: var(x) = E(x²) - E(x)²
- 减少一次遍历
- 提升约 1.05x
Block 级并行
- 两级 reduce: warp → block
- 提高并行度
- 最终达到 6.9x 提升
内存访问优化
- __ldcs / __stcs: streaming load/store
- 提示编译器数据不会重用
- 提高 cache 命中率

3.4 Block Size 选择原则

必须是 32 的倍数（warp 大小）
不超过 1024（硬件限制）
考虑 SM 最大线程数和最大 block 数的比值
一般选择 128 或 256
需要实测确定最优值

3.5 学习要点

本文通过 LayerNorm 算子的完整优化过程，展示了：

规约操作的优化: 从朴素实现到 shared memory 再到协作组
内存层次的利用: global → shared → register
并行粒度的权衡: thread → warp → block
数学优化的价值: 算法层面的改进同样重要
性能分析方法: 带宽利用率是 IO bound 操作的关键指标

3.6 参考资源

源代码: ifromeast/cuda_learning
参考实现: karpathy/llm.c
相关文章:
- Layer Normalization 反向传播推导
- CUDA 编程入门之 Cooperative Groups
- OneFlow: 如何设置 CUDA Kernel 中的 grid_size 和 block_size

总结: 通过系统的优化过程，LayerNorm 前向性能提升了 6.9 倍，反向性能提升了 12 倍，充分展示了 CUDA 优化的威力和方法论。这些技术可以推广到其他算子的优化中。

ops(1)：LayerNorm 算子的 CUDA 实现与优化

目录

概述

一、LayerNorm 前向过程的实现与优化

1.1 LayerNorm 算法原理

1.2 CPU 实现

1.3 简单 CUDA 实现 (V1)

1.4 使用 Shared Memory (V2)

1.5 使用协作组 (V3)

1.6 简化方差计算 (V4)

1.7 Block 级别 Reduce (V5)

二、LayerNorm 反向过程的实现与优化

2.1 反向梯度推导

2.2 CPU 实现

2.3 简单 CUDA 实现 (V1)

2.4 使用 Shared Memory 和协作组 (V2)

三、性能对比与总结

3.1 前向传播性能对比

3.2 反向传播性能对比

3.3 关键优化技术总结

3.4 Block Size 选择原则

3.5 学习要点

3.6 参考资源