arm neon指令集

 

arm neon 是什么

NEON是ARM架构中的一个SIMD引擎,全称为NEON Advanced SIMD,主要用于嵌入式设备中进行向量化计算,以提高计算性能。

NEON的主要特征和作用包括:

  1. 提供了128位的向量寄存器,支持并行计算。

  2. 支持多种整数和浮点数据类型,如8位/16位/32位/64位整数,以及32位和64位浮点数。

  3. 提供了丰富的指令集,可以进行向量加减乘除、逻辑、加载/存储等运算。

  4. 通过单指令多数据(SIMD)技术,可以同时处理多个数据,大大提升嵌入式设备的多媒体和信号处理能力。

  5. 可以与ARM的普通寄存器和指令无缝配合使用。

  6. 应用在图像处理、语音识别、机器学习等领域,可以获得数十倍的性能提升。

  7. 被广泛应用在高端嵌入式设备和移动SoC中,如高端手机、媒体播放器等。

官网样例

Example - RGB deinterleaving

Consider a 24-bit RGB image where the image is an array of pixels, each with a red, blue, and green element.

This is because the RGB data is interleaved, accessing and manipulating the three separate color channels presents a problem to the programmer. In simple circumstances we could write our own single color channel operations by applying the modulo 3 to the interleaved RGB values. However, for more complex operations, such as Fourier transforms, it would make more sense to extract and split the channels.

We have an array of RGB values in memory and we want to deinterleave them and place the values in separate color arrays. A C procedure to do this might look like this:

void rgb_deinterleave_c(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
    /*
     * Take the elements of "rgb" and store the individual colors "r", "g", and "b".
     */
    for (int i=0; i < len_color; i++) {
        r[i] = rgb[3*i];
        g[i] = rgb[3*i+1];
        b[i] = rgb[3*i+2];
    }
}

But there is an issue. Compiling with Arm Compiler 6 at optimization level -O3 (very high optimization) and examining the disassembly shows no Neon instructions or registers are being used. Each individual 8-bit value is stored in a separate 64-bit general registers. Considering the full width Neon registers are 128 bits wide, which could each hold 16 of our 8-bit values in the example, rewriting the solution to use Neon intrinsics should give us good results.

void rgb_deinterleave_neon(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
    /*
     * Take the elements of "rgb" and store the individual colors "r", "g", and "b"
     */
    int num8x16 = len_color / 16;
    uint8x16x3_t intlv_rgb;
    for (int i=0; i < num8x16; i++) {
        intlv_rgb = vld3q_u8(rgb+3*16*i);
        vst1q_u8(r+16*i, intlv_rgb.val[0]);
        vst1q_u8(g+16*i, intlv_rgb.val[1]);
        vst1q_u8(b+16*i, intlv_rgb.val[2]);
    }
}

Example - matrix multiplication

https://developer.arm.com/documentation/102467/0200/Example—matrix-multiplication

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            C[n*j_idx + i_idx] = 0;
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

Example - collision detection

https://developer.arm.com/documentation/102467/0200/Example—collision-detection

参考

https://developer.arm.com/documentation/102467/0100/

https://zhuanlan.zhihu.com/p/358603760