當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Neon intrinsics

發布時間：2023/12/8 编程问答 58 豆豆

生活随笔收集整理的這篇文章主要介紹了 Neon intrinsics 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.介紹

? ? ? ? ? ? 在上篇中，介紹了ARM的Neon，本篇主要介紹Neon intrinsics的函數用法，也就是assembly之前的用法。NEON指令是從Armv7架構開始引入的SIMD指令，其共有16個128位寄存器。發展到最新的Arm64架構，其寄存器數量增加到32個，但是其長度仍然為最大128位，因此操作上并沒有發生顯著的變化。對于這樣的寄存器，因為可以同時存儲并處理多組數據，稱之為向量寄存器。Intrinsics是使用C語言的方式對NEON寄存器進行操作，因為相比于傳統的使用純匯編語言，具有可讀性強，開發速度快等優勢。如果需要在代碼中調用NEON Intrinsics函數，需要加入頭文件"arm_neon.h"。關于neon的所有函數，可以參考官網的：ARM NEON intrinsics reference 這里將網上的Neon intrinsics的函數做個總結：

1.1 指令的分類

正常指令：生成大小相同且類型通常與操作數向量相同的結果向量
長指令：對雙字節向量操作數執行運算，生成四字向量的結果，所生成的元素一般是操作數元素寬度的兩倍
寬指令：一個雙字向量操作數和一個四字向量操作數執行運算，生成四字向量結果，所生成的元素和第一個操作數的元素是第二個操作數元素寬度的兩倍
窄指令：四字向量操作數執行運算，并生成雙字向量結果，所生成的元素一般是操作數元素寬度的一半
飽和指令：當超過數據類型指定的范圍則自動限制在該范圍內

示例1:

int16x8_t vqaddq_s16 (int16x8_t, int16x8_t)
int16x4_t vqadd_s16 (int16x4_t, int16x4_t)

第一個字母'v'指明是vector向量指令，也就是NEON指令；

第二個字母'q'指明是飽和指令，即后續的加法結果會自動飽和；

第三個字段'add'指明是加法指令；

第四個字段'q'指明操作寄存器寬度，為'q'時操作QWORD, 為128位；未指明時操作寄存器為DWORD，為64位；

第五個字段's16'指明操作的基本單元為有符號16位整數，其最大表示范圍為-32768 ~ 32767；

形參和返回值類型約定與C語言一致。

? ? ?其它可能用到的助記符包括:

l 長指令，數據擴展
w 寬指令，數據對齊
n 窄指令, 數據壓縮

? ? ?關于所有函數的分類，請參考博客：https://blog.csdn.net/hemmingway/article/details/44828303

1.2 數據類型

? ?NEON Intrinsics內置的整數數據類型主要包括以下幾種:

(u)int8x8_t;
(u)int8x16_t;
(u)int16x4_t;
(u)int16x8_t;
(u)int32x2_t;
(u)int32x4_t;
(u)int64x1_t;

? ? ? ?其中，第一個數字代表的是數據類型寬度為8/16/32/64位，第二個數字代表的是一個寄存器中該類型數據的數量。如int16x8_t代表16位有符號數，寄存器中共有8個數據。

2.Syntax

2.1?Arithmetic

add: vaddq_f32 or vaddq_f64 (? sum = v1 + v2?)

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 }; float32x4_t sum = vaddq_f32(v1, v2); // => sum = { 2.0, 3.0, 4.0, 5.0 }

multiply:?vmulq_f32?or?vmulq_f64 ( sum = v1 + v2 )

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 }; float32x4_t prod = vmulq_f32(v1, v2); // => prod = { 1.0, 2.0, 3.0, 4.0 }

multiply and accumulate:?vmlaq_f32 (? sum = v3 + v1 * v2 )

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 2.0, 2.0, 2.0, 2.0 }, v3 = { 3.0, 3.0, 3.0, 3.0 }; float32x4_t acc = vmlaq_f32(v3, v1, v2); // acc = v3 + v1 * v2 // => acc = { 5.0, 7.0, 9.0, 11.0 }

multiply by a scalar:?vmulq_n_f32?or?vmulq_n_f64 ( prod = V1 * a)

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 }; float32_t s = 3.0; float32x4_t prod = vmulq_n_f32(v, s); // => prod = { 3.0, 6.0, 9.0, 12.0 }

multiply by a scalar and accumulate:?vmlaq_n_f32?or?vmlaq_n_f64 ( acc = v1*v2 + s)

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 }; float32_t s = 3.0; float32x4_t acc = vmlaq_n_f32(v1, v2, s); // => acc = { 4.0, 5.0, 6.0, 7.0 }

invert (needed for division):?vrecpeq_f32?or?vrecpeq_f64 ( reciprocal?=? 1 / v)

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 }; float32x4_t reciprocal = vrecpeq_f32(v); // => reciprocal = { 0.998046875, 0.499023438, 0.333007813, 0.249511719 }

invert (more accurately): use a?Newton-Raphson iteration?to refine the estimate( reciprocal = 1 / v)

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 }; float32x4_t reciprocal = vrecpeq_f32(v); float32x4_t inverse = vmulq_f32(vrecpsq_f32(v, reciprocal), reciprocal); // => inverse = { 0.999996185, 0.499998093, 0.333333015, 0.249999046 }

2.2?Load

load vector:?vld1q_f32?or?vld1q_f64

float values[5] = { 1.0, 2.0, 3.0, 4.0, 5.0 }; float32x4_t v = vld1q_f32(values); // => v = { 1.0, 2.0, 3.0, 4.0 }

load same value for all lanes:?vld1q_dup_f32?or?vld1q_dup_f64

float val = 3.0; float32x4_t v = vld1q_dup_f32(&val); // => v = { 3.0, 3.0, 3.0, 3.0 }

set all lanes to a hardcoded value:?vmovq_n_f16?or?vmovq_n_f32?or?vmovq_n_f64

float32x4_t v = vmovq_n_f32(1.5); // => v = { 1.5, 1.5, 1.5, 1.5 }

2.3?Store

store vector:?vst1q_f32?or?vst1q_f64

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 }; float values[5] = new float[5]; vst1q_f32(values, v); // => values = { 1.0, 2.0, 3.0, 4.0, #undef }

store lane of array of vectors:?vst4q_lane_f16?or?vst4q_lane_f32?or?vst4q_lane_f64?(change to?vst1...?/?vst2...?/?vst3...for other array lengths)

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }, v1 = { 5.0, 6.0, 7.0, 8.0 }, v2 = { 9.0, 10.0, 11.0, 12.0 }, v3 = { 13.0, 14.0, 15.0, 16.0 }; float32x4x4_t u = { v0, v1, v2, v3 }; float buff[4]; vst4q_lane_f32(buff, u, 0); // => buff = { 1.0, 5.0, 9.0, 13.0 }

2.4?Arrays

access to values:?val[n]

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }, v1 = { 5.0, 6.0, 7.0, 8.0 }, v2 = { 9.0, 10.0, 11.0, 12.0 }, v3 = { 13.0, 14.0, 15.0, 16.0 }; float32x4x4_t ary = { v0, v1, v2, v3 }; float32x4_t v = ary.val[2]; // => v = { 9.0, 10.0, 11.0, 12.0 }

2.5?Max and min

max of two vectors, element by element:

float32x4_t v0 = { 5.0, 2.0, 3.0, 4.0 }, v1 = { 1.0, 6.0, 7.0, 8.0 }; float32x4_t v2 = vmaxq_f32(v0, v1); // => v1 = { 5.0, 6.0, 7.0, 8.0 }

max of vector elements, using folding maximum:

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }; float32x2_t maxOfHalfs = vpmax_f32(vget_low_f32(v0), vget_high_f32(v0)); float32x2_t maxOfMaxOfHalfs = vpmax_f32(maxOfHalfs, maxOfHalfs); float maxValue = vget_lane_f32(maxOfMaxOfHalfs, 0); // => maxValue = 4.0

min of two vectors, element by element:

float32x4_t v0 = { 5.0, 2.0, 3.0, 4.0 }, v1 = { 1.0, 6.0, 7.0, 8.0 }; float32x4_t v2 = vminq_f32(v0, v1); // => v1 = { 1.0, 2.0, 3.0, 4.0 }

min of vector elements, using folding minimum:

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }; float32x2_t minOfHalfs = vpmin_f32(vget_low_f32(v0), vget_high_f32(v0)); float32x2_t minOfMinOfHalfs = vpmin_f32(minOfHalfs, minOfHalfs); float minValue = vget_lane_f32(minOfMinOfHalfs, 0); // => minValue = 1.0

2.5 conditionals

ternary operator: use vector comparison (for example?vcltq_f32?for?less than?comparison)

float32x4_t v1 = { 1.0, 0.0, 1.0, 0.0 }, v2 = { 0.0, 1.0, 1.0, 0.0 }; float32x4_t mask = vcltq_f32(v1, v2); // v1 < v2 float32x4_t ones = vmovq_n_f32(1.0), twos = vmovq_n_f32(2.0); float32x4_t v3 = vbslq_f32(mask, ones, twos); // will select first if mask 0, second if mask 1 // => v3 = { 2.0, 1.0, 2.0, 2.0 }

3. Sample for openCV, c pointer, Neon

? ? ? ?前提：圖片大小 640 x 480，動作：每三行的各列相加等于當前列。例如：x（i，j） = x（i, j） +x(i - 1, j) + x(i-2, j).

openCV的做法：其中，cv::Mat gray, src .src是來自每一幀圖片(640x480 deep = 8bits)

GETTIME(&lTimeStart); for (int col = 0; col < gray.cols; col++) {gray.at<uchar>(0, col) = src.at<uchar>(0, col); } for (int col = 0; col < gray.cols; col++) {gray.at<uchar>(1, col) = gray.at<uchar>(0, col) + src.at<uchar>(1, col); } for (int col = 0; col < gray.cols; col++) {gray.at<uchar>(2, col) = gray.at<uchar>(1, col) + src.at<uchar>(2, col); } for (int row = 3; row < gray.rows; row++) {for (int col = 0; col < gray.cols; col++){gray.at<uchar>(row, col) = gray.at<uchar>(row - 1, col) + src.at<uchar>(row, col) - src.at<uchar>(row - 3, col);} } GETTIME(&lTimeEnd); printf("time %ldus\n",lTimeEnd - lTimeStart); 在arm-A57平臺，openCV消耗的時間均值：time = 19175us

c-pointer的做法：

GETTIME(&lTimeStart); unsigned char *ptr = src.ptr(0); unsigned char *grayPtr = gray.ptr(0); for(int col = 0; col < gray.cols; col++) {grayPtr[col] = ptr[col]; } unsigned char *ptr1 = src.ptr(1); unsigned char *grayPtr1 = gray.ptr(1); for(int col =0; col < gray.cols; col++) {grayPtr1[col] = ptr[col] + ptr1[col];//34us } unsigned char *ptr2 = NULL; unsigned char *grayPtr2 = NULL; for(int row = 2; row < gray.rows; row++) {ptr = src.ptr(row - 2);ptr1 = src.ptr(row -1);ptr2 = src.ptr(row);grayPtr2 = gray.ptr(row);for(int col = 0; col <gray.cols; col+=16){grayPtr2[col] = ptr[col] + ptr1[col] + ptr2[col];//11252us} } GETTIME(&lTimeEnd); printf("time %ldus\n",lTimeEnd - lTimeStart); 在arm-A57平臺，C-pointer消耗的時間均值：time = 11252us

Neon 方式：

GETTIME(&lTimeStart); unsigned char *ptr = src.ptr(0); unsigned char *grayPtr = gray.ptr(0); for(int col = 0; col < gray.cols; col++) {grayPtr[col] = ptr[col]; } unsigned char *ptr1 = src.ptr(1); unsigned char *grayPtr1 = gray.ptr(1); for(int col =0; col < gray.cols; col++) {grayPtr1[col] = ptr[col] + ptr1[col];//34us } unsigned char *ptr2 = NULL; unsigned char *grayPtr2 = NULL; for(int row = 2; row < gray.rows; row++) {ptr = src.ptr(row - 2);ptr1 = src.ptr(row -1);ptr2 = src.ptr(row);grayPtr2 = gray.ptr(row);for(int col = 0; col <gray.cols; col+=16){uint8x16_t in1,in2,in3,out;in1 = vld1q_u8(ptr+col);in2 = vld1q_u8(ptr1+col);in3 = vld1q_u8(ptr2+col);out = vaddq_u8(in1,in2);out = vaddq_u8(in3,out);vst1q_u8(grayPtr2+col,out);} } GETTIME(&lTimeEnd); printf("time %ldus\n",lTimeEnd - lTimeStart); 在arm-A57平臺，Neon intrinscis消耗的時間均值：time = 1907us

? ? ? ?綜上，可以看到，neon相對opencv方式的性能提升快10倍。（注意，這里的加法都有溢出的情況，由于本算法特殊，所以沒有做溢出處理）

總結

以上是生活随笔為你收集整理的Neon intrinsics的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。