android ndk neon,Android NDK开发之 NEON使用介绍
首先找到了要在C源代碼中只用NEON庫需要的頭文件 arm_neon.h、
#include
//在代碼中先添加了這行語句,然后執行ndk-build 卻提示了錯誤
//提示要增加什么標志,自己在 LOCAL_CXX_FLAGS 的后面添加了,但是仍然報錯
//后來搜索 NDK + NEON 終于找到了一點點苗頭并開始發現。
//遂總結如下內容Android.mk 文件內容可以參考這個:
http://download.csdn.net/download/carlonelong/4153631
改后的文件如下:
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
# 這里填寫要編譯的源文件路徑,這里只列舉了一部分
LOCAL_SRC_FILES := NcHevcDecoder.cpp JNI_OnLoad.cpp TAppDecTop.cpp
# 默認包含的頭文件路徑
LOCAL_C_INCLUDES := \
$(LOCAL_PATH) \
$(LOCAL_PATH)/..
# -g 后面的一系列附加項目添加了才能使用 arm_neon.h 頭文件
# -mfloat-abi=softfp -mfpu=neon 使用 arm_neon.h 必須
LOCAL_CFLAGS := -D__cpusplus -g -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a8
LOCAL_LDLIBS := -lz -llog
TARGET_ARCH_ABI :=armeabi-v7aLOCAL_ARM_MODE := arm
ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
# 采用NEON優化技術
LOCAL_ARM_NEON := true
endif
LOCAL_MODULE := NcHevcDecoder
# 生成動態調用庫
include $(BUILD_STATIC_LIBRARY)同時需要修改一下Application.mk文件,其內容如下:
參考:?http://blog.csdn.net/gg137608987/article/details/7565843
APP_PROJECT_PATH := $(call my-dir)/..
APP_PLATFORM := android-10
APP_STL := stlport_static
APP_ABI := armeabi-v7a
APP_CPPFLAGS += -fexceptions其中APP_ABI這句指定了編譯的目標平臺類型,可以針對不同平臺進行優化。 ???當然這樣指定了之后,就需要相應的設備支持NEON指令。
網上有一個用NEON優化YUV轉RGB的NEON優化例子,可以參見:
http://hilbert-space.de/?p=22 這里摘錄一下其優化過程:
1、原始代碼
void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
int i;
for (i=0; i
{
int r = *src++; // load red
int g = *src++; // load green
int b = *src++; // load blue
// build weighted average:
int y = (r*77)+(g*151)+(b*28);
// undo the scale by 256 and write to memory:
*dest++ = (y>>8);
}
}2、使用NEON庫進行代碼優化
Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel.
That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:
因為NEON工作在64位或128位的寄存器上,因此最適合同時處理8個像素點的轉換。
這樣就形成了下面這樣的代碼
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
int i;
uint8x8_t rfac = vdup_n_u8 (77); // 轉換權值 R
uint8x8_t gfac = vdup_n_u8 (151); // 轉換權值 G
uint8x8_t bfac = vdup_n_u8 (28); // 轉換權值 B
n/=8;
for (i=0; i
{
uint16x8_t temp;
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8_t result;
temp = vmull_u8 (rgb.val[0], rfac); // vmull_u8 每個字節(8bit)對應相乘,結果為每個單位2字節(16bit)
temp = vmlal_u8 (temp,rgb.val[1], gfac); // 每個比特對應相乘并加上
temp = vmlal_u8 (temp,rgb.val[2], bfac);
result = vshrn_n_u16 (temp, 8); // 全部移位8位
vst1_u8 (dest, result); // 轉存運算結果
src += 8*3;
dest += 8;
}
}vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair. ?vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.
So we end up with just three instructions for weighted average of eight pixels. Nice.
Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two
things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register.
3、結果對比
(1)C語言NEON版本匯編
/*
未進行匯編優化的結果
C-version: 15.1 cycles per pixel.
NEON-version: 9.9 cycles per pixel.
這里是說優化結果并不非常理想,所以查看了一下它的匯編文件
That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all.
What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:
*/
160: f46a040f vld3.8 {d16-d18}, [sl]
164: e1a0c005 mov ip, r5
168: ecc80b06 vstmia r8, {d16-d18}
16c: e1a04007 mov r4, r7
170: e2866001 add r6, r6, #1 ; 0x1
174: e28aa018 add sl, sl, #24 ; 0x18
178: e8bc000f ldm ip!, {r0, r1, r2, r3}
17c: e15b0006 cmp fp, r6
180: e1a08005 mov r8, r5
184: e8a4000f stmia r4!, {r0, r1, r2, r3}
188: eddd0b06 vldr d16, [sp, #24]
18c: e89c0003 ldm ip, {r0, r1}
190: eddd2b08 vldr d18, [sp, #32]
194: f3c00ca6 vmull.u8 q8, d16, d22
198: f3c208a5 vmlal.u8 q8, d18, d21
19c: e8840003 stm r4, {r0, r1}
1a0: eddd3b0a vldr d19, [sp, #40]
1a4: f3c308a4 vmlal.u8 q8, d19, d20
1a8: f2c80830 vshrn.i16 d16, q8, #8
1ac: f449070f vst1.8 {d16}, [r9]
1b0: e2899008 add r9, r9, #8 ; 0x8
1b4: caffffe9 bgt 160(2)NEON匯編優化
Since the compiler can’t generate good code I wrote the same loop in assembler.
In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all.
// 這里針對生成的目標匯編代碼進一步作了優化,優化的代碼如下:
convert_asm_neon:
# r0: Ptr to destination data
# r1: Ptr to source data
# r2: Iteration count:
push {r4-r5,lr}
lsr r2, r2, #3
# build the three constants:
mov r3, #77
mov r4, #151
mov r5, #28
vdup.8 d3, r3
vdup.8 d4, r4
vdup.8 d5, r5
.loop:
# load 8 pixels:
vld3.8 {d0-d2}, [r1]!
# do the weight average:
vmull.u8 q3, d0, d3
vmlal.u8 q3, d1, d4
vmlal.u8 q3, d2, d5
# shift and store:
vshrn.u16 d6, q3, #8
vst1.8 {d6}, [r0]!
subs r2, r2, #1
bne .loop
pop { r4-r5, pc }可以見到NEON優化在性能上提速了 7 倍多(同時處理8個像素),理論應該是8倍。
總結
以上是生活随笔為你收集整理的android ndk neon,Android NDK开发之 NEON使用介绍的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Qt操作excel
- 下一篇: Android.mk编译java动态库