我正在开发一个应该在ARMv7处理器设备上运行的原生Android应用程序.
出于某些原因,我需要对向量(短和/或浮点)进行一些繁重的计算.
我使用NEON命令实现了一些汇编功能来增强计算.我已经获得了1.5速度因素,这也不错.我想知道我是否可以更快地改进这些功能.
所以问题是:我可以做些什么改进来改善这些功能?
//add to float vectors.//the result Could be put in scr1 instead of dstvoID add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count){ asm volatile ( "1: \n" "vld1.32 {q0}, [%[src1]]! \n" "vld1.32 {q1}, [%[src2]]! \n" "vadd.f32 q0, q0, q1 \n" "subs %[count], %[count], #4 \n" "vst1.32 {q0}, [%[dst]]! \n" "bgt 1b \n" : [dst] "+r" (dst) : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count) : "memory", "q0", "q1" );}//multiply a float vector by a scalar.//the result Could be put in scr1 instead of dstvoID mul_float_vector_by_scalar_with_neon3(float* dst, float* src1, float scalar, int count){ asm volatile ( "vdup.32 q1, %[scalar] \n" "2: \n" "vld1.32 {q0}, [%[src1]]! \n" "vmul.f32 q0, q0, q1 \n" "subs %[count], %[count], #4 \n" "vst1.32 {q0}, [%[dst]]! \n" "bgt 2b \n" : [dst] "+r" (dst) : [src1] "r" (src1), [scalar] "r" (scalar), [count] "r" (count) : "memory", "q0", "q1" );}//add to short vector -> no problem of Coding limits//the result should be put in in a dest different from src1 and scr2voID add_short_vector_with_neon3(short* dst, short* src1, short* src2, int count){ asm volatile ( "3: \n" "vld1.16 {q0}, [%[src1]]! \n" "vld1.16 {q1}, [%[src2]]! \n" "vadd.i16 q0, q0, q1 \n" "subs %[count], %[count], #8 \n" "vst1.16 {q0}, [%[dst]]! \n" "bgt 3b \n" : [dst] "+r" (dst) : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count) : "memory", "q0", "q1" );}//multiply a short vector by a float vector and put the result bach into a short vector//the result should be put in in a dest different from src1voID mul_short_vector_by_float_vector_with_neon3(short* dst, short* src1, float* src2, int count){ asm volatile ( "4: \n" "vld1.16 {d0}, [%[src1]]! \n" "vld1.32 {q1}, [%[src2]]! \n" "vmovl.s16 q0, d0 \n" "vcvt.f32.s32 q0, q0 \n" "vmul.f32 q0, q0, q1 \n" "vcvt.s32.f32 q0, q0 \n" "vmovn.s32 d0, q0 \n" "subs %[count], %[count], #4 \n" "vst1.16 {d0}, [%[dst]]! \n" "bgt 4b \n" : [dst] "+r" (dst) : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count) : "memory", "d0", "q0", "q1" );}
提前致谢 !
解决方法:
您可以尝试展开循环以处理每个循环的更多元素.
你的add_float_vector_with_neon3代码每4个元素需要10个周期(因为停止),而展开到16个元素需要21个周期.
http://pulsar.webshaker.net/ccc/sample-34e5f701
虽然存在开销,因为您需要处理剩余部分(或者您可以将数据填充为16的倍数),但如果您有大量数据,则与实际总和相比,开销应该相当低.
总结以上是内存溢出为你收集整理的android – 优化霓虹灯组装功能全部内容,希望文章能够帮你解决android – 优化霓虹灯组装功能所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)