android– 优化霓虹灯组装功能_app

概述我正在开发一个应该在ARMv7处理器设备上运行的原生Android应用程序.出于某些原因,我需要对向量(短和/或浮点)进行一些繁重的计算.我使用NEON命令实现了一些汇编功能来增强计算.我已经获得了1.5速度因素,这也不错.我想知道我是否可以更快地改进这些功能.所以问题是：我可以做些什么

我正在开发一个应该在ARMv7处理器设备上运行的原生Android应用程序.
出于某些原因,我需要对向量(短和/或浮点)进行一些繁重的计算.
我使用NEON命令实现了一些汇编功能来增强计算.我已经获得了1.5速度因素,这也不错.我想知道我是否可以更快地改进这些功能.

所以问题是：我可以做些什么改进来改善这些功能？

    //add to float vectors.//the result Could be put in scr1 instead of dstvoID add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count){    asm volatile (           "1:                                                        \n"           "vld1.32         {q0}, [%[src1]]!                          \n"           "vld1.32         {q1}, [%[src2]]!                          \n"           "vadd.f32        q0, q0, q1                                \n"           "subs            %[count], %[count], #4                    \n"           "vst1.32         {q0}, [%[dst]]!                           \n"           "bgt             1b                                        \n"           : [dst] "+r" (dst)           : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)           : "memory", "q0", "q1"      );}//multiply a float vector by a scalar.//the result Could be put in scr1 instead of dstvoID mul_float_vector_by_scalar_with_neon3(float* dst, float* src1, float scalar, int count){    asm volatile (            "vdup.32         q1, %[scalar]                              \n"            "2:                                                         \n"            "vld1.32         {q0}, [%[src1]]!                           \n"            "vmul.f32        q0, q0, q1                                 \n"            "subs            %[count], %[count], #4                     \n"            "vst1.32         {q0}, [%[dst]]!                            \n"            "bgt             2b                                         \n"            : [dst] "+r" (dst)            : [src1] "r" (src1), [scalar] "r" (scalar), [count] "r" (count)            : "memory", "q0", "q1"      );}//add to short vector -> no problem of Coding limits//the result should be put in in a dest different from src1 and scr2voID add_short_vector_with_neon3(short* dst, short* src1, short* src2, int count){    asm volatile (           "3:                                                        \n"           "vld1.16         {q0}, [%[src1]]!                          \n"           "vld1.16         {q1}, [%[src2]]!                          \n"           "vadd.i16        q0, q0, q1                                \n"           "subs            %[count], %[count], #8                    \n"           "vst1.16         {q0}, [%[dst]]!                           \n"           "bgt             3b                                        \n"           : [dst] "+r" (dst)           : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)           : "memory", "q0", "q1"      );}//multiply a short vector by a float vector and put the result bach into a short vector//the result should be put in in a dest different from src1voID mul_short_vector_by_float_vector_with_neon3(short* dst, short* src1, float* src2, int count){    asm volatile (        "4:                                                         \n"        "vld1.16        {d0}, [%[src1]]!                            \n"        "vld1.32        {q1}, [%[src2]]!                            \n"        "vmovl.s16      q0, d0                                      \n"        "vcvt.f32.s32   q0, q0                                      \n"        "vmul.f32       q0, q0, q1                                  \n"        "vcvt.s32.f32   q0, q0                                      \n"        "vmovn.s32      d0, q0                                      \n"        "subs            %[count], %[count], #4                     \n"        "vst1.16         {d0}, [%[dst]]!                            \n"        "bgt             4b                                         \n"        : [dst] "+r" (dst)        : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)        : "memory", "d0", "q0", "q1"    );}

提前致谢！

解决方法:

您可以尝试展开循环以处理每个循环的更多元素.

你的add_float_vector_with_neon3代码每4个元素需要10个周期(因为停止),而展开到16个元素需要21个周期.
http://pulsar.webshaker.net/ccc/sample-34e5f701

虽然存在开销,因为您需要处理剩余部分(或者您可以将数据填充为16的倍数),但如果您有大量数据,则与实际总和相比,开销应该相当低.

总结

以上是内存溢出为你收集整理的android – 优化霓虹灯组装功能全部内容，希望文章能够帮你解决android – 优化霓虹灯组装功能所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: https://www.outofmemory.cn/web/1117235.html

android– 优化霓虹灯组装功能

发表评论

评论列表（0条）