Hi,
just some news.
So, i rewrote some functions from Neon/intrinsics to Neon/assembler in the hope of a performances improvment. Just after finishing the translations intrinsics/assembler, i was very enthousiast
, because the assembler version was running 4 time faster than the intrinsic, but in fact this result was obtained with the debug versions. With the release versions, the results were quasi identical, and perhaps even more that the intrinsic versions runs a little faster than the assembler
, but, honestly, there is no significant gap. The only noticable diffenrece is on the code size which is 3 times more important with the intrinsic version.
So, my conclusion, in my cases, is that the compiler generates very fast code using intrinsic.
Other remark, this algorithm (image processing), initially written on a PC/Windows,runs in less than one 1 msec on 1 core of an i7-4790 processor at 3.6GHz, and it takes ~4msec on 1 core of my rock64 (the ARM/Neon version is a little more optimzed than the intel version which also uses intrinsics). At the beginning, i thought that the difference will be more important, but, thanks to Neon, the final result is a little better than expected, and a rock64 is much cheap than an intel solution, and consumes much less power.
I think that something which could increases the performance of the Rock64 will be a faster memory; the one of the rock64 works in 32bits, altought the processor supports 64 bits access.
regards