Hi, For what what I understand reading these posts, something could be hardware accelerated on the rpi and not on the UDOO, maybe related to the intel HD graphics chipset kernel support? or in one of kernel option your freebsd kernel comes compiled within? Could your program be runned from an Ubuntu usb live system for testing?
Without actually able to perform performance profiling tests I think that there gone something wrong at the compile step. In some post you mentioned you are using the very recent GCC version. Which one is it? Did you build one your rpi3 natively or cross-compiled the ARM-7 binary on another machine? Same goes for the Udoo binary. Only optimization flag was -O3 without any other "cool ultra optimization loop-unrolling" feature? Also -march=core2 or even more -march=native can result in different effects. Sometimes -O3 does also hurt performance. Try to reduce to -O2. What about single thread performance? Is this result comparable or are you running into some locking conditions which is hurting performance heavily? If you have the same issues with single threaded run you can use valgrind/callgrind to perform a profiling run. For this compile an release build with debug flags enabled (-ggdb3) and check the runtime again. Should be the comparable as without debug informations, only the binary size will have increased. Then give valgrind a shot: Code: valgrind --tool=callgrind program [program_options] (NOTE: Valgrind can only handle single threaded programms properly, so run a single threaded test). This will approximately taking 10 times the runtime as without attaching Valgrind. In the end there will be a profiling file which can be examined by kcacherind. More informatione on that here: https://baptiste-wicht.com/posts/2011/09/profile-c-application-with-callgrind-kcachegrind.html Despite the fact that you are not a developer, you can at least try to compare the profiling results of both machines. For comparison just sort by instruction count in kcachegrind. If there are major differences you know that there is something wrong with your binary. You can also send me some informations about the result and I can tell you if this is the reason for your bottleneck BTW: Is this code available somewhere for public access? Then I could also take a brief look. On the other hand, you should know that neither the Udoo x86 nor the rpi has a CPU which has a number crunching chip. In any curcumstances your runtime performance will be better by magnitudes with a Core i7 or recent AMD Ryzen CPU. EDIT: One more obvious issue somes into my mind. How about power and cpu frequency management? Is this correctly configured on your Udoo x86/BSD combination? If the clock is clamped to 500 MHz nobody has to wonder about that effect....
Sorry, I'll just stop you there to let you know I am the developer. I wrote the program and have optimized it. I have tried several different compilation options on both SBCs and the -O3 provides the best performance (same as on my desktop PC). Running with fewer cores provides a slight speed advantage on the UDOO over the Rpi3, but since it is much slower than running all 4 cores, that's not an option. I cannot share the code yet because it is not yet published.
Okay than try it the other way around. We know that the clock of the Udoo x86 Basic is lower than for the ARM CPU. But the x86 should be able to perform more operations per CPU cycle. But this will only work out when you enable the CPU features for the compiler. Try to optimize with -march=core2 (afair) for the Udoo x86.