Disappointed with unexpectedly low clock speed

waltervl · Jun 17, 2017

Sysbench can also do POSIX tests so perhaps that could give you an indication in performance.

tuxun · Jun 18, 2017

Hi,

For what what I understand reading these posts, something could be hardware accelerated on the rpi and not on the UDOO, maybe related to the intel HD graphics chipset kernel support? or in one of kernel option your freebsd kernel comes compiled within?

Could your program be runned from an Ubuntu usb live system for testing?

srohmen · Jun 18, 2017

Ethin E. said: ↑

Here's my rpi3 results from a standard run of one of my programs:

real 2m40.261s
user 10m2.740s
sys 0m0.330s

And here's the results from the same program, with the same arguments, compiled the same way on the UDOO:

real 3m11.864s
user 11m58.485s
sys 0m0.328s

(On both platforms, I compiled my program with -O3 -pthread on GCC)
Click to expand...

Without actually able to perform performance profiling tests I think that there gone something wrong at the compile step. In some post you mentioned you are using the very recent GCC version. Which one is it? Did you build one your rpi3 natively or cross-compiled the ARM-7 binary on another machine? Same goes for the Udoo binary. Only optimization flag was -O3 without any other "cool ultra optimization loop-unrolling" feature? Also -march=core2 or even more -march=native can result in different effects. Sometimes -O3 does also hurt performance. Try to reduce to -O2.

What about single thread performance? Is this result comparable or are you running into some locking conditions which is hurting performance heavily? If you have the same issues with single threaded run you can use valgrind/callgrind to perform a profiling run. For this compile an release build with debug flags enabled (-ggdb3) and check the runtime again. Should be the comparable as without debug informations, only the binary size will have increased. Then give valgrind a shot:
Code:
valgrind --tool=callgrind program [program_options]
(NOTE: Valgrind can only handle single threaded programms properly, so run a single threaded test). This will approximately taking 10 times the runtime as without attaching Valgrind. In the end there will be a profiling file which can be examined by kcacherind. More informatione on that here:
https://baptiste-wicht.com/posts/2011/09/profile-c-application-with-callgrind-kcachegrind.html

Despite the fact that you are not a developer, you can at least try to compare the profiling results of both machines. For comparison just sort by instruction count in kcachegrind. If there are major differences you know that there is something wrong with your binary. You can also send me some informations about the result and I can tell you if this is the reason for your bottleneck
BTW: Is this code available somewhere for public access? Then I could also take a brief look.

On the other hand, you should know that neither the Udoo x86 nor the rpi has a CPU which has a number crunching chip. In any curcumstances your runtime performance will be better by magnitudes with a Core i7 or recent AMD Ryzen CPU.

EDIT: One more obvious issue somes into my mind. How about power and cpu frequency management? Is this correctly configured on your Udoo x86/BSD combination? If the clock is clamped to 500 MHz nobody has to wonder about that effect....

Ethin E. · Jun 18, 2017

srohmen said: ↑

Despite the fact that you are not a developer, you can at least try to compare the profiling results of both machines. For comparison just sort by instruction count in kcachegrind. If there are major differences you know that there is something wrong with your binary. You can also send me some informations about the result and I can tell you if this is the reason for your bottleneck
BTW: Is this code available somewhere for public access? Then I could also take a brief look.
Click to expand...

Sorry, I'll just stop you there to let you know I am the developer. I wrote the program and have optimized it. I have tried several different compilation options on both SBCs and the -O3 provides the best performance (same as on my desktop PC). Running with fewer cores provides a slight speed advantage on the UDOO over the Rpi3, but since it is much slower than running all 4 cores, that's not an option. I cannot share the code yet because it is not yet published.

srohmen · Jun 18, 2017

Ethin E. said: ↑

Sorry, I'll just stop you there to let you know I am the developer. I wrote the program and have optimized it. I have tried several different compilation options on both SBCs and the -O3 provides the best performance (same as on my desktop PC). Running with fewer cores provides a slight speed advantage on the UDOO over the Rpi3, but since it is much slower than running all 4 cores, that's not an option. I cannot share the code yet because it is not yet published.
Click to expand...

Okay than try it the other way around. We know that the clock of the Udoo x86 Basic is lower than for the ARM CPU. But the x86 should be able to perform more operations per CPU cycle. But this will only work out when you enable the CPU features for the compiler. Try to optimize with -march=core2 (afair) for the Udoo x86.

Log in or Sign up

Disappointed with unexpectedly low clock speed

waltervl UDOOer

tuxun Member

srohmen New Member

Ethin E. New Member

srohmen New Member

Share This Page

Log in or Sign up

Disappointed with unexpectedly low clock speed

waltervl UDOOer

tuxun Member

srohmen New Member

Ethin E. New Member

srohmen New Member

Share This Page

Useful Searches