Pine 64 benchmarks
#31
Hi guys! If any of you are interested in bencharking the real-time kernel that would really help. Test performance can be found here: https://rt.wiki.kernel.org/index.php/Cyclictest.
Samples for comparison here: http://docs.emlid.com/navio/Downloads/Re...inux-RPi2/.
More information here: http://forum.pine64.org/showthread.php?tid=394
#32
Let's discuss cooling state table settings on Github instead Wink

The most important thing to understand when it's about benchmarking is:

1) hardware settings (if you're interested in performance you have to take into account that you need to improve heat dissipation otherwise throttling will occur -- ignoring these basics just produces numbers without meaning as it is done on Phoronix/openbenchmarking.org all the times)

2) software settings (if you're interested in performance you want to use optimised software that makes use of the hardware)

The Phoronix Test suite ignores both and you end up with results like this: http://openbenchmarking.org/result/16030...603082GA36

If you compare the last 5 results you get an idea what's wrong with benchmarking the Phoronix style. For example the 'Smallpt v1.0' benchmark: The Phoronix founder lists the Pine64+ with a score of 1500 (using no heatsink, longsleep's WiP dvfs and cooling state table settings from a few days ago and especially NO optimised software). When I ran the last test, I used '-O3' (optimises the code to use NEON and so on) and heatsink/fan. My result is 215 and that's ~7 times faster than what Phoronix will publish as the Pine64's speed in this area.

That's important to understand. Michael Larabel's published score for the Pine64 is 7 times lower in a specific benchmark since he refrains from doing it right. The most important difference is -O2 vs. -O3. The whole http://openbenchmarking.org site is just a huge collection of meaningless numbers since the differences listed there are mostly influenced by compiler versions/settings and thermal conditions but he still encourages his users to misinterpret them as 'hardware benchmarks'.
#33
(03-09-2016, 03:53 AM)Andrew2 Wrote: The most important thing to understand when it's about benchmarking is:

1) hardware settings (if you're interested in performance you have to take into account that you need to improve heat dissipation otherwise throttling will occur -- ignoring these basics just produces numbers without meaning as it is done on Phoronix/openbenchmarking.org all the times)

2) software settings (if you're interested in performance you want to use optimised software that makes use of the hardware)

The Phoronix Test suite ignores both and you end up with results like this: http://openbenchmarking.org/result/16030...603082GA36

If you compare the last 5 results you get an idea what's wrong with benchmarking the Phoronix style. For example the 'Smallpt v1.0' benchmark: The Phoronix founder lists the Pine64+ with a score of 1500 (using no heatsink, longsleep's WiP dvfs and cooling state table settings from a few days ago and especially NO optimised software). When I ran the last test, I used '-O3' (optimises the code to use NEON and so on) and heatsink/fan. My result is 215 and that's ~7 times faster than what Phoronix will publish as the Pine64's speed in this area.

That's important to understand. Michael Larabel's published score for the Pine64 is 7 times lower in a specific benchmark since he refrains from doing it right. The most important difference is -O2 vs. -O3. The whole http://openbenchmarking.org site is just a huge collection of meaningless numbers since the differences listed there are mostly influenced by compiler versions/settings and thermal conditions but he still encourages his users to misinterpret them as 'hardware benchmarks'.
Well, all of these flaws have been known and reported ages ago: http://www.phoronix.com/forums/forum/sof...post320735 Smile
But as long as the Phoronix readers are happy with these tests, nothing is going to improve. There is simply no incentive to do a better job.
#34
(03-09-2016, 11:30 AM)ssvb Wrote:
(03-09-2016, 03:53 AM)Andrew2 Wrote: The whole http://openbenchmarking.org site is just a huge collection of meaningless numbers since the differences listed there are mostly influenced by compiler versions/settings and thermal conditions but he still encourages his users to misinterpret them as 'hardware benchmarks'.
Well, all of these flaws have been known and reported ages ago: http://www.phoronix.com/forums/forum/sof...post320735 Smile
But as long as the Phoronix readers are happy with these tests, nothing is going to improve. There is simply no incentive to do a better job.

I tried to create a little puzzle game with A83T: http://openbenchmarking.org/result/16030...M3+Armbian

Why gets my BPi M3 slower 'John the Ripper' scores than his BPi M3 but in all other tests it's way faster? Since I disabled code optimisations for JTR and enabled reasonable settings for the other tests. The results published there are just a huge collection of garbage. Unfortunately you're right and nothing will change since people love to compare graphs made from numbers without meaning Smile

I'm already curious how expensive Pine64+ gets when it's about to 'toss the dice for the price' for his Performance / Cost nonsense...

BTW: Not only the whole test methodology and most of the tests are broken. But not taking thermal settings/behaviour into account in 2016 is simply doing it wrong when it's about using these meaningless numbers the Phoronix users and Michael collect to compare different boards with.
#35
I thought I check the situation with common enclosures in the future. Usually enclosure makers don't think a second about thermal issues and then you end up with serious performance degradations. While the Phoronix Test Suite can't show the 'raw performance' of different systems due to many design flaws (most importantly when misused with modern SBCs due to ignoring thermal/throttling issues) at least the test suite can be used to demonstrate thermal behaviour.

Please have a look at the last two entries here: http://openbenchmarking.org/result/16031...603083GA70

This is absolutely the same hardware and the same OS image using the same kernel, U-Boot and so on.

The differences are compiler and throttling settings (software) and the last run was with the Pine64+ without heatsink jailed in a small cardboard box to emulate an enclosure. Compared with the run before (Pine64+ with less aggressive throttling settings, wearing a heatsink and a small 5V fan) it's obvious what's happening:

[Image: RPi-Monitor.png]

[Image: Pine64_in_Enclosure.png]

What can not be seen on the graphs is the count of available CPU cores. I disabled it since otherwise the graphs would've looked too weird. But when running the multithreaded tests the BSP kernel's throttling strategy quite often killed cpu3 and cpu2 and if longsleep would'nt have already included a 'core keeper' service in his Ubuntu image I would've ended up with a dual core system pretty soon after starting the tests. Longsleep's service checks the cooler state and brings back killed cores when it's below a certain treshold:

Code:
root@pine64plus:/etc/rpimonitor# /usr/local/bin/armbianmonitor -m
Stop monitoring using [ctrl]-[c]
Time        CPU    load %cpu %sys %usr %nice %io %irq   CPU
09:54:30: 1008MHz  4.76  91%   1%  88%   0%   1%   0%   90°C 4 cores active
09:54:35:  816MHz  4.70  91%   1%  88%   0%   1%   0%   90°C 4 cores active
09:54:40: 1008MHz  4.73  91%   1%  88%   0%   1%   0%   95°C 2 cores active
09:54:46: 1008MHz  4.75  91%   1%  88%   0%   1%   0%   93°C 4 cores active
09:54:51: 1008MHz  4.69  91%   1%  88%   0%   1%   0%   88°C 4 cores active
09:54:56:  600MHz  4.63  83%   1%  80%   0%   1%   0%   95°C 2 cores active
09:55:01: 1008MHz  4.82  91%   1%  88%   0%   1%   0%   92°C 4 cores active
09:55:07:  816MHz  4.76  91%   1%  88%   0%   1%   0%   88°C 4 cores active
09:55:12:  600MHz  4.78  91%   1%  88%   0%   1%   0%   93°C 4 cores active
#36
Next round of tests, this time with a cheap heatsink (low price, low profile, low performance -- do a web search for 'cubie 20x20 heatsink'): http://openbenchmarking.org/result/16031...603107GA53

Comparing the last 2 results shows that just the heatsink is responsible for ~20% or even more performance with multithreaded workloads in the same situation (no airflow possible, Pine64+ in a small enclosure). But more importantly no CPU cores were killed and CPU clockspeed was adjusted to 816MHz instead of 600MHz as before:

[Image: Pine64_in_Enclosure_with_Heatsink.png]

Next step is to activate missing clockspeeds, currently longsleep's settings contain only the following:

Code:
1152000
1104000
1008000
816000
648000

And now I'll try it with a few more:
Code:
1152000
1104000
1056000
1008000
960000
912000
816000
648000
#37
Being able to use 912, 960 and 1056 MHz does help with benchmarks (and 'real world' performance too) but it depends on the workload. Multithreaded workloads benefit more than single-threaded: http://openbenchmarking.org/result/16031...603105GA62 

[Image: Pine64_in_Enclosure_with_Heatsink_more_c..._steps.png]

And there still is a problem with the trip points defined since when running a single threaded workload throttling starts way too early (reducing clockspeed to 1104 or even 1056 when temperatures are still ok). Next try will be to increase the first thermal treshold from 65°C to 75°C (and to switch again to optimised compiler settings since the most moronic PTS test ever -- Smallpt -- takes ages) and see what happens.
#38
We tweaked the settings a bit yesterday (defining a few more dvfs operating points -- see below -- and cpufreq steps and also increased the thermal trip points a bit. The 1st was set to 65°C before and is now at 80°C which helps a lot with single threaded workloads).

New results here: http://openbenchmarking.org/result/16031...603109GA38 (these can not be compared with the previous results except of the "Pine64+ ARMv8 -O3" entry that shared the same compiler switches).

I'm still not satisfied with these settings since as soon as single threaded workloads lead to the SoC temperature exceeding 80°C slight throttling occurs but since I already try to simulate worst case conditions (Pine64+ inside a small cardboard box only wearing a cheap heatsink) and this only sometimes happens the settings are ok more or less:

[Image: Pine64_in_Enclosure_with_Heatsink_higher...points.png]

Two more tests will follow with identical settings that will only differ with multithreaded workloads since there throttling will jump in. I removed the cardboard box and will test with the Pine64+ lying flat on the table at 23°C ambient temperature. One test only with heatsink and the other with another Pine64+ without heatsink but small fan trying to blow some air over the SoC's surface (to answer the most important question regarding benchmarks for every more recent SoC: does using a heatsink or a fan is more efficient?)

[Image: Pine64_Plus_Heatsink_vs_Fan.jpg]

[1] Dynamic voltage frequency scaling settings used now:

Code:
       max_freq = <1152000000>;
       min_freq = <480000000>;
       lv_count = <8>;
       lv1_freq = <1152000000>;
       lv1_volt = <1300>;
       lv2_freq = <1104000000>;
       lv2_volt = <1260>;
       lv3_freq = <1056000000>;
       lv3_volt = <1240>;
       lv4_freq = <1008000000>;
       lv4_volt = <1200>;
       lv5_freq = <960000000>;
       lv5_volt = <1160>;
       lv6_freq = <912000000>;
       lv6_volt = <1120>;
       lv7_freq = <816000000>;
       lv7_volt = <1080>;
       lv8_freq = <648000000>;
       lv8_volt = <1040>;
#39
I guess that gives the heatsink in free surroundings, the better cooling. In one case, a fan for cool air must presumably direct addition.
#40
Last two test runs: http://openbenchmarking.org/result/16031...603116GA70

Please be aware that results can not be compared directly. "Pine64+ take 2" and "Pine64+ take4" are from Michael Larabel and completely irrelevant since he used old thermal/throttling settings and we do know nothing about throttling behavior in his setup. 

You can use the last two results as a relative comparison how good heatsink vs. fan behave when it's about limiting throttling and compare with the "Pine64+ ARMv8 -O3" results (same code optimisation level but the "Pine64+ ARMv8 -O3" run with 1344MHz scaling_max_cpufreq, small heatsink and fan showing that you can prevent throttling even at higher clockspeeds mostly).

The results labeled "Pine64+ in enclosure", "Pine64+ enclosure+heatsink", "Pine64+ encl/heatsink/cpufreq" can also be compared directly (same/no code optimisations) and show clearly that mounting a heatsink when trying to jail the Pine64+ in a small enclosure helps with performance and that the little software tweak to allow a few more cpufreq steps improved performance also a lot just by establishing better throttling behavior that helps A64 stay at higher clockspeeds more often.

In the meantime when trying out the last test in the results above my Pine64+ always powered off for no apparent reason. I thought maybe jumping between different cpufreq operating points might overburden PSU/DC-IN and therefore chose to power the board through the Euler connector.

But to no avail. Since looking at the graphs I noticed that the board always died after heavy switching frequencies/voltages I thought maybe adding a heatsink to the AXP803 PMIC chip would help (I know it from A20's companion AXP209 that it can get quite hot and contains overtemperature protection the hard way -- maybe it's the same with AXP803 again). At least after adding the heatsink I could run the last test without problems (and the very same board already survived the tests running at 1344 MHz but with less throttling)

Pine64+ just with a fan (directly on top):

[Image: Pine64_Plus_only_Fan.jpg]

[Image: Pine64_Plus_only_Fan.png]

And now only a heatsink (the heatsinks on DRAM and PMIC aren't performance relevant, the latter maybe for stability -- to be confirmed)

[Image: Pine64_Plus_only_Heatsink.jpg]

[Image: Pine64_Plus_only_Heatsink.png]

What can we learn from that?

1) Thermal/throttling settings are responsible for high performance (true for every modern SoC -- ignored by many/most benchmarks especially the more popular ones). Using single threaded workloads most often won't show throttling effects which has to be considered.

2) Adding a few more dvfs (dynamic voltage frequency scaling) operating points allows the throttle driver to adjust clockspeeds more fine graded which helps improving performance a lot

3) To push the envelope you would have to improve heat dissipation and take thermal conditions into account (benchmarking in the morning when ambient temperature is a few degrees lower might result in 10% better scores -- keep that in mind)

4) When it's about to choose between fan and heatsink, the choice is obvious: heatsink wins. A fan does only help when combined with a heatsink. And when the fan blows just somewhere around it's only annoying and doesn't help at all (if only enclosure makers would notice!)

5) If you plan to run heavy stuff on your board be prepared to switch from the Micro USB connector to the Euler connector for DC-IN. You can feed 5V through Euler pins 2 and 4 and can connect GND to Euler pins 6, 9 and 14.

6) In case you experience sudden power-offs (green led also immediately off) think about adding a heatsink to the PMIC also (unconfirmed at the moment and more of a guess than a recommendation)

7) We should keep in mind that benchmark scores that differ by less than 20% should be interpreted as being identical when it's about normal use cases (if you want to do number crunching then it's a different story but then you chose the wrong device anyway)

8) We should also keep in mind that benchmarking irrelevant stuff is just that: irrelevant. When you want to use your device to watch videos then it's more important whether HW acceleration for the video codecs you're interested in is availble than how slow/fast the CPU might be able to calculate prime numbers (even worse with unoptimised code as it happens all the times)

9) Take benchmark results that do not take care of throttling with a grain of salt since they are misleading

10) Take benchmark results that do not make use of optimised code with a grain of salt since they are even more misleading (you got that ARMv8 thingie since you wanted to benefit from faster software, right? A benchmark that disables code optimisations like PTS' Smallpt as a prominent example is rather useless since it shows irrelevant performance scores)

11) Take every benchmark result for the Pine64+ that will be published the next few weeks with a grain of salt since settings aren't ready yet.

What does "settings" mean? Maybe we might improve the throttling strategies for single threaded workloads in the next time. Then 'real world performance and also single threaded benchmark's scores will automagically improve by 10%-30%. 

Regarding this dvfs stuff: The higher the so called VDD_CPUX voltage is set (that's the core voltage the CPU cores are fed with) the hotter the SoC gets and in case CPU/GPU cores are busy throttling will jump in earlier. So you define dvfs operating points always in a way to reduce them to a reasonable minimum. This process has not even started yet. We currently rely on Allwinner's defaults and no one had a look into it how low can you go (again: If we're able to reduce the voltages all a bit with some safety headroom then the SoC will remain cooler, throttling will happen later and performance increases automagically -- but this whole process is time consuming and needs a lot of boards/users to join in)

And regarding settings I've to add that the most important setting regarding stupid 'fire and forget' benchmarks the Phoronix style we already changed is the default Allwinner behaviour. They ship their BSP with rather strange settings where CPU cores were killed instead of let throttling do the job. We already changed that to sane values recently. Therefore using the settings from last week you might end up with a Pine64+ only running on 1 or 2 cores and it should be obvious how this influences benchmarks (that's the main reason Orange Pi PC/Plus are listed on Phoronix that slow since that happened back then too and the tester didn't take notice)

BTW: It should also be obvious that you can't do proper benchmarking without monitoring the system your tests run on (see the graphs above). To ease that I wrote a simple script that installs RPi-Monitor with A64 adjustments on Debian based distros like longsleep's Ubuntu OS image. It contains also a function to apply our latest adjustments to cpufreq/dvfs settings so if you answer yes to the question "Do you want to adjust throttling settings (requires overwriting u-boot/dtb)?" your Pine64+ will perform better at least in multithreaded benchmarks (since these settings improve throttling a lot)

http://kaiser-edv.de/tmp/4U4tkD/install-...for-a64.sh


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pine A64+ mainline mipi-dsi Learnincurve 3 7,310 01-12-2021, 07:36 AM
Last Post: Learnincurve
  DietPi OS for PINE A64 MichaIng 0 9,745 12-15-2020, 06:11 AM
Last Post: MichaIng
  Fedora 32/CentOS 8 Pine A64+ Images wideawake 2 6,103 10-02-2020, 11:38 AM
Last Post: mathiraj
  LibreELEC(KODI) for Pine A64 (+) pineadmin 6 16,970 05-02-2020, 10:29 AM
Last Post: aaronmarsh632
  Lakka-nightlies for Pine A64 (and H64) roel 11 21,378 04-23-2020, 12:55 AM
Last Post: roel
  Read DHT22 on Pine A64 ayufan 3 7,268 01-19-2020, 02:51 PM
Last Post: {-DesT-}
  Centos 7 for Pine A64 Luke 2 6,559 05-28-2019, 12:18 AM
Last Post: pineadmin
  NEMS Linux for Pine A64 (+) Luke 1 5,561 05-09-2019, 05:42 PM
Last Post: pineadmin
  Pine Board using linux stuck during boot sequence ktaragorn 4 9,181 03-30-2019, 06:48 AM
Last Post: ktaragorn
  motionEyeOS (PINE A64(+)) pineadmin 4 12,797 11-09-2018, 03:25 AM
Last Post: pineadmin

Forum Jump:


Users browsing this thread: 1 Guest(s)