Pine64 Cluster economy
#1
Hi,

I was interested how the Pine64 with its very low price would stand up against other approaches. I am interested mostly in mixed CPU intensive workload. Some floating point, but also some unstructured code.

I have chosen povray, as easily available benchmark, and something that might be actually close to what I would like to test and run on pine64 cluster.

Even if results are negative I might consider pine just for distributed software development and testing. But if the economy is on a bad side, it will not make it very practical replacement for other computing platforms.

I explicitly excluded GPUs, as they are in their specific domain, will deliver best performance, both in absolute terms, and probably in perf/W and perf/$. But development for them isn't that easy and they are not suited for generic codes.

I had an access to one PINE64+ and one PC with relatively modern Intel CPU (Sandy Bridge architecture, i3930K 3.2GHz running at 4.2GHz).

Running Debian testing / unstable.

I compiled povray manually from debian sources (povray 3.7.0), with gcc 6.1 on amd64, and gcc 7.0 (custom built git version from few days ago), on aarch64. This way I got about 15% performance improvement both on Sandy Bridge CPU and on A64 SoC. Debian generic packages in testing/unstable are already compiled with -O3, but without specific -march or -mtune options to tune instruction scheduling, use cache information, etc and such.

(Note that Sandy Bridge lacks AVX and AVX2, but I doubt this would help much in this benchmark. Haswell/Broadwell might bring about 20% of generic IPC improvements, and that is the estimate I used below).

I used FDO and LTO, with final compilation using collected profiles from povray benchmark using -O3 -march=native -ffast-math -fomit-frame-pointers. I tested with other switches, but they do not bring any additional benefits or degrade performance. Most of the performance benefits are from -march=native (especially on Sandy Bridge), just a little from -ffast-math, and FDO/LTO adds few more% of improvement (but also makes binary smaller, helping with the cache utilization).

Results.

I used:
echo | time povray -benchmark

Doing standard pov ray benchmark with 512x512 target output, and adaptive subsampling, for a total of 294912 pixels, ~776k samples, ~2.63 samples/pixel. The benchmark does have very varied structure, with both simple and complex objects,. simple and complex regions, some areas with and without reflection, refraction, aniostropy, complex and simple shading, high and small spatial density of objects, mathematically complex objects and simple ones. For textured objects it uses exclusively procedural textures (including Perlin noise and other fractal methods). It puts small pressure on memory (just few megabytes of memory used at most), and small pressure on memory bandwidth (almost everything fits in the cache, and there is considerably more arithmetic and cpu code than memory accesses, also due to the lack of pregenerated textures / images).

The time below is real time passed on the main Trace pass in povray. (data parsing, data structures creation, photon time, excluded as they are only about 1% of total time, and not all are multithreaded).
 
i7-3930K 3.2GHz @ 4.2GHz, 32nm process, 32GB RAM:  (using 12 threads)

time: 90.227 s  (107.7s on debian generic build)
pixels/s: 3267
power at load: 270W (full system estimate)
power at idle: 100W (full system estimate)
minimal full system price: 450$ (estimate of minimal full system price. 8GB ram, no case, no switch. realistic)
pixels/s/$: 7.26
pixels/s/W: 12.1

Pine64 (sun50iw1p1), 2GB RAM  (with small heatsink): (using 4 threads)

time: 1359 s  (1949 s on debian generic build)
pixels/s: 217  (151.3 on generic build)
power at load: 6W (full system estimate including potential ethernet switch port amortized power)
power at idle: 1.5W (without ethernet)
minimal full system price: 20$ (includes PSU, cabling, heat sink, amortized price of ethernet switch port, but no cases or mounting hardware. optimistic)
pixels/s/$: 10.85
pixels/s/W: 36.17


Speculative estimate for more modern CPU:

i7-5820K 3.3GHz @ 4.2GHz, 6 core, 2x4GB RAM, 22nm process

time: 80 s
pixels/s: 3700
power at load: 250W
power at idle: 60W
minimum full system price: 550$
pixels/s/$: 6.7
pixels/s/W: 14.7


Note that I am only comparing very high end Intel cpus, with 6 cores. Comparison to i5 and i7 4 cores CPU, or AMD CPUs would make it probably considerably in favor on the x86. Both for perf/W and perf/$. I do have an access to few more CPUs, including very low power ones, and would like to update this list shortly.


Summary:

Pine64+ cluster might be realistically speaking viable alternative the x86 high performance computer in some compute intensive workloads. However, it will require about 20 Pine64 boards to match performance of the x86. But with less memory (also distributed across more devices) and memory bandwidth limits, and management + custom software overheads. To fully utilize Pine64 low price and low power usage to compute with other platforms, one needs to bring the costs as low as possible (every 1$ counts), but using shared PSU, and cheap cables and cooling solutions and DIY mounting options. Just making it cost more than 30$ will make it economically impractical. 25$ might be ok (19$ for PINE, and 6$ for different parts), 20$ would be the best.

If matched with initial cost, PINE64 might be very attractive option in the long run, due to very low idle power usage, and very high performance/W metrics.

Future work:

More tests. Calculate 2-year ownership cost at 20% utilization (20% of the time system fully loaded, 80% idle).

What do you think? Any other comparisons to make? Maybe things like webserving? memory based caching? storage of some sort?

I would love to see somebody build ~50 nodes cluster with price, performance and power details.
  Reply
#2
(05-07-2016, 06:05 PM)baryluk Wrote: Pine64 (sun50iw1p1), 2GB RAM  (with small heatsink): (using 4 threads)

Without telling throttling behaviour unfortunately all performance numbers are somewhat irrelevant. You should keep in mind that if you want performance 'small heatsink' isn't enough. Using appropriate heatsinks, intelligent assembly and only 2 120mm fans with controlled airflow Pine64+ might easily run 30%-40% faster.

Unless you run longsleep's pine64_health.sh script in parallel or install RPi-Monitor (easy on any Debian based image) you just produce numbers without meaning. Another fatal issue is using the wrong OS images (like eg. the featured Xubuntu rip-off the Pine64 folks provide) since without at least passive monitoring you don't get how background tasks might negatively influence the benchmark you're currently running.

This is Xubuntu's screensaver starting after an hour on an absolutely idle Pine64:

[Image: Bildschirmfoto%202016-05-03%20um%2017.44.12.png]

The screensaver alone is that CPU intensive that SoC temperature reaches 90°C, cooling state 4 and throttling down to 960MHz happened. And this screensaver also starts when you run benchmarks unattended through SSH since it only checks for keyboard/mouse interaction. Benchmarking without monitoring always produces only numbers without meaning.

TL;DR: The numbers provided are questionable, using monitoring and better heat dissipation the efficiency can be improved most probably by 30% or even more.
  Reply
#3
(05-09-2016, 04:56 AM)tkaiser Wrote:
(05-07-2016, 06:05 PM)baryluk Wrote: Pine64 (sun50iw1p1), 2GB RAM  (with small heatsink): (using 4 threads)

Without telling throttling behaviour unfortunately all performance numbers are somewhat irrelevant. You should keep in mind that if you want performance 'small heatsink' isn't enough. Using appropriate heatsinks, intelligent assembly and only 2 120mm fans with controlled airflow Pine64+ might easily run 30%-40% faster.

Unless you run longsleep's pine64_health.sh script in parallel or install RPi-Monitor (easy on any Debian based image) you just produce numbers without meaning. Another fatal issue is using the wrong OS images (like eg. the featured Xubuntu rip-off the Pine64 folks provide) since without at least passive monitoring you don't get how background tasks might negatively influence the benchmark you're currently running.

This is Xubuntu's screensaver starting after an hour on an absolutely idle Pine64:

[Image: Bildschirmfoto%202016-05-03%20um%2017.44.12.png]

The screensaver alone is that CPU intensive that SoC temperature reaches 90°C, cooling state 4 and throttling down to 960MHz happened. And this screensaver also starts when you run benchmarks unattended through SSH since it only checks for keyboard/mouse interaction. Benchmarking without monitoring always produces only numbers without meaning.

TL;DR: The numbers provided are questionable, using monitoring and better heat dissipation the efficiency can be improved most probably by 30% or even more.

There was no throttling during benchmark. I know how to do benchmarks. I do have 15 years of experience in that matter.

It was ~65 C all the time, and never below 1.15GHz. With better cooling you might get few percent improvement. Nothing to change whole picture.

Your comments are without meaning, and full of assumptions. If you are that smart, why you not run some comparison instead?

Also you totally missed the point about the economy part. Adding 2 120 fans and will easily make it failed project due to crossing the cost threshold I stated in the post.
  Reply
#4
(05-17-2016, 03:26 PM)baryluk Wrote: It was ~65 C all the time, and never below 1.15GHz.

Please don't get offended since I'm not talking to you, just writing some stuff as a reference for others that might come across this thread interested in cluster/performance experiences.

65°C with heatsink is simply not possible with these types of workloads (maybe single-threaded but not when running on 4 fully utilized cores). I just gave it a try on my Pine64 without heatsink. I'm currently refraining from putting a heatsink on it to be able to help the guys doing A64 mainline kernel work -- for both reason as well as some thermal/throttling results of really demanding workloads using NEON optimizations see starting from here http://irclog.whitequark.org/linux-sunxi...#16481377;

[Image: Bildschirmfoto%202016-05-18%20um%2014.50.39.png]

Calculated average cpufreq (see below) has been 950 MHz that means by comparing my 2245 sec seconds duration compared with your 1949 means you ran at an average 1095 MHz given you used the same THS settings (keeping SoC temp at around 90°C). This means also that povray with these optimizations can't be considered a heavy workload (please compare with cpuburn-a53 in the link above -- with this kind of code you end up at 600 MHz throttling frequency without a heatsink). BTW: Maybe I did something wrong:

apt-get install povray time --> http://pastebin.com/Ynebb1sF

Further readings regarding performance optimizations:

https://github.com/longsleep/build-pine64-image/pull/3
http://forum.pine64.org/showthread.php?t...10#pid3410

And in addition to the improvements we made back in March see also this (adding even more cpufreq steps helps with performance on both H3 and A64 as soon as throttling occurs Smile )

https://github.com/igorpecovnik/lib/issu...-220017171

BTW: An easy way to calculate average cpufreq when running a benchmark or a real workload while throttling occured is to switch to performance governor as root

Code:
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

and then do
Code:
cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state >/tmp/before
$benchmark
cat /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state >/tmp/after

Then do the math.
  Reply
#5
(05-17-2016, 03:26 PM)baryluk Wrote: Also you totally missed the point about the economy part. Adding 2 120 fans and will easily make it failed project due to crossing the cost threshold I stated in the post.

Forgot to address this concern. I was talking about 2 120 mm fans per cluster. I was talking about that already in another cluster thread here: http://forum.pine64.org/showthread.php?t...31#pid7431

I tested this with a few H3 boards and 90mm fans and it works quite well. With Pine64 it should work even better since with vertical mounting you can build a tower with inner dimensions of 90x90mm with 6, 12, 18 or even 24 Pine64 (calculate with 18-20cm height per 6 x Pine64) and using 20mm cardboard stripes you control the airflow between two boards so the SoC with mounted heatsink becomes a narrow point so that as much air as possible will flow over the heatsink's surface.

The 120mm fans on the bottom and top do not need to rotate that fast so this approach is also almost silent.
  Reply
#6
Also a silent approach..(have a look at the pictures) I know not everyone has something lying around at home but Fans are also not for free and consume energy.

- So let say you pay 1€ for 2 small heatsinks (and there getting a lot cheaper when you buy a higher number of them)

- Than you need a big one for GPU and CPU  the heatsinks size is limited to L-4cm to W-4cm -
otherwise it would be possible to get a short-circuit (but not tried yet) - I found suitable heatsinks for around 2-3€ on ebay.

- cable tie almost price free

- MX-2 8g for around 5€ and needed for 3 devices something like 1g


......I had not ran a benchmark but installed Mate and than had ran a loop in racked(lisp) for 15min,
and the Pine was never over room temperature (19°).


Attached Files Thumbnail(s)
           
  Reply


Possibly Related Threads...
Thread Author Replies Views Last Post
  I started blogging on Pine64 Cluster salmangano 13 2,815 04-12-2017, 12:44 PM
Last Post: Shad0wSt4R
  Cluster Software Madroxprime 1 1,546 04-03-2017, 06:57 AM
Last Post: Shad0wSt4R
  Cluster / Render Farm Ron Piggott 15 5,603 10-02-2016, 02:55 PM
Last Post: tllim
  Bargain 5 Node Cluster of PINE A64+ from Climber.net tllim 8 2,834 10-02-2016, 07:01 AM
Last Post: sheffield_nick
  Batch Schedulers on pine64 cdslashetc 0 600 07-24-2016, 09:19 AM
Last Post: cdslashetc
  cluster computing jproffer 20 5,258 05-05-2016, 04:13 PM
Last Post: pineresearch
  Cluster capable? W1SPY 7 2,196 02-25-2016, 10:34 PM
Last Post: hrh1818

Forum Jump:


Users browsing this thread: 1 Guest(s)