Pine64 Cluster economy
#1
Hi,

I was interested how the Pine64 with its very low price would stand up against other approaches. I am interested mostly in mixed CPU intensive workload. Some floating point, but also some unstructured code.

I have chosen povray, as easily available benchmark, and something that might be actually close to what I would like to test and run on pine64 cluster.

Even if results are negative I might consider pine just for distributed software development and testing. But if the economy is on a bad side, it will not make it very practical replacement for other computing platforms.

I explicitly excluded GPUs, as they are in their specific domain, will deliver best performance, both in absolute terms, and probably in perf/W and perf/$. But development for them isn't that easy and they are not suited for generic codes.

I had an access to one PINE64+ and one PC with relatively modern Intel CPU (Sandy Bridge architecture, i3930K 3.2GHz running at 4.2GHz).

Running Debian testing / unstable.

I compiled povray manually from debian sources (povray 3.7.0), with gcc 6.1 on amd64, and gcc 7.0 (custom built git version from few days ago), on aarch64. This way I got about 15% performance improvement both on Sandy Bridge CPU and on A64 SoC. Debian generic packages in testing/unstable are already compiled with -O3, but without specific -march or -mtune options to tune instruction scheduling, use cache information, etc and such.

(Note that Sandy Bridge lacks AVX and AVX2, but I doubt this would help much in this benchmark. Haswell/Broadwell might bring about 20% of generic IPC improvements, and that is the estimate I used below).

I used FDO and LTO, with final compilation using collected profiles from povray benchmark using -O3 -march=native -ffast-math -fomit-frame-pointers. I tested with other switches, but they do not bring any additional benefits or degrade performance. Most of the performance benefits are from -march=native (especially on Sandy Bridge), just a little from -ffast-math, and FDO/LTO adds few more% of improvement (but also makes binary smaller, helping with the cache utilization).

Results.

I used:
echo | time povray -benchmark

Doing standard pov ray benchmark with 512x512 target output, and adaptive subsampling, for a total of 294912 pixels, ~776k samples, ~2.63 samples/pixel. The benchmark does have very varied structure, with both simple and complex objects,. simple and complex regions, some areas with and without reflection, refraction, aniostropy, complex and simple shading, high and small spatial density of objects, mathematically complex objects and simple ones. For textured objects it uses exclusively procedural textures (including Perlin noise and other fractal methods). It puts small pressure on memory (just few megabytes of memory used at most), and small pressure on memory bandwidth (almost everything fits in the cache, and there is considerably more arithmetic and cpu code than memory accesses, also due to the lack of pregenerated textures / images).

The time below is real time passed on the main Trace pass in povray. (data parsing, data structures creation, photon time, excluded as they are only about 1% of total time, and not all are multithreaded).
 
i7-3930K 3.2GHz @ 4.2GHz, 32nm process, 32GB RAM:  (using 12 threads)

time: 90.227 s  (107.7s on debian generic build)
pixels/s: 3267
power at load: 270W (full system estimate)
power at idle: 100W (full system estimate)
minimal full system price: 450$ (estimate of minimal full system price. 8GB ram, no case, no switch. realistic)
pixels/s/$: 7.26
pixels/s/W: 12.1

Pine64 (sun50iw1p1), 2GB RAM  (with small heatsink): (using 4 threads)

time: 1359 s  (1949 s on debian generic build)
pixels/s: 217  (151.3 on generic build)
power at load: 6W (full system estimate including potential ethernet switch port amortized power)
power at idle: 1.5W (without ethernet)
minimal full system price: 20$ (includes PSU, cabling, heat sink, amortized price of ethernet switch port, but no cases or mounting hardware. optimistic)
pixels/s/$: 10.85
pixels/s/W: 36.17


Speculative estimate for more modern CPU:

i7-5820K 3.3GHz @ 4.2GHz, 6 core, 2x4GB RAM, 22nm process

time: 80 s
pixels/s: 3700
power at load: 250W
power at idle: 60W
minimum full system price: 550$
pixels/s/$: 6.7
pixels/s/W: 14.7


Note that I am only comparing very high end Intel cpus, with 6 cores. Comparison to i5 and i7 4 cores CPU, or AMD CPUs would make it probably considerably in favor on the x86. Both for perf/W and perf/$. I do have an access to few more CPUs, including very low power ones, and would like to update this list shortly.


Summary:

Pine64+ cluster might be realistically speaking viable alternative the x86 high performance computer in some compute intensive workloads. However, it will require about 20 Pine64 boards to match performance of the x86. But with less memory (also distributed across more devices) and memory bandwidth limits, and management + custom software overheads. To fully utilize Pine64 low price and low power usage to compute with other platforms, one needs to bring the costs as low as possible (every 1$ counts), but using shared PSU, and cheap cables and cooling solutions and DIY mounting options. Just making it cost more than 30$ will make it economically impractical. 25$ might be ok (19$ for PINE, and 6$ for different parts), 20$ would be the best.

If matched with initial cost, PINE64 might be very attractive option in the long run, due to very low idle power usage, and very high performance/W metrics.

Future work:

More tests. Calculate 2-year ownership cost at 20% utilization (20% of the time system fully loaded, 80% idle).

What do you think? Any other comparisons to make? Maybe things like webserving? memory based caching? storage of some sort?

I would love to see somebody build ~50 nodes cluster with price, performance and power details.
  Reply


Messages In This Thread
Pine64 Cluster economy - by baryluk - 05-07-2016, 06:05 PM
RE: Pine64 Cluster economy - by tkaiser - 05-09-2016, 04:56 AM
RE: Pine64 Cluster economy - by baryluk - 05-17-2016, 03:26 PM
RE: Pine64 Cluster economy - by tkaiser - 05-18-2016, 08:53 AM
RE: Pine64 Cluster economy - by tkaiser - 05-19-2016, 03:58 AM
RE: Pine64 Cluster economy - by JasperBrown - 06-19-2016, 09:13 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  I started blogging on Pine64 Cluster salmangano 13 18,703 04-12-2017, 12:44 PM
Last Post: Shad0wSt4R
  Cluster Software Madroxprime 1 5,487 04-03-2017, 06:57 AM
Last Post: Shad0wSt4R
  Cluster / Render Farm Ron Piggott 15 25,608 10-02-2016, 02:55 PM
Last Post: tllim
  Bargain 5 Node Cluster of PINE A64+ from Climber.net tllim 8 14,733 10-02-2016, 07:01 AM
Last Post: sheffield_nick
  Batch Schedulers on pine64 cdslashetc 0 2,968 07-24-2016, 09:19 AM
Last Post: cdslashetc
  cluster computing jproffer 20 36,726 05-05-2016, 04:13 PM
Last Post: pineresearch
  Cluster capable? W1SPY 7 12,569 02-25-2016, 10:34 PM
Last Post: hrh1818

Forum Jump:


Users browsing this thread: 2 Guest(s)