09-26-2020, 12:29 PM
(This post was last modified: 09-26-2020, 07:27 PM by simonsouth.
Edit Reason: Brevity
)
I've experienced this as well on my ROCK64 (v2, 4 GB): Often either gcc or Linux itself will crash with an "Illegal instruction" or undefined-instruction error during lengthy builds. Here's a typical kernel dump, to help others searching for a solution:
I have a hunch this is due simply to the memory chip getting too hot, and not because it can't run reliably at higher speeds. In addition to the heat it generates itself, the memory chip's position right next to the SoC means it is likely absorbing heat radiated by the CPU cores. The official aluminum case may even aggravate this situation, as I suspect the extra-wide heat pipe intended to wick heat away from both chips may actually conduct it from one to the other at times.
The situation is even worse on boards like mine with the SpecTek memory chip, as its datasheet shows it rated for reliable use up to only 70 degrees Celsius, unlike the other parts that are rated for use up to 85. Meanwhile, the RK3328 can reach temperatures of 90 degrees or more under continuous heavy load.
A bit of experimentation supports my theory: While monitoring the SoC temperature (using "watch cat /sys/class/thermal/thermal_zone0/temp") during lengthy builds, I have yet to see a crash when the temperature stays below 70 degrees. Once it rises above that threshold, though, a crash often happens within minutes.
So what can be done? One solution would be to switch to active cooling by installing a fan. (In some cases, installing a separate heatsink on the memory chip may be enough.) Another is to limit the clock rate of the memory chip and/or the CPUs, as suggested above.
For systems running Linux, a third approach would be to adjust the trip points of the thermal-management driver so it manages the CPU clock rate more aggressively, with the goal of keeping the SoC temperature below 70 degrees for as long as possible (without impacting the machine's performance under normal use). Here's the existing configuration, from the RK3328 device tree:
It appears that by default the driver takes no action at all until the temperature reaches 70 degrees, by which point the system may already be heading for a crash. My guess is that reducing the first two temperature values to (say) 60000 and 70000 will help greatly in keeping ROCK64s with SpecTek memory stable under load. (To my knowledge, the thermal driver does not allow these trip points to be adjusted at runtime.) I'll be experimenting with this to see what sort of difference it makes.
Code:
[ 2437.611193] kernel BUG at arch/arm64/kernel/traps.c:405!
[ 2437.611663] Internal error: Oops - BUG: 0 [#1] SMP
[ 2437.612084] Modules linked in: ath9k_htc ath9k_common ath9k_hw ath mac80211 cfg80211 libarc4 crct10dif_ce
[ 2437.612933] CPU: 0 PID: 1044 Comm: kworker/0:0H Not tainted 5.4.39-gnu #1
[ 2437.613527] Hardware name: Pine64 Rock64 (DT)
[ 2437.613921] Workqueue: 0x0 (kblockd)
[ 2437.614250] pstate: 00000085 (nzcv daIf -PAN -UAO)
[ 2437.614679] pc : do_undefinstr+0x2a0/0x348
[ 2437.615041] lr : do_undefinstr+0x13c/0x348
[ 2437.615401] sp : ffff80001155bba0
[ 2437.615694] x29: ffff80001155bba0 x28: ffff0000f6879880
[ 2437.616161] x27: ffff0000d602d900 x26: ffff0000f6879880
[ 2437.616627] x25: ffff800010783050 x24: 0000000000000000
[ 2437.617092] x23: 0000000000000085 x22: ffff8000100dcdfc
[ 2437.617559] x21: ffff80001155bd40 x20: ffff80001155bc00
[ 2437.618025] x19: ffff800010af8000 x18: 0000000000000000
[ 2437.618491] x17: 0000000000000000 x16: ffffffffffcfffff
[ 2437.618956] x15: ffffffffffffffff x14: ffff0000fa1fe380
[ 2437.619423] x13: ffff0000ce340000 x12: 0000000000000002
[ 2437.619887] x11: 0000000000000001 x10: ffff0000fa1fe340
[ 2437.620352] x9 : 0000000000000000 x8 : ffff0000fc9a7ab0
[ 2437.620817] x7 : ffff0000f7e00800 x6 : ffff80001155bbf8
[ 2437.621282] x5 : ffff800010b68100 x4 : 0000000000000000
[ 2437.621748] x3 : 00000000d5300000 x2 : ffff800010b01608
[ 2437.622214] x1 : ffff800010b68100 x0 : 0000000000000085
[ 2437.622680] Call trace:
[ 2437.622900] do_undefinstr+0x2a0/0x348
[ 2437.623235] el1_undef+0x10/0x84
[ 2437.623528] deactivate_task+0x5c/0xa8
[ 2437.623865] __schedule+0x2e8/0x4d0
[ 2437.624176] schedule+0x30/0xa8
[ 2437.624458] worker_thread+0xe0/0x4e0
[ 2437.624786] kthread+0x124/0x128
[ 2437.625076] ret_from_fork+0x10/0x18
[ 2437.625398] Code: f94013b5 17ffffef a9025bb5 f9001bb7 (d4210000)
[ 2437.625935] ---[ end trace 91edd19288ede6cb ]---
I have a hunch this is due simply to the memory chip getting too hot, and not because it can't run reliably at higher speeds. In addition to the heat it generates itself, the memory chip's position right next to the SoC means it is likely absorbing heat radiated by the CPU cores. The official aluminum case may even aggravate this situation, as I suspect the extra-wide heat pipe intended to wick heat away from both chips may actually conduct it from one to the other at times.
The situation is even worse on boards like mine with the SpecTek memory chip, as its datasheet shows it rated for reliable use up to only 70 degrees Celsius, unlike the other parts that are rated for use up to 85. Meanwhile, the RK3328 can reach temperatures of 90 degrees or more under continuous heavy load.
A bit of experimentation supports my theory: While monitoring the SoC temperature (using "watch cat /sys/class/thermal/thermal_zone0/temp") during lengthy builds, I have yet to see a crash when the temperature stays below 70 degrees. Once it rises above that threshold, though, a crash often happens within minutes.
So what can be done? One solution would be to switch to active cooling by installing a fan. (In some cases, installing a separate heatsink on the memory chip may be enough.) Another is to limit the clock rate of the memory chip and/or the CPUs, as suggested above.
For systems running Linux, a third approach would be to adjust the trip points of the thermal-management driver so it manages the CPU clock rate more aggressively, with the goal of keeping the SoC temperature below 70 degrees for as long as possible (without impacting the machine's performance under normal use). Here's the existing configuration, from the RK3328 device tree:
Code:
trips {
threshold: trip-point0 {
temperature = <70000>;
hysteresis = <2000>;
type = "passive";
};
target: trip-point1 {
temperature = <85000>;
hysteresis = <2000>;
type = "passive";
};
soc_crit: soc-crit {
temperature = <95000>;
hysteresis = <2000>;
type = "critical";
};
};
It appears that by default the driver takes no action at all until the temperature reaches 70 degrees, by which point the system may already be heading for a crash. My guess is that reducing the first two temperature values to (say) 60000 and 70000 will help greatly in keeping ROCK64s with SpecTek memory stable under load. (To my knowledge, the thermal driver does not allow these trip points to be adjusted at runtime.) I'll be experimenting with this to see what sort of difference it makes.