PINE64
NVMe-related crashes and instability, plus a solution - Printable Version

+- PINE64 (https://forum.pine64.org)
+-- Forum: Pinebook Pro (https://forum.pine64.org/forumdisplay.php?fid=111)
+--- Forum: Pinebook Pro Hardware and Accessories (https://forum.pine64.org/forumdisplay.php?fid=116)
+--- Thread: NVMe-related crashes and instability, plus a solution (/showthread.php?tid=11683)

Pages: 1 2


NVMe-related crashes and instability, plus a solution - simonsouth - 09-30-2020

After installing an NVMe SSD in my Pinebook Pro I began to see Linux crashing periodically with output like the following:

Code:
[    7.153982] SError Interrupt on CPU2, code 0xbf000002 -- SError
[    7.153986] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.153988] Hardware name: PINE64 Pinebook Pro (DT)
[    7.153989] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[    7.153991] pc : nvme_submit_cmd+0x11c/0x130
[    7.153992] lr : nvme_queue_rq+0x43c/0x6b8
[    7.153993] sp : ffff80001409b6f0
[    7.153995] x29: ffff80001409b6f0 x28: ffff0000f4716000
[    7.153998] x27: 0000000000000000 x26: 0000000000001000
[    7.154002] x25: 0000000000000001 x24: 0000000000001000
[    7.154004] x23: ffff0000eff62000 x22: 0000000000000000
[    7.154007] x21: 0000000000000001 x20: ffff0000f4536a40
[    7.154010] x19: ffff800010d1a000 x18: 0000000000000000
[    7.154014] x17: 0000000000000000 x16: 0000000000000000
[    7.154016] x15: 0000000000000000 x14: 0000000000000000
[    7.154019] x13: 0000000000000000 x12: ffff800010226c88
[    7.154022] x11: 0000000000000000 x10: 0000000000000000
[    7.154025] x9 : 0000000000000000 x8 : ffffffffffffffff
[    7.154028] x7 : 00000000e929d000 x6 : 00000000e929d000
[    7.154031] x5 : 0000000007ef7ac9 x4 : 0000000000000006
[    7.154034] x3 : 0000000000000000 x2 : 0000000780000007
[    7.154037] x1 : ffff0000f4536a48 x0 : 0000000000000000
[    7.154040] Kernel panic - not syncing: Asynchronous SError Interrupt
[    7.154042] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.154044] Hardware name: PINE64 Pinebook Pro (DT)
[    7.154044] Call trace:
[    7.154046]  dump_backtrace+0x0/0x1d8
[    7.154047]  show_stack+0x14/0x20
[    7.154048]  dump_stack+0xbc/0xf8
[    7.154049]  panic+0x150/0x348
[    7.154050]  add_taint+0x0/0xa8
[    7.154051]  arm64_serror_panic+0x74/0x80
[    7.154053]  do_serror+0x6c/0x168
[    7.154054]  el1_error+0x84/0x100
[    7.154055]  nvme_submit_cmd+0x11c/0x130
[    7.154056]  nvme_queue_rq+0x43c/0x6b8
[    7.154058]  __blk_mq_try_issue_directly+0x104/0x230
[    7.154059]  blk_mq_request_issue_directly+0x50/0x100
[    7.154061]  blk_mq_try_issue_list_directly+0x58/0xe8
[    7.154062]  blk_mq_sched_insert_requests+0xe0/0x150
[    7.154064]  blk_mq_flush_plug_list+0x11c/0x188
[    7.154065]  blk_flush_plug_list+0xd8/0x108
[    7.154066]  blk_finish_plug+0x30/0xa0
[    7.154067]  read_pages+0x154/0x290
[    7.154069]  page_cache_readahead_unbounded+0x160/0x220
[    7.154070]  __do_page_cache_readahead+0x34/0x48
[    7.154072]  force_page_cache_readahead+0xb4/0x108
[    7.154073]  page_cache_sync_readahead+0xe4/0xf0
[    7.154074]  generic_file_buffered_read+0x5d8/0xa28
[    7.154076]  generic_file_read_iter+0xd0/0x180
[    7.154077]  blkdev_read_iter+0x38/0x48
[    7.154079]  new_sync_read+0xec/0x188
[    7.154080]  vfs_read+0x1bc/0x1d0
[    7.154081]  ksys_read+0x68/0xf8
[    7.154082]  __arm64_sys_read+0x14/0x20
[    7.154083]  do_el0_svc+0x68/0xd0
[    7.154084]  el0_sync_handler+0x16c/0x2a0
[    7.154086]  el0_sync+0x140/0x180
[    7.154112] SMP: stopping secondary CPUs
[    7.154113] Kernel Offset: disabled
[    7.154114] CPU features: 0x200022,01006008
[    7.154116] Memory Limit: none

The crashes became more and more frequent until eventually the system would fail to boot most times. The exact backtrace varied, but it always referenced the NVMe driver and indicated an "asynchronous system error", pointing to an issue with the hardware itself.

After some research, I've found the solution is to remove this line from the Pinebook Pro device tree:

Code:
max-link-speed = <2>;

Since building a new kernel with this change I've yet to see a single crash from the NVMe driver and the system appears completely stable.

What this change does is stop the Linux PCIe driver from trying to operate the PCIe link at rates above the default for RK3399-based devices of 2.5 GT/s, which is the maximum rate Rockchip themselves claim the SoC will support. It seems the RK3399 was originally designed to operate its PCIe bus at the higher, "gen 2" speed, but since the SoC's release the company has downgraded its specifications as (I assume) variances in manufacturing resulted in many parts proving unstable at that speed—as my Pinebook Pro demonstrates.

I suspect this may be the cause of many of the NVMe-related issues other forum members are experiencing, particularly when failures are intermittent or the drive is known to work in other machines.

In fact, between this and the 2.0 GHz CPU frequency (also unsupported by Rockchip) that is enabled in the kernels most people are using, I find it remarkable that most Pinebook Pros have been running out-of-spec by default, which I have to think has something to do with the uneven experiences people are reporting with the machine as well as the general lack of reliability you sense skimming the posts in this forum.

In any case, if your Pinebook Pro seems to be having trouble using an NVMe drive, try bringing it back within the manufacturer's specifications by removing the line above from the device tree (and reverting the 2.0 GHz patch, if you've been using it) and building a new kernel. You may find the problems you've been experiencing disappear completely.


RE: NVMe-related crashes and instability, plus a solution - wdt - 09-30-2020

>Since building a new kernel with this change
Speaking as someone who has (poorly) edited dtb's , compiling a kernel
is not necessary, just get dtc and learn how to use it
Granted, it is harder to modify an existing dtb than start from scratch...
Thanks for finding this


RE: NVMe-related crashes and instability, plus a solution - as400 - 10-01-2020

@simonsouth - great detective work. Thanks.

I have no such problems whatsoever but anyway it's good to know.

As to the overclock. The problem is more in voltage than CPU frequency from my experience. As I wrote in another thread mine is running super stable with 1.7/2.18 GHz but it is undervolted compared to original overclock - 1.15/1.25 V (original 1.3V).

It looks like Pinebook Pros are more prone to "silicon lottery" than other products. Great example is display enabled uboot. Where I almost had no problems with it - most people do have severe problems booting newer kernels. Why ? The only solution that comes to my mind is hardware quality differences.

BTW - nobody wants to test your PWM kernel patch Smile So probably on weekend I will do this. Although, as I said, I had minor problems with booting newer kernels.
I suspect your patch might also repair screen power management problems when using display enabled uboot. When screen goes off you just can't bring it back anymore. That's on plasma DE.
And there's also kexec problem. Screen distortions after kexec. This also might have something to do with PWM.


RE: NVMe-related crashes and instability, plus a solution - Jojonintendo - 10-12-2020

Hi guys. Many thanks for this discovery, the issue described is exactly the one I've been struggling with. Sometimes just on the GNOME DE, without doing anything, the system just hard crashes, but of course it is much more easy to trigger by opening heavy apps. I'll try to recompile the kernel and see if it works this way.


RE: NVMe-related crashes and instability, plus a solution - wdt - 10-12-2020

>I'll try to recompile the kernel
Don't need to do this
This is in dtb, so get dtc (device tree compiler) and learn how to use it,, it's not hard
This is a simple edit, what is a LOT harder is mem timing, gpio mapping, regulator settings
When you de-compile, get a dts, search/replace max-link-speed = <2>; to max-link-speed = <1>;
recompile
This may or may not solve your problem, I assume you have checked power consumption?


RE: NVMe-related crashes and instability, plus a solution - kuleszdl - 10-12-2020

You might be also interested in the related thread for the rockpro64 (same SoC) discussing a similar issue and comparing performance between the modes:

https://forum.pine64.org/showthread.php?tid=8374


RE: NVMe-related crashes and instability, plus a solution - Jojonintendo - 10-13-2020

Thank you both for your replies. It seems my nvme can draw as much as 6W on pstate 0. I understand it's still within spec of the PBP, but this is definitely something I need to change anyway for better battery life, etc. The changes to the dtb didn't avoid the kernel panics for me.

I have tried to set the drive to PS 2, but it doesn't get applied, the drive is consistently in PS 4, until some load actually makes it jump briefly to other states. While this confirms that APST is operational, I would like to be able to limit the PS it can use anyway. I have tried to disable it with the boot parameter "nvme_core.default_ps_max_latency_us=0", but this doesn't seem to change anything. Is there another way to disable APST? Or am I going in the wrong direction about this?

Edit: it works in the end. After setting the kernel parameter it does indeed disable APST, which by default sets the drive to its most powerful state. However after that I can correctly force the PS 2 and have a working system, no longer crashing no matter the workload (been trying to crash it for 2-3 hours now with heavy IO). Even setting it to PS 1 still works, however I don't notice any improvement in responsiveness nor peak performance, so I prefer to stay at ~3W (PS 2) vs 4.2W (PS 1). If the PBP is plugged to the wall, then I can use the PS 0 as much as i want and it doesn't crash, so it definitely was a power issue.


RE: NVMe-related crashes and instability, plus a solution - xmixahlx - 10-13-2020

i've had no issue with nvme with my wdc sn550 512gb. it has very low power usage (<3W peak), although i run it at ps1 generally, anways. i think a better solution to the nvme problem is using a drive with low power needs.


RE: NVMe-related crashes and instability, plus a solution - nostro - 12-09-2020

(09-30-2020, 02:18 PM)simonsouth Wrote: After installing an NVMe SSD in my Pinebook Pro I began to see Linux crashing periodically with output like the following:

Code:
[    7.153982] SError Interrupt on CPU2, code 0xbf000002 -- SError
[    7.153986] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.153988] Hardware name: PINE64 Pinebook Pro (DT)
[    7.153989] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[    7.153991] pc : nvme_submit_cmd+0x11c/0x130
[    7.153992] lr : nvme_queue_rq+0x43c/0x6b8
[    7.153993] sp : ffff80001409b6f0
[    7.153995] x29: ffff80001409b6f0 x28: ffff0000f4716000
[    7.153998] x27: 0000000000000000 x26: 0000000000001000
[    7.154002] x25: 0000000000000001 x24: 0000000000001000
[    7.154004] x23: ffff0000eff62000 x22: 0000000000000000
[    7.154007] x21: 0000000000000001 x20: ffff0000f4536a40
[    7.154010] x19: ffff800010d1a000 x18: 0000000000000000
[    7.154014] x17: 0000000000000000 x16: 0000000000000000
[    7.154016] x15: 0000000000000000 x14: 0000000000000000
[    7.154019] x13: 0000000000000000 x12: ffff800010226c88
[    7.154022] x11: 0000000000000000 x10: 0000000000000000
[    7.154025] x9 : 0000000000000000 x8 : ffffffffffffffff
[    7.154028] x7 : 00000000e929d000 x6 : 00000000e929d000
[    7.154031] x5 : 0000000007ef7ac9 x4 : 0000000000000006
[    7.154034] x3 : 0000000000000000 x2 : 0000000780000007
[    7.154037] x1 : ffff0000f4536a48 x0 : 0000000000000000
[    7.154040] Kernel panic - not syncing: Asynchronous SError Interrupt
[    7.154042] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.154044] Hardware name: PINE64 Pinebook Pro (DT)
[    7.154044] Call trace:
[    7.154046]  dump_backtrace+0x0/0x1d8
[    7.154047]  show_stack+0x14/0x20
[    7.154048]  dump_stack+0xbc/0xf8
[    7.154049]  panic+0x150/0x348
[    7.154050]  add_taint+0x0/0xa8
[    7.154051]  arm64_serror_panic+0x74/0x80
[    7.154053]  do_serror+0x6c/0x168
[    7.154054]  el1_error+0x84/0x100
[    7.154055]  nvme_submit_cmd+0x11c/0x130
[    7.154056]  nvme_queue_rq+0x43c/0x6b8
[    7.154058]  __blk_mq_try_issue_directly+0x104/0x230
[    7.154059]  blk_mq_request_issue_directly+0x50/0x100
[    7.154061]  blk_mq_try_issue_list_directly+0x58/0xe8
[    7.154062]  blk_mq_sched_insert_requests+0xe0/0x150
[    7.154064]  blk_mq_flush_plug_list+0x11c/0x188
[    7.154065]  blk_flush_plug_list+0xd8/0x108
[    7.154066]  blk_finish_plug+0x30/0xa0
[    7.154067]  read_pages+0x154/0x290
[    7.154069]  page_cache_readahead_unbounded+0x160/0x220
[    7.154070]  __do_page_cache_readahead+0x34/0x48
[    7.154072]  force_page_cache_readahead+0xb4/0x108
[    7.154073]  page_cache_sync_readahead+0xe4/0xf0
[    7.154074]  generic_file_buffered_read+0x5d8/0xa28
[    7.154076]  generic_file_read_iter+0xd0/0x180
[    7.154077]  blkdev_read_iter+0x38/0x48
[    7.154079]  new_sync_read+0xec/0x188
[    7.154080]  vfs_read+0x1bc/0x1d0
[    7.154081]  ksys_read+0x68/0xf8
[    7.154082]  __arm64_sys_read+0x14/0x20
[    7.154083]  do_el0_svc+0x68/0xd0
[    7.154084]  el0_sync_handler+0x16c/0x2a0
[    7.154086]  el0_sync+0x140/0x180
[    7.154112] SMP: stopping secondary CPUs
[    7.154113] Kernel Offset: disabled
[    7.154114] CPU features: 0x200022,01006008
[    7.154116] Memory Limit: none

The crashes became more and more frequent until eventually the system would fail to boot most times. The exact backtrace varied, but it always referenced the NVMe driver and indicated an "asynchronous system error", pointing to an issue with the hardware itself.

After some research, I've found the solution is to remove this line from the Pinebook Pro device tree:

Code:
max-link-speed = <2>;

Since building a new kernel with this change I've yet to see a single crash from the NVMe driver and the system appears completely stable.

What this change does is stop the Linux PCIe driver from trying to operate the PCIe link at rates above the default for RK3399-based devices of 2.5 GT/s, which is the maximum rate Rockchip themselves claim the SoC will support. It seems the RK3399 was originally designed to operate its PCIe bus at the higher, "gen 2" speed, but since the SoC's release the company has downgraded its specifications as (I assume) variances in manufacturing resulted in many parts proving unstable at that speed—as my Pinebook Pro demonstrates.

I suspect this may be the cause of many of the NVMe-related issues other forum members are experiencing, particularly when failures are intermittent or the drive is known to work in other machines.

In fact, between this and the 2.0 GHz CPU frequency (also unsupported by Rockchip) that is enabled in the kernels most people are using, I find it remarkable that most Pinebook Pros have been running out-of-spec by default, which I have to think has something to do with the uneven experiences people are reporting with the machine as well as the general lack of reliability you sense skimming the posts in this forum.

In any case, if your Pinebook Pro seems to be having trouble using an NVMe drive, try bringing it back within the manufacturer's specifications by removing the line above from the device tree (and reverting the 2.0 GHz patch, if you've been using it) and building a new kernel. You may find the problems you've been experiencing disappear completely.

Hey guys,
I recently put an NVMe-ssd in my PBP (Intel 660p M.2 1TB) and experiencing failures after copying larger amount of data. The device just disappears from the device lists, and it takes some reboots before it resurfaces.
I've tried copying at lower speeds, but that doesn't seem to to the trick neither.
I stumbled on this post here and taught I  could give it a try, but I'm not really sure how to go about it, not being that proficient in rebuilding kernels et al. Can someone maybe give some pointers on how to go about this?
I'm currently running manjaro ARM 20.10 with kernel 5.9.9-2
Does this solution mean you have to tinker every time the kernel updates?

Thanks a lot


RE: NVMe-related crashes and instability, plus a solution - HitsuMaruku - 01-24-2021

(12-09-2020, 01:41 AM)nostro Wrote: I recently put an NVMe-ssd in my PBP (Intel 660p M.2 1TB) and experiencing failures after copying larger amount of data. The device just disappears from the device lists, and it takes some reboots before it resurfaces.

Hey Nostro. As mentioned by JojoNintendo above, take a look into the max latency setting. I did some testing myself over on this thread for my Intel 660p 2TB, where I found setting the NVMe max latency resolved my issues with the NVMe disappearing, with and without the PCIe max-link-speed changes. Yours isn't the exact same model, so it may or may not resolve your issues, but saw your post and figured I'd ping it for reference. Good luck!