NVMe-related crashes and instability, plus a solution
#1
After installing an NVMe SSD in my Pinebook Pro I began to see Linux crashing periodically with output like the following:

Code:
[    7.153982] SError Interrupt on CPU2, code 0xbf000002 -- SError
[    7.153986] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.153988] Hardware name: PINE64 Pinebook Pro (DT)
[    7.153989] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[    7.153991] pc : nvme_submit_cmd+0x11c/0x130
[    7.153992] lr : nvme_queue_rq+0x43c/0x6b8
[    7.153993] sp : ffff80001409b6f0
[    7.153995] x29: ffff80001409b6f0 x28: ffff0000f4716000
[    7.153998] x27: 0000000000000000 x26: 0000000000001000
[    7.154002] x25: 0000000000000001 x24: 0000000000001000
[    7.154004] x23: ffff0000eff62000 x22: 0000000000000000
[    7.154007] x21: 0000000000000001 x20: ffff0000f4536a40
[    7.154010] x19: ffff800010d1a000 x18: 0000000000000000
[    7.154014] x17: 0000000000000000 x16: 0000000000000000
[    7.154016] x15: 0000000000000000 x14: 0000000000000000
[    7.154019] x13: 0000000000000000 x12: ffff800010226c88
[    7.154022] x11: 0000000000000000 x10: 0000000000000000
[    7.154025] x9 : 0000000000000000 x8 : ffffffffffffffff
[    7.154028] x7 : 00000000e929d000 x6 : 00000000e929d000
[    7.154031] x5 : 0000000007ef7ac9 x4 : 0000000000000006
[    7.154034] x3 : 0000000000000000 x2 : 0000000780000007
[    7.154037] x1 : ffff0000f4536a48 x0 : 0000000000000000
[    7.154040] Kernel panic - not syncing: Asynchronous SError Interrupt
[    7.154042] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.154044] Hardware name: PINE64 Pinebook Pro (DT)
[    7.154044] Call trace:
[    7.154046]  dump_backtrace+0x0/0x1d8
[    7.154047]  show_stack+0x14/0x20
[    7.154048]  dump_stack+0xbc/0xf8
[    7.154049]  panic+0x150/0x348
[    7.154050]  add_taint+0x0/0xa8
[    7.154051]  arm64_serror_panic+0x74/0x80
[    7.154053]  do_serror+0x6c/0x168
[    7.154054]  el1_error+0x84/0x100
[    7.154055]  nvme_submit_cmd+0x11c/0x130
[    7.154056]  nvme_queue_rq+0x43c/0x6b8
[    7.154058]  __blk_mq_try_issue_directly+0x104/0x230
[    7.154059]  blk_mq_request_issue_directly+0x50/0x100
[    7.154061]  blk_mq_try_issue_list_directly+0x58/0xe8
[    7.154062]  blk_mq_sched_insert_requests+0xe0/0x150
[    7.154064]  blk_mq_flush_plug_list+0x11c/0x188
[    7.154065]  blk_flush_plug_list+0xd8/0x108
[    7.154066]  blk_finish_plug+0x30/0xa0
[    7.154067]  read_pages+0x154/0x290
[    7.154069]  page_cache_readahead_unbounded+0x160/0x220
[    7.154070]  __do_page_cache_readahead+0x34/0x48
[    7.154072]  force_page_cache_readahead+0xb4/0x108
[    7.154073]  page_cache_sync_readahead+0xe4/0xf0
[    7.154074]  generic_file_buffered_read+0x5d8/0xa28
[    7.154076]  generic_file_read_iter+0xd0/0x180
[    7.154077]  blkdev_read_iter+0x38/0x48
[    7.154079]  new_sync_read+0xec/0x188
[    7.154080]  vfs_read+0x1bc/0x1d0
[    7.154081]  ksys_read+0x68/0xf8
[    7.154082]  __arm64_sys_read+0x14/0x20
[    7.154083]  do_el0_svc+0x68/0xd0
[    7.154084]  el0_sync_handler+0x16c/0x2a0
[    7.154086]  el0_sync+0x140/0x180
[    7.154112] SMP: stopping secondary CPUs
[    7.154113] Kernel Offset: disabled
[    7.154114] CPU features: 0x200022,01006008
[    7.154116] Memory Limit: none

The crashes became more and more frequent until eventually the system would fail to boot most times. The exact backtrace varied, but it always referenced the NVMe driver and indicated an "asynchronous system error", pointing to an issue with the hardware itself.

After some research, I've found the solution is to remove this line from the Pinebook Pro device tree:

Code:
max-link-speed = <2>;

Since building a new kernel with this change I've yet to see a single crash from the NVMe driver and the system appears completely stable.

What this change does is stop the Linux PCIe driver from trying to operate the PCIe link at rates above the default for RK3399-based devices of 2.5 GT/s, which is the maximum rate Rockchip themselves claim the SoC will support. It seems the RK3399 was originally designed to operate its PCIe bus at the higher, "gen 2" speed, but since the SoC's release the company has downgraded its specifications as (I assume) variances in manufacturing resulted in many parts proving unstable at that speed—as my Pinebook Pro demonstrates.

I suspect this may be the cause of many of the NVMe-related issues other forum members are experiencing, particularly when failures are intermittent or the drive is known to work in other machines.

In fact, between this and the 2.0 GHz CPU frequency (also unsupported by Rockchip) that is enabled in the kernels most people are using, I find it remarkable that most Pinebook Pros have been running out-of-spec by default, which I have to think has something to do with the uneven experiences people are reporting with the machine as well as the general lack of reliability you sense skimming the posts in this forum.

In any case, if your Pinebook Pro seems to be having trouble using an NVMe drive, try bringing it back within the manufacturer's specifications by removing the line above from the device tree (and reverting the 2.0 GHz patch, if you've been using it) and building a new kernel. You may find the problems you've been experiencing disappear completely.
  Reply
#2
>Since building a new kernel with this change
Speaking as someone who has (poorly) edited dtb's , compiling a kernel
is not necessary, just get dtc and learn how to use it
Granted, it is harder to modify an existing dtb than start from scratch...
Thanks for finding this
  Reply
#3
@simonsouth - great detective work. Thanks.

I have no such problems whatsoever but anyway it's good to know.

As to the overclock. The problem is more in voltage than CPU frequency from my experience. As I wrote in another thread mine is running super stable with 1.7/2.18 GHz but it is undervolted compared to original overclock - 1.15/1.25 V (original 1.3V).

It looks like Pinebook Pros are more prone to "silicon lottery" than other products. Great example is display enabled uboot. Where I almost had no problems with it - most people do have severe problems booting newer kernels. Why ? The only solution that comes to my mind is hardware quality differences.

BTW - nobody wants to test your PWM kernel patch Smile So probably on weekend I will do this. Although, as I said, I had minor problems with booting newer kernels.
I suspect your patch might also repair screen power management problems when using display enabled uboot. When screen goes off you just can't bring it back anymore. That's on plasma DE.
And there's also kexec problem. Screen distortions after kexec. This also might have something to do with PWM.
  Reply
#4
Hi guys. Many thanks for this discovery, the issue described is exactly the one I've been struggling with. Sometimes just on the GNOME DE, without doing anything, the system just hard crashes, but of course it is much more easy to trigger by opening heavy apps. I'll try to recompile the kernel and see if it works this way.
  Reply
#5
>I'll try to recompile the kernel
Don't need to do this
This is in dtb, so get dtc (device tree compiler) and learn how to use it,, it's not hard
This is a simple edit, what is a LOT harder is mem timing, gpio mapping, regulator settings
When you de-compile, get a dts, search/replace max-link-speed = <2>; to max-link-speed = <1>;
recompile
This may or may not solve your problem, I assume you have checked power consumption?
  Reply
#6
You might be also interested in the related thread for the rockpro64 (same SoC) discussing a similar issue and comparing performance between the modes:

https://forum.pine64.org/showthread.php?tid=8374
  Reply
#7
Thank you both for your replies. It seems my nvme can draw as much as 6W on pstate 0. I understand it's still within spec of the PBP, but this is definitely something I need to change anyway for better battery life, etc. The changes to the dtb didn't avoid the kernel panics for me.

I have tried to set the drive to PS 2, but it doesn't get applied, the drive is consistently in PS 4, until some load actually makes it jump briefly to other states. While this confirms that APST is operational, I would like to be able to limit the PS it can use anyway. I have tried to disable it with the boot parameter "nvme_core.default_ps_max_latency_us=0", but this doesn't seem to change anything. Is there another way to disable APST? Or am I going in the wrong direction about this?

Edit: it works in the end. After setting the kernel parameter it does indeed disable APST, which by default sets the drive to its most powerful state. However after that I can correctly force the PS 2 and have a working system, no longer crashing no matter the workload (been trying to crash it for 2-3 hours now with heavy IO). Even setting it to PS 1 still works, however I don't notice any improvement in responsiveness nor peak performance, so I prefer to stay at ~3W (PS 2) vs 4.2W (PS 1). If the PBP is plugged to the wall, then I can use the PS 0 as much as i want and it doesn't crash, so it definitely was a power issue.
  Reply
#8
i've had no issue with nvme with my wdc sn550 512gb. it has very low power usage (<3W peak), although i run it at ps1 generally, anways. i think a better solution to the nvme problem is using a drive with low power needs.
  Reply
#9
(09-30-2020, 02:18 PM)simonsouth Wrote: After installing an NVMe SSD in my Pinebook Pro I began to see Linux crashing periodically with output like the following:

Code:
[    7.153982] SError Interrupt on CPU2, code 0xbf000002 -- SError
[    7.153986] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.153988] Hardware name: PINE64 Pinebook Pro (DT)
[    7.153989] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[    7.153991] pc : nvme_submit_cmd+0x11c/0x130
[    7.153992] lr : nvme_queue_rq+0x43c/0x6b8
[    7.153993] sp : ffff80001409b6f0
[    7.153995] x29: ffff80001409b6f0 x28: ffff0000f4716000
[    7.153998] x27: 0000000000000000 x26: 0000000000001000
[    7.154002] x25: 0000000000000001 x24: 0000000000001000
[    7.154004] x23: ffff0000eff62000 x22: 0000000000000000
[    7.154007] x21: 0000000000000001 x20: ffff0000f4536a40
[    7.154010] x19: ffff800010d1a000 x18: 0000000000000000
[    7.154014] x17: 0000000000000000 x16: 0000000000000000
[    7.154016] x15: 0000000000000000 x14: 0000000000000000
[    7.154019] x13: 0000000000000000 x12: ffff800010226c88
[    7.154022] x11: 0000000000000000 x10: 0000000000000000
[    7.154025] x9 : 0000000000000000 x8 : ffffffffffffffff
[    7.154028] x7 : 00000000e929d000 x6 : 00000000e929d000
[    7.154031] x5 : 0000000007ef7ac9 x4 : 0000000000000006
[    7.154034] x3 : 0000000000000000 x2 : 0000000780000007
[    7.154037] x1 : ffff0000f4536a48 x0 : 0000000000000000
[    7.154040] Kernel panic - not syncing: Asynchronous SError Interrupt
[    7.154042] CPU: 2 PID: 169 Comm: udevd Not tainted 5.8.1-gnu #1
[    7.154044] Hardware name: PINE64 Pinebook Pro (DT)
[    7.154044] Call trace:
[    7.154046]  dump_backtrace+0x0/0x1d8
[    7.154047]  show_stack+0x14/0x20
[    7.154048]  dump_stack+0xbc/0xf8
[    7.154049]  panic+0x150/0x348
[    7.154050]  add_taint+0x0/0xa8
[    7.154051]  arm64_serror_panic+0x74/0x80
[    7.154053]  do_serror+0x6c/0x168
[    7.154054]  el1_error+0x84/0x100
[    7.154055]  nvme_submit_cmd+0x11c/0x130
[    7.154056]  nvme_queue_rq+0x43c/0x6b8
[    7.154058]  __blk_mq_try_issue_directly+0x104/0x230
[    7.154059]  blk_mq_request_issue_directly+0x50/0x100
[    7.154061]  blk_mq_try_issue_list_directly+0x58/0xe8
[    7.154062]  blk_mq_sched_insert_requests+0xe0/0x150
[    7.154064]  blk_mq_flush_plug_list+0x11c/0x188
[    7.154065]  blk_flush_plug_list+0xd8/0x108
[    7.154066]  blk_finish_plug+0x30/0xa0
[    7.154067]  read_pages+0x154/0x290
[    7.154069]  page_cache_readahead_unbounded+0x160/0x220
[    7.154070]  __do_page_cache_readahead+0x34/0x48
[    7.154072]  force_page_cache_readahead+0xb4/0x108
[    7.154073]  page_cache_sync_readahead+0xe4/0xf0
[    7.154074]  generic_file_buffered_read+0x5d8/0xa28
[    7.154076]  generic_file_read_iter+0xd0/0x180
[    7.154077]  blkdev_read_iter+0x38/0x48
[    7.154079]  new_sync_read+0xec/0x188
[    7.154080]  vfs_read+0x1bc/0x1d0
[    7.154081]  ksys_read+0x68/0xf8
[    7.154082]  __arm64_sys_read+0x14/0x20
[    7.154083]  do_el0_svc+0x68/0xd0
[    7.154084]  el0_sync_handler+0x16c/0x2a0
[    7.154086]  el0_sync+0x140/0x180
[    7.154112] SMP: stopping secondary CPUs
[    7.154113] Kernel Offset: disabled
[    7.154114] CPU features: 0x200022,01006008
[    7.154116] Memory Limit: none

The crashes became more and more frequent until eventually the system would fail to boot most times. The exact backtrace varied, but it always referenced the NVMe driver and indicated an "asynchronous system error", pointing to an issue with the hardware itself.

After some research, I've found the solution is to remove this line from the Pinebook Pro device tree:

Code:
max-link-speed = <2>;

Since building a new kernel with this change I've yet to see a single crash from the NVMe driver and the system appears completely stable.

What this change does is stop the Linux PCIe driver from trying to operate the PCIe link at rates above the default for RK3399-based devices of 2.5 GT/s, which is the maximum rate Rockchip themselves claim the SoC will support. It seems the RK3399 was originally designed to operate its PCIe bus at the higher, "gen 2" speed, but since the SoC's release the company has downgraded its specifications as (I assume) variances in manufacturing resulted in many parts proving unstable at that speed—as my Pinebook Pro demonstrates.

I suspect this may be the cause of many of the NVMe-related issues other forum members are experiencing, particularly when failures are intermittent or the drive is known to work in other machines.

In fact, between this and the 2.0 GHz CPU frequency (also unsupported by Rockchip) that is enabled in the kernels most people are using, I find it remarkable that most Pinebook Pros have been running out-of-spec by default, which I have to think has something to do with the uneven experiences people are reporting with the machine as well as the general lack of reliability you sense skimming the posts in this forum.

In any case, if your Pinebook Pro seems to be having trouble using an NVMe drive, try bringing it back within the manufacturer's specifications by removing the line above from the device tree (and reverting the 2.0 GHz patch, if you've been using it) and building a new kernel. You may find the problems you've been experiencing disappear completely.

Hey guys,
I recently put an NVMe-ssd in my PBP (Intel 660p M.2 1TB) and experiencing failures after copying larger amount of data. The device just disappears from the device lists, and it takes some reboots before it resurfaces.
I've tried copying at lower speeds, but that doesn't seem to to the trick neither.
I stumbled on this post here and taught I  could give it a try, but I'm not really sure how to go about it, not being that proficient in rebuilding kernels et al. Can someone maybe give some pointers on how to go about this?
I'm currently running manjaro ARM 20.10 with kernel 5.9.9-2
Does this solution mean you have to tinker every time the kernel updates?

Thanks a lot
  Reply
#10
(12-09-2020, 01:41 AM)nostro Wrote: I recently put an NVMe-ssd in my PBP (Intel 660p M.2 1TB) and experiencing failures after copying larger amount of data. The device just disappears from the device lists, and it takes some reboots before it resurfaces.

Hey Nostro. As mentioned by JojoNintendo above, take a look into the max latency setting. I did some testing myself over on this thread for my Intel 660p 2TB, where I found setting the NVMe max latency resolved my issues with the NVMe disappearing, with and without the PCIe max-link-speed changes. Yours isn't the exact same model, so it may or may not resolve your issues, but saw your post and figured I'd ping it for reference. Good luck!
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  New Working nVME gilwood 0 155 02-12-2024, 08:46 AM
Last Post: gilwood
  NVME problems 2022 / Intel 660p 1TB Starbug 1 1,412 04-04-2023, 12:16 PM
Last Post: globaltree
Thumbs Up NVMe adapter, Great addition dachalife 2 1,725 11-28-2022, 12:56 PM
Last Post: dachalife
  NVMe drives not detected mattpenn 12 10,029 03-05-2022, 04:53 AM
Last Post: mattpenn
  NVme intall usage? tkudog 2 2,793 03-04-2022, 01:29 AM
Last Post: Tazdevl
  Anyone selling a spare NVMe adapter in Europe? tom.tomasz 1 1,793 01-03-2022, 07:57 AM
Last Post: tom.tomasz
  NVMe SSD testing methodology halogen 1 2,555 07-22-2021, 05:57 PM
Last Post: calinb
Question Battery stops charging and NVMe and other media disconnect randomly Eey0zu6O 4 4,629 07-09-2021, 08:45 PM
Last Post: moonwalkers
  nvme drive disappears after about an hour of uptime codebreaker 25 31,442 02-09-2021, 11:32 PM
Last Post: dsimic
  NVME SPI Update not booting SD Card WZ9V 5 6,244 10-18-2020, 08:36 PM
Last Post: wdt

Forum Jump:


Users browsing this thread: 1 Guest(s)