nvme drive disappears after about an hour of uptime
#11
(10-01-2020, 09:52 AM)simonsouth Wrote:
(09-19-2020, 03:02 PM)codebreaker Wrote: This is the one that hangs all the time. Not sure what to do next.

This is the sort of intermittent failure I suspect may be caused by the PCIe driver trying to operate the link at a speed higher than the RK3399 can reliably support.

I've written about this in another thread. If you remove the "max-link-speed" override from the device tree your system is using, does it improve the situation any?

Any news on this one. Would like to try it, but I don't have a clue how to go about it. Any tips?
  Reply
#12
dtc (device tree compiler) is not that hard to use
copy pbp dtb to a tmp working directory
decompile, edit, recompile, name original to *.old, copy changed dtb to there
If you are clever, just run with no parameters, otherwise man dtc
(to understand how to make the command line)
Or google device tree
I would make it explicit rather than just commenting it out, ie, max-link-speed=1
  Reply
#13
I happened upon this thread, because I also bought and installed the Intel 660p 2TB. I've updated the wiki's "Pinebook Pro Hardware Accessory Compatibility" for its power states, and added some additional info on the discussion page.

I've also experienced the aforementioned disappearing, and it appears to happen under sustained, heavy IO. This even happens after ensuring power state 2 (2.6W max), changing the DTB to revert Max-link-speed from 2 to 1, and ensuring the barrel power cord was charging. Under some circumstance (unsure which), the lspci -vv command also returns no entries, but this does not always happen.

My operation was using Restic to restore from a USB HDD restic backup to the NVMe storage. I restored a limited subset of files to test and succeeded in copying; however, when I attempt to restore the full repo (about 650GiB, expanded to about 750GiB across about 150k files), the device disappears after a few hours. During this time, the battery level went up and down a bit around 80% charge, and slowly drained to about 50%, all while connected to mains power.

Rebooting and recharging do not seem to fix the problem. I've only been able to get it to show again by leaving it overnight, lending more credibility to the temperature theory.

I'm open to additional suggestions. Wanted to post this as a confirmation of trying the max-link-speed suggestion, and also see if anyone else has any further ideas.
  Reply
#14
Quite frankly, if the NVMe drive doesn't become accessible after a reboot, the only root cause that seems reasonable is the NVMe drive overheating.  However, once powered off it should cool down rather quickly, which contradicts the need for leaving the laptop turned off overnight.

Furthermore, the NVMe drive should throttle itself down when its temperature reaches a certain threshold, instead of overheating and becoming inaccessible.  As a result, the drive shouldn't shutdown and become inaccessible due to overheating, but that's actually the issue at hand. :/

The only thing that comes to mind is to monitor the temperature of the drive while reproducing the issue.
  Reply
#15
I wasn't able to confirm the root cause, but I did get it to show back up by removing and reinserting both the drive itself and the ribbon cable into both slots. Nothing seemed out of the ordinary, but there was I think a post mentioning their ribbon cable was pinched and may have affected it.

The ribbon cable itself doesn't actually fit the angle and distance it's designed for; there's always a hump/bulge in the middle of it, because the length doesn't allow it to lay flat. I tried laying it in a way that would prevent it from being pinched by anything, but we'll see.

I'll give it a couple more tries this week and next, and if it still disappears, I might just return it and try another one.
  Reply
#16
Since my last failure, I booted a few times with NVMe not reappearing. I opened up my PBP and re-seated the NVMe and the ribbon for the adapter. Booted up and system and NVMe was there again. I attempted another overnight restic restore from my USB external to the NVMe. As expected, it again drew too much power, and was powered off in the morning. I looked at journalctl logs, and while the NVMe errors didn't coincide with the shutdown itself, I found this error:
Code:
nvme0: failed to set APST feature (-19)

Which led me to these pages:
18.04 and 18.10 fail to boot nvme0: failed to set APST feature (-19)
EXT4-fs error after Ubuntu 17.04 upgrade

These suggest that APST causes errors with NVMe drives in Linux, due a known bug. I didn't read the bug more than a cursory glance, but the workaround for such is to set the maximum exit latency, which effectively disables idle power states within such threshold:
[Solved] Can't start array after adding 2 NVME drives to the config
Fixing NVME SSD Problems On Linux

These show what to change in Grub, but it appears PBP uses U-Boot. After looking around a little, looks like you can edit the /boot/extlinux/extlinux.conf file. Add the following to the end of the APPEND line:
Code:
nvme_core.default_ps_max_latency_us=5500

According to Solid state drive/NVMe, the max latency disables any states with exlat above that value. So for the Intel 660p 2TB, we have the following power states, according to "nvme id-ctrl":
Code:
ps    0 : mp:5.50W operational enlat:0 exlat:0 rrt:0 rrl:0
ps    1 : mp:3.60W operational enlat:0 exlat:0 rrt:1 rrl:1
ps    2 : mp:2.60W operational enlat:0 exlat:0 rrt:2 rrl:2
ps    3 : mp:0.0300W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
ps    4 : mp:0.0040W non-operational enlat:5000 exlat:9000 rrt:4 rrl:4

This means, setting a value of 5500 would disable PS 4, but leave PS 3 enabled. Setting a value of 0 effectively disables APST (and checking nvme feature 0x0c confirms this). I set mine to 0 for now.

Overall, after researching this and configuring it, I tried restarting. I was able to confirm the above, but the NVME controller still did not load (as was my previous issue). I had the barrel power plugged in still, so I tried disconnecting that and rebooting again; voila, NVMe was there again. So it may be something with the power plugged in on boot, or may be an insufficient power plugged in on boot (just running from some US wall-based outlet USB port), or may be completely coincidental.

I have not yet tried an overnight again. I suspect that this will stop the NVMe from disappearing during use, but will still have the issue of drawing too much power that the battery can't charge quickly enough to be sustained. This also doesn't definitively rule out a thermal issue. My return deadline is coming up, so I may just return it and get a portable USB drive, and revisit this in the future.

Hope this helps someone!
  Reply
#17
It is really strange that reinserting the ribbon cable makes the NVMe drive to appear again, but it may very well be caused by the excessive lenght of the ribbon cable.  What is the revision of the NVMe adapter you're using, the first revision (with the wider PCB), or the second revision (which has a narrow PCB)?

In a pinch, you could try applying some hot glue to the ends of the ribbon cable, where they meet the PCB connectors, to stop the cable from accidentally becoming loose.  I know, it may look messy, but I've seen hot glue even in some high-end products, so it should be good enough for the PineBook Pro. Smile

I've continued the investigation you started, and it seems that APST (Autonomous Powe State Transition) issues with NVMe drives, unfortunately, are pretty much here to stay, depending on the actual NVMe drive and the system it is used in.

Could you, please, run the tests with all power states disabled, so we can establish that as a stable configuration?  After that, I'd suggest that you change the "nvme_core.default_ps_max_latency" setting to have only the lowest power state disabled, and re-run the tests.  If that ends up in instability, have the latency setting changed to have two lowest power states disabled, re-run the tests, etc.

I know, it's a lot of work, but should be helpful for other users.  I've also added a link to this thread to the PineBook Pro wiki page(s).

Edit: According to the Linux kernel NVMe driver source, only a single Toshiba NVMe SSD has confirmed APST-related issues.  However, we clearly see that more drives are affected on certain systems.  By the way, please make sure that the PCIe link in your PineBook Pro runs at Gen1 speed, as described here.

Edit #2: You may also try using pure alcohol to clean the ends of the ribbon cable.  Before applying hot glue, of course. Smile
  Reply
#18
Thanks for all the feedback! I likely won't try all of your suggestions, simply because I don't have the free time I'd like to. But I did try the overnight restic restore again, with two changes: 1) set the max latency to 0 (disabling APST), and 2) used my 5v/3a USB-C charger instead of the barrel charger (think the barrel was only getting 5v/2.4a max with the outlet). The result was no missing NVMe drive overnight, seeing evidence of continued operation, and still nearly full power (whereas before it would have shut off around 4am). So it appears to either be the lowest power state (as you surmise), or APST itself.

(01-17-2021, 01:04 AM)dsimic Wrote: Edit: According to the Linux kernel NVMe driver source, only a single Toshiba NVMe SSD has confirmed APST-related issues.  However, we clearly see that more drives are affected on certain systems.  By the way, please make sure that the PCIe link in your PineBook Pro runs at Gen1 speed, as described here.

I did actually! I had already previously decompiled, changed the max link speed, and recompiled the device tree, and had no change in issues. I might try reverting this to Gen2 speeds to rule it out as a possible red herring (i.e. maybe APST is the only problem and Gen2 speeds are fine). For what it's worth, here's my SMART report. Notice that temp has reached nowhere near critical conditions. 37 Celsius was the highest I've seen, and reports no count (I presume that's what that is) of critical warnings, so temp doesn't appear to be an issue here.

Code:
smartctl 7.1 2019-12-30 r5022 [aarch64-linux-5.7.19-1-MANJARO-ARM] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNW020T8
Serial Number:                      <redacted>
Firmware Version:                   004C
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                <redacted>
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Sun Jan 17 12:42:27 2021 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
0 +     5.50W       -        -    0  0  0  0        0       0
1 +     3.60W       -        -    1  1  1  1        0       0
2 +     2.60W       -        -    2  2  2  2        0       0
3 -   0.0300W       -        -    3  3  3  3     5000    5000
4 -   0.0040W       -        -    4  4  4  4     5000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    6,929 [3.54 GB]
Data Units Written:                 3,842,148 [1.96 TB]
Host Read Commands:                 245,426
Host Write Commands:                15,698,447
Controller Busy Time:               1,535
Power Cycles:                       22
Power On Hours:                     79
Unsafe Shutdowns:                   4
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   4
Thermal Temp. 1 Total Time:         79

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged
  Reply
#19
I've noticed some interesting data in the SMART report you've provided.  Here's an excerpt from the SMART data:

Code:
Thermal Temp. 1 Transition Count:   4
Thermal Temp. 1 Total Time:         79

According to the description of NVMe thermal throttling management, this SMART data says that your drive has spent some time in a light trottling state, since the last power-on event (those numbers shouldn't be lifetime counts).  That doesn't make much sense, because the drive worked well in your last overnight test.

Could the drive be somehow defective?  Just guessing.

Edit: After checking the NVMe specification (pages 123-124) and the source code of smartmontools, I can confirm the above-stated meaning for those two SMART values.  By the way, the value for total time is in seconds.
  Reply
#20
(01-17-2021, 01:39 PM)dsimic Wrote: It is really strange that reinserting the ribbon cable makes the NVMe drive to appear again, but it may very well be caused by the excessive lenght of the ribbon cable.  What is the revision of the NVMe adapter you're using, the first revision (with the wider PCB), or the second revision (which has a narrow PCB)?

I meant to answer this before, and had forgotten. Mine is the second revision, I believe, whichever was the first replacement for the original which needed an adjustment to fit (i.e. mine worked out of the box).


(01-17-2021, 01:39 PM)dsimic Wrote: According to the description of NVMe thermal throttling management, this SMART data says that your drive has spent some time in a light trottling state, since the last power-on event (those numbers shouldn't be lifetime counts).  That doesn't make much sense, because the drive worked well in your last overnight test.

Could the drive be somehow defective?  Just guessing.

Edit: After checking the NVMe specification (pages 123-124) and the source code of smartmontools, I can confirm the above-stated meaning for those two SMART values.  By the way, the value for total time is in seconds.

Nice finds! I took a glance at the documentation. What is it that makes you say they shouldn't be lifetime counts? I see nothing claiming for or against, but I do see: "A value of 0h, indicates that this transition has never occurred or this field is not implemented." Particularly, "never occurred" as opposed to something like "has not occurred since uptime" would infer that it's more likely a lifetime count (I'm just basing this on the wording; I'm mostly unfamiliar with this technology).

If it is lifetime counts, it could refer to thermal throttling after first putting it in, prior to having set up power state configurations. Further, this SSD model does not support saving PS config to the NVMe itself, so I have to use a systemd task to set PS 2 every boot; I would guess this potentially leaves room for the "vendor specific thermal management actions" mentioned in the documentation.

It's also unclear what exactly those thermal thresholds are. I'm not seeing anything in my smart report to indicate what temperatures, so while PS 2 is the only power state available to it currently, I'm guessing it's possible the threshold is such that it's running the function which adjusts those values without actually switching power states.
  Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  New Working nVME gilwood 0 156 02-12-2024, 08:46 AM
Last Post: gilwood
  NVME problems 2022 / Intel 660p 1TB Starbug 1 1,420 04-04-2023, 12:16 PM
Last Post: globaltree
Thumbs Up NVMe adapter, Great addition dachalife 2 1,726 11-28-2022, 12:56 PM
Last Post: dachalife
  NVMe drives not detected mattpenn 12 10,045 03-05-2022, 04:53 AM
Last Post: mattpenn
  NVme intall usage? tkudog 2 2,795 03-04-2022, 01:29 AM
Last Post: Tazdevl
  Anyone selling a spare NVMe adapter in Europe? tom.tomasz 1 1,795 01-03-2022, 07:57 AM
Last Post: tom.tomasz
  NVMe-related crashes and instability, plus a solution simonsouth 13 14,370 12-10-2021, 07:47 PM
Last Post: josmo
  NVMe SSD testing methodology halogen 1 2,556 07-22-2021, 05:57 PM
Last Post: calinb
Question Battery stops charging and NVMe and other media disconnect randomly Eey0zu6O 4 4,633 07-09-2021, 08:45 PM
Last Post: moonwalkers
  NVME SPI Update not booting SD Card WZ9V 5 6,251 10-18-2020, 08:36 PM
Last Post: wdt

Forum Jump:


Users browsing this thread: 3 Guest(s)