PINE64
eMMC failed - Printable Version

+- PINE64 (https://forum.pine64.org)
+-- Forum: Pinebook Pro (https://forum.pine64.org/forumdisplay.php?fid=111)
+--- Forum: Pinebook Pro Hardware and Accessories (https://forum.pine64.org/forumdisplay.php?fid=116)
+--- Thread: eMMC failed (/showthread.php?tid=15121)



eMMC failed - Dendrocalamus64 - 10-17-2021

The OEM 64GB Sandisk eMMC in my first Pinebook Pro just failed from one moment to the next. I'd been using that machine heavily, for everything, including daily web browsing with the cache on the emmc, for about two years.

I'd never used anything with an emmc, sd card or ssd before, so my expectations for reliability were set by traditional spinning platter disks, where you could leave a good one running for a decade with no problems. Looking up emmcs & sd cards now, I had no idea they were so incredibly unreliable and prone to catastrophic failure with no warning at all.

The immediate takeaway is to back up your user data daily. The OS can be reinstalled, and adds a lot of bulk; if it's slow, you're less likely to do it regularly. Ideally don't just trust sd cards for the backups; you should have network-attached storage with real disks. And turn off web browser disk cache, or put it on a ram disk. I also did a fair amount of compiling, which generates a massive amount of disk writes on large projects, and I didn't switch to zram instead of disk swap until recently.

I was running Manjaro Xfce from the sd card, with the emmc mounted for storage. I tried to save a file in Mousepad, and it said the file was read-only. Checked perms; it should be writable by me. I tried to create a new file in Thunar, knowing I shouldn't have to do that, and it said the file system was read-only. That was the "oh shit" moment because I know linux remounts file systems read-only when an I/O error occurs.

I checked the dmesg, and it was full of CQE recovery failed messages for the mmc. The filesystem was still navigable due to caching, but trying to read a file, even with less, would result in the command locking up. On the web, it looked like rebooting solved similar errors temporarily, so I rebooted. After reboot, lsblk showed ~30MB capacity for the emmc instead of the expected 58.9G, and the sd card was flagged read-only in short order; the system would not allow remounting the sd card rw.

I attempted to reboot again, and the system wouldn't boot. I now know that the boot priority on the PBP always starts with the emmc, and the bootloader on the emmc is supposed to check for bootable media on the sd card, so if the emmc bootloader is out of sorts, you have to flip the emmc disable switch or pull the emmc in order to boot.

I put the pulled emmc on the pine store-supplied emmc-to-usb adapter, and tried it in a second PBP. The USB mass storage device lists as 0b capacity, and reports "Medium not present". testdisk isn't able to read it.

Looking at recovery options, there are at least four levels you can access these devices on. The highest level is a usb adapter, where the simple circuitry on the adapter presents it to the system as a generic block device. Next is putting it into an emmc socket, where it reports as /dev/mmcblkX, and you can manipulate it as an mmc. I'm going to see what I can do with that next. The ideal testbed for that is a single-board computer in an open enclosure, so you don't have to open your running PBP and flip the switch back on.

Then there's JTAG ? I still need to read about that.

And finally, there's reading the NAND directly. This is the best single thread I've found about it so far, including the linked pages and PDFs:
Which NAND flash reader ?
https://web.archive.org/web/20211017144203/https://forum.hddguru.com/viewtopic.php?f=10&t=33785

It looks like tons of people lose their data on these every year, and data recovery is a solved problem. But, it's all proprietary. Third-party companies have developed the tools & software to read the raw NAND, and sell them at high prices. Sending your chip to commercial data recovery would be privacy suicide; there's no way to guarantee the company doesn't keep a copy, and some just make nand dumps and ftp them to other countries for outsourced processing without telling the customer. The data is usually all there, including deleted files, but most people don't get it recovered.

There should be an open source solution for emmcs widely used by the open source community, like these sandisk emmcs are now.

Steps,
- Acquire a bunch of emmcs for practice, document the process of exposing & connecting to the nand interface, develop a training curriculum like the commerical vendors have that affected users can follow at home to develop the hardware skills.
- Start with a commercial nand dumper. Eventually can be replaced with an open source device at lower cost.
- Develop the software & procedure for the specific combination of the Sandisk hardware & likely linux filesystems. Less work than having to support all mmcs from all manufacturers, and all operating systems.

Next to do for me: Try the mmcblk interface, then start looking at how much of step (3) has already been done.

All Pine devices should have socketed emmcs, not hard-soldered. A common failure mode is that a phone dies for some other reason, and the emmc is still good, but the user never gets it desoldered, so the data is lost anyway. Socketed emmcs are a major step forward compared to the usual way of doing it.

Most of the time, my system wasn't hitting the swap, but I was experimenting with different thread counts during builds to balance compile speed against the system running out of memory in bottlenecks. Just two threads could result in the system swapping massively when the make process was attempting to build two large source files concurrently. That may account for a lot of the reduced life expectancy. Nonetheless, properly designed solid state storage should remain readable in a read-only mode when it runs out of writes, and it appears that common emmcs do not.


RE: eMMC failed - KC9UDX - 10-17-2021

To me, it just doesn't matter what the media is. I've had too many shocking sudden hard drive failures over the years. I've been very lucky with SD cards and eMMcs. I was very skeptical of them but I really abuse them and have only had one or two failures in years. Backups are just necessary. I regularly backup everything I have to two separate hard drives (sequential duplicates, each one is everything) which are stored in a fire safe in a cool dry place. Well not everything. I also have a bunch of QICs and Video 8s, and floppy disks, which are all somewhat worrisome.


RE: eMMC failed - Dendrocalamus64 - 10-20-2021

I compiled mmc-utils from,
https://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc-utils-old.git/

and tried reading the eMMC Life Time Estimation for the 64GB eMMC in my second PBP. As far as I remember, I've barely used that one. Yet it says,
Code:
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x09
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01

According to the wear estimate, the type A memory is already through 90% of its lifespan. At best, the wear estimate may not be accurate or useful.

Quote:The second factor we explored is storage heterogeneity. Some flash-based storage devices combine different types of flash memories. The faster, more expensive memory has a higher lifetime, and is used sparingly for storing hot data and caching purposes [37]. For such hardware architectures, eMMC supports two different wear-out indicators, one for each memory type (labeled “Type A” and “Type B” ). Note that these different memory types are managed by FTL, and presented to the OS as a single device. The distinction is only visible at the level of these differentiated wear indicators. [...] We also note that “Type A” memory wears out much faster under high utilization setups, while “Type B” memories wear-out much slower.

The quote is from this paper,
Flash Drive Lifespan *is* a Problem
which found that it's easy for malware to intentionally brick a mobile device by wearing out the emmc. It doesn't need any special permissions since volume of writes to the disk isn't rationed.

Results for the 128GB emmc in my Rockpro64:
Code:
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01
Both in 0-10% range.


RE: eMMC failed - TRS-80 - 10-29-2021

Armbian have been using zram as default for years already, exactly because of reasons you learned (sounds like the hard way) here.


RE: eMMC failed - Dendrocalamus64 - 11-03-2021

The eMMC in my PBP #2 went from reporting
Code:
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x09
to
Code:
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x0b
in a few days of little use. Since 0x01 is 0-10%, it went from 80-90% to 100-110% estimated lifespan used.

That was after I switched to zram & turned off web browser disk cache. This made me wonder if there was something left in the background writing heavily to it.

We can use dumpe2fs to check the total data ever written to the filesystem.
Rockpro64 128GB emmc root fs Lifetime writes: 687 GB
PBP2 64GB emmc root fs: Lifetime writes: 58 GB

IIRC PBP2 has the same filesystem it came with. People typically assume an emmc can endure around 3000 write cycles, which would be 192 TB written to a 64 GB card.

I don't think it's just swap, some of them fail faster than others, and people should keep an eye on the life estimates besides just making backups.


RE: eMMC failed - Dendrocalamus64 - 11-09-2021

Failure may not be related to nand wear at all. This was a recurring problem on the Samsung Galaxy S3 and Note 2 eight years ago.

eMMC sudden death research
Quote:Hi, my S3 bricked and even a JTAG could not save it. Yes, the eMMC was bricked at the very low level.
Samsung replaced my board and i checked it is now running 0xf7 revision, the sammy engineer also told me this is a safe fw immune to that superbrick. After further questioning and hardcore probing - the engineer revealed that the eMMC fw of 0xf1 has a bug in its wear leveling algorithm, which causes the sector containing the BIOS to be damaged, and this fw will fix that.

There can be a bug in the firmware code running on the emmc controller which corrupts its internal data. Sudden failures of SD cards may be caused by a similar problem. I've just read about plenty of those on photography boards. They are now designing pro cameras with two slots so they can mirror the data across two cards to reduce the risk of data loss from sudden failure.

The most recent version of the Jedec eMMC spec I've so far found for free download:
https://tuxdoc.com/downloadFile/jesd84-b51_pdf

Code:
JESD84-B51A     Jan 2019
JESD84-B51      Feb 2015  <-- This one



RE: eMMC failed - lot378 - 11-09-2021

Which file system was used on the failed eMMC and what options were used in fstab?


RE: eMMC failed - Dendrocalamus64 - 11-11-2021

Manjaro defaults. ext4 (rw,relatime)


RE: eMMC failed - lot378 - 11-12-2021

Well, the file system choice is a smaller part of the overall picture and your goal here is data recovery rather than figuring out how it went wrong. For certain data loss is no fun. For ext4 I might have preferred to see noatime but relatime is acceptable -- writes are writes and the eMMC was doomed either way. And I think the Pine64 community will see this happen more and more as time passes especially with PinePhone where it is still a bit "wild west" and awareness is minimal with no concern yet about wear rates or file system choices.

Tesla have had issues with worn out NAND eMMCs in their MCU on cars older than 2018. It seemed to be due to excessive log writes.

See: https://teslaowners.org.uk/kb/emmc-chip-failure-what-is-it-should-i-be-worried-what-can-i-do


RE: eMMC failed - Dendrocalamus64 - 11-12-2021

(11-12-2021, 08:48 AM)lot378 Wrote: and your goal here is data recovery rather than figuring out how it went wrong.

It's both. I want to know what happened. And I am still very skeptical that it was due to nand wear. There were none of the gradual slowdown & steadily worsening problems that Tesla & Raspberry Pi users have reported. And I have seen many reports of SD cards failing in the same way after little or no use. I think people just assume all failures are due to nand wear and don't realize that firmware bugs can axe all their data without warning.

Currently waiting on testbed hardware to arrive so I can start probing it.

BTW, emmc spec section 6.6.33 - Dynamic Capacity Management - The host+card combination are supposed to cooperate to gradually reduce the reported size of the medium as blocks wear out & need to be retired. I don't get any web search results about whether linux supports this yet, so maybe it doesn't. It should. If not, the DYNCAP_NEEDED field, showing the number of groups that need to be released (and easily readable with mmc-utils) will just keep incrementing, which is another sign of flash wear, or not.

Code:
sudo mmc extcsd read /dev/mmcblk2 | grep DYNCAP

Also, here is some interesting reading on JTAG.
https://blog.senr.io/blog/jtag-explained

The emmc itself should be open source. It is too important a component to be left secret & proprietary. We should be running open mmc firmware code with everything fully documented.