(11-25-2020, 04:48 PM)DrYak Wrote: Sorry, my bad, I should have written it more explicitly:
4KiB doesn't matter that much nowadays with regards to write amplification (read-modify-write cycles).
(i.e.: R-M-Ws will always happen upon in-place overwriting in a sub-part of an erase block).
Of course, it would matter to at least align the 4KiB block of an ext partitions with the 4KiB sectors, for read performance (faster to read a single sector that to reads 2 sectors and merge them).
(and given that the erase-block are a super-set of this sector layer, aligning to erase blocks to aleviate R-M-Ws will also get you aligned to sector boundary anyway).
But wouldn't a 4kB sector that is unaligned require writing two new blocks in an erase block, while an aligned sector would only require one?
It would be nice to be able to measure block erase count and total writes. Then a benchmark could be done with aligned and unaligned fs blocks and the WAF measured. It looks like JEDEC standard eMMC health has little detail and one would need a massive benchmark that used a significant portion of the total eMMC lifetime. There is a Micron extension that could work. Looks like PBP ships with a Sandisk eMMC.
Here is the command I used to measure random IO before and after aligning the partition with ext4. It would be interesting to know how f2fs compares.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --filename=test.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --name=aligned-ext4
this belief:
Quote:That why log-structured filesystems (like F2FS) or copy-on-write filesystems (BTRFS, ZFS, BCacheFS) are better for flash:
They *never* modify/overwrite inplace (which will invariably trigger a read-modify-write). They *always exclusively* append writes by definition (which can be write-only by allocating fresh free erased blocks from the wear levelling pool).
contradicts to your own previous paragraph:
Quote:Trying to overwrite and change in-place any content small than that size will always trigger a read-modify-write cycle. (with the small details that the erase block that get "read" and "erased", and that the erase block that get actually "written with the modified version", might actually be two different block in order to rotate which blocks get erased and spread the wear).
and it's not a "small detail", it's essential and it's exactly what makes all these F2FS and alikes unnecessary for anything except bare NAND - because internal controller does manage the internal flash storage and only it can do wear leveling for real. at best, that "flash friendly" FS does nothing bad to the card. FS has no clue about internal "geometry" of the storage and thus cannot do anything helpful with respect of wear leveling in particular, it writes into logical LBA space and all its "super duper log based/cow" features apply to it. and it has nothing in common with the internals of flash.
ANT - my hobby OS for x86 and ARM.
(11-25-2020, 06:04 PM)z4v4l Wrote: contradicts to your own previous paragraph:
No, it's two different things.
- The append only nature of log-structured (F2FS, UDF, etc.) and copy-on-write filesystems (BTRFS, ZFS, etc.) means that there is less inplace writing (compared to appending, and therefore less overall erase. You litterally avoid performing erases (wears the flash) and avoid read-modify-write cycles (bad performance). You get less erases in total for the same amount of operations. (At the cost of complexity of FS and/or RAM usage).
- That the flash management will put an erase-block back into the pool after erase, and will perform the writing into a different erase-block pulled from the pool, will spread the wear. You get the same number of erases (not less), you just avoid the erases constantly happening on the exact same erase block and thus wearing a single point prematurely. You also neet to read-mody-and-write: no matter if you write elsewhere, you get the performance penalty.
A. is about reduction of erases and reduction of RMW cycles.
B. is about avoiding the erases always happening in the same point.
(11-25-2020, 06:04 PM)z4v4l Wrote: and it's not a "small detail", it's essential and it's exactly what makes all these F2FS and alikes unnecessary for anything except bare NAND - because internal controller does manage the internal flash storage and only it can do wear leveling for real.
You're confusing modern flash filesystems (F2FS, UDF) with the "alikes" (JFFS, JFFS2, YAFFS, UBIFS, etc).
- The point of modern filesystems is the reduce the number of erases and in-place overwrites (and thus RMW cycles). (point A above). They can work on flash that has an advanced controller, and in the case of F2FS was designed with such a controller in mind to begin with. (F2FS only reduces the writes. It doesn't care about the actual wearlevelling, that is left to the controller. It just makes the job of the controller easier by making less often necessary to erase blocks or perform R-M-W). UDF specifically can reduce the inplace overwrites to absolute zero and work in "append-only" mode on medium that is write-only (like CD-R DVD-+R, or that can only be erased as a whole like CD-RW), that's why it became a popular successor to ISO9660 for optical media.
Thes modern filesystem can even work on things which aren't even flash to begin with, but which don't like in place overwrite much. (the prime example being spinning rust HDD that use a "shingle" type of magnetic track arrangement)
- The old-school filesystem are designed with bare flash in mind and handle the wear-levelling themselves. Not only do they try reducing erases and in-place modifications, they also take care to make sure that the erases and writes happen at different places in flash and are evenly spear accross the medium (and in case of UBIFS, there's an entirely separate layer - UBI - dedicated to the task). If R-M-W is necessary, the filesystem itself makes sure that the write part is performed at a different position in the raw flash. It basically perform the point B above, but entierly in software. This is a concern that is entirely absent in F2FS.
if you only manipulate a couple of files, a few KiB each:
- the old-school flash filesystems will make sure that they have cycled through the whole raw flash before reusing a block.
- BTRFS will basically allocate only 2 data-chunks max and happily append to each in turn while releasing the other (I am over simplifying, and skipping the metadata chunks, etc.) It will be the role of the flash controller to make sure each time a chunk is released and then reallocated, a different groups of erase-blocks is picked up from the pool.
Quote:at best, that "flash friendly" FS does nothing bad to the card.
At *worst*.
At *best* a "flash friendly" FS makes sure that you run a lot less erases and/or R-M-Ws cycles than performing the same operation on FAT32/exFAT.
Quote:FS has no clue about internal "geometry" of the storage
True for F2FS, UDF, BTRFS, ZFS, BCacheFS, etc.
Nope for the older: JFFS, JFFS2 and YAFFS2 assume a "raw" geometry (UBIFS is a bit different. It assume UBI, and it's UBI's job to handle the raw flash. There's a separation of concerns).
Quote:and thus cannot do anything helpful with respect of wear leveling in particular, it writes into logical LBA space and all its "super duper log based/cow" features apply to it. and it has nothing in common with the internals of flash.
For the absolutely specifics of the wear leveling: indeed it doesn't influence it directly. But that's not its job. F2FS doesn't care directly about the low-level details of wear-leveling. The job of F2FS is to make data organised in such a way that you seldom need to erase a block and seldom need to modify data inplace.
That has the indirect effect that the internal controller will need less often to perform read-modify-write cycles.
In short:
- old-school flash filesystem are all about making sure that the writes explicitely different address in LBA space. Make sure every single LBA has been used at least once before recycling them.
- modern filesystem are all about making sure that erase less, mostly by avoiding repeated writes to an LBA in the middle of two other already-written to LBAs, but preferably to an LBA that is in a region that got issued a TRIM before. If all this only happens in the same 4GiB chunks of a 128GiB eMMC: don't care.
(11-29-2020, 11:33 AM)DrYak Wrote: - The append only nature of log-structured (F2FS, UDF, etc.) and copy-on-write filesystems (BTRFS, ZFS, etc.) means that there is less inplace writing (compared to appending, and therefore less overall erase. You litterally avoid performing erases (wears the flash) and avoid read-modify-write cycles (bad performance).
How about that performance? Since you have flash reformatted to F2FS, try the benchmark I posted earlier. I think it would be interesting to see how F2FS compares to ext4 with aligned blocks vs ext4 with unaligned blocks.
(11-30-2020, 04:31 PM)xyzzy Wrote: How about that performance? Since you have flash reformatted to F2FS, try the benchmark I posted earlier. I think it would be interesting to see how F2FS compares to ext4 with aligned blocks vs ext4 with unaligned blocks.
I have a archlinux install from scrath with ext4 partitions. I used https://github.com/cosmos72/fstransform to convert my data filesystem to f2fs. root filesystem stayed in ext4, one day i'll migrate it too. So :
/dev/mmcblk2p2 on /home type f2fs (rw,relatime,lazytime,background_gc=on,discard,no_heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,mode=adaptive,active_logs=6,alloc_mode=default,fsync_mode=posix)
┌─[remy@ecaz][~/mes_docs]
└»»[$]fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --filename=test.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --name=f2fs
f2fs: (g=0): rw=randrw, bs=® 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.23
Starting 1 process
f2fs: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=15.9MiB/s,w=5478KiB/s][r=4080,w=1369 IOPS][eta 00m:00s]
f2fs: (groupid=0, jobs=1): err= 0: pid=937275: Tue Dec 1 06:15:06 2020
read: IOPS=3521, BW=13.8MiB/s (14.4MB/s)(3070MiB/223187msec)
bw ( KiB/s): min= 9416, max=18704, per=100.00%, avg=14095.10, stdev=1180.16, samples=445
iops : min= 2354, max= 4676, avg=3523.64, stdev=295.07, samples=445
write: IOPS=1176, BW=4707KiB/s (4820kB/s)(1026MiB/223187msec); 0 zone resets
bw ( KiB/s): min= 3128, max= 6432, per=100.00%, avg=4710.35, stdev=438.33, samples=445
iops : min= 782, max= 1608, avg=1177.40, stdev=109.62, samples=445
cpu : usr=8.41%, sys=40.23%, ctx=1414441, majf=0, minf=17
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=13.8MiB/s (14.4MB/s), 13.8MiB/s-13.8MiB/s (14.4MB/s-14.4MB/s), io=3070MiB (3219MB), run=223187-223187msec
WRITE: bw=4707KiB/s (4820kB/s), 4707KiB/s-4707KiB/s (4820kB/s-4820kB/s), io=1026MiB (1076MB), run=223187-223187msec
Disk stats (read/write):
mmcblk2: ios=784214/262403, merge=1415/198, ticks=10430260/3649931, in_queue=14080224, util=100.00%
Look like the filesystem doesn't matter that much. Here's what I got for ext4 after aligning the partition correctly.
test: (g=0): rw=randrw, bs=® 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.23
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=14.7MiB/s,w=5232KiB/s][r=3766,w=1308 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2029: Mon Nov 23 18:21:28 2020
read: IOPS=3549, BW=13.9MiB/s (14.5MB/s)(3070MiB/221433msec)
bw ( KiB/s): min= 4656, max=17064, per=100.00%, avg=14267.87, stdev=523.80, samples=440
iops : min= 1164, max= 4266, avg=3566.76, stdev=130.95, samples=440
write: IOPS=1186, BW=4745KiB/s (4859kB/s)(1026MiB/221433msec); 0 zone resets
bw ( KiB/s): min= 1712, max= 5712, per=100.00%, avg=4768.40, stdev=260.98, samples=440
iops : min= 428, max= 1428, avg=1191.96, stdev=65.26, samples=440
cpu : usr=6.37%, sys=30.10%, ctx=1602898, majf=0, minf=16
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=13.9MiB/s (14.5MB/s), 13.9MiB/s-13.9MiB/s (14.5MB/s-14.5MB/s), io=3070MiB (3219MB), run=221433-221433msec
WRITE: bw=4745KiB/s (4859kB/s), 4745KiB/s-4745KiB/s (4859kB/s-4859kB/s), io=1026MiB (1076MB), run=221433-221433msec
Disk stats (read/write):
mmcblk2: ios=784912/262623, merge=1152/193, ticks=9997099/3905482, in_queue=13902784, util=99.58%
|