Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ZFS 2.2.1: Block Cloning disabled due to data corruption (github.com/openzfs)
135 points by turrini on Nov 22, 2023 | hide | past | favorite | 64 comments


Reporter of #15526 here. I've still only ever seen this corruption in files being built, which seems to be a situation that the race can be won repeatably by some packages. I reviewed all my user data and I've found no corruption to it, but my workloads don't involve block clones.


Don't forget how `zfs send | zfs receive` could cause both pools to be corrupted, reported in

- https://github.com/openzfs/zfs/issues/15140 : supposed to be from flushing largepages

- https://github.com/openzfs/zfs/issues/15275 fixed by https://github.com/openzfs/zfs/issues/15464 : linked to block cloning

Block cloning is a large suspect.


2nd in the bug here. I actually have some blockcloned data, f.e. steam uses it for all proton installs, but no issues other than the go build.


Coming back home and checking 2.2.1 out zfs instantly started spewing write & checksum errors due to #15533. Both this and #15526 seem to have underlying issues from <2.2 that are just more easily triggered now. First one also confirmed on FreeBSD now.

Holding off on 2.2 seems recommended, and if you're keeping critical data on OpenZFS it might be a good idea to give the issues a glance. The 2nd one might have the same underlying solution as an issue that has given me system freezes when closing in on ~90% pool usage using 2.1 on top of LUKS.


I was glad for this feature to land, but being conservative with my main storage array, I decided not to immediately jump on this. Prudence pays off I guess.

That said, for my main workstation, I plan to migrate to bcachefs very quickly once it's mainlined. I haven't done enough introspection to be able to tell you why I can hold both opinions in my head so well.


Kent does plenty of things right during the development process, but having a decade of experience contributing to OpenZFS/ZFSOnLinux, I can say that I will be very surprised if everything goes as well as people seem to think. There are many bugs in a complex code base and a simple 10x increase in userbase will undoubtably result in more reports of people hitting them.

That said, I think he is laying a good foundation (from what I have heard/read about his development process), but there are many times when the best of us have been confident in code that turned out to have problems and I doubt he is an exception to this.


> he is laying a good foundation (from what I have heard/read about his development process)

Could you please share some blog or resource to read a bit about his dev process myself? That stuff usually interests me, to improve my own process.


It is mostly from external observation. He seems to be the reason that bcachefs isn't being shipped in production today, as he is trying to work through all of his backlog rather than shipping it in a WIP state. This is very different than how developers of a certain filesystem in Linux's kernel source tree that I shall not name do things. If I recall correclty, a developer of that filesystem told me several years ago that if they did not ship the code for users to use it, they would not find bugs. That file system is not much less buggy today than it was back then. :/


btrfs


So the bcachefs is built on the basics of bcache which existed on its own as a block layer and worked in production for many years. This is layering that doesn't really exist in other filesystems. Not sure if that's what the parent meant.


That is not quite what I meant. Not shipping premature code that you know needs more work is something to be admired. It is something I wish more developers would do.

Anyway, it has a while since I read anything about bcachefs, but what I have read stuck me as being consistent with doing things well. For example, he is working on having an automated test suite in place before he ships it, which is a great thing to see:

https://lwn.net/Articles/934692/


> That said, I think he is laying a good foundation (from what I have heard/read about his development process), but there are many times when the best of us have been confident in code that turned out to have problems and I doubt he is an exception to this.

For mixed storages, bcachefs will be interesting.

For RAID1, another option is any filesystem over dm-integrity over mdadm: dm-integrity can protect against silent file corruption if used below the mdadm level: any filesystem reading inconsistent data + checksum at dm-integrity level would cause dm-integrity to give a EILSEQ to mdadm, which should recover data from the mirrors.

It's done with a cryptsetup step, and explained on https://gist.github.com/MawKKe/caa2bbf7edcc072129d73b61ae781...

Main advantages:

1. it's a "bring your own filesystem" (ex: XFS, EXT4): you add protection against bitrot to any filesystem

2. dm-integrity may be newer, but mdadm and XFS have a large user base, making them well tested.

3. you can test this approach by simulating bitflips on the underlying data device, reading from the /dev/mapper entry, reading files themselves, doing a scrub etc.

4. you can select other algorithms besides crc32: in the rare case the error couldn't be seen by crc32 (which is likely to be applied at the hardware level) you gain an extra layer of safety


The level of simplicity of ZFS has no equivalent in Linux; the whole lvm stack is a fuckup from a usability perspective. Half a dozen distinct commands working on different layers that - more often than not - are conceptual. ZFS has been around for almost 20 years now. It was production ready more than 10 years ago, when it was open-sourced. Having your systems 1 or 2 versions behind the mainline is good practice in every mission-critical piece of software.


> Having your systems 1 or 2 versions behind the mainline is good practice in every mission-critical piece of software.

I do that, and I've even personally experienced a few rare ZFS bugs that seem due to interactions between ZFS and Western Digital firmwares.

Still, I was caught unprepared: I have several backups not stored on ZFS, but all of them were made FROM a ZFS source, meaning they are now all suspicious since silent corruption has been possible since version 2.1.4, and maybe even longer than that.

ZFS is practical to use, but for now I think I'll keep a history of file checksums, like how it was done before bitrot protection.


Still, that could happen with any other FS. Silent corruption is actually quite more common than winning the lottery, so in that regard ZFS is actually a good step on reducing those odds, even if its not zero (like advertised in the package).

I'm also assuming those backups aren't actually ZFS streams (from zfs send|receive) which is a special case of "bugs biting you twice" :P


I just benefited from ZFS boot environments/snaphsots during the update of FreeBSD from 13.2 to 14.0-release. It's a great tool to have at your disposal


As someone who just installed OpenZFS on Linux, I agree with you totally.

ZFS just is, no need to worry about layering device mappers and LVM. The automount is also very nice.


I'd probably give it a release or three (or 5). Still plenty of development actively happening despite it being mostly just Kent.

But yes I plan to use it soon-ish as well.


I had the exact same reaction. Saw the massive list of new features and decided to wait for .1 or .2 releases to shake out the bugs. Seems like patience is paying off as always.


Same boat, while they keep shipping impressive features, I've been bitten far too many times on btrfs and zfs.


I've heard and experienced plenty of problems with BTRFS, but it's pretty rare that I hear of problems with ZFS (aside the the current article, obviously); what problems have you hit? (As a very heavy user of ZFS, I would very much like to know about possible problems before they bite me)


I have two identical Debian 11 on ZFS servers with ZFS encryption enabled. On one pool on one machine it will start reporting errors after about 1 week of uptime. The errors are caused by failure to decrypt within the Kernel Cryptography Framework. No read/write/cksum errors are reported for any of the zpool devices. Running a scrub finds no errors and clears the errors. The presence of the errors causes my snapshot replication to fail so I need to reboot the server weekly and run a scrub. One person has reported that updating to 2.1.13 fixed the issue, IIRC. That version hasn't been released to Debian 11 yet and that version also included a commit which removed the KCF kstat code so I wouldn't be able to monitor the KCF errors anymore.

Another issue I ran into is when SQLite is run in synchronous mode (the default) with WAL (not default but recommended) and the default locking mode (so it creates an -shm file) and the shm file is stored on a ZFS filesystem. SQLite frequently executes ftruncate on the shm file when different processes access the same SQLite database, and for some reason ZFS can cause the ftruncate call to block until the txg timeout which is usually 5 seconds [0]. If you were running a program which records every shell command you run to a SQLite database, for example, that would cause a 5 second hang before any shell command would execute [1]. The workaround is to disable synchronous in SQLite or the ZFS dataset, which is probably safe because of the ZFS atomicity guarantees.

Those are two examples I have run into recently. I'm sure people run into issues with e.g. ext4 as well but I think they are a bit more frequent with ZFS on Linux, especially if you use more fringe features like encryption.

[0]: https://github.com/openzfs/zfs/issues/14290

[1]: https://github.com/atuinsh/atuin/issues/952


Unfortunately, encryption never was quite as robust as the rest of ZFS. That said, things have become much better in the past year, as there have been targetted efforts focused on improving it, which found and fixed multiple bugs.

That said, most of the developers are likely not using it on their own systems, which probably allowed bugs to stick around abnormally long compared to the rest of the code base (although some really ancient bugs were found last year in other areas).


Since you asked me, it was many years ago now (5+) - but it hurt. I lost a 16TB cluster to btrfs, the raid just failed for no apparent reason (this wasn't uncommon back then according to forums). And on ZFS it was a kernel bug where on boot, it would mount ZFS after waiting for root - so root never appeared. (This was due to ZFS running as a user-space driver on linux back when it wasn't yet in the kernel)


You can find many corruption issues on the openzfs bug tracker. The btrfs corruptions are a bit of a meme now and everyone has their own story from (old system, unsupported feature) failure long time ago, so it's quite hard to really compare how stable they are in realistic scenarios.


Here's an example of a nasty data corruption bug in ZFS from not that long ago: https://github.com/openzfs/zfs/issues/7401

Corresponding HN discussion at the time: https://news.ycombinator.com/item?id=16797644


That was 5 years ago. What would you consider to be a long time ago?


This wasn’t the only block cloning bug. They also completely broke cp from unencrypted to encrypted by cloning unencrypted blocks into the encrypted file system.


I reported maybe the first instance of that (#15275), and lost my home pool (but was able to get my files back). It was not a good weekend.


NOTE: This situation is worse than it seemed when this thread landed on HN; see also https://news.ycombinator.com/item?id=38405731


This is why you should always have proper backups, and a great reminder that snapshots are not backups.


Except that in this case your backups might have already been corrupted... because the first version of the file was already messed up.


I was excited to try zfs on a separate disk with Ubuntu, until I screwed up my windows boot... guess sit will be a while before I get to try it while I fix my mess.


Am I the only waiting for real write cache on ssds (not just ram)?


ZFS is undeniably impressive – I first started using it back in 2009 with OpenSolaris. This year, motivated mainly by the power consumption of my previous OmniOS setup, I made the move to the "dark side" with Synology. After more than a decade of relying on ZFS, I must admit, the transition to something less rigid but still robust has been quite refreshing.

It could be a subjective feeling, but the recent years of OpenZFS development have reminded me a bit of the OpenStack experience. It seemed like almost anyone could contribute, sometimes resulting in features that were questionable in terms of stability, development, and overall thoughtfulness. Perhaps this is why iXsystems has taken a more cautious (albeit slower) approach in enabling new features in TrueNAS.


I'm in nearly the same boat as you.

I've been running my current 32TB ZFS storage array on an HP DL380 in my basement. The power consumption and noise are both incredible.

I'm closley watching black friday/monday deals hoping for something on the Synology DS1522+.


I quite like my DS1819+. Not exactly a powerhouse but it's rock solid and has zero drama.


What I liked about solaris--and illumos based OS's--is the time slider built in to the file manager. Would live to see something like that in freebsd and/or linux


How the heck did this ever make it through? I don't think this ever would have happened pre-OpenZFS, and undermines the stable reputation of ZFS. OpenZFS needs to do better and look to FreeBSD developers, not Linux developers, as role models.


>How the heck did this ever make it through?

Because it was introduced by a FreeBSD developer, and FreeBSD's fs and cp are a generation behind Linux's so they weren't even able to hit this.

>I don't think this ever would have happened pre-OpenZFS, and undermines the stable reputation of ZFS.

Of course it happened pre-OpenZFS: https://blog.lastinfirstout.net/2010/04/bit-by-bug-data-loss...

From the people that brought you Slowlaris.

>OpenZFS needs to do better and look to FreeBSD developers, not Linux developers, as role models.

This was introduced by a FreeBSD developer and was caught by a Linux one. You need to rethink your role models.


>Because it was introduced by a FreeBSD developer, and FreeBSD's fs and cp are a generation behind Linux's so they weren't even able to hit this.

Honestly, this feature could have used more testing on both platforms.

Also, FreeBSD's VFS in theoretically able to support reflinks across datasets while Linux's VFS disallows that. It is not really behind.


>FreeBSD's VFS in theoretically able to support reflinks across datasets while Linux's VFS disallows that.

I don't believe so, you're still not able to hardlink across file systems in BSD.

>It is not really behind.

It is behind: https://github.com/openzfs/zfs/issues/405


>Because it was introduced by a FreeBSD developer, and FreeBSD's fs and cp are a generation behind Linux's

This is often regarded as a good thing in the BSD world. In all fairness, this bug just enforces that notion.

And Solaris (at least on amd64) wasn't slow. It was demanding, and not really suited for desktop. And it was picky with hardware.

Totally agree on the rest of your comment, though.


>This is often regarded as a good thing in the BSD world. In all fairness, this bug just enforces that notion.

Same in Linux, which is why you'd choose a distro that's less bleeding edge.

>And Solaris (at least on amd64) wasn't slow.

No, it was slow. There's a reason why everyone in HPC/HFT/etc. moved off Solaris to Linux in the 2000s. Linux was regularly beating Slowlaris in practically every category at the end.


A FreeBSD developer wrote this code.


and then there were deep concerns about the stability of same, so vfs.zfs.bclone_enabled = 0 was left in-place

https://github.com/freebsd/freebsd-src/commit/068913e4ba3dd9...


So far nobody has proven that this bug exists on FreeBSD, though. :)


The bugs are in cross platform code beneath the VFS. They exist on FreeBSD too.


Because FreeBSD is behind Linux in copying with reflinks. They couldn't even reproduce the issue if they wanted to.


This code was implemented for FreeBSD before it was implemented for Linux. I do not think it has made it to a FreeBSD release yet though.


No it wasn't: https://github.com/openzfs/zfs/issues/405

This hit Linux first, which is why Linux experienced the issue. FreeBSD does not currently have reflink cp.


It’s in FreeBSD 14. But they disabled block cloning by default.



Looks like a deeper bug has been exposed due to the way bclone works? They're finding issues now on ZFS versions predating bclone

Edit: yep, it's an older bug. Set zfs_dmu_offset_next_sync=0 to save your data

https://github.com/openzfs/zfs/issues/15526#issuecomment-182...


Not sure what's your point. The comment I linked to just confirmed that the bug (whichever it might be) has reproduced on FreeBSD.


Just following up for the readers of the thread


Nope, somebody managed to reproduce this issue even with zfs_dmu_offset_next_sync=0

https://github.com/openzfs/zfs/issues/15526#issuecomment-182...


Apply this patch, it appears to be the real fix.

https://github.com/openzfs/zfs/pull/15571/files?diff=split&w...


I'm reading through the bug report and the investigation in the comments. Is there something here that makes this bug extremely obvious to you that nobody else is comprehending?


You don't have to enable all these fancy new features in your zpools even if you update your build of zfs if you don't want to.

I have multiple zpools and none of them have been upgraded to enable any of these new features in the past couple years.


That alone probably protected a number of users from hitting this.


It seems like a meme that GPL software developers are more ego driven than BSD/MIT software developers, and so bike shedding and new features take precedence over correctness/simplicity/beauty. I was a decade long Gentoo user who was frustrated by crashy software, switched back to pirated windows (I would never be caught dead paying for windows), only to find out the same trash developer habits carried over in my absence and now windows sucks just as much plus IT SPIES ON YOU!!! I'm now on Qubes OS hoping Xen is well enough built to assuage my anxieties.


The Linux developers involved use the CDDL. While some of them have patches in Linus’ tree, most of them are described as Linux developers because they develop and use the code on Linux. They have little to do with mainline Linux, partly because certain the mainline Linux developers will actively go out of their way to antagonize them if they even try to contribute to mainline. Certain others will simply ignore their emails. I will not say names, but I have had that happen to me in the past.

That said, I suspect you did not read the replies, since the code in question was written by a FreeBSD developer. Bugs in new code have been introduced by developers on both platforms. Unfortunately, the bugs in this feature were not caught before they reached a stable release. :/


> bike shedding and new features take precedence over correctness/simplicity/beauty.

I just learned the recent FreeBSD 14 release doesn't have 802.11n WiFi support. Is that considered bike shedding?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: