Introduction

Somehow, it has come to be that I upgrade my computer every four years. It is purely coincidence that I happened to start that trend on a leap year, and so for the forseeable future I build a new computer roughly once per leap year.

Though if I’m being honest hardware is getting to be so ridiculously fast that I have very little need for another upgrade. Of course PCIe and DDR5 are compelling updates, but I can’t say that it feels like I’m running into I/O or memory bottlenecks these days.

My computers, in order, are:

  • 2008: COMMIE64: Athlon 64 X2 (65nm)
  • 2012: TOPH: FX-8150 (32nm)
  • 2016: NAGATO: i7-6700K (14nm)
  • 2020: NAGATO “2”: THREADRIPPER 3960X (7nm, 12nm)

The last two computers share a name mostly because I built them in the vein of “ship of theseus”: wherein they shared RAM, storage, GPU, etc. They have since diverged entirely, but I have been too lazy to rename the computer. (Plus, at this point, it would mess up my backup scripts.)

It is also interesting to note that NAGATO2 no longer shares an OS install lineage with NAGATO1. NAGATO2 is now a continuation of an Arch Linux VM which started as a virtual machine (“DARKBLUE”) created on TOPH.

So, below this line references to “NAGATO” mean my 2020 computer. (I will disambiguate when referring to the old Skylake computer.)


Hardware

NAGATO contains the following major components:

Motherboard	ASRock TRX40 Creator	(sTRX4)
CPU			AMD THREADRIPPER 3960X	(3.8GHz)	(24c 48t)
Memory		4x DDR4 G.Skill 32GB	(3200MHz)	(128GB)

GPU1	Radeon RX 550
GPU2	NVIDIA RTX 4090 (as of 2023/04, formerly: NVIDIA RTX 2080 Super)
PCI3	USB 3.1 Controller

Storage	(B) Samsung 980 PRO		(NVMe)	(1TB)
		(B) Samsung 980 PRO		(NVMe)	(1TB) 
		(V) Samsung 990 PRO		(NVMe)	(2TB) (as of 2023/04, was: 980 PRO)
		(1) SanDisk Ultra 3D	(SATA)	(1TB)
		(1) SanDisk Ultra 3D	(SATA)	(1TB)
		(2) WDC 20TB			(SATA)	(20TB)
		(2) WDC 20TB			(SATA)	(20TB)

Boot Process

The drives marked B contain a small EFI system partition along with a ZFS root pool. Both have valid boot loaders (rEFInd) and an initramfs containing ZFSBootMenu. So the firmware will load rEFInd from either one of those drives, pass control to ZFSBootMenu, which in turn will load a boot environment from the root pool and boot it.

Once the system is started the Linux console is accessible from my Radeon GPU. Typically the system either runs headless, or I use sway as my window manager of choice.

Typically the next thing I do is sudo virsh start nugget-real-boy which runs a QEMU (KVM) virtual machine by passing it three PCIe devices: the NVidia GPU (GPU2), an NVMe boot device (V), and a dedicated USB controller. (PCI3)

Virtual Machine

I still do quite a lot of gaming, and my day job is at a Microsoft (.NET) shop, so rather unfortunately I still find myself shackled to Windows. I have gotten around this by creating a virtual machine with PCIe passthrough.

It has not been all sunshine and roses, the issues I’ve had to date include:

  1. Rearranging the PCIe topology (i.e. adding devices) often causes the VM to not boot until it has been reconfigured.
  2. I used to have a SATA HDD attached to the VM with virtio-scsi, which had some strange performance pitfalls. (Performance was very bursty.) I never did figure out an ideal configuration, and eventually just replaced it with network shares from the host.
  3. The emulated TPM has given me some grief with BitLocker in the past. I had to update a group policy in the guest to prevent BitLocker from caring about some unsupported features of the emulated TPM.
  4. This is no longer a problem, but running NVidia GPUs inside a VM used to require some hacks to “hide” the fact that it was a virtual machine.
  5. Similarly some games have over-zealous anticheat which doesn’t like running inside a VM. Typically hiding virtualization means disabling some “enlightenments”, which allow for maximum performance of the Windows guest.
  6. See “kernel hacks.”

So I’d like to give a big “fuck you” to any dev in 2022 that thinks a virtual machine is a valid reason to stop your software from running.

Kernel Hacks

There were two issues I had with the Linux kernel that made running this guest a pain in the ass:

  1. Motherboards group PCIe devices into “IOMMU” groups that allow the kernel to isolate the address space associated with that device. On Intel client SKUs (like my old Skylake machine) this required a “PCIe ACS Override” patch to get it to create sane IOMMU groups. As the name implies this is not a particularly safe thing to do. (It is overriding what the CPU/motherboard vendors intended for the platform.)

  2. I had a Samsung 950 drive which required a patch in order for the drive to not hang after the vfio-pci passthrough driver resets the device with an FLR. (tl;dr: without the patch the drive is unusable after a guest reboot until a full host reboot.) The patch was never accepted upstream despite the fact that it already exists for a sibling drive. (The 960 Pro.) So for years I had to build a patched kernel until that drive was eventually replaced with the 2TB 980 Pro. (Which does not exhibit the issue!)

Thankfully (1) was addressed because the ASRock TRX40 puts virtually everything into its own IOMMU group, with the exception of the USB controllers, hence my dedicated USB PCIe card. A new storage drive solved (2) obviating the need for me to maintain that patch.

Usage

To switch between my Linux and Windows machines: the two GPUs are attached to a a 2x2 4K KVM which supports DisplayPort 1.4. The computers are individually controlled by passing the input devices to either (a) nagato’s onboard USB controller, or (b) the guest’s dedicated PCIe USB controller.

Shared storage between the host and guest is accomplished with SMB shares created from the ZFS pools (1 and 2) which provide SSD storage or HDD storage depending on performance requirements. The machine has a virtualized 10Gbps NIC, so transfer speeds are adequate to saturate the SATA SSD. (Even reads would theoretically be at a max of 12Gbps which is “close enough” to 10Gbps for the types of workloads I’d use a network share for.)

The KVM switch has a USB 3 support which allows me to attach fairly fast peripherals and seamlessly share them to either the host or guest. The one caveat is you need to be careful switching devices when a USB storage device is attached, or any other peripheral that would be sensitive to being hot plugged.

Audio

Another quirk is audio. I haven’t tried using virtualized audio in a while, but the latency with pulseaudio was pretty much unbearable for games. It would also often introduce “pops” or “crackles” into the audio stream. (I use pipewire now, so this may be a non issue, but I can’t be arsed to redo my setup.)

I use a project called Scream which creates a fake audio device in the guest that writes to a shared memory buffer on the host. The host then plays that buffer through pipewire. (Supposedly their network interface is better, but I am not sure I want to saturate the 10Gbps host-guest vNIC with that traffic. Some day I should probably try it out, though.)

My microphone is currently attached to the host and attached to the virtualized soundcard in the guest. I used to hotplug it w/ the KVM, but got sick of programs nagging me about new audio devices all the time. (Also occasionally Linux would get quite confused about what my default device should be.) This adds a bit of additional latency, but it seems to be acceptable. (Presumably since it’s a mono stream at a much lower sample rate/bit-depth than my audio output.)

Storage

The VM has its own NVMe drive formatted with NTFS which serves as its primary storage. I try to make it a policy to not store anything on that drive that is not ephemeral. (i.e. games which could be re-downloaded from Steam, etc.)

Data which needs to be backed up is sent to the 20TB storage pool via the SMB share. Data which needs to be “hot” but live off the VM is stored on a share from the 1TB SATA SSD pool.

ZFS

For the past several years I have been using ZFS on Linux for most of my storage needs:

  • 2016ish: experiments with ZoL in VMs
  • 2017: built a storage server (kaisei)
  • 2018: switched my server from btrfs to zfs
  • 2019: switched to Linux on my main workstation with ZoL on /.

Just FYI this article is going to be mostly my subjective experience with ZFS. This is not a deep-dive on the internals of ZFS, or a thorough analysis of its performance. To the contrary: one of the things I like about ZFS is that it gets out of the way. It works well enough that for the most part I don’t have to think about it.

Currently I have the following pools:

  1. seraphina: single mirror of 500GB Samsung SATA SSDs; server root pool.
  2. nico: single mirror of 20TB WD (enterprise-grade) SATA SSDs; desktop storage.
  3. nagatosan: single mirror of 1TB SanDisk SATA SSDs; desktop storage.
  4. nagato: single mirror of 1TB Samsung NVMe SSDs, desktop root pool.
  5. kaisei: 3x4TB mirrors, mix of seagate/hitachi consumer grade SATA HDDs; home storage appliance.
  6. tsubaki: 1x10TB mirror of HGST enterprise grade HDDs; home storage appliance.
  7. natsuki: singleton vdev of a 256GB Samsung SATA SSD, laptop root pool.

The pools broadly speaking are configured with ZFS defaults and an ashift=12. Compression is enabled on everything except a few datasets that store media which will not compress well.

ZFS provides for virtually all of my storage needs these days. The only real exception to this is that I’m using NTFS for the Windows devices and VMs in my life, along with whatever filesystem iDevices are using these days. (I’m not sure if iOS 15+ has switched to APFS yet, and haven’t had a chance to jailbreak a device this new to find out.)

Though it’s ultimately NTFS underneath: there are a few Windows VMs which I am lying to, by exposing ZVOLs over iSCSI. I am aware that ZVOLs are slow1, but the results from this article didn’t seem to match my own experience; and so far haven’t had the time to dig into why that is. (I am sure it is a misconfiguration on my part, but it was no small difference in my case: using a ZVOL took a system with borderline unusable latency and made it sufficient for light gaming.)

Experience

Basically my flow is something like this:

  • all my hosts backup from their respective pools to kaisei/hosts/<..host root..>

    • this is done with zfs send | receive
    • I’m using syncoid/sanoid for automation at this point
  • kaisei backs up to tsubaki periodically, which is a pair of drives that are mostly designed to be attached-recv'd-scrubbed-detached and stored off-site.

  • kaisei and tsubaki grow as the demands placed on them grow. Considering my desktop now has 20TB (wtf) of storage they will probably need to expand soon.

ZFS send and receive is one of the greatest things ever. It works great but building a correct command-line, managing incremental snapshot chains, etc. can be something of a chore. sanoid takes out most of the suck & guess-work however. My only complaint about sanoid honestly is that it requires a perl module from the AUR which I regularly have to rebuild. I’m not a perl fan (scariliege, I know) and I’m not an AUR fan: so this drives me to drink every other month or so. Some day I will probably automate building this package, or install one of those cursed “AUR helpers”, but today is not that day.

Believe me when I say that zfs send and zfs receive are tools I no longer want to live without. In 2022 I think a filesystem that cannot serialize to an (incremental!) replication stream need not exist. SSDs obviate most of the downsides of CoW filesystems, so there’s really no excuse. I’m not sure what MSFT is doing, but they really need to finish ReFS. I’m not sure if APFS has serialization capabilities, as I haven’t used Mac OS for almost a decade, but Apple should seriously consider adding it if they haven’t already. It’s a game changer. (Combined with their Time Machine product, if that’s still a thing, this would give Apple a best-in-class desktop backup solution in my opinion.)

I have some rudimentary monitoring built around zfs-list, zed, SMART, and friends that alerts me to pools that are filling up or failing. I have yet to have a pool fail on me.

I scrub the laptop every once in a while. You may think that scrub on a singleton vdev seems pointless, but it has detected bitrot a handful of times. So far that data has mostly been incosequential data in old snapshots, and not references to live blocks. The beauty of ZFS is that it will tell you where the corrupted data is: if it ever does hit a live block, I will know to restore that from a backup. (The backup which lives on three redundant pools, which are regularly scrubbed and capable of self-healing.) - I’ve looked through old photos from NTFS volumes and occasionally I stumble across corrupted images; I have to wonder if that is bitrot from “a time before ZFS”, or is it just some operational error long since forgotten? The beauty of ZFS is that you don’t have to wonder. (The other advantage of the laptop being ZFS, despite having no redundancy or ECC RAM, is that it makes backups a breeze.)

Rant: stop recommending overkill hardware.

The ZFS community has a huge problem where they recommend that you should only run ZFS on enterprise grade drives, with ECC memory, and that it requires globs of RAM. My experience has proven this to be patently false. I use ZFS happily on a laptop with an old-ass CPU, DDR3 non-ECC RAM, a slow SATA drive at 3Gbps, and a singleton vdev. I wouldn’t enable deduplication on it, or use that laptop as the headend to a zpool on a huge attached JBOD, but it works fine as a small linux terminal.

On a singleton vdev ZFS itself will literally tell you to “restore damaged files from backup.” - This, to me, heavily implies that ZFS was designed to be used on an end-user device (i.e. without an array) that was storing traditional backups. Those backups, of course, can easily live online on a redundant pool which can be trivially updated with zfs send/recv.

As for using enterprise hardware: my desktop does not have ECC memory (nor enterprise grade storage for the root pool) and has never detected a single error in the pool in several years of heavy use. (Hundreds of TBW.)

You are far better off building multiple cheap pools, and maintaining healthy backups, then you are putting all your eggs in one “enterprise grade” basket. If someone is looking to build their first pool the advice should not be to blow their budget on one turbo-expensive pool, but to instead spread the budget evenly across a primary pool and some replica. (Whether it lives on ZFS or otherwise.)

ZFS was developed in a time when drives, and their firmware, was far more immature than it is today. It was also designed in a multi-user environment where clients (Solaris workstations) were presumably doing most of their actual work on gigantic Sun rack-scale computers. So I have to imagine that ZFS’s first-party support for volumes (iSCSI), shares (NFS/SMB), along with snapshots and zfs send/recv were primarily motivated by operator concerns which come with computing in that sort of environment.

(This seems to be supported in that: if you go look at the original ZFS manpages you will see in numerous examples on administering snapshots that their hypothetical users were storing home directories et al. on ZFS.)

There are legitimate concerns about using certain drives in RAID, particularly drives which like to sleep (WD greens), drives designed for sequential access (WD purple), or drives with zoned storage. (e.g. shingled magnetic recording.) However the concerns with these drives are (a) not limited to ZFS, and (b) not even limited solely to running CoW filesystems on them. People should be avoiding these specialty products unless their use-case demands for it, those use-cases rarely involve ZFS operating with redundancy.

There’s gotta be something bad about it?

The worst thing ZFS has ever done to me is destroy data I told it to destroy. I had a dataset that was essentially a copy of a block device I wanted to nuke and re-use. I don’t even recall the exact chain of events (though I’m sure it involved the -r flag) and somehow I ended up nuking that image. I noticed it the next day, long after it was possible to easily rewind the pool to that txg. (This was also before the zpool-checkpoint feature was a thing.)

The other thing that sucks is I have to exercise caution upgrading kernels on my machines, especially ones with boot pools. It requires some hyper-vigilance on my part to make sure that no upgrade ever fails to create an initramfs. Should I ever meet a genie: I am spending at least one wish on getting ZFS in Linus’ tree. (Also know that I refuse to consider the ramifications of this wish. BLIND OPTIMISM WILL SURELY SEE US THROUGH.)

So I use an inferior operating system (Linux, not BSD) which causes me grief, and once I typed a command literally called destroy which destroyed my data. In 6 years or so I would say that’s pretty good. To be fair NTFS, ext4, and xfs haven’t (knowingly!) destroyed any of my data in the same timeframe, either. However my three stints with btrfs all encountered dataloss within that timeframe. (The first one lasted literally minutes, the second one lasted ~3 months, and the third one lasted a few hours but I was intentionally trying to trigger a corner case w/ rebalance to see if it was fixed; it wasn’t.)

Another gotcha I run into quite often is that features & properties which require on-disk format changes are not transparent. For instance on Windows you can turn on BitLocker and it will kick off a background process to encrypt the volume. On ZFS you would have to create an encrypted dataset, move everything into it, update all your mountpoints, and kill the old dataset. Similarly if you want to change compression or record-size these require rewriting the whole dataset.

This isn’t too bad, and actually I’ve learned to appreciate the simplicity, in that ZFS doesn’t try to hide what would be a very expensive operation behind an innocently named switch or tunable. However it is something I will occasionally stumble over. This goes double for pools, which have some properties that cannot be changed, like ashift=. That being said I’m sure this criticism applies to basically any complex filesystem or volume manager, and ZFS combines both into one.

CLI done right

The ZFS command line (zpool for pool operations, zfs for dataset operations, and zdb when you fuck up) is perhaps one of the best interfaces I’ve ever seen. The documentation is superb, the options tend to be both obvious and consistent, and the commands are pretty good about stopping you from doing stupid things. (Particularly zpool is pretty good about stopping you from doing things that would compromise a pools availability or resilliency.) Furthermore all the commands support both human-readable and machine-readable outputs, making automation of common tasks fairly painless.

One other thing, which I don’t hear many people talk about, is ZFS delegation. The ability to hand a subset of operations out to unprivileged users is fantastic. Figuring out what permissions you need for any particular operation can be a bit challenging, but it’s really nice to be able to run backups w/ an otherwise unprivileged account. (However, make no mistake, an account that can create and destroy datasets & snapshots, even if they’re “just backups”, is still very privileged.)