ZFS Root on Archlinux

ZFS, Native Encryption, root's and you

About

I see repeated questions and general confusion surrounding zpool's and datasets for a zfs-on-root configuration so I figured that would make a good first post.

ZFS is a fully featured solution providing disk array management, its own filesystem and support for virtual block devices backed by the pool. It has a ton of transparent features for optimizing performance and efficiency for tuning in professional deployments.

Data integrity is verified through checksumming of each written record. With this zpool on a single disk is capable of detecting bitrot caused by a failing drive or a transient problem on the host allowing for an early reaction to failure. Redundant arrays can repair the corrupted data.

It also uses a Copy-on-Write methodology avoiding the write-hole problem in traditional hardware RAID5 arrays among other potential write truncation events.

Native Encryption, key notes and comparisons

ZFS's Native Encryption works 'at rest' meaning it encrypts data as it is written to the disks. This happens transparently alongside other features such as compression with advantages and other considerations.

ZSF Native Encryption uses modern encryption standards such as AES128/256 with either in Cipher Block Chaining (CBC) mode or Galoris/Counter Mode (GCM). GCM supports multithreading and is the recommended choice (aes-256-gcm). It also happens to be the default when encryption is set to on.

As encryption is enabled per-dataset (or volume) it doesn't encrypt metadata regarding the zpool itself. This means running zfs list will still show an encrypted dataset's name and properties though its contents cannot be read without unlocking it with a passphrase or keyfile. Some consider this unacceptable though this model poses no security risk for the data inside.

This design choice comes with powerful benefits such as being able to verify the integrity of the data in a scrub operation without having to unlock the dataset. This also means you can also send a raw snapshot of an encrypted dataset to another zpool or remote zpool without decrypting it. The destination can safely store a copy and even check it for errors during a scrub without ever having access to the content inside.

OpenZFS have implemented this feature from the ground up. Its capable of achieving great speeds on modern CPUs with relevant acceleration though could prove to be an undercut for IO performance on older systems. I've personally found that older model laptops achieve significantly slower throughput with native encryption compared with today's desktop CPUs of many cores/threads to do work on.

LUKS (Linux Unified Key Setup)

LUKS is the tried and tested go-to for Linux disk encryption featuring great performance and compatibility with any filesystem as it presents the underlying storage as a virtual block device ready to be formatted with anything. It can be used directly on a disk or partition before putting a filesytem on top. It is often used in conjunction with other software array management tools such as mdadm or lvm to encrypt a virtual block device such as LVM's Logical Volumes. Some even run LUKS on top of a ZFS zvol.

In difference to Native Encryption on ZFS where the dataset metadata is known and accessible on a zpool, LUKS encrypts all writes to the virtual block device it exposes. Instead of unlocking a ZFS dataset at boot time one could opt to unlock a LUKS-encrypted whole-disk and mount the underlying partitioned filesystem. Or one could make your zpool on top of disks encrypted with LUKS if desired.

A combination of ZFS and LUKS (disk>Luks>zpool>dataset / zfs>zvol>luks>filesystem) can eliminate the metadata concern some have with luks encrypting everything. Though the metadata of zfs datasets and volumes pose no threat following a sound encryption implementation with a securely stored passphrase or keyfile for unlocking them.

Using ZFS for a root filesystem

There are good reasons to use ZFS as your rootfs. Not only is a rootfs well compressible but the use of ZFS is transparent to the operating system allowing us to take advantage of many relevant features for workstation setup such as native encryption, transparent compression and snapshotting for easy rollbacks or emergency retrieval of something which has been deleted.

I intend to cover some popular ZFS root EFI configurations in this article using Archlinux and to hope to clear up any confusion regarding commonly included flags and trends from the web.

Some mainstream distributions already bundle support for ZFS-root configurations however I find they clutter the topology and naming conventions quite often when it really doesn't have to be complicated in the slightest. This guide will be focusing on a tidy configuration example where one could easily recursively send all zpool datasets elsewhere as part of a backup strategy.

Getting started

If you're unsure about configuring a ZFS root or whether it's for you - you could also follow this guide zero risk using a virtual machine with a throwaway virtual disk.

With Archlinux there are many ways to bootstrap a new system or VM. You can:

  1. Start a VM with both the intended installation disk and ISO attached to install before booting on real hardware
  2. Create a /boot and zfs partition+zpool right on the host, pacstrap directly into them, chroot and generate initramfs images for the new installation
  3. Boot the ISO on real hardware and do the installation traditionally.

We will be following the most accessible example here using #3.

First off, grab the latest archiso from Archlinux.org and you can either block-copy it to a USB drive, write a CD, or put it on a PXE server. While this is traditionally done with a block level copying tool such as dd or graphical utilities in Windows such as Rufus - this can also be done by piping from pv, or cat into the usb's block device path. Even calling cp directly from the iso file to the usb stick block device works given these tools all write sequentially without modifying or truncating the output.

To make sure you don't write the ISO to the the wrong disk it's highly recommended to use the /dev/disk/by-id disk paths which contain valuable information for identifying your disks by model, brand and serial numbers. This is also key information for setting up a zfs zpool's as well for the same reasons.

Preparing the live environment

To create a zpool you'll need the ZFS driver and userspace utilities in the live environment. You will have to expand your livecd's cowspace mountpoint in-memory for more room to work with and then source ZFS by either building its source, installing a package built to your kernel version or the easiest solution: Installing a zfs-dkms package.

Luckily this is a common experience and projects already exist to set this up automatically. eoli3n's project archiso-zfs handles exactly this and will be used for this example to skip all the manual work. In the live environment run:

  1. git clone https://github.com/eoli3n/archiso-zfs
  2. bash archiso-zfs/init

This script will prepare the ZFS module and its userspace utilities in the live environment automatically. If your kernel version doesn't match the latest archzfs repo builds (Common with the arch ISOs) it will build one for itself using dkms.

Once it says "ZFS is ready" you can proceed to configure your disk.

Disk configuration

Partitioning

Run "gdisk" against the new disk you wish to install on and configure an EFI partition plus a ZFS partition of the remaining space:

$ gdisk /dev/disk/by-id/ata-diskPathGoesHere-verifyPathFirst

n       # New partition (1)
Enter   # Default, first partition number available
Enter   # Default, Start at the beginning of the disk
+500M   # A large enough EFI partition for most cases.
EF00    # Label it as an 'EFI system partition'

n       # New partition (2)
Enter   # Default, first partition number available
Enter   # Default, Start at the beginning of the next available space
Enter   # Default, Use all remaining contiguous space
BF01    # Label it as a 'Solaris /usr & Mac ZFS' partition
w       # Write our changes
Y       # Confirm

Formatting and creating a zpool.

Create a vfat partition on the first partition:

mkfs.vfat /dev/disk/by-id/ata-diskPathGoesHere-verifyPathFirst-part1

And create a zpool on the second/final partition using the system hostname as the zpool name for simplicity:

zpool create my-pc mkfs.vfat /dev/disk/by-id/ata-diskPathGoesHere-verifyPathFirst-part2 -o ashift=12 -O normalization=formD -O compression=lz4 -O xattr=sa -O mountpoint=none

This example sets some common helpful flags for a new zpool:

ashift=12 to create the zpool assuming a 4096b sector size of the drive. Modern SSDs often lie about being 512b and setting this value too low can cause performance problems if you switch from a 512b sector-sized disk to a 4096b one in future. Having this value set too high is fine, but too low can amplify IO operations unintentionally when switching to differently sector-sized disks in future. ashift=12 (2^12=4096b) is a safe assumption for most drives a person will encounter.

xattr=sa to use System-attribute-based xattrs which improves performance by reducing the amount of IO required to read them separately from a file. This is not critical to most situations but you wouldn't need to find that out later. It primarily improves performance in workloads workload which extensively reference extended attributes of files.

-O mountpoint=none This just tells the new zpool to not mount itself to /zpoolName after creation nor any new child datasets.

normalization=formD avoids potential filename confusion by the system in the event that two different files with their own names may look identical to the system. This prevents that by squashing their filename into a unicode representation to avoid this.

compression=lz4 uses LZ4 compression which is a common general recommendation given its great performance and resulting compression while being fast.

LZ4 compression on ZFS also skips the operation if ZFS detects something cannot be compressed such as media (which is already encoded to be as small as can be and will not compress). This prevents wasting cycles later trying to read the data and having to decompress it to no IO benefit.

During zpool creation if you want to use multiple disks for your host's zpool you can do so during the creation step with any number of redundant or striped array configurations.

Root dataset

Creation

We can create a dataset intended to be used as the root dataset with:

zfs create my-pc/root -o mountpoint=legacy

Keeping in mind this dataset will inherit important properties from the default zpool dataset such as compression, normalization and xattr storage preferences which are all highly recommended for rootfs datasets.

In this example I've set the mountpoint to legacy which leaves the dataset for fstab to mount. In our case an initramfs hook will see the root dataset at boot and use it regardless of this. However, if we were to set it as "/" it may have over-mounted our live environment and upset some things.

Mounting

Mount the dataset with: mount -t zfs my-pc/root /mnt.

You can verify it has been mounted with df -h /mnt which should show the zpool and dataset name under the Filesystem section on the left.

At this point we should also make a boot directory and mount our EFI partition there:

mkdir /mnt/boot

mount /dev/disk/by-id/ata-diskPathGoesHere-verifyPathFirst-part1 /mnt/boot

Pacstrapping

We can now begin installing to this environment. I'm going to recommend using linux-lts here as ZFS often run on the latest kernel versions until it has caught up with supporting build:

pacstrap -P /mnt base vim networkmanager linux-lts linux-lts-headers

Depending on your network and IO speeds this step can take a few minutes.

Prepping the rootfs

We can save editing /etc/fstab by running genfstab /mnt > /mnt/etc/fstab to write an fstab file including its current rootfs and boot partition mounts. Using the mountpoint=legacy approach needs an fstab file so the zfs initramfs hook will mount it at boot. For some reason when it sees mountpoint=legacy it refuses to mount a legacy dataset without seeing it in fstab. There is probably good reason for this.

At this point chroot into the new rootfs with arch-chroot /mnt to configure some additional things

System general

First, we can enable the NetworkManager service in this example so it has networking once it boots.

systemctl enable NetworkManager

We can also give it a hostname:

echo my-pc > /etc/hostname

Also don't forget to set a root login password with passwd so you can log in after it boots!

Bootloader

Install a bootloader (bootctl/gummiboot)

bootctl install

And create a boot option making sure to replace this example zpool name with your own:

cat >/boot/loader/entries/my-pc.conf << EOF
title   my-pc ZFS Root
linux   vmlinuz-linux-lts
initrd  initramfs-linux-lts.img
options zfs=my-pc/root rw
EOF 

If your machine is being accessed with a serial console it would be a good idea to append console=ttyS0 after the rw argument.

Initramfs and ZFS

We need to install zfs. The pacstrap command used in the archiso environment copied our ZFS repository over so we can run pacman -Sy zfs-dkms in the chroot to grab it.

We also need to regenerate our initramfs with the packaged zfs hook Or an alternative:

Edit /etc/mkinitcpio.conf and add zfs to your array of HOOKS=

Then run: mkinitcpio -P to generate new images. Keep an eye open for the [zfs] hook in here making sure there are no errors trying to add its modules otherwise attempting to boot will fail.

Booting

At this point the machine is ready to boot. Restart and boot into the disk and enjoy setting up the remainder of your zfs-on-root Arch experience!