Direct Kernel booting with QEMU and installing a new rootfs to boot with.

About

Direct kernel booting is something I've found myself doing over the years. At first to see if I could pull it off and over time it became a fast way for to test initramfs hooks without having to reboot a physical machine or maintain and rebuilt initramfs images for a virual environment another layer down.

Booting a kernel image directly with QEMU lets me work on initramfs hooks at record pace while rebuilding images right on the host before booting them seconds later.

Creating a virtual disk for the kernel to land in

Direct kernel booting is great and can be pointed to an NFS server or iSCSI target to proceed. For this demonstration I'll be using a virtual disk for the guest to boot into without its own EFI partition.

In my case I have ZFS available on just about every machine I manage so for me creating a new virtual block device for a guest is as simple as sudo zfs create zpool/ext4_directboot_test -s -V50G.

Otherwise a flatfile image can be created by either concatenating say, zeros into a .img file - or using `qemu-img create -f raw /tmp/ext4_directboot_test.img 50G for the same environment.

While its possible to use other formats with -f such as vmdk or qcow2 - for simplicities sake I would like to partition and mkfs on the image file directly.

Preparing the virtual disk

Working with raw disk dumps (Or in this case a raw virtual disk)

I stuck with raw disks for this exercise to lessen the complication of having to work with other virtual disk formats which don't store data 1:1 in their flatfiles but if we're using a raw flatfile image we still have to expose the partition to the host.

In the case of a the raw disk image we can use losetup to map this raw image to a loop device. This is useful for many reasons especially ISOs but in our case it can also expose the partition table of the flat image file to the host as part of a device file mapping for ease of access.

This can be done with losetup -Pf /tmp/ext4_directboot_test.img

-f finds the first available loop device number index and uses that (Typically /dev/loop0) and -P does a partition scan so existing or new partitions we make will be picked up immediately.

You can then run losetup to see what loop device it was assigned to. If you aren't using any loops it will typically be /dev/loop0.

Partitioning

At this point we can run gdisk against the /dev/loop0 path or that of a avirtual block device in the case of ZFS/LVM. gdisk will open up with a new empty GUID partition table to start with given its a blank 'disk'.

Press n to create a new partition and press through the prompts to create a first partition consuming the entire virtual disk file's space with a default header with type code 8300 (Linux filesystem).

Then q to write it out and exit.

Filesystem creation

At this point we can run mkfs.ext4 on the virtual disk flatfile/path (/dev/loop0p1) and it will set one of those up too.

Mounting and bootstrapping an installation into it

The varying distros out there have their own way of bootstrapping a system. In my case I'm using Archlinux and will bootstrap it with pacstrap.

At this point we can mount the new filesystem we created. For a zvol or similar such as LVM logical volumes we can mount the new partition via its partition path under /dev/zvol or its name in /dev/mapper:

mount /dev/zvol/zpool/ext4_directboot_test-part1 /mnt

For loops it will be /dev/loop0p1

We can use pacstrap to install the bare minimum for the smallest installation possible with some key packages:

pacstrap /mnt base vim dhclient

Inside provisioning

Now we must chroot into it and at the very least set a password:

arch-chroot /mnt

passwd # Set one here

Making it boot

We're at a point where we can just point QEMU at our /boot/vmlinux-linux image (Or similar) and give it arguments such as root=/dev/sda1 and console=ttyS0 if there's no virtual display attached (Serial) but there are sometimes additional gotchas in this approach.

Gotchas

In my case the /boot/vmlinuz-linux put together by the linux pacman package doesn't seem to include ext4 in its list of bdev filesystems. That is to say its incapable of booting into an ext4 disk all on its own.

The ahci driver and virtio driver seem to be built-in so accessing the disk should be fine.

In my case its just easier to compile a new kernel and check the ext4 box with make xconfig.

Compiling the kernel

Visit https://www.kernel.org/ and grab the latest release (Typically a big gold button)

tar xvf it and cd into the directory.

Run make xconfig for a nice configuration UI or use make menuconfig to use the CLI version.

In these flag-setting screens you can browse to File Systems and ensure that The Extended 4 (ext4) filesystem is checked. This will ensure the ext4 module is included as a built-in.

The kernel is nice and modular and with that comes many modules. Its ideal to build a kernel with only what it needs to function or boot to minimize its size in a boot partition or even more strictly into flash memory for say, your typical consumer router/switch/ap combos. A ton of typical home networking devices run Linux though often with a severely cut-down and compiled kernel image to fit into their limited boot flash and available memory. Plenty of enterprise gear too.

Once ext4 has been enabled exit while saving your changes and run make -j$(nproc).

nproc is a nice utility from the GUI coreutils suite and returns the total number of CPU threads on the system so we can build all the bits and pieces of the kernel as quickly as possible rather than only one at a time before finally putting them all together. While its possible to read, list and parse paths such as /proc/cpuinfo and say, /sys/devices/system/cpu nproc just takes the over-engineering away.

Once the kernel is compiled with ext4 support built-in (No initramfs, yay!) it should spit out the image to the relative subdirectory arch/x86/boot/bzImage which we can reference in the booting step.

Booting

Leave the chroot and umount the path which the new ext4 partition was mounted to. If losetup was used run 'losetup -d /dev/loopX' where X is the index it was given.

At this point we can refer to QEMU for getting things running.

QEMU

I personally use My VFIO script for all things QEMU but it also spits out the arguments for qemu-system-x86_64 for use elsewhere.

If a UEFI guest is created in virt-manager you can go to the Boot Options tab and go to the Direct kernel boot section to paste in or browse to the bzImage you compiled and add some kernel args like root=/dev/sda1 and if there's no virtual display, console=ttyS0 so it can show you the boot process and any errors you could run into.

With my script I invoke this with my paths: DISPLAY= ~/vfio/main -run -kernel /tmp/linux-6.9/arch/x86/boot/bzImage -cmdline 'root=/dev/vda1 rw console=ttyS0' -image /dev/zvol/zpool/ext4_directboot_test to get it fired up. This boots me to a login screen for the guest in under a second!

For using QEMU directly arguments like the below should do the trick:

qemu-system_x86_64 \
  -machine q35,accel=kvm,kernel_irqchip=on \
  -enable-kvm \
  -m 2048 \
  -cpu host,kvm=on,topoext=on \
  -smp sockets=1,cores=2,threads=2 \
  -drive if=pflash,format=raw,unit=0,readonly=on,file=/usr/share/ovmf/x64/OVMF_CODE.fd \
  -serial mon:stdio \
  -nodefaults \
  -drive file=/dev/zvol/zpool/ext4_directboot_test,if=none,discard=on,id=drive1 \
  -device virtio-blk-pci,drive=drive1,id=disk1, \
  -nographic \
  -vga none \
  -netdev user,id=vmnet \
  -device virtio-net,netdev=vmnet,mac=52:54:00:11:22:33 \
  -kernel /zfstmp/linux-6.9/arch/x86/boot/bzImage \
  -append root=/dev/vda1\ rw\ console=ttyS0

This example starts up a somewhat basic UEFI qemu machine (Your OVMF_CODE.fd path may vary) which can also quickly boot into the new virtual disk rootfs in under a second without a bootloader or boot partition.

Motivations, wrapping up

There aren't many valid use cases to do this. I can imagine some enterprise cases are starting up guests using QEMU which are expected to boot a common generalized kernel and some iSCSI or NFS target. At that point maintaining installations each with their own boot partition wouldn't scale well and this makes sense.

In my case direct kernel booting lets me test specialized boot hooks without having to either reboot a real machine nor maintaining and rebuilding the initramfs in a cumbersome virtualized guest environment every time. Direct Kernel booting with QEMU makes this a four second turnaround per test rather than a few minutes per test. When you're building hundreds of new initramfs images to try new hook revisions this becomes critical.