Intro¶
The storage driver domain in this howto is called zfshost
I have two small Intel 320 SSD 40GB in MD raid1 in a LVM VG (Volume Group) called vg_raid1
. This VG contains 4 LV's (root+swap for dom0 and zfshost)
The LVM LVs (Logical Volume) zfshost-disk
zfshost-swap
are used for boot and swap for the Debian based zfs storage driver domain.
The actual ZFS storage disks are handled fully by the storage driver domain and they are on a SATA controller exported with pci-export to the zfshost domU.
The SATA disks on the exported SATA controller are in a pool called tank.
Why not FreeBSD
- NFSv4 and Kerberos not working well enough
- No pure PV mode (only HVM) (causes issues with pci passthrough)
- Resize of zvol needs domU restart
Preparations¶
Switch from pygrub to pvgrub¶
Pygrub can not be used with disks directly from a storage driver domain as pygrub runs on dom0 it self. Instead all domUs using pygrub shall be changed to use pvgrub.
In domU (Debian Jessie only)¶
Change from grub-legacy to pvgrub (based on grub2)
# apt-get install grub-xen # mv /boot/grub/menu.lst /root/ # update-grub
In dom0 (Debian Jessie only)¶
Make sure the package grub-xen-host
is installed first, then apply the following diff to the domU
--- a/xen/<domU-name>.cfg +++ b/xen/<domU-name>.cfg @@ -8,7 +8,7 @@ # -bootloader = '/usr/lib/xen-4.4/bin/pygrub' +kernel = '/usr/lib/grub-xen/grub-x86_64-xen.bin' vcpus = '1' memory = '1024' @@ -17,7 +17,6 @@ memory = '1024' # # Disk device(s). # -root = '/dev/xvda2 ro' disk = [ 'phy:/dev/vg_raid1/<domU-name>-disk,xvda2,w', 'phy:/dev/vg_raid1/<domU-name>-swap,xvda1,w',
For 32-bit domUs use /usr/lib/grub-xen/grub-i386-xen.bin
.
Shutdown the domU and restart it
# xl shutdown <domU-name> - (wait until it is down) # xl create /etc/xen/<domU-name>.cfg -c
Setup zfshost¶
Installation¶
Install Debian Jessie as a XEN-PV on a LVM lv from the dom0 (e.g. /dev/vg_raid1/zfshost-root
and /dev/vg_raid1/zfshost-swap
)
The disks that will be managed by ZFS are connected to a SATA controller exported to the domU with PCI export.
PCI export¶
Find the PCI id for the SATA/SAS card to export, in my case (on a HP Microserver Gen8)
# lspci | fgrep AHCI 00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family SATA AHCI Controller (rev 05)
Add the following to the end of /etc/xen/zfshost.cfg to export the pci device
pci = [ '00:1f.2' ]
Hand the device over to the dom0 xen-pciback module
# echo xen-pciback >> /etc/modules # modprobe xen-pciback # xl pci-assignable-add 00:1f.2
For automatic handling of xl pci-assignable-add
at reboot see here setup-pci-passthrough
Setup xenstore¶
root@zfshost:~ # RUNLEVEL=1 apt-get install --no-install-recommends xen-utils-4.4
xen-tools are usually used in a dom0, to be used in a storage driver domain we should disable services only used in a dom0
root@zfshost:~ # systemctl disable xen.service root@zfshost:~ # systemctl disable xendomains.service
Mount /proc/xen
root@zfshost:~ # mount -t xenfs xenfs /proc/xen
Also add the /proc/xen
mounting to /etc/rc.local
, plus add a xenstore call about that the storage domain is online (we will
wait for this in the dom0).
mount -t xenfs xenfs /proc/xen xenstore-write /local/domain/`xenstore-read domid`/data/storage-online 1 exit 0
Install ZFS on Linux¶
Follow this guide ZoL Debian
Creating the tank pool¶
Setup disks for gpt format (without adding any partitions), you can use gdisk
for this.
Create the pool with ashift for Advanced Format disks (4k sector size), this will automatically partition the disks as well:
root@zfshost:~ # zpool create -o ashift=12 tank mirror sda sdb
Or as an alternative to sd[a-z] naming you can use "disk by-id" names (see /dev/disk/by-id/)
After this the pool should be up and running (Note that I use "disk by-id" names)
root@zfshost:~ # zpool status pool: tank state: ONLINE scan: resilvered 240K in 0h0m with 0 errors on Sat Jul 11 23:58:12 2015 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-WDC_WD20EFRX-..... ONLINE 0 0 0 ata-WDC_WD20EFRX-..... ONLINE 0 0 0 errors: No known data errors
Set pool to autoexpand if you add larger disks later to the mirror
root@zfshost:~ # zpool set autoexpand=on tank
Switch to sysvinit¶
The pool will fail to mount with this error at reboot
zpool[231]: cannot import 'tank': no such pool or dataset
The reason for this is this zfs services and tasks start in the wrong order.
I could not find a reliable solution for this for a system with systemd (zol version 0.6.4-1.2-1), so I did a fallback to sysvinit instead:
root@zfshost:~ # apt-get install --purge -y sysvinit-core
Fix getty startup on hvc0
--- a/inittab +++ b/inittab -1:2345:respawn:/sbin/getty 38400 tty1 -2:23:respawn:/sbin/getty 38400 tty2 -3:23:respawn:/sbin/getty 38400 tty3 -4:23:respawn:/sbin/getty 38400 tty4 -5:23:respawn:/sbin/getty 38400 tty5 -6:23:respawn:/sbin/getty 38400 tty6 +1:2345:respawn:/sbin/getty 38400 hvc0 +#2:23:respawn:/sbin/getty 38400 tty2 +#3:23:respawn:/sbin/getty 38400 tty3 +#4:23:respawn:/sbin/getty 38400 tty4 +#5:23:respawn:/sbin/getty 38400 tty5 +#6:23:respawn:/sbin/getty 38400 tty6
Patch xendomains init scripts¶
The following patch adds storage domain support to the xendomains start script
--- a/default/xendomains +++ b/default/xendomains @@ -58,3 +58,7 @@ XENDOMAINS_AUTO=/etc/xen/auto # XENDOMAINS_STOP_MAXWAIT=300 +# If using a storage domain its name should be supplied. The storage +# domain will be started first and no other domains will start before it +# is fully online. +XENDOMAINS_STORAGE_DOM_NAME="zfshost" diff --git a/init.d/xendomains b/init.d/xendomains index 5fd5a5d..1ac35db 100755 --- a/init.d/xendomains +++ b/init.d/xendomains @@ -150,10 +150,38 @@ do_start_auto() done } +start_storage() +{ + log_action_begin_msg "Starting Storage domain $XENDOMAINS_STORAGE_DOM_NAME" + + out=$(xen create --quiet --defconfig "/etc/xen/${XENDOMAINS_STORAGE_DOM_NAME}.cfg" 2>&1 1>/dev/null) + case "$?" in + 0) + log_action_end_msg 0 + ;; + *) + log_action_end_msg 1 + echo "$out" + ;; + esac + + sleep 5 + stor_dom=$(xen domid $XENDOMAINS_STORAGE_DOM_NAME) + + log_action_begin_msg "Waiting for storage to come online (forever)." + until $(xenstore-exists /local/domain/${stor_dom}/data/storage-online) + do + sleep 2 + done + log_action_end_msg 0 +} + do_start() { declare -A domains + [ -n "$XENDOMAINS_STORAGE_DOM_NAME" ] && start_storage + do_start_restore do_start_auto } @@ -183,7 +211,7 @@ do_stop_shutdown() { while read id name rest; do log_action_begin_msg "Shutting down Xen domain $name ($id)" - xen shutdown $id 2>&1 1>/dev/null + xen shutdown --wait $id 2>&1 1>/dev/null log_action_end_msg $? done < <(/usr/lib/xen-common/bin/xen-init-list) while read id name rest; do
Moving existing domU data¶
Install netcat on dom0 and zfshost¶
# apt-get install netcat-openbsd
zvol topdir for XEN domUs¶
Create a topdir for xen domU storage with lz4
compression
zfs create -o compression=lz4 tank/xen
Creating zvol for domU swap¶
The block size should match the vm:s system page size (for Linux 64-bit it is 4k)
Example
root@zfshost:~ # zfs create -b 4k \ -V <size>G \ -o com.sun:auto-snapshot=false \ tank/xen/<domU-name>-swap
LVM lv to zvol¶
WARNING Transferring data like it is done in this chapter is very fast, but it puts high stress on
ZFS. When testing this on a storage domU with only 5GB RAM it resulted in a kernel panic
related to that the system was out of memory. Tuning /proc/sys/vm/min_free_kbytes
up to 128MB solved these problems for me.
For next start-up add the following to /etc/sysctl.conf
# Make sure ZFS does not take all memory when stressed vm.min_free_kbytes = 128000
Create zvol for non swap
root@zfshost:~ # zfs create -V <existing-lv-size>G tank/xen/<domU-name>-disk
Start netcat on zfshost
root@zfshost:~ # nc -l 2222 > /dev/zvol/tank/xen/<domU-name>-disk
Stop domU
root@dom0:~ # xl shutdown <domU-name>
Send data from dom0
root@dom0:~ # nc zfshost 2222 < /dev/vg_raid1/<domU-name>-disk
Patch domU.cfg file
--- a/xen/<domU-name>.cfg +++ b/xen/<domU-name>.cfg @@ -18,8 +18,8 @@ memory = '512' # Disk device(s). # disk = [ - 'phy:/dev/vg_raid1/<domU-name>-disk,xvda2,w', - 'phy:/dev/vg_raid1/<domU-name>-swap,xvda1,w', + 'phy:/dev/zvol/tank/xen/<domU-name>-disk,xvda2,w,backend=zfshost', + 'phy:/dev/zvol/tank/xen/<domU-name>-swap,xvda1,w,backend=zfshost', ]
Start domU and attach the console. In pvgrub add fsck.mode=force
as a kernel parameter.
root@dom0:~ # xl create /etc/xen/<domU-name>.cfg -c
In the domU
root@domU:~ # mkswap /dev/xvda1 root@domU:~ # swapon -a
Appendix¶
attaching volumes to domains (and dom0)¶
Example attach a zvol to dom0 as /dev/xvdc1
root@dom0:~ # xl block-attach Domain-0 'format=raw,backendtype=phy,backend=zfshost,vdev=xvdc1,target=/dev/zvol/tank/xen/dom0'
detaching volumes from domains (and dom0)¶
xl block-list
does not work with disks from a storage driver domain, instead you need to look for <DevId>
in xenstore with xenstore-ls
After finding the right <DevId>
volumes can be detached as per usual xl block-detach <Domain> <DevId>
Example for dom0
root@dom0:~ # xenstore-ls | fgrep -C2 /dev/zvol/tank/xen/dom0 51745 = "" frontend = "/local/domain/0/device/vbd/51745" params = "/dev/zvol/tank/xen/dom0" script = "/etc/xen/scripts/block" frontend-id = "0" root@dom0:~ # xl block-detach Domain-0 51745