ZFS

This guide is based on the FreeBSD ZFS Tuning Guide by Ivan Voras et al.

About ZFS

The ZFS (originally Zettabyte File System) is from Sun Solaris.

Tuning Guide

To use ZFS, at least 1GB of memory is recommended (for all architectures) but more is helpful as ZFS needs lots of memory. Depending on your workload, it may be possible to use ZFS on systems with less memory, but it requires careful tuning to avoid panics from memory exhaustion in the kernel. An amd64 system is preferred due to its larger address space and better performance on 64bit variables, which ZFS uses a lot.

MidnightBSD 0.3-CURRENT includes ZFS version 6.

By default, kmem address space (the one used by in-kernel malloc(9)) in i386 is configured to a maximum of ~300MB, which is way too low for ZFS. If you don't configure it manually, you will eventually see a kernel panic resembling the following in /var/log/messages: Apr 7 21:09:07 nas savecore: reboot after panic: kmem_malloc(114688): kmem_map too small: 324825088 total allocated

For every architecture you should increase it to at least 512MB. You can do it by adding:

vm.kmem_size="512M"
vm.kmem_size_max="512M"

to your /boot/loader.conf file. It's been reported that even this configuration can panic a system in less than a minute (for example, by copying files from a NFS server connected via gigabit crossover): Apr 8 06:46:08 nas savecore: reboot after panic: kmem_malloc(131072): kmem_map too small: 528273408 total allocated Therefore it's recommended to either increase it further or reduce the size of ZFS's ARC (adaptive replacement cache):

vfs.zfs.arc_max="100M"

Every 32-bit i386 machine running ZFS needs tuning to improve stability against out-of-memory kernel panics. Unfortunately, ZFS is very memory hungry by design and it looks like the default kernel memory configuration is too conservative for ZFS's liking.

Heavy IO activity between ZFS and another file system (like rsyncing between ZFS and UFS or between ZFS and NFS) may result in a deadlock.

Symptoms: processes wanting to do IO on a ZFS file system get stuck forever in "zfs" state (WCHAN), other file systems (e.g. UFS) are still working.

References: http://lists.freebsd.org/pipermail/freebsd-fs/2008-February/004391.html , http://lists.freebsd.org/pipermail/freebsd-stable/2008-January/040047.html

Heavy IO activity in multithreaded applications (like file system benchmarks) can provoke a panic. Symptoms: the kernel panics in txg_*. References: http://lists.freebsd.org/pipermail/freebsd-stable/2008-March/040943.html

ZFS file systems don't start on boot in single-user mode. Symptoms: ZFS file systems are not started / mounted when the machine is booted in single-user mode. References: http://lists.freebsd.org/pipermail/freebsd-current/2007-April/071462.html . Workaround: see referenced post.

Swapping on ZFS doesn't work. Reference: http://lists.freebsd.org/pipermail/freebsd-current/2007-September/076831.html

Workarounds

It has been suggested that, in addition to memory tuning, adding

    vfs.zfs.zil_disable=1
    vfs.zfs.prefetch_disable=1

to loader.conf significantly improves stability. This might work but disabling ZIL can have serious consequences, depending on your workload.

The same system that paniced with kmem_size of 512 MB required the following settings (and a kernel recompile, see i386 notes below) with the default ARC configuration to achieve stable operation:

vm.kmem_size="1536M"
vm.kmem_size_max="1536M"

The issue of kernel memory exhaustion is a complex one, involving the interaction between disk speeds, application loads and the special caching ZFS does. Faster drives will write the cached data faster but will also fill the caches up faster. Generally, larger and faster drives will need more memory for ZFS.

i386

On i386 systems you will need to recompile your kernel with increased KVA_PAGES option to increase the size of the kernel address space before vm.kmem_size can be increased beyond 512M. Add the following line to your kernel configuration file to increase available space for vm.kmem_size to at least 1 GB:

options KVA_PAGES=512

By default the kernel receives 1GB of the 4GB of address space available on the i386 architecture, and this is used for all of the kernel address space needs, not just the kmem map. By increasing KVA_PAGES you can allocate a larger proportion of the 4GB address space to the kernel (2 GB in the above example), allowing more room to increase vm.kmem_size. The trade-off is that user applications have less address space available, and some programs (e.g. those that rely on mapping data at a fixed address that is now in the kernel address space, or which require close to the full 3GB of address space themselves) may no longer run.

For *really* memory constrained systems it is also recommended to strip out as many unused drivers and options from the kernel (which will free a couple of MB of memory). A stable configuration with vm.kmem_size="1536M" has been reported using an unmodified kernel, relatively sparse drivers as required for the hardware and options KVA_PAGES=512.

Some workloads need greatly reduced ARC size and the size of VDEV cache. ZFS manages the ARC through a multi-threaded process. If it requires more memory for ARC ZFS will allocate it. It can and usually does exceed arc_max (vfs.zfs.arc_max) while another thread within ZFS periodically frees memory allocated to ARC when arc_max has been exceeded. Therefore even with a small arc_max it is possible for ARC to exceed kmem_size_max and panic the system. On memory constrained systems it is safer to use an arbitrarily low arc_max. For example it is possible to set vm.kmem_size and vm.kmem_size_max to 512M, vfs.zfs.arc_max to 160M, keeping vfs.zfs.vdev.cache.size to half its default size of 10 Megs (setting it to 5 Megs anecdotally achieves even better stability).

There is one example (CySchubert) of ZFS running nicely on a laptop with 768 Megs of physical RAM with the following settings:

vm.kmem_size="330M"
vm.kmem_size_max="330M"
vfs.zfs.arc_max="40M"
vfs.zfs.vdev.cache.size="5M"

Kernel memory should be monitored while tuning to ensure a comfortable amount of free kernel address space. The following script will summarize kernel memory utilization and assist in tuning arc_max and VDEV cache size.

TEXT=`kldstat | awk 'BEGIN {print "16i 0";} NR>1 {print toupper($4) "+"} END {print "p"}' | dc`
DATA=`vmstat -m | sed -Ee '1s/.*/0/;s/.* ([0-9]+)K.*/\1+/;$s/$/1024*p/' | dc`
TOTAL=$((DATA + TEXT))

echo TEXT=$TEXT, `echo $TEXT | awk '{print $1/1048576 " MB"}'`
echo DATA=$DATA, `echo $DATA | awk '{print $1/1048576 " MB"}'`
echo TOTAL=$TOTAL, `echo $TOTAL | awk '{print $1/1048576 " MB"}'`

Note: Perhaps there is a more precise way to calculate / measure how large of a vm.kmem_size setting can be used with a particular kernel, but the authors of this wiki do not know it. Experimentation does work. However, if you set vm.kmem_size too high in loader.conf, the kernel will panic on boot. You can fix this by dropping to the boot loader prompt and typing set vm.kmem_size="512M" (or a similar smaller number known to work.)

The vm.kmem_size_max setting is not used directly during the system operation (i.e. it is not a limit which kmem can "grow" into) but for initial autoconfiguration of various system settings, the most important of which for this discussion is the ARC size. If kmem_size and arc_max are tuned manually, kmem_size_max will be ignored.

amd64

Kernel memory usage (vm.kmem_size) should be increased to around 1 GB and ARC size reduced:

vm.kmem_size_max="1024M"
vm.kmem_size="1024M"
vfs.zfs.arc_max="100M"

This might help if the machine is also loaded with other tasks, such as network activity (a file server), etc. Tuning KVA_PAGES is not required on amd64.

To increase performance, you may increase kern.maxvnodes (/etc/sysctl.conf) way up if you have the RAM for it (e.g. 400000 for a 2GB system). Keep an eye on vfs.numvnodes during production to see where it stabilizes. AMD64 uses direct mapping for vnodes, so you don't have to worry about address space for vnodes on this architecture (as opposed to i386).

Increase the address space on AMD64 (patch from Artem Belevich for FreeBSD)

diff -r bbee47b28f7f amd64/include/vmparam.h
--- a/amd64/include/vmparam.h   Sat Jan 31 21:03:53 2009 -0800
+++ b/amd64/include/vmparam.h   Sat May 02 16:25:42 2009 -0700
@@ -149,7 +149,7 @@
  */
 
 #define        VM_MAX_KERNEL_ADDRESS   KVADDR(KPML4I, NPDPEPG-1, NPDEPG-1, NPTEPG-1)
-#define        VM_MIN_KERNEL_ADDRESS   KVADDR(KPML4I, NPDPEPG-6, 0, 0)
+#define        VM_MIN_KERNEL_ADDRESS   KVADDR(KPML4I, NPDPEPG-16, 0, 0)
 
 #define        DMAP_MIN_ADDRESS        KVADDR(DMPML4I, 0, 0, 0)
 #define        DMAP_MAX_ADDRESS        KVADDR(DMPML4I+1, 0, 0, 0)
diff -r bbee47b28f7f kern/kern_malloc.c
--- a/kern/kern_malloc.c        Sat Jan 31 21:03:53 2009 -0800
+++ b/kern/kern_malloc.c        Sat May 02 16:25:42 2009 -0700
@@ -181,16 +181,16 @@
  */
 static uma_zone_t mt_zone;
 
-u_int vm_kmem_size;
-SYSCTL_UINT(_vm, OID_AUTO, kmem_size, CTLFLAG_RD, &vm_kmem_size, 0,
+u_long vm_kmem_size;
+SYSCTL_ULONG(_vm, OID_AUTO, kmem_size, CTLFLAG_RD, &vm_kmem_size, 0,
     "Size of kernel memory");
 
-u_int vm_kmem_size_min;
-SYSCTL_UINT(_vm, OID_AUTO, kmem_size_min, CTLFLAG_RD, &vm_kmem_size_min, 0,
+u_long vm_kmem_size_min;
+SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_min, CTLFLAG_RD, &vm_kmem_size_min, 0,
     "Minimum size of kernel memory");
 
-u_int vm_kmem_size_max;
-SYSCTL_UINT(_vm, OID_AUTO, kmem_size_max, CTLFLAG_RD, &vm_kmem_size_max, 0,
+u_long vm_kmem_size_max;
+SYSCTL_ULONG(_vm, OID_AUTO, kmem_size_max, CTLFLAG_RD, &vm_kmem_size_max, 0,
     "Maximum size of kernel memory");
 
 u_int vm_kmem_size_scale;
@@ -589,7 +589,7 @@
 #if defined(VM_KMEM_SIZE_MIN)
        vm_kmem_size_min = VM_KMEM_SIZE_MIN;
 #endif
-       TUNABLE_INT_FETCH("vm.kmem_size_min", &vm_kmem_size_min);
+       TUNABLE_ULONG_FETCH("vm.kmem_size_min", &vm_kmem_size_min);
        if (vm_kmem_size_min > 0 && vm_kmem_size < vm_kmem_size_min) {
                vm_kmem_size = vm_kmem_size_min;
        }
@@ -597,17 +597,19 @@
 #if defined(VM_KMEM_SIZE_MAX)
        vm_kmem_size_max = VM_KMEM_SIZE_MAX;
 #endif
-       TUNABLE_INT_FETCH("vm.kmem_size_max", &vm_kmem_size_max);
+       TUNABLE_ULONG_FETCH("vm.kmem_size_max", &vm_kmem_size_max);
        if (vm_kmem_size_max > 0 && vm_kmem_size >= vm_kmem_size_max)
                vm_kmem_size = vm_kmem_size_max;
 
        /* Allow final override from the kernel environment */
 #ifndef BURN_BRIDGES
-       if (TUNABLE_INT_FETCH("kern.vm.kmem.size", &vm_kmem_size) != 0)
+       if (TUNABLE_ULONG_FETCH("kern.vm.kmem.size", &vm_kmem_size) != 0)
                printf("kern.vm.kmem.size is now called vm.kmem_size!\n");
 #endif
-       TUNABLE_INT_FETCH("vm.kmem_size", &vm_kmem_size);
+       TUNABLE_ULONG_FETCH("vm.kmem_size", &vm_kmem_size);
 
+#if 0  /* don't enforce kmem size limit */
+       
        /*
         * Limit kmem virtual size to twice the physical memory.
         * This allows for kmem map sparseness, but limits the size
@@ -616,7 +618,8 @@
         */
        if (((vm_kmem_size / 2) / PAGE_SIZE) > cnt.v_page_count)
                vm_kmem_size = 2 * cnt.v_page_count * PAGE_SIZE;
-
+#endif
+       
        /*
         * Tune settings based on the kmem map's size at this time.
         */
diff -r bbee47b28f7f vm/vm_kern.h
--- a/vm/vm_kern.h      Sat Jan 31 21:03:53 2009 -0800
+++ b/vm/vm_kern.h      Sat May 02 16:25:42 2009 -0700
@@ -69,6 +69,6 @@
 extern vm_map_t kmem_map;
 extern vm_map_t exec_map;
 extern vm_map_t pipe_map;
-extern u_int vm_kmem_size;
+extern u_long vm_kmem_size;
 
 #endif                         /* _VM_VM_KERN_H_ */