Blog
December 29, 2019
Zettabyte File System Explained
In this article, I will strive to answer many questions that have been asked about ZFS, such as what is it, why should I use it, what can I do with it, and the like? Let's begin:
What are some of the attributes of ZFS?
- ZFS is a fully-featured filesystem
- Does data integrity checking
- Uses snapshots
- Created by Sun Microsystems, forked by Oracle
- Oracle version is less full featured
- OpenZFS - open source version of ZFS
- Feeds into FreeBSD, illumos, ZFSonLinux, Canonical
What makes ZFS special?
- Leverages well-understood standard userland tools into the filesystem
- Checksums everything
- Metadata abounds
- Uses Compression
- diff(1)
- Copy-on-Write (COW)
What is Copy-on-Write?
- ZFS never changes a written disk sector
- A sector changes? Allocate a new sector. Write data to it
- Data on disk is always coherent
- Power loss half-way through a write? Old data is still there untouched. Version control at the disk level
- Interesting side-effect. You can effectively free snapshots
ZFS Assumptions?
- ZFS is not your typical EXT/UFS filesystem
- Traditional assumptions about filesystems will come back to haunt you
- Non-ZFS tools like dump will appear to work, but not really
ZFS Hardware?
- RAID Controllers -- Absolutely NOT!
- ZFS expects raw disk access
- RAID controller in JBOD or single-disk RAID0?
- RAM -- ECC?
- Disk redundancy
ZFS Terminology
- VDEV or Virtual Device - a group of storage providers
- Pool - a group of identical VDEVs
- Dataset - a named chunk of data on a pool
- You can arrange data in a pool anyway that you desire
- -f switch is very important (be careful how you use it)
Virtual Devices (VDEVs) and Pools
- Basic unit of storage in ZFS
- All ZFS redundancy occurs at the virtual device level
- Can be built out of any storage provider
- Most common providers: disk or GPT partition
- Could be FreeBSD crypto device
- Low-Level Virtual Machine (LLVM) RAID
- A Pool contains only one type of VDEV
- "X VDEV" and "X Pool" get used interchangeably
- VDEVs are added to Pools
- Typically providers are not added to VDEVs but to Pools
Stripe VDEV/Pool
- Each disk is its own VDEV
- Data is striped across all VDEVs in the Pool
- Can add striped VDEVs to grow Pools
- No redundancy. Absolutely none. Nada!
- No self-healing
- Set copies=2 to get self-healing; Must be done when added
Mirror VDEV/Pool
- Each VDEV contains multiple disks that replicate the data of all other disks in the VDEV
- A Pool with multiple VDEVS is synonymous to RAID-10 (Stripe over Mirrors)
- Can add more mirror VDEVs to grow Pool
RAIDZ VDEV/Pool
- Each VDEV contains multiple disks
- Data integrity maintained via parity (such as RAID-5, etc.)
- Lose a disk - No data loss
- Can self-heal via redundant checksums
- RAIDZ Pool can have multiple identical VDEVs
- Cannot expand the size of a RAIDZ VDEV by adding more disks
RAIDZ Types
- RAID-Z1
- 3+ Disks
- Can lose 1 disk/VDEV
- RAID-Z2
- 4 + Disks
- Can lose 2 disks/VDEV
- RAID-Z3
- 5+ Disks
- Can lose 3 disks/VDEV
- Disk size far exceeds disk access speed
Number of Disks and Pools?
- No more than 9 - 12 Disks per VDEV
- Pool size is your choice
- Avoid putting everything in one massive Pool
- Best practice is to put OS in one mirrored Pool, and data in a separate Pool
RAIDZ vs. Traditional RAID
- ZFS combines filesystem and Volume Manager - faster recovery
- Write hole
- Copy-on-Write -- never modify a block, only write new blocks
Create Striped Pools
- Each VDEV is a single disk
- No special label for VDEV of striped disk
- # zpool create trinity gpt/zfs0 gpt/zfs1 \ gpt/zfs2 gpt/zfs3 gpt/zfs4
Viewing Stripe/Mirror/RAIDZ Pool Results
- Use # zpool status
Multi-VDEV RAIDZ
- Stripes are inherently multi-VDEV
- There's no traditional RAID equivalent
- Use type keyword multiple times
- # zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5
Malformed Pool Example
- # zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 mirror gpt/zfs3 gpt/zfs4 gpt/zfs5
- receives an "invalid vdev specification" message
- Don't use -f here as ZFS will let you when you shouldn't
- Attempting to add a Striped Mirror to a RAIDz -- No go!
Reusing Providers
- # zpool create db gpt/zfs1 gpt/zfs2 gpt/zfs3 gpt/zfs4
- /dev/gpt/zfs3 is a part of an exported pool 'db' , so
- the use of -f here is appropriate and essential
Pool Integrity
- ZFS is self-healing at the Pool and VDEV level
- Parity allows data to be rebuilt
- Every block is hashed; hash is stored in the parent
- Data integrity is checked as the data is accessed on the disk
- A Scrub is essentially checking the live filesystem without off-line'ing
- If you don't have VDEV redundancy, use dataset copies property
Scrub vs fsck
- ZFS has no offline integrity checker
- ZFS scrub does everything that fsck does, and more
- You can offline your Pool to scrub, by why would you?
- Scrub isn't perfect, but it's better than fsck
Pool Properties
- Properties are tunables
- Both Pools and Datasets have properties
- Commands: zpool set, and zpool get
- Some are read-only
- # zpool get all | less
Changing Pool Properties
- # zpool set comment="Main OS Files" zroot
- # zpool set copies=2 zroot
Pool History
- # zpool history zroot
ZPool Feature Flags
- ZFS had version numbers
- Then, Oracle assimilated Sun
- Feature flags are at version 5000
- Feature flags versus OS
- # zpool get all trinity | grep feature
Datasets
- A named chunk of data
- Filesystems
- Volume
- Snapshot
- Clone
- Bookmark
- Properties and features work on a per-dataset basis
- # zfs list -r zroot/ROOT
Creating Datasets
- $ zfs create zroot/var/mysql
- $ zfs create -V 4G zroot/vmware
Destroying Datasets
- # zfs destroy zroot/var/old-mysql
- -v -- verbose mode
- -n -- no-op flag
Parent-Child Relationships
- Datasets inherit their parent's properties
- If you change it locally, but want to have it use the parent's inherited property, use
zfs inherit
- Renaming a Dataset changes its inheritance
Pool Repair & Maintenance
- Resilvering
- Rebuild from parity
- Uses VDEV redundancy data
- No redundancy? No resilvering
- Throttled by Disk I/O
- Happens automatically when disk is replaced
- Can add VDEVs to Pools, not disks to VDEV
- Be cautious of slightly smaller disks (check sector size as they can vary from disk to disk of equal capacity)
Add VDEV to Pool
- New VDEVs must be identical to existing VDEVs in the Pool
- # zpool add scratch gpt/zfs99
- # zpool add db mirror gpt/zfs6 gpt/zfs7
- # zpool add trinity raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5
Hardware States in ZFS
- ONLINE -- operating normally
- DEGRADED -- at least one storage provider has failed
- FAULTED -- generated too many errors
- UNAVAIL -- cannot open storage provider
- OFFLINE -- storage provider has been shut down
- REMOVED -- hardware detection of unplugged device
- Errors percolate up through the ZFS stack
- Hardware RAID hides errors - ZFS does not!
Log and Cache Devices
- Read Cache -- L2ARC (Level 2 Adaptive Replacement Cache)
- Synchronous Write Log -- ZIL, SLOG (ZFS Intent Log, Separate Log Device)
- Where is the bottleneck?
- Log/Cache Hardware
Filesystem Compression
- Compression exchanges CPU time for disk I/O
- Disk I/O is very limited
- CPU time is plentiful
- LZ4 by default
- Enable compression before writing any data
- # set compress=lz4 zroot
- Typically gzip-9 is better than lz4
- No more userland log compression
Memory Cache Compression
- Advanced Replication Cache is ZFS' Buffer Cache
- ARC compression exchanges CPU time for memory
- Memory can be somewhat limited
- CPU time is plentiful
- ZFS ARC auto compresses what can be compressed
Deduplication (Dedup)
- ZFS deduplication isn't good as you would imagine it is
- Only duplicates identical filesystem blocks
- Most data is not ZFS deduplicable
- 1TB of dedup'd data = 5G RAM for the dedup process
- Dedup RAM X 4 = System RAM
- Effectiveness: run zdb -S zroot, check dedup column
- Cost-effective ZFS dedup just doesn't exist
Snapshots
- # zfs snapshot trinity/home@<today's date>
- # zfs list -t snapshot
- # zfs -r zroot@<today's date>
- Access snapshots in hidden .zfs directory (especially when using ZFSonLinux)
- # zfs destroy trinity/home@<today's date>
- Use -vn in destroy operations
Snapshot Disk Use
- Delete file from live filesystem
- Blocks in a snapshot remain in use
- Blocks are freed only when no snapshot uses them
Roll Back
- Can rollback filesystem to the most recent snapshot
- # zfs rollback zroot/ROOT/@<before upgrade>
- Newer data is destroyed
Clones
- A read-write copy of a snapshot
- # zfs clone zroot/var/mysql@<today> \ zroot/var/mysql-test
- Run a test, then discard afterward
Boot Environments
- Built on clones and snapshots
- Snapshot root filesystem dataset before an upgrade
- If Upgrade goes awry. Roll back!
- FreeBSD: sysutils/beadm
ZFS Send/Receive
- Move whole filesystems to another host
- Blows rsync out of the water
- Resumable