In Linux Find Truth

December 29, 2019

Zettabyte File System Explained

In this article, I will strive to answer many questions that have been asked about ZFS, such as what is it, why should I use it, what can I do with it, and the like? Let's begin:

What are some of the attributes of ZFS?

ZFS is a fully-featured filesystem
Does data integrity checking
Uses snapshots
Created by Sun Microsystems, forked by Oracle
Oracle version is less full featured
OpenZFS - open source version of ZFS
Feeds into FreeBSD, illumos, ZFSonLinux, Canonical

What makes ZFS special?

Leverages well-understood standard userland tools into the filesystem

Checksums everything
Metadata abounds
Uses Compression
diff(1)
Copy-on-Write (COW)

What is Copy-on-Write?

ZFS never changes a written disk sector
A sector changes? Allocate a new sector. Write data to it
Data on disk is always coherent
Power loss half-way through a write? Old data is still there untouched. Version control at the disk level
Interesting side-effect. You can effectively free snapshots

ZFS Assumptions?

ZFS is not your typical EXT/UFS filesystem
Traditional assumptions about filesystems will come back to haunt you
Non-ZFS tools like dump will appear to work, but not really

ZFS Hardware?

RAID Controllers -- Absolutely NOT!

ZFS expects raw disk access
RAID controller in JBOD or single-disk RAID0?
RAM -- ECC?
Disk redundancy

ZFS Terminology

VDEV or Virtual Device - a group of storage providers
Pool - a group of identical VDEVs
Dataset - a named chunk of data on a pool
You can arrange data in a pool anyway that you desire
-f switch is very important (be careful how you use it)

Virtual Devices (VDEVs) and Pools

Basic unit of storage in ZFS
All ZFS redundancy occurs at the virtual device level
Can be built out of any storage provider
Most common providers: disk or GPT partition

Could be FreeBSD crypto device
Low-Level Virtual Machine (LLVM) RAID

A Pool contains only one type of VDEV
"X VDEV" and "X Pool" get used interchangeably
VDEVs are added to Pools
Typically providers are not added to VDEVs but to Pools

Stripe VDEV/Pool

Each disk is its own VDEV
Data is striped across all VDEVs in the Pool
Can add striped VDEVs to grow Pools
No redundancy. Absolutely none. Nada!
No self-healing
Set copies=2 to get self-healing; Must be done when added

Mirror VDEV/Pool

Each VDEV contains multiple disks that replicate the data of all other disks in the VDEV
A Pool with multiple VDEVS is synonymous to RAID-10 (Stripe over Mirrors)
Can add more mirror VDEVs to grow Pool

RAIDZ VDEV/Pool

Each VDEV contains multiple disks
Data integrity maintained via parity (such as RAID-5, etc.)
Lose a disk - No data loss
Can self-heal via redundant checksums
RAIDZ Pool can have multiple identical VDEVs
Cannot expand the size of a RAIDZ VDEV by adding more disks

RAIDZ Types

RAID-Z1

3+ Disks
Can lose 1 disk/VDEV

RAID-Z2

4 + Disks
Can lose 2 disks/VDEV

RAID-Z3

5+ Disks
Can lose 3 disks/VDEV

Disk size far exceeds disk access speed

Number of Disks and Pools?

No more than 9 - 12 Disks per VDEV
Pool size is your choice
Avoid putting everything in one massive Pool
Best practice is to put OS in one mirrored Pool, and data in a separate Pool

RAIDZ vs. Traditional RAID

ZFS combines filesystem and Volume Manager - faster recovery
Write hole
Copy-on-Write -- never modify a block, only write new blocks

Create Striped Pools

Each VDEV is a single disk
No special label for VDEV of striped disk
# zpool create trinity gpt/zfs0 gpt/zfs1 \ gpt/zfs2 gpt/zfs3 gpt/zfs4

Viewing Stripe/Mirror/RAIDZ Pool Results

Use # zpool status

Multi-VDEV RAIDZ

Stripes are inherently multi-VDEV
There's no traditional RAID equivalent
Use type keyword multiple times

# zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5

Malformed Pool Example

# zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 mirror gpt/zfs3 gpt/zfs4 gpt/zfs5

receives an "invalid vdev specification" message

Don't use -f here as ZFS will let you when you shouldn't
Attempting to add a Striped Mirror to a RAIDz -- No go!

Reusing Providers

# zpool create db gpt/zfs1 gpt/zfs2 gpt/zfs3 gpt/zfs4
/dev/gpt/zfs3 is a part of an exported pool 'db' , so
the use of -f here is appropriate and essential

Pool Integrity

ZFS is self-healing at the Pool and VDEV level
Parity allows data to be rebuilt
Every block is hashed; hash is stored in the parent
Data integrity is checked as the data is accessed on the disk
A Scrub is essentially checking the live filesystem without off-line'ing
If you don't have VDEV redundancy, use dataset copies property

Scrub vs fsck

ZFS has no offline integrity checker
ZFS scrub does everything that fsck does, and more
You can offline your Pool to scrub, by why would you?
Scrub isn't perfect, but it's better than fsck

Pool Properties

Properties are tunables
Both Pools and Datasets have properties
Commands: zpool set, and zpool get
Some are read-only

# zpool get all | less

Changing Pool Properties

# zpool set comment="Main OS Files" zroot
# zpool set copies=2 zroot

Pool History

# zpool history zroot

ZPool Feature Flags

ZFS had version numbers
Then, Oracle assimilated Sun
Feature flags are at version 5000
Feature flags versus OS
# zpool get all trinity | grep feature

Datasets

A named chunk of data
Filesystems
Volume
Snapshot
Clone
Bookmark
Properties and features work on a per-dataset basis
# zfs list -r zroot/ROOT

Creating Datasets

$ zfs create zroot/var/mysql
$ zfs create -V 4G zroot/vmware

Destroying Datasets

# zfs destroy zroot/var/old-mysql
-v -- verbose mode
-n -- no-op flag

Parent-Child Relationships

Datasets inherit their parent's properties
If you change it locally, but want to have it use the parent's inherited property, use

zfs inherit

Renaming a Dataset changes its inheritance

Pool Repair & Maintenance

Resilvering
Rebuild from parity
Uses VDEV redundancy data
No redundancy? No resilvering
Throttled by Disk I/O
Happens automatically when disk is replaced
Can add VDEVs to Pools, not disks to VDEV
Be cautious of slightly smaller disks (check sector size as they can vary from disk to disk of equal capacity)

Add VDEV to Pool

New VDEVs must be identical to existing VDEVs in the Pool

# zpool add scratch gpt/zfs99
# zpool add db mirror gpt/zfs6 gpt/zfs7
# zpool add trinity raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5

Hardware States in ZFS

ONLINE -- operating normally
DEGRADED -- at least one storage provider has failed
FAULTED -- generated too many errors
UNAVAIL -- cannot open storage provider
OFFLINE -- storage provider has been shut down
REMOVED -- hardware detection of unplugged device
Errors percolate up through the ZFS stack
Hardware RAID hides errors - ZFS does not!

Log and Cache Devices

Read Cache -- L2ARC (Level 2 Adaptive Replacement Cache)
Synchronous Write Log -- ZIL, SLOG (ZFS Intent Log, Separate Log Device)
Where is the bottleneck?
Log/Cache Hardware

Filesystem Compression

Compression exchanges CPU time for disk I/O
Disk I/O is very limited
CPU time is plentiful
LZ4 by default
Enable compression before writing any data
# set compress=lz4 zroot
Typically gzip-9 is better than lz4
No more userland log compression

Memory Cache Compression

Advanced Replication Cache is ZFS' Buffer Cache
ARC compression exchanges CPU time for memory
Memory can be somewhat limited
CPU time is plentiful
ZFS ARC auto compresses what can be compressed

Deduplication (Dedup)

ZFS deduplication isn't good as you would imagine it is
Only duplicates identical filesystem blocks
Most data is not ZFS deduplicable
1TB of dedup'd data = 5G RAM for the dedup process
Dedup RAM X 4 = System RAM
Effectiveness: run zdb -S zroot, check dedup column
Cost-effective ZFS dedup just doesn't exist

Snapshots

# zfs snapshot trinity/home@<today's date>
# zfs list -t snapshot
# zfs -r zroot@<today's date>
Access snapshots in hidden .zfs directory (especially when using ZFSonLinux)
# zfs destroy trinity/home@<today's date>
Use -vn in destroy operations

Snapshot Disk Use

Delete file from live filesystem
Blocks in a snapshot remain in use
Blocks are freed only when no snapshot uses them

Roll Back

Can rollback filesystem to the most recent snapshot
# zfs rollback zroot/ROOT/@<before upgrade>
Newer data is destroyed

Clones

A read-write copy of a snapshot
# zfs clone zroot/var/mysql@<today> \ zroot/var/mysql-test
Run a test, then discard afterward

Boot Environments

Built on clones and snapshots
Snapshot root filesystem dataset before an upgrade
If Upgrade goes awry. Roll back!
FreeBSD: sysutils/beadm

ZFS Send/Receive

Move whole filesystems to another host
Blows rsync out of the water
Resumable

Let's install ZFS in Linux and start using it

In Linux Find Truth

December 29, 2019, 19:54

0 reactions

0 views