In Linux Find Truth

Zettabyte File System Explained

2019-12-29T19:54:49.370Z

In this article, I will strive to answer many questions that have been asked about ZFS, such as what is it, why should I use it, what can I do with it, and the like? Let's begin:

What are some of the attributes of ZFS?

ZFS is a fully-featured filesystem
Does data integrity checking
Uses snapshots
Created by Sun Microsystems, forked by Oracle
Oracle version is less full featured
OpenZFS - open source version of ZFS
Feeds into FreeBSD, illumos, ZFSonLinux, Canonical

What makes ZFS special?

Leverages well-understood standard userland tools into the filesystem

Checksums everything
Metadata abounds
Uses Compression
diff(1)
Copy-on-Write (COW)

What is Copy-on-Write?

ZFS never changes a written disk sector
A sector changes? Allocate a new sector. Write data to it
Data on disk is always coherent
Power loss half-way through a write? Old data is still there untouched. Version control at the disk level
Interesting side-effect. You can effectively free snapshots

ZFS Assumptions?

ZFS is not your typical EXT/UFS filesystem
Traditional assumptions about filesystems will come back to haunt you
Non-ZFS tools like dump will appear to work, but not really

ZFS Hardware?

RAID Controllers -- Absolutely NOT!

ZFS expects raw disk access
RAID controller in JBOD or single-disk RAID0?
RAM -- ECC?
Disk redundancy

ZFS Terminology

VDEV or Virtual Device - a group of storage providers
Pool - a group of identical VDEVs
Dataset - a named chunk of data on a pool
You can arrange data in a pool anyway that you desire
-f switch is very important (be careful how you use it)

Virtual Devices (VDEVs) and Pools

Basic unit of storage in ZFS
All ZFS redundancy occurs at the virtual device level
Can be built out of any storage provider
Most common providers: disk or GPT partition

Could be FreeBSD crypto device
Low-Level Virtual Machine (LLVM) RAID

A Pool contains only one type of VDEV
"X VDEV" and "X Pool" get used interchangeably
VDEVs are added to Pools
Typically providers are not added to VDEVs but to Pools

Stripe VDEV/Pool

Each disk is its own VDEV
Data is striped across all VDEVs in the Pool
Can add striped VDEVs to grow Pools
No redundancy. Absolutely none. Nada!
No self-healing
Set copies=2 to get self-healing; Must be done when added

Mirror VDEV/Pool

Each VDEV contains multiple disks that replicate the data of all other disks in the VDEV
A Pool with multiple VDEVS is synonymous to RAID-10 (Stripe over Mirrors)
Can add more mirror VDEVs to grow Pool

RAIDZ VDEV/Pool

Each VDEV contains multiple disks
Data integrity maintained via parity (such as RAID-5, etc.)
Lose a disk - No data loss
Can self-heal via redundant checksums
RAIDZ Pool can have multiple identical VDEVs
Cannot expand the size of a RAIDZ VDEV by adding more disks

RAIDZ Types

RAID-Z1

3+ Disks
Can lose 1 disk/VDEV

RAID-Z2

4 + Disks
Can lose 2 disks/VDEV

RAID-Z3

5+ Disks
Can lose 3 disks/VDEV

Disk size far exceeds disk access speed

Number of Disks and Pools?

No more than 9 - 12 Disks per VDEV
Pool size is your choice
Avoid putting everything in one massive Pool
Best practice is to put OS in one mirrored Pool, and data in a separate Pool

RAIDZ vs. Traditional RAID

ZFS combines filesystem and Volume Manager - faster recovery
Write hole
Copy-on-Write -- never modify a block, only write new blocks

Create Striped Pools

Each VDEV is a single disk
No special label for VDEV of striped disk
# zpool create trinity gpt/zfs0 gpt/zfs1 \ gpt/zfs2 gpt/zfs3 gpt/zfs4

Viewing Stripe/Mirror/RAIDZ Pool Results

Use # zpool status

Multi-VDEV RAIDZ

Stripes are inherently multi-VDEV
There's no traditional RAID equivalent
Use type keyword multiple times

# zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5

Malformed Pool Example

# zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 mirror gpt/zfs3 gpt/zfs4 gpt/zfs5

receives an "invalid vdev specification" message

Don't use -f here as ZFS will let you when you shouldn't
Attempting to add a Striped Mirror to a RAIDz -- No go!

Reusing Providers

# zpool create db gpt/zfs1 gpt/zfs2 gpt/zfs3 gpt/zfs4
/dev/gpt/zfs3 is a part of an exported pool 'db' , so
the use of -f here is appropriate and essential

Pool Integrity

ZFS is self-healing at the Pool and VDEV level
Parity allows data to be rebuilt
Every block is hashed; hash is stored in the parent
Data integrity is checked as the data is accessed on the disk
A Scrub is essentially checking the live filesystem without off-line'ing
If you don't have VDEV redundancy, use dataset copies property

Scrub vs fsck

ZFS has no offline integrity checker
ZFS scrub does everything that fsck does, and more
You can offline your Pool to scrub, by why would you?
Scrub isn't perfect, but it's better than fsck

Pool Properties

Properties are tunables
Both Pools and Datasets have properties
Commands: zpool set, and zpool get
Some are read-only

# zpool get all | less

Changing Pool Properties

# zpool set comment="Main OS Files" zroot
# zpool set copies=2 zroot

Pool History

# zpool history zroot

ZPool Feature Flags

ZFS had version numbers
Then, Oracle assimilated Sun
Feature flags are at version 5000
Feature flags versus OS
# zpool get all trinity | grep feature

Datasets

A named chunk of data
Filesystems
Volume
Snapshot
Clone
Bookmark
Properties and features work on a per-dataset basis
# zfs list -r zroot/ROOT

Creating Datasets

$ zfs create zroot/var/mysql
$ zfs create -V 4G zroot/vmware

Destroying Datasets

# zfs destroy zroot/var/old-mysql
-v -- verbose mode
-n -- no-op flag

Parent-Child Relationships

Datasets inherit their parent's properties
If you change it locally, but want to have it use the parent's inherited property, use

zfs inherit

Renaming a Dataset changes its inheritance

Pool Repair & Maintenance

Resilvering
Rebuild from parity
Uses VDEV redundancy data
No redundancy? No resilvering
Throttled by Disk I/O
Happens automatically when disk is replaced
Can add VDEVs to Pools, not disks to VDEV
Be cautious of slightly smaller disks (check sector size as they can vary from disk to disk of equal capacity)

Add VDEV to Pool

New VDEVs must be identical to existing VDEVs in the Pool

# zpool add scratch gpt/zfs99
# zpool add db mirror gpt/zfs6 gpt/zfs7
# zpool add trinity raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5

Hardware States in ZFS

ONLINE -- operating normally
DEGRADED -- at least one storage provider has failed
FAULTED -- generated too many errors
UNAVAIL -- cannot open storage provider
OFFLINE -- storage provider has been shut down
REMOVED -- hardware detection of unplugged device
Errors percolate up through the ZFS stack
Hardware RAID hides errors - ZFS does not!

Log and Cache Devices

Read Cache -- L2ARC (Level 2 Adaptive Replacement Cache)
Synchronous Write Log -- ZIL, SLOG (ZFS Intent Log, Separate Log Device)
Where is the bottleneck?
Log/Cache Hardware

Filesystem Compression

Compression exchanges CPU time for disk I/O
Disk I/O is very limited
CPU time is plentiful
LZ4 by default
Enable compression before writing any data
# set compress=lz4 zroot
Typically gzip-9 is better than lz4
No more userland log compression

Memory Cache Compression

Advanced Replication Cache is ZFS' Buffer Cache
ARC compression exchanges CPU time for memory
Memory can be somewhat limited
CPU time is plentiful
ZFS ARC auto compresses what can be compressed

Deduplication (Dedup)

ZFS deduplication isn't good as you would imagine it is
Only duplicates identical filesystem blocks
Most data is not ZFS deduplicable
1TB of dedup'd data = 5G RAM for the dedup process
Dedup RAM X 4 = System RAM
Effectiveness: run zdb -S zroot, check dedup column
Cost-effective ZFS dedup just doesn't exist

Snapshots

# zfs snapshot trinity/home@<today's date>
# zfs list -t snapshot
# zfs -r zroot@<today's date>
Access snapshots in hidden .zfs directory (especially when using ZFSonLinux)
# zfs destroy trinity/home@<today's date>
Use -vn in destroy operations

Snapshot Disk Use

Delete file from live filesystem
Blocks in a snapshot remain in use
Blocks are freed only when no snapshot uses them

Roll Back

Can rollback filesystem to the most recent snapshot
# zfs rollback zroot/ROOT/@<before upgrade>
Newer data is destroyed

Clones

A read-write copy of a snapshot
# zfs clone zroot/var/mysql@<today> \ zroot/var/mysql-test
Run a test, then discard afterward

Boot Environments

Built on clones and snapshots
Snapshot root filesystem dataset before an upgrade
If Upgrade goes awry. Roll back!
FreeBSD: sysutils/beadm

ZFS Send/Receive

Move whole filesystems to another host
Blows rsync out of the water
Resumable

Let's install ZFS in Linux and start using it

Designating Hot Spares in Your ZFS Storage Pool

2019-12-28T21:24:52.001Z

There is a feature built into ZFS called the "hotspares" feature which allows a sysadmin to identify those drives available as spares which can be swapped out in the event of a drive failure in a storage pool. If an appropriate flag is set in the feature, the "hot spare" drive can even be swapped automatically to replace the failed drive. Or, alternatively, a spare drive can be swapped manually if the sysadmin detects a failing drive that is reported as irreparable.

Hot spares can be designated in the ZFS storage pool in two separate ways:

When the pool is created using the zpool create command, and

After the pool is created using the zpool add command

Before beginning to create the ZFS storage pool and identifying the spares, I want to list the available drives in my Debian 10 Linux VM in VirtualBox 6.0. I can accomplish this in the Terminal using the command:

Here, we see that I have 7 total drives apart from the primary drive, /dev/sda, used by Debian 10 Linux. These drives are listed above and range from /dev/sdb ... to /dev/sdh.

Therefore, I plan to create a ZFS storage pool called trinity with a mirror of two drives: sdb and sdc; and hot spares which I will identify as sdd and sde.

The diagram below illustrates the new ZFS mirrored storage pool with the two spares identified. In addition, I have taken the liberty of recreating the trinity/data, .../apps, and .../datapioneer datasets:

As reported above using the zpool status or zpool status trinity command, the two mirror drives sdb & sdc are ONLINE and I have identified two AVAIL spares sdd & sde. If I want to remove one of the hotspares and replace it with another drive, I can do this using the zpool remove command:

Since the spare drive was not INUSE but was AVAIL instead, we were able to remove it. Had it been INUSE, this wouldn't be allowed. Instead, the drive would have to be manually taken OFFLINE first, then removed. One important thing to keep in mind here is that drives identified as spares in the system must have a drive capacity >= largest drive in the storage pool. Let's add another spare to the storage pool by replacing the drive sdd that we previously removed. This is accomplished by running the zpool add trinity spare command and inserting the sdd drive designated as the replacement:

As an aside, if at anytime you wish to determine the health of a ZFS storage pool, you can either run the zpool trinity status command or zpool trinity status -x command. The latter will actually list the health so you don't have to look at the STATE category of the status screen:

root@debian-10-desktop-vm:/trinity# zpool status trinity -x
pool 'trinity' is healthy

Now, if an ONLINE drive fails, the zpool will move into a DEGRADED state. The failed drive can then be replaced manually by the sysadmin or, if the sysadmin has set the autoreplace=on in the property of the zpool, the failed drive will be automatically replaced by ZFS. The command to set this property in this example is:

root@debian-10-desktop-vm:/trinity# zpool set autoreplace=on trinity

We can simulate the failure of one of our drives by manually taking it OFFLINE. So, let's pretend that drive sdb fails by manually OFFLINE'ing it:

Note here that sdb now shows DEGRADED in the status and the overall status of the Zpool Trinity is in a DEGRADED State as well. If this had been a real world scenario where a drive failure actually occurred instead of simulating the failure by OFFLINE'ing the drive, one of the AVAIL spares would have replaced the failed drive, sdb, with either sde or sdd, reducing the AVAIL spares by one and bringing the Zpool back to ONLINE status. I can either ONLINE the device, sdb, since I know it to be a good drive, or I can simulate replacing this drive with one of the spares but doing so through a manual rather than automatic process. I will demonstrate the latter process using the zpool replace command:

Now, the current status after running the zpool replace trinity sdb sde command to replace the failed sdb drive with the hotspare sde indicates that spare sde is currently INUSE rather than AVAIL and sde is ONLINE.

And, finally, if we detach the failed (OFFLINE) drive using the zpool detach trinity sdb command, the status of the Zpool storage pool should be returned to ONLINE in healthy status:

Note, now the AVAIL spares has been reduced to sdd only and sde has replaced the failed sdb drive which was formerly OFFLINE. The drive was resilvered with 507K bytes with 0 errors returned and this process was accomplished on Sat Dec 28 2019 at 16:04:34 Local time.

The last thing that we can do here before wrapping up this article is to add another drive to the "hot spares" list. We know from the previous command that we ran that we have drive sdf available for this purpose. So, let's add this drive as a spare. Since the drive, sdf, is being added after the ZFS storage pool was created, we can use the zpool add trinity spare sdf command to add this drive as an AVAIL spare:

Now, we have two spares AVAIL instead of the single drive and our ZFS storage pool is healthy once again.

Return to beginning of the article - Part 1

Setting Up Quotas & Reservations in OpenZFS in Linux

2019-12-28T17:02:41.964Z

ZFS supports quotas and reservations at the filesystem level. Quotas in ZFS set limits on the amount of space that a ZFS filesystem can use. Reservations in ZFS are used to guarantee a certain amount of space is available to the filesystem for use for apps and other objects in ZFS. Both quotas and reservations apply to the dataset the limits are set on and any descendants of that dataset.

Primarily, quotas are set to limit the amount of space that a particular user of the system can consume so that no one user maximizes the Zpool in the dataset. In this article, I have setup two users in the Linux system: datapioneer (myself as the primary user) and dante (a secondary user of the Linux system) upon which OpenZFS on Linux has been installed.

Let's look at the zpool that I currently have setup in Linux for this example. Running both zpool status and zfs list commands in Debian 10 Linux produce the following profile of the ZFS filesystem:

Here, I have a RAIDZ of 3 SCSI drives with a logs mirror and two-drive SSD Zil SLOG. All drives are ONLINE and the zfspool is also ONLINE and healthy. As shown, the zfspool/data dataset is using 12M of data out of a total space of 19.0G. This means that both datapioneer and dante have access to this total 19.0G of zfspool drive space across the drives. I would like to limit the user, dante, to 5G of drive space so he doesn't inadvertently fill up the Zpool with data. In addition, I would like to reserve 5G of the Zpool under the dataset zfspool/apps for applications. Here's how we can accomplish both.

To limit 5G of drive space as a quota for dante under zfspool/data, we can run the following command:

Setting the quota for dante under the zfspool/data/dante dataset to 5G is performed using: zfs set quota=5G zfspool/data/dante. In the figure above, I ran a zfs list to look at the current listing of drive utilization and you can see that dante has the entire 19.0G available to him. After running the zfs set quota command, we can confirm that we have limited dante to 5G by running the zfs get quota zfspool/data/dante command showing a value of 5G for the property: Quota. Rerunning zfs list also shows that only 4.99G, ~ 5G, is currently available to dante. However, note that the user datapioneer still has full use of the zpool wherein the amount of 19.0G of space is showing as available. The user dante will only be able to fill up 5G of space under the zfspool/data/dante dataset and any attempts to add more data to the drives will be denied. Now, if we add a child dataset called ws under zfspool/data/dante, the quota for /zfspool/data/dante/ws inherits the 5G limitation:

Now, let's set a 5G reservation for apps under /zfspool/apps. What this means is that the ZFS filesystem will set aside 5G of drive space for use by apps only and nothing else will be able to use it. This is different from quotas set up in the system since setting quotas allow drive space use up to a certain value whereas establishing a reservation tells the ZFS filesystem not to allow use of the reserved amount by any object in the system other than apps under /zfspool/apps. To set a reservation of 5G for apps dataset in ZFS, we can run the command:

After setting the 5G reservation in ZFS for apps, we see that the amount USED for zfspool in the system has increased from 24.4M to 5.02G (which includes the 5G reservation for apps) with 14.0G AVAIL instead of 19.0G AVAIL prior to setting up the reservation. Also, note that the amount available for apps under /zfspool/apps remains at 19.0G which does not establish a quota of any kind since this amount AVAIL remains unchanged before and after the reservation was set. However, what should be realized here is that the amount of space AVAIL to dante remains at 5G (as established in the Quota we setup earlier) but, the total amount AVAIL for datapioneer has been reduced from 19.0G to 14.0G (the difference being 5G) which reflects the fact that 5G of space has been reserved for apps and is no longer available to any other object in the zfspool system.

Designating "Hot Spares" in the Storage Pool in OpenZFS - Part 6

Logs Mirror, Cache, & Snapshots in OpenZFS Filesystem on Linux

2019-12-27T23:15:25.279Z

Another big advantage to installing ZoL or OpenZFS filesystem on Linux is that as the sysadmin you can create a mirror of two SCSI drives in the filesystem containing your system logs and create a cache consisting of two SSD drives in the filesystem containing system cache information. The logs mirror helps to balance the load of the ZFS pool in the system and also helps to ensure that your log files are preserved in the event of RAIDZ failure. The cache is a part of the ARC (Adaptive Replacement Cache) system in OpenZFS and assists in rebuilding drives to restore your system if drives begin to fail. Read cache is referred to as L2ARC (Level 2 Adaptive Replacement Cache), synchronous write cache is ZIL (ZFS Intent Log), SLOG (Separate Log Device).

To prepare for the creation of the logs mirror, I added two additional 10G SCSI drives to the ZFS pool. These are designated as /dev/sde and /dev/sdf, respectively. Next, I ran the following command in the Terminal to add the logs mirror:

Creating the logs mirror pretty much guarantees preservation of logs information in OpenZFS in the event of drive failure. Creating the logs mirror is just the first step in creating what is referred to as the ZIL, SLOG which is a fast persistent write cache for ZFS writes to disk. Note I said creating the logs mirror is the first step. Now that we have created the logs mirror in ZFS, the second step is to create what is known as the ZFS cache (a part of the L2ARC - Level 2 Adaptive Replacement Cache). To create the cache in the system, I added two high speed SSD drives /dev/sdg and /dev/sdh and then ran the following command:

# zpool add zspool cache /dev/sdg /dev/sdh

To check the status of the zpool at this point, we can rerun the zpool status command:

The size of the disks that contain the logs and cache need to be determined based on the performance of your system. Monitoring this will give you a better indication of the apparent size these drives need to be. Using high speed SSDs or NVMe M.2 drives for the cache rather than traditional SCSI drives is a good idea especially if you don't want to start running into performance bottlenecks when cache information is being written to these cache drives during OpenZFS resilvering of replacement drives.

Combining the ZIL and Separate Log Device (SLOG) in the ZIL, SLOG configuration that you see above, you are essentially balancing the load on your system using OpenZFS.

ZFS Snapshots are essentially instances in time of the ZFS filesystem across the entire pool or datasets within the pool that you wish to capture. Snapshots in ZFS are read-only, immutable, great for backups since you can backup the snapshot rather than the files, and snapshots within the same host are not backups. Snapshots can be exported from one host to another host or within the same host. When creating snapshots, it is important to note that these snapshots in Linux will be represented in the system as hidden directories under the dataset wherein they are created with the snapshots being placed in those hidden directories. Files within the snapshots that are deleted may be restored using a rollback procedure which I will demonstrate below. Let's create a snapshot of the /zfspool/data/apps dataset:

The command: # zfs snapshot zfspool/apps@271220191653 which creates a snapshot of the apps dataset represents an instance in time at 16:53 on 27 December 2019 of the zfspool/apps dataset.

After creating this snapshot, I used vim to create a file called file1 which I placed in the /zfspool/apps directory where the snapshot was created:

root@debian-10-desktop-vm:/zfspool/apps# ls -lh
total 1.0K
-rw-r--r-- 1 root root 15 Dec 27 16:55 file1

Now, if I rollback the snapshot using the ZFS rollback command, this file should no longer exist since it was created after the snapshot zfspool/apps@271220191653 was created:

root@debian-10-desktop-vm:/zfspool/apps# zfs rollback zfspool/apps@271220191653
root@debian-10-desktop-vm:/zfspool/apps# ls -lh
total 0
root@debian-10-desktop-vm:/zfspool/apps#

and, indeed, the rollback has eliminated the file which was created earlier. Snapshots allow you to create a mark in time for your system. If you wish to list out the snapshots that have been created in the ZFS filesystem, you can accomplish this using the command:

These snapshots do take up space on your system, but, as you can see here, this snapshot is only using 17.3K in the system.

And, finally, if you want to move a zpool to another machine, you need to get out of the zpool you wish to move, then use the following command:

root@debian-10-desktop-vm:/zfspool/apps# pwd
/zfspool/apps
root@debian-10-desktop-vm:/zfspool/apps# cd ../../
root@debian-10-desktop-vm:/# pwd
/
root@debian-10-desktop-vm:/# zpool export zfspool
root@debian-10-desktop-vm:/# zpool status
no pools available
root@debian-10-desktop-vm:/#

As you can see above, I moved up the directory tree two levels to the / directory, then ran the command which is highlighted. After running the zpool status command, you see that no pools are available. This is because the zpool has been exported. I did not use the switch on the export command to direct the export to a specific directory in Linux, but you probably should do this so you know where you put it. In this example, I let it place the exported zpool in the default location because I am going to turn right around and import the zpool back to its original location. I do this using the ZFS import command like this:

The zpool zfspool has been returned and running a status on the pool returns good status as before.

Setting up Quotas & Reservations in OpenZFS - Part 5

Investigating RAIDZ in Debian 10 "Buster" Linux

2019-12-26T20:17:26.342Z

Now that we have looked at implementing OpenZFS on Linux in Debian 10 Linux and created zfs pool mirrors using OpenZFS as well as created and accessed ZFS datasets on the system, let's turn our attention to implementing RAID in OpenZFS. How does implementing RAID in OpenZFS compare to traditional RAID solutions? Is there a one-to-one correlation between RAIDZ and traditional RAID? Are there advantages to running RAIDZ rather than the traditional RAID solutions?

RAIDZ in OpenZFS is roughly equivalent to RAID-5 in the traditional RAID world which requires at least 4 drives with one drive serving as a parity drive. RAIDZ (RAID-5) can survive the loss of one drive and still function in the RAID. RAIDZ2 is roughly equivalent to RAID-6 which requires a total of 6 drives with two drives serving as parity drives. RAIDZ2 (RAID-6) can survive the loss of two drives and still function within the RAID-array. RAIDZ3 has no equivalent in traditional RAID.

The first thing that we need to do is to destroy the current zpool arrangement that we established when I showed you how to implement OpenZFS on Debian 10 "Buster" Linux and set up ZFS pools or mirrors in the original article shown in paragraph 1 above. The process of destroying the ZFS pools is irreversible, so if there is any data that is important to you or you want to go back to this OpenZFS configuration, then you should perform a backup by taking a snapshot of the system. We are using VirtualBox 6.0 Manager to create this OpenZFS environment in Debian 10 Linux, so it is very simple to create a snapshot. If you don't know how to create a snapshot of your VM (similar to creating a Restore Point in Windows), then consult this resource on how to accomplish this prior to moving forward.

Before destroying the current ZFS pool configuration, I suggest shutting down the VM and adding some additional SCSI drives. In the example below, I have added three more virtual SCSI drives to the VM. These are /dev/sdf, /dev/sdg, and /dev/sdh for a total of seven SCSI drives in the system.

Now, restart the VM and return to the Terminal in Debian 10. To destroy the current OpenZFS configuration we established in Linux, obtain elevated privileges in the system by becoming root, then run the following command as shown in the diagram below:

The zpool destroy zfspool command is akin to running rm -Rf <directory name> in Linux as this command will recursively remove all ZFS pools and datasets that currently reside on the system. Rerunning the zpool list command afterward confirms that we no longer have any OpenZFS pools or datasets present in the Linux system. It is worth mentioning here that this command not only removes all OpenZFS pools and datasets, but it also removes any snapshots that may have been taken in ZFS as well as any exports you may have performed and the zfs mountpoints that were created. Therefore, backups are critical at this point as reversing this process is impossible.

So, what is RAIDZ? Essentially, RAIDZ is the implementation of RAID in OpenZFS which is roughly equivalent to RAID-5 using traditional software/hardware RAID in Linux. One primary distinction between RAIDZ and RAID-5 that I would like to mention at this point is that unlike traditional RAID-5, RAIDZ does not read block-by-block when it rebuilds a drive that may be lost in the RAID or that is manually replaced in the RAID prior to failure. Rather, RAIDZ only looks at the data that resides on the SCSI drive(s) that are rebuilt and writes back to the replaced drive(s) only the data not a block-by-block restoration of the drive(s) itself. When OpenZFS restores a RAID drive, it reports that the drive has been "resilvered". We'll see this when I show you how to implement RAIDZ going forward in this article.

To create a RAIDZ or RAIDZ1 in Linux, I need to run the following command as root, then I can check the status of the newly-created RAIDZ running the subsequent command we've run previously. This process will take a few seconds longer to run since the RAIDZ takes a little more work. Here we are creating an equivalent to a traditional RAID-5 using this one OpenZFS command and attaching three SCSI drives in the process. This RAIDZ will allow us to survive the loss of one drive without losing the RAID:

Now, if I want to add an additional RAIDZ pool under zfspool I can rerun the previous command replacing the "create" option with "add" and changing /dev/sdb, /dev/sdc, and /dev/sdd with /dev/sde, /dev/sdf, and /dev/sdg, respectively. See the diagram below:

Running a status of the zpool configuration now shows that we have created two RAIDZ mirrors: raidz1-0 and raidz1-1 in the system. The first RAIDZ comprises drives sdb through sdd and the second RAIDZ comprises drives sde through sdg. All of these RAID mirrors are ONLINE and no known errors have been detected in the drives.

If you recall from the previous article on implementing OpenZFS in Linux, we set all the SCSI drives to a 10G capacity. Let's investigate how much drive capacity we have usable among the six 10G-capacity SCSI drives that we have implemented in this scheme:

From the diagram above, we see that OpenZFS is reporting a total of 38.1G of usable drive space out of a total of 60G. This is roughly 2/3rds of the available drive capacity among the six SCSI drives. We are losing 1/3rd of the capacity or roughly two SCSI drives to parity. In this RAIDZ configuration, we can survive the loss of a single drive and survive a system-wide data loss. We need to recreate the datasets /zfspool/apps, /zfspool/data, and /zfspool/data/datapioneer that we had previously. This is accomplished in the Terminal like so:

We also need to reacquire the datapioneer dataset we just created which will require us to change ownership of /zfspool/data/datapioneer to datapioneer and group ownership to datapioneer as well as changing the permissions to 755 on this dataset (mounted directory for the zfspool in Linux). See the diagram below:

Running ls -ld against the dataset confirms this has been accomplished. I have written a Bash script which writes data to the zpool we created and made accessible, namely, /zfspool/data/datapioneer. The bash script looks like the following:

I called this bash script, data.sh., added the executable bit to the file, then ran the script which added 3M X 4 (over 4 tests) of data to the zfspool dataset. To confirm this, we can run the df -kh command in the Terminal. This reveals:

which demonstrates successful writing of ~12M (13M shown) of data to the dataset. Similarly, we can look at this in the GUI by bringing up the File Manager and taking a Properties sheet on the combined files test through test4 that were created in running the bash script:

So, now, let's look at some real world examples of how RAIDZ works. If we return to take a look at the status of the current RAIDZ, we see the following output:

Let's assume that we, as sysadmin, see a bunch of CKSUMs starting to show up on drive /dev/sdc indicating a potential failure in that drive. We can take one of our spare drives, /dev/sdh, and replace /dev/sdc with this spare. To accomplish this on-the-fly using OpenZFS, we run the following command in the Terminal:

In the command that we ran in the Terminal, we referenced /dev/sdc as the drive to be replaced and /dev/sdh as the drive replacing it. Note, under zfspool / raidz1-0 drive sdc has been replaced by the spare drive, sdh, which is showing up as ONLINE, and the comment following "scan:" indicated the zfspool has been "resilvered" which as I mentioned earlier is the equivalent to a RAID rebuilding of the replaced drive. Note, also, that this process took only 1 second to rebuild since unlike traditional RAID-5, RAIDZ looks only at the 12M of data that it needs to rewrite to the spare drive rather than rebuilding the entire 10G drive block-by-block.

So, now let's look at another real world scenario in which a SCSI drive in the RAIDZ zfspool actually fails. I have replaced /dev/sdh with /dev/sdc once again and then simulated a drive failure in drive /dev/sdc using the following command in the Terminal and outputting the status following the running of the command:

Taking /dev/sdc OFFLINE results in a degraded zfspool which is shown in the "state" of the pool as well as an OFFLINE status for sdc under raidz1-0 with a STATE of "degraded". The "status" reported shows one or more devices has been taken offline by the administrator (simulating the failure) and that "sufficient replicas exist for the pool to continue functioning in a degraded state." The recommended "action" is to "Online the device using 'zpool online' or replace the device with 'zpool replace'."

In this example, we're going to ONLINE the drive /dev/sdc, by running the appropriate command shown below, then rerunning a status to show the current state of the zpool in a non-DEGRADED state:

The drive, /dev/sdc, has been resilvered and the entire zfspool has been returned to an ONLINE status from DEGRADED status reported earlier.

Now, notice that when we run zfs list again in the Terminal, I, as datapioneer, have access to 38.1G of usable drive space across the six SCSI drives and, could theoretically fill that entire space on my own.

But, in OpenZFS, there is a way to limit the amount of this usable space that a single user can consume. This is referred to as establishing a QUOTA on the pool. If I want to limit my quota of usable drive space to 100M rather than the entire 38.1G, I can run the following command in the Terminal:

# zfs set quota=100m zfspool/data/datapioneer

Then, we can confirm this by running the following command and looking at the output in the Terminal:

The quota assigned to datapioneer is listed as 100M, not the full pool of 38.1G. So, if I start creating files that begin to fill up the pool beyond 100M, future actions to create files will be disallowed until action is taken to reduce the amount of used space across the six drives below the quota assigned to me. Granted, 100M of allowed space in today's world seems totally preposterous, but you get the point.

Another good example here would be to run a zfs command that would allow the administrator to reserve 100M of usable space for apps on the pool. Unlike quotas, this is considered a reservation of space that is to be used by apps only so that no one else can use it. To perform this action in OpenZFS against the pool, we can run the following command in the Terminal:

So, by looking at the before-and-after listing of zfs we see that initially there was 38.1G available space in zfspool. After running the reservation using the command shown to reserve 100M of space for apps, the amount of available space in zfspool has been effectively reduced to 38.0G so that others cannot use the difference of 0.1G or 100M.

And, finally, the last thing that I would like to show as an example of a real world scenario using OpenZFS is to simulate a process in OpenZFS that is similar to the process that can be performed in Linux but which is much more difficult to perform in the latter case. What I'm referring to here is this. If you start noticing a drive might be failing or data might begin becoming corrupt on the drive, in Linux you would have to take the drive OFFLINE by umounting the drive (and this could be in a Linux RAID scenario), then run fsck against the drive, getting a status, then returning the drive ONLINE by mounting it if you find there are no errors or if errors have been corrected. One of the beauties of OpenZFS is that a sysadmin can check a drive or the pool of drives in RAIDZ, for example, without having to OFFLINE a particular drive. This process is referred to as a SCRUB in OpenZFS and it can be performed at anytime while the ZFS pool is functioning or on a schedule using Cron. To perform this action manually, you can run the following command in the Terminal:

Running the command as shown above performed the scrub scan against the zfspool and it reports that 0B of data was repaired in 1 second. If lots of data were repaired instead, this would be an indication of corruption occurring in the pool and the sysadmin would be alerted to a potentially failing drive that needed action taken to replace.

In a future article, I will look at the implementation of RAIDZ2 and RAIDZ3 and its implications.

Logs, Mirrors, Cache, & Snapshots in OpenZFS in Linux - Part 4

Installing and Using OpenZFS on Debian 10 "Buster" Linux

2019-12-26T05:05:36.047Z

I am running Debian 10 "Buster" Linux in Virtual Box 6.0 Manager on my Win10 Pro Main PC using the debian10-1.0-amd64-netinst.iso file which I downloaded from the Debian Linux download page. This distro was originally installed as a VM using the ext4 filesystem for the primary partition represented as /dev/sda1 in the system. I wanted to experiment with using ZFS (ZetaByte File System) which was originally developed by Sun Microsystems and published under the CDDL license in 2005 as part of the OpenSolaris operating system. I further wanted to investigate this filesystem over others that are traditionally used in Linux, such as ext3/ext4/btrfs because ZFS is known for two specific reasons: (1) It stores large files in compressed format, and, (2) it decouples the filesystem from the hardware or the platform on which it is running. In my specific case, I'm running Debian 10 in a Virtual Machine rather than on bare metal and I'm running Linux, not Windows, MacOS, or BSD.

The implementation of ZFS that I am undertaking, however, isn't ZFS but, instead, is OpenZFS. I chose OpenZFS because unlike ZFS which is a proprietary filesystem developed under Oracle, the latter is opensource and community-supported and licensed under its own opensource licensing. OpenZFS has been ported from FreeBSD to Linux in support of the ZoL (ZFS on Linux) installation on Linux distros like Debian, Arch, Fedora, Gentoo, OpenSuse, RHEL and CentOS, and Ubuntu.

My OpenZFS on Debian 10 "Buster" Linux project was performed by following the steps below:

Step 1: Configure Debian 10 "Buster" Linux by setting it up in Oracle Virtual Box 6.0 Manager running in Windows 10 Pro and installing the operating system using the default ext4 filesystem. Apply all system updates prior to moving to Step 2, then shutdown the system.

Step 2: Add two VHD SCSI disks in the system designated as /dev/sdb and /dev/sdc, then restart the VM. See the diagram below which shows how I added the SCSI Controller in the Storage module, then added the two VHD SCSI Virtual Hard Drives on the SCSI Controller in the system:

Step 3: For Debian Buster Linux, the ZFS packages are included in the contrib repository. In my case, I used the backports repository which typically contain updated releases of these ZFS packages. I added the backports repository in Debian 10 Buster using the following commands in the Linux Terminal:

# vi /etc/apt/sources.list.d/buster-backports.list
deb http://deb.debian.org/debian buster-backports main contrib
deb-src http://deb.debian.org/debian buster-backports main contrib

# vi /etc/apt/preferences.d/90_zfs
Package: libnvpair1linux libuutil1linux libzfs2linux libzpool2linux spl-dkms zfs-dkms zfs-test zfsutils-linux zfsutils-linux-dev zfs-zed
Pin: release n=buster-backports
Pin-Priority: 990

Step 4: Run a system update to refresh the repositories using

# apt update

and then install the kernel headers and associated dependencies using

# apt install --yes dpkg-dev linux-headers-$(uname -r) linux-image-amd64

And, finally, install the ZFS packages by running

# apt-get install zfs-dkms zfsutils-linux

Step 5: My Debian 10 Buster Linux system was unable to find the zpool command when I started to execute the commands in the Linux Terminal to setup the two SCSI drives as a ZFS mirror pool. So, I had to add the path to the zfs commands so BASH would recognize them. I ran the command:

# whereis zpool

which returned the location to the command as

/usr/sbin/zpool

Therefore, to allow my Linux system to find this command as well as other zfs commands by default, I modified the ~/.bashrc file and added the following line at the bottom of that file before saving it, then restarting the Terminal:

# adding the path to the zpool command in Linux
export PATH=$PATH:/usr/sbin/

This makes the $PATH change persistent in the system. To prove this, I ran the command:

echo $PATH

in the Terminal and the following was returned:

/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/sbin/

indicating that the $PATH has been extended to include the location for the zfs commands in the system making these commands recognizable by BASH by default.

Next, I created a ZFS Pool using the two SCSI drives /dev/sdb and /dev/sdc I created earlier into a ZFS mirror-0, using the command:

# zpool create zfspool mirror /dev/sdb /dev/sdc

and checked the status of the ZFS pool I just created using the following zpool status command with -v switch, and then running the zfs list command to indicate the amount of pool space currently available and where it is mounted. I followed this up by running the df -kh command to have Linux show me the filesystem breakdown in the system by human-readable blocksize. See these commands listed out below:

Note above that the zfspool is showing up as ONLINE state and both sdb and sdc along with the associated mirror-0 are ONLINE as well. No errors were detected in these drives. The zfs list command shows that 88.5k of data was written to the mirror for administrative tracking purposes leaving 9.3G out of 20G of total drive space (10GB X 2) as available for the zfspool mirrored drive. Also, note that the mount of the zfspool at /zfspool as shown in the diagram above is persistent. It means that by implementing the OpenZFS filesystem using ZoL I am no longer required to update the /etc/fstab file in Linux each time I create a ZFS pool. Thus, the zfspool which was created is not managed by Linux at all. Instead, ZFS manages the pool of two drives, in this example, and automatically mounts the pool as mirror-0 (listing out the drives separately).

Step 6: Now, I want to expand my current zfspool of drives from its current 10G size to 20G by adding a new mirror of two drives designated as /dev/sdd and /dev/sde. To accomplish this, I need to stop the VirtualBox 6.0 Manager VM of Debian 10 Buster, then add two SCSI drives as VHD virtual drives which show up as /dev/sdd and /dev/sde in the Linux system. The process to add these two additional drives in VirtualBox 6.0 Manager is shown below:

Note, we now have two additional drives under the SCSI Controller: /dev/sdd and /dev/sde. Each of these drives were set to a size of 10G.

Next, I use the command:

# zpool add zfspool mirror /dev/sdd /dev/sde

which adds the two SCSI drives to the existing zfspool as mirrored drives. If I then list out the zpool, run the Linux command that displays the existing mount of zfspool and its size, and obtain a current status of the zpool, you can see that I have effectively expanded the existing zfspool I originally created to a size of 20G (doubling the drive space).

Now that I've created mirror-0 by combining /dev/sdb and /dev/sdc as well as mirror-1 by combining /dev/sdd and /dev/sde, I can no longer add more drives to either mirror. However, I can add additional drives by creating additional mirrors.

In ZFS, Pools are the equivalent of Disk Volumes in RAID and other disk-combining systems. Moreover, in ZFS, Datasets are the equivalent of data shares in those other systems. Thus, in the next step, I create my first dataset in ZFS underneath the zfspool mounted at /zfspool.

Step 7: To create a ZFS dataset called data under zfspool, I run the following command, then list out the status of the pool:

This creates zfspool/data mounted at /zfspool/data automatically and 24k of administrative data was written to the pool to keep track of it. Now, if I wanted to add another ZFS Dataset called apps and add another ZFS Dataset called datapioneer underneath the data ZFS Dataset, I can do this in the Terminal as follows:

Step 8: To gain access to the /zfspool/data/datapioneer dataset and permissions to write to it, I can perform the following commands in the Linux Terminal:

Listing the storage looking at directories only, you can see that I now own the /zfspool/data/datapioneer directory and the group owner of this directory is also datapioneer with rwx, r-x, and r-x permissions. This is possible in the Linux Terminal since ZFS has a Posix layer in its ZoL implementation in Linux.

Accessing The Mirror - Part 2

Accessing an OpenZFS Mirror Created in Debian 10 Linux

2019-12-26T05:03:29.944Z

Now that we have created the ZFS Mirrors in the Linux system which point to four other SCSI drives of 10G capacity each, our two existing mirrors of two drives each are theoretically capable of accessing data of at least 20G in size. However, due to the overhead for keeping track of this data in the filesystem, our Linux system show a total data access space of around 19G.

If I open the Linux Terminal and run a listing of the storage using the ls command, then change directory to the zfspool to the subdirectory called datapioneer under the data subdirectory of zfspool, and then run the df -kh command, the output indicates the 19G blocksize that I mentioned above.

The total available space in the /zfspool/data/datapioneer dataset directory is 19G. This can be seen both in the Terminal and in File Manager as shown above. Now, since I changed ownership of this directory to datapioneer and set the permissions on the directory to 755, as datapioneer in the Linux system I should be able to touch this directory and create a file, name the file, add some content to it, and save the file under /zfspool/data/datapioneer with a total of 19G of available space in which to do so. This is demonstrated below in the Terminal:

Here, a new file was created called newFile.txt containing the text shown in double quotes following the echo command which is then redirected as output to the file itself. Opening this file in File manager shows a clearer picture in the GUI:

Thus, we have created an additional 20G available ZFS pool of 4 SCSI drives of 10G capacity each that allow us to expand our available space across all four drives in the Linux system for storing our files which can be also be shared on the network.

If, in the future, additional storage space is needed, then all that is required is to install the additional SCSI drives, create a new mirror, and add them to the ZFS Pool expanding the total space across all drives in the Linux system. This is easily performed under OpenZFS. In this example I used small capacity SCSI drives, but I could just as easily add 1, 2, 3, or 4TB SCSI drives in a mirrored fashion to greatly enhance available storage and doing so is accomplished quite easily, as you can see, with the implementation of ZoL in Linux.

Investigating RAIDZ in Debian 10 "Buster" Linux - Part 3

Personal Writing Blog

2019-11-11T16:10:47.796Z

@ Scripto (Veritas)

Linux Unix Tech Channel

2019-10-23T16:21:00.245Z

The link to my YouTube Channel is: https://www.youtube.com/user/dlcalloway/

Setting Up a NAS Solution Running in Debian Linux

2019-10-15T14:26:04.834Z

Recently, I setup a secondary network-attached storage (NAS) solution at home using a virtual machine rather than a bare metal PC/server as a test platform. The process for setting this up is rather easy and anyone can do it. As I stated here, this is a secondary NAS solution since I already have a 5TB WDMyCloud Personal Cloud which I have had in place in my home running on my LAN now for several years.

I have Virtualbox 6.0 Manager running on my Windows 10 Pro main PC from which I created my virtual environment of an application called OpenMediaVault. This application runs on Debian Linux and can be downloaded from OpenMediaVault's download page. There is a link on this page to the ISO images that you can select from to get started. In my particular test case in Virtualbox 6.0 Manager, I selected the 5.0.5 image. If, however, you are electing to install OpenMediaVault on a Raspberry Pi 3 or 4 device, then you will want to elect to download and install the Raspberry Pi Images onto an SD Card and insert it in the Pi to install and utilize your Raspberry Pi device for this purpose. Since this is a test project for me so I can see how OpenMediaVault works and decide whether I want to use it as a secondary NAS solution, I chose the former rather than the latter ISO image.

Rather than describe all the necessary steps that I used to download, install, and configure OpenMediaVault as a VM in Virtualbox 6.0 Manager on my Windows 10 Pro platform, I will, instead, point you to the video that walks you through the entire process. If you have questions about this process after watching the video, then please leave a comment below this article and I will attempt to answer any questions as I get to them.