Posts Tagged ‘zfs’

Our open storage partner, Nexenta Systems Inc., hit a milestone this month by releasing NexentaStor 4.0.1 for general availability. This release is significant mainly because it is the first commercial release of NexentaStor based on the Open Source Illumos kernel and not Oracle’s OpenSolaris (now closed source). With this move, NexentaStor’s adhering to the company’s promise of “open source technology” that enables hardware independence and targeted flexibility.

Some highlights in 4.0.1:

Faster Install times

Better HA Cluster failover times and “easier” cluster manageability

Support for large memory host configurations – up to 512GB of DRAM per head/controller

Note, the 18TB Community Edition EULA is still hampered by the “non-commercial” language, restricting it’s use to home, education and academic (ie. training, testing, lab, etc.) targets. However, the “total amount of Storage Space” license for Community is a deviation from the Enterprise licensing (typical “raw” storage entitlement)

2.2 If You have acquired a Community Edition license, the total amount of Storage Space is limited as specified on the Site and is subject to change without notice. The Community Edition may ONLY be used for educational, academic and other non-commercial purposes expressly excluding any commercial usage. The Trial Edition licenses may ONLY be used for the sole purposes of evaluating the suitability of the Product for licensing of the Enterprise Edition for a fee. If You have obtained the Product under discounted educational pricing, You are only permitted to use the Product for educational and academic purposes only and such license expressly excludes any commercial purposes.

– NexentaStor EULA, Version 4.0; Last updated: March 18, 2014

For those who operate under the Community license, this means your total physical storage is UNLIMITED, provided your space “IN USE” falls short of 18TB (18,432 GB) at all times. Where this is important is in constructing useful arrays with “currently available” disks (SATA, SAS, etc.) Let’s say you needed 16TB of AVAILABLE space using “modern” 3TB disks. The fact that your spinning disks are individually larger than 600GB indicates that array rebuild times might run afoul of failure PRIOR to the completion of the rebuild (encountering data loss) and mirror or raidz2/raidz3 would be your best bet for array configuration.

SOLORI Note: Richard Elling made this concept exceedingly clear back in 2010, and his “ZFS data protection comparison” of 2, 3 and 4-way mirrors to raidz, raidz2 and raidz3 is still a great reference on the topic.

By “raw” licensing standards, the 3-way mirror would require a 76TB license while the raidz2 volume would require a 51TB license – a difference of 25TB in licensing (around $5,300 retail). However, under the Community License, the “cost” is exactly the same, allowing for a considerable amount of flexibility in array loadout and configuration.

Why do I need 54TiB in disk to make 16TB of “AVAILABLE” storage in Raidz2?

The RAID grouping we’ve chosen is 6-disk raidz2 – that’s akin to 4 data and 2 parity disks in RAID6 (without the fixed stripe requirement or the “write hole penalty.”) This means, on average, one third of the space consumed on-disk will be in the form of parity information. Therefore, right of the top, we’re losing 33% of the disk capacity. Likewise, disk manufacturers make TiB not TB disks, so we lose 7% of “capacity” in the conversion from TiB to TB. Additionally, we like to have a healthy amount of space reserved for new block allocation and recommend 30% unused space as a target. All combined, a 6-disk raidz array is, at best, 43% efficient in terms of capacity (by contrast, 3-way mirror is only 22% space efficient). For an array based on 3TiB disks, we therefore get only 1.3TB of usable storage – per disk – with 6-disk raidz (by contrast, 10-disk raidz nets only 160GB additional “usable” space per disk.)

SOLORI’s Take: If you’re running 3.x in production, 4.0.1 is not suitable for in-place upgrades (yet) so testing and waiting for the “non-disruptive” maintenance release is your best option. For new installations – especially inside a VM or hypervisor environment as a Virtual Storage Appliance (VSA) – version 4.0.1 presents a better option over it’s 3.x siblings. If you’re familiar with 3.x, there’s not much new on the NMV side outside better tunables and snappier response.

Following-up on the last installment of managing CIFS shares, there has been a considerable number of questions as to how to establish domain user rights on the share. From these questions it is apparent that the my explanation about root-level share permissions could have been more clear. To that end, I want to look at default shares from a Windows SBS Server 2008 R2 environment and translate those settings to a working NexentaStor CIFS share deployment.

Evaluating Default Shares

In SBS Server 2008, a number of default shares are promulgated from the SBS Server. Excluding the “hidden” shares, these include:

Address

ExchangeOAB

NETLOGON

Public

RedirectedFolders

SYSVOL

UserShares

Printers

Therefore, it follows that a useful exercise in rights deployment might be to recreate a couple of these shares on a NexentaStor system and detail the methodology. I have chosen the NETLOGON and SYSVOL shares as these two represent default shares common in all Windows server environments. Here are their relative permissions:

NETLOGON

From the Windows file browser, the NETLOGON share has default permissions that look like this:

Looking at this same permission set from the command line (ICALCS.EXE), the permission look like this:

The key to observe here is the use of Windows built-in users and NT Authority accounts. Also, it is noteworthy that some administrative privileges are different depending on inheritance. For instance, the Administrator’s rights are less than “Full” permissions on the share, however they are “Full” when inherited to sub-dirs and files, whereas SYSTEM’s permissions are “Full” in both contexts.

SYSVOL

From the Windows file browser, the NETLOGON share has default permissions that look like this:

Looking at this same permission set from the command line (ICALCS.EXE), the permission look like this:

Note that Administrators privileges are truncated (not “Full”) with respect to the inherited rights on sub-dirs and files when compared to the NETLOGON share ACL.

Create CIFS Shares in NexentaStor

On a ZFS pool, create a new folder using the Web GUI (NMV) that will represent the SYSVOL share. This will look something like the following:Read the rest of this entry ?

In this In-the-Lab segment we’re going to look at how to recover from a failed ZFS version update in case you’ve become ambitious with your NexentaStor installation after the last Short-Take on ZFS/ZPOOL versions. If you used the “root shell” to make those changes, chances are your grub is failing after reboot. If so, this blog can help, but before you read on, observe this necessary disclaimer:

NexentaStor is an appliance operating system, not a general purpose one. The accepted way to manage the system volume is through the NMC shell and NMV web interface. Using a “root shell” to configure the file system(s) is unsupported and may void your support agreement(s) and/or license(s).

That said, let’s assume that you updated the syspool filesystem and zpool to the latest versions using the “root shell” instead of the NMC (i.e. following a system update where zfs and zpool warnings declare that your pool and filesystems are too old, etc.) In such a case, the resulting syspool will not be bootable until you update grub (this happens automagically when you use the NMC commands.) When this happens, you’re greeted with the following boot prompt:

grub>

Grub is now telling you that it has no idea how to boot your NexentaStor OS. Chances are there are two things that will need to happen before your system boots again:

We’ll update both in the same recovery session to save time (this assumes you know or have a rough idea about your intended boot checkpoint – it is usually the highest numbered rootfs-nmu-NNN checkpoint, where NNN is a three digit number.) The first step is to load the recovery console. This could have been done from the “Safe Mode” boot menu option if grub was still active. However, since grub is blown-away, we’ll boot from the latest NexentaStor CD and select the recovery option from the menu.

Import the syspool

Then, we login as “root” (empty password.) From this “root shell” we can import the existing (disks connected to active controllers) syspool with the following command:

# zpool import -f syspool

Note the use of the “-f” card to force the import of the pool. Chances are, the pool will not have been “destroyed” or “exported” so zpool will “think” the pool belongs to another system (your boot system, not the rescue system). As a precaution, zpool assumes that the pool is still “in use” by the “other system” and the import is rejected to avoid “importing an imported pool” which would be completely catastrophic.

With the syspool imported, we need to mount the correct (latest) checkpointed filesystem as our boot reference for grub, destroy the local zfs.cache file (in case the pool disks have been moved, but still all there), update the boot archive to correspond to the mounted checkpoint and install grub to the disk(s) in the pool (i.e. each mirror member).

List the Checkpoints

# zfs list -r syspool

From the resulting list, we’ll pick our highest-numbered checkpoint; for the sake of this article let’s say it’s “rootfs-nmu-013” and mount it.

Install Grub to Each Mirror Disk

Unmount and Reboot

# umount /tmp/syspool
# sync
# reboot

Now, the system should be restored to a bootable configuration based on the selected system checkpoint. A similar procedure can be found on Nexenta’s site when using the “Safe Mode” boot option. If you follow that process, you’ll quickly encounter an error – likely intentional and meant to elicit a call to support for help. See if you can spot the step…

Jeff Bonwick’s last day at Oracle may be September 30, 2010 after two decades with Sun, but his contributions to ZFS and Solaris will live on through Oracle and open source storage for decades to come. In 2007, Bill Moore, Jeff Bonwick (co-founders of ZFS) and Pawel Jakub Dawidek (ported ZFS to FreeBSD) were interviewed by David Brown for the Association for Computing Machinery and discussed the future of file systems. The discussion gave good insights into the visionary thinking behind ZFS and how the designers set out to solve problems that would plague future storage systems.

One thing that has changed, as Bill already mentioned, is that the error rates have remained constant, yet the amount of data and the I/O bandwidths have gone up tremendously. Back when we added large file support to Solaris 2.6, creating a one-terabyte file was a big deal. It took a week and an awful lot of disks to create this file.

Now for comparison, take a look at, say, Greenplum’s database software, which is based on Solaris and ZFS. Greenplum has created a data-warehousing appliance consisting of a rack of 10 Thumpers (SunFire x4500s). They can scan data at a rate of one terabyte per minute. That’s a whole different deal. Now if you’re getting an uncorrectable error occurring once every 10 to 20 terabytes, that’s once every 10 to 20 minutes—which is pretty bad, actually.

But it’s quotes like this from Jeff’s blog in 2007 that really resonate with my experience:

Custom interconnects can’t keep up with Ethernet. In the time that Fibre Channel went from 1Gb to 4Gb — a factor of 4 — Ethernet went from 10Mb to 10Gb — a factor of 1000. That SAN is just slowing you down.

Today’s world of array products running custom firmware on custom RAID controllers on a Fibre Channel SAN is in for massive disruption. It will be replaced by intelligent storage servers, built from commodity hardware, running an open operating system, speaking over the real network.

My old business partner, Craig White, philosopher and network architect at BT let me in on that secret back in the late 90’s. At the time I was spreading Ethernet across a small city while Craig was off to Level3 – spreading gigabit Ethernet across entire continents. He made it clear to me that Ethernet – in its simplicity and utility – was like the loyal mutt that never let you down and always rose to meet a fight. Betting against Ethernet’s domination as an interconnect was like betting against the house: ultimately a losing proposition. While there will always be room for exotic interconnects, the remaining 95% of the market will look to Ethernet. Lookup “ubiquity” in the dictionary – it’s right there next to Ethernet, and it’s come a long way since it first appeared on Bob Metcalf’s napkin in ’73.

Looking back at Jeff’s Sun blog, it’s pretty clear that Sun’s “near-death experience” had the same profound change on the his thinking; and perhaps that change made him ultimately incompatible with the Oracle culture. I doubt a culture that embraces the voracious acquisition and marketing posture of former HP CEO Mark Hurd would likewise embrace the unknown risk and intangible reward framework of openness.

In each case, asking the question with a truly open mind changed the answer. We killed our more-of-the-same SPARC roadmap and went multi-core, multi-thread, and low-power instead. We started building AMD and Intel systems. We launched a wave of innovation in Solaris (DTrace, ZFS, zones, FMA, SMF, FireEngine, CrossBow) and open-sourced all of it. We started supporting Linux and Windows. And most recently, we open-sourced Java. In short, we changed just about everything. Including, over time, the culture.

Still, there was no guarantee that open-sourcing Solaris would change anything. It’s that same nagging fear you have the first time you throw a party: what if nobody comes? But in fact, it changed everything: the level of interest, the rate of adoption, the pace of communication. Most significantly, it changed the way we do development. It’s not just the code that’s open, but the entire development process. And that, in turn, is attracting developers and ISVs whom we couldn’t even have spoken to a few years ago. The openness permits us to have the conversation; the technology makes the conversation interesting.

This lesson, I fear, cannot be unlearned, and perhaps that’s a good thing. There’s an side to an engineer’s creation that goes way beyond profit and loss, schedules and deadlines, or success and failure. This side probably fits better in the subjective realm of the arts than the objective realm of engineering and capitalism. It’s where inspiration and disruptive ideas abide. Reading Bonwick’s “fairwell” posting, it’s clear to see that the inspirational road ahead has more allure than recidivism at Oracle. I’ll leave it in his words:

The early summer storms have taken its toll on Alabama and UPS failures (and short-falls) have been popping-up all over. Add consolidated, shared storage to the equation and you have a recipe for potential data loss – at least this is what we’ve been seeing recently. Add JBOD’s with separate power rails and limited UPS life-time and/or no generator backup and you’ve got a recipe for potential data loss.

Even with ZFS pools, data integrity in a power event cannot be guaranteed – especially when employing “desktop” drives and RAID controllers with RAM cache and no BBU (or perhaps a “bad storage admin” that has managed to disable the ZIL). When this happens, NexentaStor (an other ZFS storage devices) may even show all members in the ZFS pool as “ONLINE” as if they are awaiting proper import. However, when an import is attempted (either automatically on reboot or manually) the pool fails to import.

From the command line, the suspect pool’s status might look like this:

Not good. This probably indicates that something is not right with the array. Let’s try to force the import and see what happens:

Nope. Now this is the point where most people start to get nervous, their neck tightens-up a bit and they begin to flip through a mental calendar of backup schedules and catalog backup repositories – I know I do. However, it’s the next one that makes most administrators really nervous when trying to “force” the import:

In this case, something must have happened to corrupt metadata – perhaps the non-BBU cache on the RAID device when power failed. Expensive lesson learned? Not yet. The ZFS file system still presents you with options, namely “acceptable data loss” for the period of time accounted for in the RAID controller’s cache. Since ZFS writes data in transaction groups and transaction groups normally commit in 20-30 second intervals, that RAID controller’s lack of BBU puts some or all of that pending group at risk. Here’s how to tell by testing the forced import as if data loss was allowed:

root@NexentaStor:~# zpool import -nfF pool0
Would be able to return data to its state as of Fri May 7 10:14:32 2010.
Would discard approximately 30 seconds of transactions.

If the first output is acceptable, then proceeding without the “n” option will produce the desired effect by “rewinding” the last couple of transaction groups (read ignoring) and imported the “truncated” pool. The “import” option will report the exact number of “seconds” worth of data that cannot be restored. Depending on the bandwidth and utilization of your system, this could be very little data or several MB worth of transaction(s).

What to do about the second option? From the man pages on “zpool import” Sun/Oracle says the following:

Imports all pools found in the search directories. Identical to the previous command, except that all pools with a sufficient number of devices available are imported. Destroyed pools, pools that were previously destroyed with the “zpool destroy” command, will not be imported unless the-D option is specified.

-omntopts

Comma-separated list of mount options to use when mounting datasets within the pool. See zfs(1M) for a description of dataset properties and mount options.

-oproperty=value

Sets the specified property on the imported pool. See the “Properties” section for more information on the available pool properties.

-ccachefile

Reads configuration from the given cachefile that was created with the “cachefile” pool property. This cachefile is used instead of searching for devices.

-ddir

Searches for devices or files in dir. The -d option can be specified multiple times. This option is incompatible with the -c option.

-D

Imports destroyed pools only. The -f option is also required.

-f

Forces import, even if the pool appears to be potentially active.

-F

Recovery mode for a non-importable pool. Attempt to return the pool to an importable state by discarding the last few transactions. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost. This option is ignored if the pool is importable or already imported.

-a

Searches for and imports all pools found.

-Rroot

Sets the “cachefile” property to “none” and the “altroot” property to “root”.

-n

Used with the -F recovery option. Determines whether a non-importable pool can be made importable again, but does not actually perform the pool recovery. For more details about pool recovery mode, see the -F option, above.

No real help here. What the documentation omits is the “-X” option. This option is only valid with the “-F” recovery mode setting, however it is NOT well documented suffice to say it is the last resort before acquiescing to real problem solving… Assuming the standard recovery mode “depth” of transaction replay is not quite enough to get you over the hump, the “-X” option gives you an “extended replay” by seemingly providing a scrub-like search through the transaction groups (read “potentially time consuming”) until it arrives at the last reliable transaction group in the dataset.

Lessons to be learned from this excursion into pool recovery are as follows:

The data integrity functions in ZFS are solid when used appropriately. When architecting your HOME/SOHO/SMB NAS appliance, pay attention to the hidden risks of “promised performance” that may walk you down the plank towards a tape backup (or resume writing) event. Better to leave the 5-15% performance benefit on the table or purchase adequate BBU/UPS/Generator resources to supplant your system in worst-case events. In complex environments, a pending power loss can be properly mitigated through management supervisors and clever scripts: turning down resources in advance of total failure. How valuable is your data???

Sharing folders in NexentaStor is pretty easy in Workgroup mode, but Active Directory integration takes a few extra steps. Unfortunately, it’s not (yet) as easy as point-and-click, but it doesn’t have to be too difficult either. (The following assumes/requires that the NexentaStor appliance has been correctly configured-in and joined-to Active Directory.)

Typical user and group permissions for a local hard disk in Windows.

Let’s examine the case where a domain admin group will have “Full Control” of the share, and “Everyone” will have read/execute permissions. This is a typical use case where a single share contains multiple user directories under administrative control. It’s the same configuration as local disks in a Windows environment. For our example, we’re going to mimic this setup using a CIFS share from a NexentaStor CE appliance and create the basic ACL to allow for Windows AD control.

For this process to work, we need to join the NexentaStor appliance to the Active Directory Domain. The best practice is to create the machine account in AD first, assign control user/group rights (if possible) and then attempt to join. It is IMPORTANT that the host name and DNS configuration of the NexentaStor appliance match domain norms, or things will come crashing to a halt pretty quickly.

That said, assuming that your DC is 1.1.1.1 and your BDC is 1.1.1.2 with a “short” domain of “SOLORI” and a FQDN of “SOLORI.MSFT” your NexentaStor’s name server configuration (Settings->Network->Name Servers) would look something like this:

This is important because the AD queries will pull service records from the configured domain name servers. If these point to an “Internet” DNS server, the AD entries may not be reflected in that server’s database and AD authentication (as well as join) will fail.

The other way the NexentaStor appliance knows what AD Domain to look into is by its own host name. For AD authentication to work properly, the NexentaStor host name must reflect the AD domain. For example, if the FQDN of your AD domain is “SOLORI.MSFT” then your domain name on the appliance would be configured like this (Appliance->Basic Settings->Domainname):

The next step is to create the machine account in AD using “Active Directory Users and Computers” administrator’s configuration tool. Find your domain folder and right-click “Computers” – select New->Computer from the menu and enter the computer name (no domain). The default user group assigned to administrative control should be Domain Admins. Since this works for our example, no changes are necessary so click “OK” to complete.

Now it’s time to join the AD domain from NexentaStor. Any user with permissions to join a machine to the domain will do. Armed with that information, drill down to Data Management->Shares->CIFS Server->Join AD/DNS Server and enter the AD/DNS server. AD server, AD user and user password into the configuration box:

If your permissions and credentials are good, your NexentaStor appliance is not now a member of your domain. As such, it can now identify AD users and groups by unique gid and uid data created from AD. This gid and uid information will be used to create our ACLs for the CIFS share.

To uncover the gid for the “Domain Admins” and “Domain Users” groups, we issue the following from the NexentaStor NMC (CLI):

nmc@san01:/$ idmap dump -n | grep "Domain Admins"

wingroup:Domain Admins@solori.msft == gid:3036392745

nmc@san01:/$ idmap dump -n | grep “Domain Users”

wingroup:Domain Users@solori.msft == gid:1238392562

Now we can construct a CIFS share (with anonymous read/write disabled) and apply the Domain Admin gid to an ACL – just click on the share, and then click “(+) Add Permissions for Group”:

Applying administrative permissions with the AD group ID for Domain Admins.

We do similarly with the Domain User gid:

Applying the Domain User gid to CIFS share ACL.

Note that the “Domain Users” group gets only “execute” and “read” permissions while the “Domain Admins” group gets full control – just like the local disk! Now, with CIFS sharing enabled and the ACL suited to our AD authentication, we can access the share from any domain machine provided our user is in the Domain Users or Admins group.

Administrators can now create “personal” folders and assign detailed user rights just as they would do with any shared storage device. The only trick is in creating the initial ACL for the CIFS share – as about – and you’ve successfully integrated your NexentaStor appliance into your AD domain.

NOTE: If you’re running Windows Server 2008 (or SBS 2008) as your AD controller, you will need to update the share mode prior to joining the domain using the following command (from root CLI):

# sharectl set -p lmauth_level=2 smb

NOTE: I’ve also noticed that, upon reboot of the appliance (i.e. after a major update of the kernel/modules) your ephemeral id mapping takes some time to populate during which time authentication failures to CIFS shares can fail. This appears to have something to do with the state of ephemeral-to-SID mapping after re-boot.

Yesterday Jeff Bonwick (Sun) announced that deduplication is now officially part of ZFS – Sun’s Zettabyte File System that is at the heart of Sun’s Unified Storage platform and NexentaStor. In his post, Jeff touched on the major issues surrounding deduplication in ZFS:

Deduplication in ZFS is Block-level

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS’s 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

Deduplication in ZFS is Synchronous

ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

Deduplication in ZFS is Per-Dataset

Like all zfs properties, the ‘dedup’ property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis. Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help.

Deduplication in ZFS is based on a SHA256 Hash

Chunks of data — files, blocks, or byte ranges — are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Deduplication in ZFS can be Verified

[If you are paranoid about potential “hash collisions”] ZFS provies a ‘verify’ option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not.

Deduplication in ZFS is Scalable

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you’re so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk — but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.

What does this mean for ZFS users? That depends on the application, but highly duplicated environments like virtualization stand to gain significant storage-related value from this small addition to ZFS. Considering the various ways virtualization administrators deal with virtual machine cloning, even the basic VMware template approach (not using linked-clones) will now result in significant storage savings. This restores parity between storage and compute in the virtualization stack.

What does it mean for ZFS-based storage vendors? More main memory and processor threads will be necessary to limit the impact on performance. With 6-core and 8-thread CPU’s available in the mainstream, this problem is very easily resolved. Just like the L2ARC tables consume main memory, the DDT’s will require an increase in main memory for larger datasets. Testing and configuration convergence will likely take 2-3 months once dedupe is mainstream.

When can we expect to see dedupe added to ZFS (i.e. OpenSolaris)? According to Jeff, “in roughly a month.”

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.