WHAT IS OpenGFS??
The Open Global File System (OpenGFS, or OGFS) is a journaled filesystem that
supports simultaneous sharing of a common storage device by multiple computer
nodes. This is one way to implement a "clustered file system".
As an example, consider a cluster of 3 computers that share a single array
of Fibre Channel (FC) drives. Each of the 3 computers can directly access the
drives (perhaps via a FC hub or switch), and pretty much treat the whole array
as if the computer had sole access. OpenGFS coordinates the storage accesses
so that the different computers don't clobber (overwrite) each other's data,
while providing simultaneous read access for sharing data among the computers.
___________
____|computer |\
lock | |_________| \ _________
coordination -> | \ | |
via LAN | ___________ \ | |
|____|computer |-----|Storage|
| |_________| / | array |
| / | |
| ___________ / |_______|
|____|computer |/
|_________|
How is this different from a Network File System (NFS)? In NFS, the storage
access goes through the NFS server; the client computers do not have direct
access to the storage device. This adds the network communication overhead
to the process of storing and retrieving data.
OpenGFS provides all of the components necessary for clustered operation. The
original code was written before many of these components were commonly
available, either within the Linux kernel, or from other open source projects.
Therefore, the original authors provided *everything* needed for clustering.
Since the original authoring, many of these components have become available
from other sources, and a lot of the current work on OpenGFS involves stripping
away redundant functionality, and taking advantage of the newly available
components.
As an example, we recently (Spring '03) "liberated" OpenGFS from relying on the
OpenGFS "pool" kernel module and user-space utilities. This component provided
volume management and device mapping for OpenGFS. Now, OpenGFS can use
virtually any volume manager (preferably cluster-aware, such as the
Enterprise Volume Management System, EVMS), and/or device mapper (e.g.
the DM device mapper).
We have also recently (Fall '03) added the OpenDLM lock module. This locking
protocol is more efficient than OpenGFS' memexp protocol, and avoids the
single point of failure (SPOF) exhibited by memexp's lock storage server.
Here's a high level block diagram of OpenGFS and its locking support. Note
that the lock module (a.k.a. locking backend) attaches to OpenGFS via a
plug-in interface. Refer to this diagram when reading the next section.
Inter-node
(other support ogfs.o Linux (other
nodes) _____________ _______________ _________ nodes)
| | Lock | | | | | | | ___________
|______| Module |--> >--| | |____| VFS | | | |
| | | | g | file- | |_______| | | |
| |___________| | - | system | | | |__| Shared |
LAN | | | l | and |____| BlkIO | | | Storage |
| ______|______ | o | journal| |_______| | | |
| | Cluster | | c | | | | | |_________|
|______| Manager | | k | | | Driver|----|
| |___________| |____|________| |_______| |
(other (other
nodes) nodes)
WHAT ARE THE ELEMENTS OF CLUSTERING?
Sharing and journaling storage among a cluster of computers relies on:
1). A consistent view of storage from all computers. /dev names and sizes,
and file names and attributes must appear identical to each computer.
As a "nice-to-have" feature, it may be desirable to map several hardware
storage devices to appear as a single large filesystem device. These
features have been provided by the OpenGFS "pool" volume manager, and may
now be provided by other volume managers.
2). Locking services, so two computers don't try to write to the same
storage location at the same time, or one computer doesn't try to read
a file while it is being modified by another computer. OpenGFS has an
embedded locking system, and a locking harness to allow the usage of
different lock protocols (e.g. OpenGFS' "memexp" and "nolock" protocols).
The memexp protocol uses local area network (LAN) to communicate among
the cluster's computers. The recently added OpenDLM protocol also uses
LAN, and is more efficient than memexp. It also avoids memexp's single
point of failure (the lock storage server).
3). Individual journals for each computer. Journals allow recovery to a
consistent state when a computer node crashes in the middle of a
write operation. Independent journals for each computer provide
isolation so that one computer does not corrupt the journal of another,
no locking conflicts occur for journal writes, and it's easier to play
back a single computer's journal if it dies.
Journals may be "internal" (within the filesystem device) or, with recent
changes, "external" (stored on a shared device other than the filesystem
device).
The journaling provided in OpenGFS protects only the filesystem metadata
(not file data itself), so data within a file may get corrupted by a
crash, but the overall filesystem layout will remain intact.
4). Cluster membership services, to assign a journal to a given computer
as it joins the cluster, to let the locking service know about other
computers in the cluster, and to trigger journal and locking recovery
operations when a computer leaves the cluster unexpectedly (dies).
This service is only loosely coupled with the OGFS filesystem code, which
responds to it by way of the locking service interface. The membership
service may be implemented in different ways for different locking
protocols.
5). Fencing, or "Shoot The Other Machine In The Head (STOMITH)", to protect
the filesystem from corruption from a computer node that dies or goes
crazy. Fencing isolates the computer from the storage device, while
STOMITH powers-down or reboots the computer. OpenGFS provides several
methods for doing this using a variety of hardware devices (e.g. power
switches, fibre channel isolation switches), or human intervention
("meatware"). The memexp locking protocol provides hooks for triggering
these:
APC Masterswitch power switch (STOMITH)
WTI NPS network power switch (STOMITH)
VACM Nexxus (serial port IPMI?) (STOMITH)
Brocade Fibre Channel Switch, port and zone based (Fencing)
Vixel Fibre Channel Hub (Fencing)
This service is quite loosely coupled with the OGFS filesystem code,
which really has no idea that the STOMITH features exist. The cluster
management/membership service is responsible for triggering STOMITH.
There is no requirement to use memexp as the cluster manager.
6). Resizing of the filesystem. This is not a strict requirement, but is
a "nice-to-have" that's often expected in a clustered environment.
OpenGFS user-space utilities provide for expanding the filesytem data
storage space (ogfs_expand, by writing new resource group headers to disk),
and for adding new journals (ogfs_jadd, by writing new journal headers).
OpenGFS does not currently provide for shrinking storage or removing
journals. You will need to use a volume manager (e.g. pool or, more
preferably, EVMS) to expand the filesystem *device* size before adding
new OpenGFS data storage space or internal journals.
WHAT IS IN THE CVS TREE?
The CVS tree and distribution tarballs contain the code and documentation for
kernel space modules and user-space utilities for all OpenGFS features.
The build process uses autoconf/automake to configure the build to your
computer and your preferences. See opengfs/docs/HOWTO-generic or HOWTO-nopool
for more information. All code, both kernel-space and user-space, gets built
with a single "make" command.
The code currently maintained and in use exists in the opengfs directory within
CVS. One other directory, gnbd, contains deprecated code that implements
"an OGFS-friendly network block device".
Within CVS opengfs directory, several directories are empty, but are there
for supporting autoconf/automake when building. The other directories
(in alphabetical order) contain the docs and source:
1). docs: Documentation on usage and design. Many of these are
mirrored on the OpenGFS web site's Docs/Info page, as well.
2). kernel_patches: Currently, OpenGFS requires a relatively small patch
to the 2.4 series of Linux kernels (2.5 kernels are not yet supported).
3). man: man pages for user-space utilities, and for mount options.
You may view these pages using the "man" command, even without
building/installing OpenGFS.
4). scripts: A variety of scripts for a variety of purposes,
including applying patches, creating .h files for supporting debug
modes in the filesystem code, setting up the EVMS volume manager,
installing OGFS in Debian and Redhat environments, etc. etc. etc.
Some of these are old and unmaintained. Your mileage may vary.
5). src: Source for all user-space and kernel space code. See below
for tour of subdirectories.
opengfs/src directory, in alphabetical order:
1). divdi: Architecture-specific division support (boring)
2). fs: Filesystem code. The heart of OpenGFS. This is the kernel module
ogfs.o. Includes basic filesystem operation, journaling, locking
(the part within the filesystem). Architecture-specific subdirectories
include support for 2.2 and 2.4 kernels, and user space (used by certain
user space utilities, e.g. ogfs_jadd).
3). gnbd: Deprecated "GFS-friendly network block device" code.
4). include: Some .h files with global relevance (user-space and kernel).
5). locking: Kernel modules (other than ogfs.o) that support locking.
Also, user-space memexpd lock server daemon. See below for breakout
of subdirectories.
6). pool: Kernel module, pool.o, for "pool" device mapper. User-space
utilities for pool are in opengfs/src/tools/ptools directory. Pool
has a few rough edges, and with recent changes to CVS tree, we are
trying to get away from using it, trying other volume managers / device
mappers instead (e.g. EVMS/DM).
7). stomith: Support for fencing, both kernel and user-space. "agents"
subdirectory contains support for various isolation methods. "daemon"
subdirectory contains user-space stomithd daemon that invokes methods when
needed. "module" subdirectory contains kernel module stomith.o, which
provides support for kernel components to communicate with the daemon.
In the future, we would like to relegate this stomith functionality to a
separate, non-OGFS, cluster manager.
8). tools: User-space utilities for a variety of purposes. See below for
breakout of subdirectories.
opengfs/src/locking subdirectories, in alphabetical order:
1). harness: Kernel module lock_harness.o. Registers available lock protocol
implementation modules, and connects one of them to filesystem at mount
time.
2). modules: Lock protocol implementation modules. Includes subdirectories
for memexp (clustered), opendlm (clustered), nolock (non-clustered), and
stats (stacks on top of another) protocols.
3). servers: The memexpd user-space daemon, the central lock storage server
for the memexp locking protocol. Uses memory or disk-based storage.
opengfs/src/tools subdirectories contain user-space utilities. Many of these
have man pages in the opengfs/man directory. In alphabetical order:
1). dmep_tools: Utilities dmep_conf and do_dmep for Direct Memory Export
Protocol (DMEP) hardware-based lock storage support for memexp. DMEP
support exists within memexp and pool, but is not being maintained, as we
don't know anyone who is using it.
2). hexedit: Editor for binary files. Seems a little buggy. hexedit on
RedHat distro works better.
3). initds: Initialize Disk Store utility. Prepares disk-based storage area
for memexpd lock storage server.
4). mangle_fest: A whole bunch of test and stress utilities for OpenGFS.
5). mkfs: Makes the filesystem on disk by writing the superblock, data
resource group headers, journal headers, and any other needed metadata
to disk.
6). ogfsck: Filesytem check utility.
7). ogfsconf: Writes the cluster configuration onto a cluster information
device (cidev). Locking and journaling uses this information to know
which machines (IP addresses) are potential members of the cluster.
8). ogfs_expand: Adds data resource group headers to unused device space,
to increase data storage capacity of existing filesystem.
9). ogfs_jadd: Adds new journals to existing filesystem.
10). ogfs_tool: Debugging and statistics tool for communicating with
filesystem during operation.
11). ptools: Utilities for creating, enlarging, and reading pool configuration.
12). test_dmep: Test utility for DMEP (hardware-based locking) device (?)
13). test_mmap:
14). ucmemexp: User-space client for memexp locking protocol.
PROJECT HISTORY:
OpenGFS has its roots in the GFS project, which was originally sponsored by
the University of Minnesota from 1995-2000 (according to copyright notices
in the code). Around 2000, U. of M. professor Matthew O'Keefe founded Sistina
Software based on the U. of M. research work. Sistina kept the project open
source through GFS 4.x, but decided to take GFS proprietary around 2001,
and GFS is now a commercial product (and we wish Sistina well; they have
contributed significantly to the Linux community, and continue to do so).
OpenGFS was started shortly thereafter, based on the 4.x source. Notes in the
CVS base indicate that it was imported from Compaq's SSIC-Linux 0.5.1 in
August of 2001. This work was done by Linux heavy hitters such as Christoph
Hellwig and Alan Cox. According to Alan, they had time to "clean it up and
make it basically sane, but not to then tackle the big jobs" such as "sorting
out all the pool mess".
Since then, the project has been maintained mostly by Brian Jackson and
Dominik Vogt. 2003 has seen a lot of effort towards understanding and
documenting the code (Dominik, Stefan Domthera, and Ben Cahill), removing
the dependence on pool (Ben Cahill), supporting OpenGFS with EVMS (IBM
EVMS team) and developing a new lock module for OpenDLM (Stanley Wang).
Many other folks are monitoring the list and providing helpful suggestions
and discussions.
The OpenGFS team is NOT monitoring the commercial GFS development process, and
is NOT intentionally implementing similar upgrades or features.
REFERENCES:
OpenGFS web site:
http://opengfs.sourceforge.net (see especially "Docs/Info" page, via menu)
OpenGFS CVS (browsable/downloadable via OpenGFS website):
opengfs/docs/*
Many docs are also mirrored to http://opengfs.sourceforge.net/docs.php
opengfs/man/*
man pages are viewable within CVS tree, without build/install, e.g.:
cd /path/to/opengfs/man
man ./ogfs_mount.8
Mail lists:
Archives (Sept. 2001 - present) and lists on our project page:
http://sourceforge.net/mail/?group_id=34688
Older GFS mail list archives (May 2000 - present), no search facility:
http://www.spinics.net/lists/gfs
More recent, searchable archives, (Aug. 2003 - present):
http://marc.theaimsgroup.com/?l=opengfs-users&r=1&w=2
http://marc.theaimsgroup.com/?l=opengfs-devel&r=1&w=2
http://marc.theaimsgroup.com/?l=opengfs-bugs&r=1&w=2
http://marc.theaimsgroup.com/?l=opengfs-announce&r=1&w=2
Kernel docs:
Documentation/filesystems/ext2.txt
Documentation/filesystems/ext3.txt
These don't discuss OpenGFS, but are interesting reading anyway,
due to many similarities.
Documentation/DocBook/journal-api.*
Kernel interface between filesystem and Linux Journal Block Device (JBD),
the journaling service used for ext3. Again, this doesn't discuss OpenGFS,
but is interesting reading anyway, due to many similarities.
To create viewable files in the kernel source tree, do:
cd /path/to/linux
make pdfdocs, or
make htmldocs
Interesting internet resources and other projects:
http://opendlm.sourceforge.net
Open Distributed Lock Manager (OpenDLM) project
http://oss.software.ibm.com/dlm
Distributed Lock Manager documentation
http://evms.sourceforge.net
Enterprise Volume Management System (EVMS) project
ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/journal-design.ps.gz
Great paper on journaling for ext3. Very similar to journaling in OpenGFS.
http://www.linuxsymposium.org/2003/audio.php
Audio presentations from 2000 and 1999. A bit dated, but still helpful.
Recommended:
2000, GFS for Linux (by one of the original authors)
2000, EXT3, Journalling FS (by Steven Tweedie, ext3 author) *
2000, XFS for Linux *
2000, Intelligent I/O for Linux
1999, Intermezzo: Distributed Filesystem
* transcript available at http://olstrans.sourceforge.net
http://www.linuxsymposium.org/2003/proceedings.php
Chock full of lots of good stuff, filesystems, clustering, you name it.
http://www.namesys.com/v4/v4.html
Interesting paper (with distinctive graphics) on Reiser 4 filesystem.
Note that OpenGFS does not work like Reiser 4, just mind-expanding reading.
Books:
"Understanding the Linux Kernel", 2nd edition, Bovet and Cesati, O'Reilly
Good for information on Linux kernel components that support OpenGFS,
notably the Virtual File System (chapter 12), block handling
(in chapter 13), and the page and buffer caches (in chapter 14). It also
discusses the ext2/ext3 filesystem (chapter 17), which does many things in
ways similar to (but different from) OpenGFS.
"Linux Device Drivers", 2nd edition, Alessandro Rubini, O'Reilly
Good complement to "Understanding the Linux Kernel" (neither book tells
the whole story). Good for background on Linux drivers in general, and
specifically, chapter 12 "Loading Block Drivers" talks about, well,
block drivers such as the OpenGFS pool kernel module.
"Linux File System", Moshe Bar, (publisher?)
recommendation courtesy of Andrea Glorioso
"Managing RAID on Linux", Derek Vadala, O'Reilly
In addition to a discussion on RAID, this book has a good (quick) overview
of filesystem operation, including a little about journals, and a quick look
at ext2, ext3, ReiserFS, IBM JFS, and SGI XFS.