Thanks, we have similar locking improvements in mind but cannot
promise a date when these will be available. Some of the challenges
that we'll need to think about are how to map any such locking scheme
to standard locking behaviour for posix/nfsv3/v4/cifs.

Hi, thanks for replying

Whilst I can see that there is some optimisation to be had by combining
brick level locking with filesystem level locking, I just want to
clarify that my proposal was really about intra-brick locks, and not
really about the toplevel filesystem level locking?

- The goal of any fileserver is to take async requests from lots of
clients and arrange to serialise that access
- Up until recently such fileservers have been on a single machine, but
involving multiple async clients connecting
- Even in a single server solution the bottleneck becomes that each
client cannot cache *any* data since it's not known if the server copy
has changed since we accessed it (even a microsecond earlier)
- The solution which has become popular (see CIFS, NFSV4 (?), GFS2, etc)
was to offer clients an "optimistic lock", ie the client can acquire a
token which while it's held means that it can cache data locked by that
token and even offer writeback optimisations on that data (obviously
subject to whatever the application tolerates for unsync'ed data)
- This "optimistic lock" means that we effectively push the file locking
to the client, hence once a lock is acquired then further access by the
client is no longer bounded by the network access latency, under many
circumstances this leads to massive speedups
- Clearly when a second client comes along and demands access to the
same data then we need a process to break the lock and inform the first
client that they need to reacquire the lock (or revert to a kind of
"write-through" access system while waiting)

So this process clearly benefits situations where there is serialised
access by single clients at a time. Excluding databases however, this
access pattern seems quite common for lots of applications

So with regards to Gluster I would see that we need this same type of
locking implemented at the brick level. Hence if you re-read the
description above, then each *gluster server* would be the possible
clients (think of the lower level being bricks talking to each other,
and the upper level being clients talking to bricks). ie yes, posix
locking needs to serialise access to every end client that connects to
every brick, but we can also benefit from locking to serialise access
between bricks (if 3,000 clients hammer one brick for a single file,
then we care that our single brick is allowed to read/write that file
freely because it informed the other bricks that it now holds a lock,
it's a separate problem to serialise all the clients talking to the one
brick)

So compared with traditional fileservers we actually need two levels of
locking to serialise access. At one level we need to serialise clients
access to the filesystem, and lower down we need to serialise access
between bricks

I think an alternative way of looking (and perhaps implementing) the
situation could be something like:

- Consider two bricks with files replicated between them
- Client 1 accesses Brick 1 and requests File A

- Brick 1 contacts the other replicas and requests to become the "master
replica" of that file. All future accesses to that file must now go
through only Brick 1 while it remains in that "role"
- If Client 2 accesses Brick 1 and tries to do something with File A,
then the normal filesystem locking must arrange for serialisation
between Client 1 and Client 2, however, Brick 1 need not contact any
other brick and there is no network latency penalty serving that file to
Client 2 (obviously at some point one client will write data and we need
to sync that, but read access incurs no network access)

- OK, now the trick is what happens when Client 3 accesses Brick 2 and
requests File A... Somehow we need to wrest control back from Brick 1
and inform it that it's no longer the "master". A really simple
solution to this (at least conceptually) is to proxy all access requests
from Brick 2 back to Brick 1. This satisfies our requirement that
accesses are serialised across bricks and effectively there is still a
"master" brick remaining in control.
- We can see that this setup is conceptually similar to having a
traditional lock server arbitrating brick access to a given file, but in
example above we have implemented a distributed lock server, the lock
server effectively becoming the same server as what we hope is the "hot
server", so that we aren't incuring network latency to contact the lock
server all the time.
- A further improvement would clearly be to have some kind of process
where the "master brick" can move about, ie in the case above if Client
3 starts to bash away at Brick 2 for File A, then Brick 2 is migrated to
become the "master" and hold the lock, now any access through Brick 1
must effectively proxy requests back to Brick 2 or re-acquire it's lock
(ie become the master)

OK, so the above is a very simple example of optimistic locking and
could be trivially implemented using an external lock server which
tracks which brick currently holds the lock for a given file (ie can
read/write freely without first checking if other bricks have modified
the file). A given brick which doesn't hold a lock on a file must first
do kind of what it does already and contact the lock server to see if
another brick holds the lock. If not it can acquire the lock itself.
If the lock is held elsewhere we either need to break the lock (or proxy
access requests to the server holding the lock).

Really this is not so different to what is there today, but it's simply
an efficiency improvement because we don't need to touch *every* brick
for *every* file access, instead we make some network requests on first
access to a file and then can continue to touch that file for a period
afterwards without needing further network access with other bricks

However, whilst some kind of implementation of the above could offer a
huge performance speedup for many of the situations which come up on the
mailing list, the issue is that the lock server becomes a) a bottleneck
and b) point of failure. So the chain of thought almost certainly goes
something like:

- Make the gluster bricks become the lock servers, ie they negotiate
amongst themselves. Really this is roughly what happens right now, only
it's on every access, rather than access being "sticky" once acquired
- Now analyse all the corner cases that bricks go down holding locks, or
get segmented while holding/acquiring locks and discover some tricky
issues...

Paxos seems like a clever way of dealing with the locking going
distributed, yet not necessarily having a 100% consistent view of who
owns which lock. By introducing a voting method it can show robustness
in the face of failed machines and new machines can be added without
needing to store reliable state information (or at least this is true
with the improvements described in the articles)

Does that make sense? Apologies if the above is long winded, but the
point is really that the performance improvements come from pushing
locks between bricks, and probably this is distinct from client level
locking such as nfs/cifs/posix, etc locking

For advanced cluster filesystems such as GFS2, the general "optimistic
locking" technique appears to show massive speed improvements (for many
access patterns) and it's also likely to do so in Gluster. Really my
original email jumped two steps and suggested an improved form of
distributed locking, which itself could be used as the actual
implementation, but other forms of distributed locking between bricks
would be highly desirable also.