Client access to the Plasma Filesystem
This is a client library providing full access to the Plasma filesystem.
It is probably intuitive to understand this interface, but if any
question pops up, please consult the page Plasmafs_protocol. It
explains all background concepts of the PlasmaFS protocol.

Many of the following functions return so-called engines. These functions have the suffix _e. There is always a
"normal", i.e. synchronous variant not returning engines
computing the result, but directly the result. The engines make
it possible to send queries asynchronously. For more information
about engines, see the module Uq_engines of Ocamlnet. It is
generally not possible to use the client in a synchronous way when
an engine is still running.

Closes the descriptors to remote services so far possible, but
does not permanently shut down the client functionality. The
descriptors are automatically opened again when needed. The
effect is not only that resources are given back temporarily,
but also that the pending transactions are aborted.

Configures that the data nodes with the given identities are
preferred for the allocation of new blocks. This config is
active until changed again. Useful for configuring local identities
(see local_identities below), i.e. for enforcing that blocks
are allocated on the same machine, so far possible.

Return the identities of the data nodes running on this machine
(for configure_pref_nodes)

Authentication and impersonation

There is a distinction between authentication on the RPC level, and
authentication on the filesystem level. For RPC, the client has only
the choice between two user IDs, namely "proot" and "pnobody".
The first has all rights, whereas the latter one can only connect
(unless it tries to get more rights). Non-privileged clients use
"pnobody" and provide an additional authentication ticket to obtain
additional permissions.

On the filesystem level, the client can take over any user ID independent
of what ID was used on the RPC level. If "proot" is the RPC user,
one can just become any filesystem user without credentials. If
"pnobody" is the RPC user, one needs an authentication ticket to
become a certain user on the filesystem level.

configure_auth c nn_user dn_user get_password: Configures that accesses
to the namenode are authenticated on the RPC level as nn_user and
accesses to datanodes
are authenticated as dn_user. The function get_password is called
to obtain the password for a user.

nn_user can be set to "proot" or "pnobody".

dn_user is normally set to "pnobody".

This type of authentication does not imply any impersonation on the
filesystem level. One should run Plasma_client.impersonate to
set something.

configure_default_user_group c user group: Normally, new files are
created as the user and group corresponding to the current
impersonation. If privileges permit it, this can changed here so that
files are created as user and group. Each string can be empty,
in which case the value is taken from the impersonation.

This is especially useful if one authenticates as "proot" and does not
do any impersonation, i.e. the superuser privileges are still in
effect. Another use is to create files with a group that is different
from the main group of the current impersonation.

This affects not only files, but also new directories and symlinks.

Transactions

All functions requiring a plasma_trans value as argument must be
run inside a transaction. This means one has to first call start
to open the transaction, call then the functions covered by the
transaction, and then either commit or abort.

It is allowed to open several transactions simultaneously.

If you use the engine-based interface, it is important to
ensure that the next function in a transaction can first be
called when the current function has responded the result.
This restriction is only valid in the same transaction -
other transactions are totally independent in this respect.

Get info about inode. This function returns the inodeinfo record
from the cache. The cache can only contain committed versions of the
inodeinfo record, and it is tried that only recent versions are in
the cache. If the cache does not contain the data, or if the data is
out of date, a new transaction is started to get the newest committed
version.

The bool argument can be set to true to enforce that the
newest version is retrieved. However, there is no guarantee that
the returned version is still the newest one when this function
returns.

Note that get_inodeinfo also implicitly refreshes the cache when
the transaction is (still) only used for read accesses.

The returned inodeinfo does not include modifications caused by
block writes that were not yet flushed to disk.

Fast sequential data access

The function copy_in writes a local file to the cluster. copy_out
reads a file from the cluster and copies it into a local file.

Especially copy_in works only in units of whole blocks. The
function never reads a block from the filesystem, modifies it,
and writes it back. Instead, it writes the block with the data it
has, and if there is still space to fill, it pads the block with
zero bytes. If you need support for updating parts of a block
only, better use the buffered access below.

copy_in_e c inode pos fd len: Copies the data from the file descriptor
fd to the file given by inode. The data is taken from the current
position of the descriptor. Up to len bytes are copied. The data
is written to position pos of the file referenced by the inode. If
it is written past the EOF position of the destination file, the EOF
position is advanced. The function returns the number of copied
bytes.

For seekable descriptors, len specifies the exact number of bytes
to copy. If the input file is shorter, null bytes are appended to
the file until len is reached.

For non-seekable descriptors, an additional buffer needs to be
allocated. Also, len is ignored for non-seekable descriptors -
data is always copied until EOF is seen. (However, in the
future this might be changed. It is better to pass
Int64.max_int as len if unlimited copying is required.)

topology says how to transfer data from the client to the data nodes.
`Star means the client organizes the writes to the data nodes as
independent streams. `Chain means that the data is first written to
one of the data nodes, and the replicas are transferred from there to
the next data node.

flags:

`No_datasync: Data blocks are not synchronized to disk

`Late_datasync: Only the last block is synchronized to disk.
This also includes are preceding blocks. If an error occurs, though,
nothing is guaranteed.

The default is to write synchronously: At the end of each transaction
copy_in commits, all blocks are guaranteed to be on disk.

Limitation: pos must be a multiple of the blocksize. The file
is written in units of the blocksize (i.e. blocks are never partially
updated).

copy_in_from_buf c inode pos buf len: Copies the data from
buf to the file denoted by inode. The data is taken from the
beginning of buf, and the length is given by len. The data is
written to position pos of inode.

copy_in_from_buf works much in the same way as copy_in, only
that the data is taken from a buffer and not from a file descriptor.

copy_out_e c inode pos fd len Copies the data from the file referenced
by inode to file descriptor fd. The data is taken from position
pos to pos+len-1 of the file, and it is written to the current
position of fd. The number of copied bytes is returned.

Seekable output files may only be extended, but are never truncated.

For non-seekable descriptors, an additional buffer needs to be allocated.

If there are holes in the input file, the corresponding byte
region is filled with zero bytes in the output.
If it is tried to read past EOF, this is not prevented, but handled
as if the region past EOF was a file hole.

Limitation: pos must be a multiple of the blocksize.

copy_out performs its operations always in separate transactions.

Flags:

`No_truncate: The descriptor fd is not truncated to the real
file size

copy_out_to_buf_e c inode pos buf len Copies the data from the
file denoted by inode to the buffer buf. The data is taken from
position
pos to pos+len-1 of the file, and it is written to the beginning
of buf.

read_e c inode pos s spos len: Reads data from inode, and returns
(n,eof,ii) where n is the number of read bytes, and eof the indicator
that EOF was reached. This number n may be less than len only
if EOF is reached. ii is the current inodeinfo.

Before a read is responded from a clean buffer it is checked whether
the buffer is still up to date.

By default, read updates the metadata from the namenode before starting
any transaction. By setting lazy_validation, one can demand a different
mode, where these updates can be delayed by a short period of time
(useful when several reads are done in sequence).

multi_read_e c inode stream: This version of read allows it
to read multiple times from the same file. All reads are done in
the same transaction.

The function gets the next task from stream when the previous
task is done (if any). A task is always an engine which results
either in None (ending the stream), or in Some(req,pass_resp).
The request req = (pos, s, spos, len) says from where to take
the data and where to store it (like in read_e). The
response resp = (n,eof,ii) is the argument of pass_resp.

write_e c inode pos s spos len: Writes data to inode and returns
the number of written bytes. This number n may be less than len for
arbitrary reasons (unlike read - to be fixed).

A write that is not aligned to a block implies that the old version
of the block is read first (if not available in a buffer). This is
a big performance penalty, and best avoided.

It is not ensured that the write is completed when the return value
becomes available. The write is actually done in the background,
and can be explicitly triggered with the flush_e operation. Also,
note that the write happens in a separate transaction. (With
"background" we do not mean a separate kernel thread, but an
execution thread modeled with engines.)

Writing also triggers that the EOF position is at least set to the
position after the last written position. However, this is first
done when the blocks are flushed in the background. (Use get_write_eof
to get this value immediately, before flushing.)

As writing happens in the background, some special attention has to be
paid for the way errors are reported. At the first error the write thread
stops, and an error code is set. This code is reported at the next
write or flush. After being reported, the code is cleared again.
Writing is not automatically resumed - only further write and
flush invocations will restart the writing thread. Also, the
data buffers are kept intact after errors - so everything will be
again tried to be written (which may run into the same error).
The function drop_inode can be invoked to drop all dirty buffers
of the inode in the near future.

snapshot trans inode: Takes a snapshot of the file, affecting
buffered reads and writes, and a few other functions. Reads and
writes use now the transaction trans instead of creating
transactions automatically. Also, the block list is completely
buffered up. The main effects:

Reads view the contents of the file in the version when the snapshot
was made, even if other transactions change the file. Only
the changes made via trans are visible. Taking a snapshot
is an atomic operation.

Writes can change the file. However, there is no automatic
commit anymore. First if trans is committed (by calling commit)
the changes are made permanent (atomically).

Thus, snapshots can be used to read and write files with high
consistency guarantees.

There are a few other effects of the snapshot mode:

At the moment the snapshot is made, all buffers for this inode
are dropped. This affects both clean and dirty buffers.

When trans is aborted, the dirty buffers are also dropped.

When trans is committed, the buffers for inode remain intact,
of course, because they reflect now the latest state of the file.
Note that it is strongly recommended to flush the buffers
before committing.

The flush operation can now fail with ECONFLICT if there are
other transactions writing to the same file.

The snapshot ends when trans is either committed or aborted.

The following functions also see/modify the snapshot if trans is used:

get_inodeinfo

set_inodeinfo

truncate

get_write_eof

get_write_mtime

flush

flush_all

It is a bad idea to access the file via this client while the
snapshot is being made.

The append flag enables an optimization if new data is only
appended to the file. In this case, it is sufficient to take
only a snapshot of the last block of the file, because the
previous blocks can be considered as immutable.

namelock trans dir name: Acquires an existence lock on the member
name of directory dir. name must not contain slashes.

A namelock prevents that the entry name of the directory dir
can be moved or deleted. This protection lasts until the end of
the transaction. If a concurrent transaction tries to move or
delete the file, it will get an `econflict error.

It is not allowed to lock a not yet existing entry.

It is not prevented that the directory dir is moved, and thus it
is possible that the absolute path of the protected file changes.

rename trans old_path new_path: Renames/moves the file or directory
identified by old_path to the location identified by new_path.
There must not be a file at new_path (i.e. you cannot move into
a directory).

rename_at trans old_dir_inode old_name new_dir_inode new_name:
Moves the file old_name in old_dir_inode to the new location
which is given by new_name in new_dir_inode.
Neither old_name nor new_name must contain slashes.