Group management: combine a group of users and projects into a bigger entity:

a University course -- see feature list below

a research group: a collection of projects with a theme that connects them, where everybody has access to everybody else's projects

A feed that shows activity on all projects that users care about, with some ranking. Better notifications about chat messages and activity

Commercial Products

We plan four distinct products of the SMC project: increased quotas, enhanced university course support, license and support to run a private SMC cloud, supported open source BSD-licensed single-user version of SMC (and offline mode).

PRODUCT: Increase the quota for a specific project (launch by Aug 2014)

cpu cores

RAM

timeout

disk space

number of share mounts

Remarks:

There will be an option in the UI to change each of the above parameters that some project collabs see (maybe only owners initially).

Within moments of making a change it goes live and billing starts.

When the change is turned off, billing stops. When a project is not running it is not billed. (Obviously, we need to add a stop button for projects.)

There is a maximum amount that the user can pre-spend (e.g., $500 initially).

At the end of the month, the user is given a link to a Univ of Washington website and asked to pay a certain amount, and register there under the same email as they use with SMC.

When they pay, SMC receives an email and credits their account for the amount they pay.

There will also be a limit soon on the number of projects that can be associated with an account (e.g., 10); pay a small monthly to raise this.

PRODUCT: University course support (launch by Aug 2014 in time for Fall 2014 semester)

Free for the instructor and TA's

Each student pays $20 in exchange for:

one standard project (they can upgrade quotas as above), which TA and instructor are automatically collabs on

student is added as collaborator to a big shared project

in student's private project they get homework assignments (assigned, collected)

Tuesday, April 15, 2014

SageMathCloud (SMC) is
a browser-based hosted cloud computing environment for easily collaborating on
Python programs, IPython notebooks, Sage worksheets and LaTeX documents.
I spent the last four months wishing very much that less people would use SMC.
Today that has changed, and this post explains some of the reasons why.

Consistency Versus Availability

Consistency and availability are competing requirements. It is trivial
to keep the files in a SageMathCloud project consistent if we store it
in exactly one place; however, when the machine that project is on goes
down for any reason, the project stops working, and the users of the project
are very unhappy. By making many copies of the files in a project, it's
fairly easy to ensure that the project is always available, even if network
switches in multiple data centers completely fail, etc. Unfortunately, if
there are too many users and the synchronization itself puts too heavy of a load
on the overall system, then machines will fail more frequently, and though
projects are available, files do not stay consistent and
data is lost to the user (though still "out there" somewhere for me to find).

Horizontal scalability of file storage and availability of files are also competing requirements.
If there are a few compute machines in one place, then they can all mount user
files from one central file server. Unfortunately, this approach leads to horrible performance
if instead the network is slow and has high latency; it also doesn't scale up to potentially
millions of users. A benchmark I care about is
downloading a Sage binary (630MB) and extracting it (creating over 70,000 files);
I want this to take at most 3 minutes total, which is hard using a networked filesystem served over
the general Internet between data centers. Instead, in SMC, we store the files for user projects on
the compute machines themselves, which provides optimal speed. Moreover, we use a compressed filesystem,
so in many cases read and write speeds are nearly twice as fast as they might be otherwise.

Users can also create files they don't care too much about in /scratch, which is a compressed and deduplicated ZFS filesystem.
It is not backed up in any way, and is local to that compute.

The /projects directory is one single big ZFS filesystem, which
is both lz4 compressed and deduplicated. ZFS compression is just plain awesome. ZFS deduplication
is much more subtle, as deduplication is tricky to do right. Since data can
be deleted at any time, one can't just use a bloom filter
to very efficiently tell whether data is already known to the filesystem, and instead ZFS uses a much
less memory efficient data structure. Nonetheless, deduplication works well in our situation, since the compute
machines all have sufficient RAM (around 30-60GB), and the total data stored in /projects is
well under 1TB. In fact, right now most compute machines have about 100GB stored in /projects.
The /bup/bups directory is also one single big ZFS filesystem; however, it is neither
compressed nor deduplicated. It contains bup
repositories, where bup is an awesome git-based
backup tool written in Python that is designed for storing snapshots of
potentially large
collections of arbitrary files in a
compressed and highly deduplicated way. Since the git pack format is already compressed and deduplicated,
and bup itself is highly efficient at deduplication, we would gain almost nothing by using
compression or deduplication directly on this ZFS filesystem. When bup deduplicates data, it does so using
a sliding window through the file, unlike ZFS which simply breaks the file up into blocks, so bup
does a much better job at deduplication. Right now, most compute machines have about 50GB stored in /bup/bups.

When somebody actively uses a project, the "important" working files are snapshotted about once every two minutes.
These snapshots are done using bup and stored in /bup/bups/project_id, as mentioned above.
After a snapshot is successfully created, the files in the working directory and in the bup repository
are copied via rsync to each replica node. The users of the project do not have direct access to
/bup/bups/project_id, since it is of vital importance that these snapshots cannot be corrupted
or deleted, e.g., if you are sharing a project with a fat fingered colleague, you want peace of mind that
even if they mess up all your files, you can easily get them back. However, all snapshots are mounted
at /projects/project_id/.snapshots and browseable by the user; this uses bup's FUSE filesystem
support, enhanced with some patches I wrote
to support file permissions, sizes, change times, etc.
Incidentally, the bup snapshots have no impact on the user's disk quota.

We also backup all of the bup archives (and the database nodes) to a single large bup archive,
which we regularly backup offsite on encrypted USB drives.
Right now, with nearly 50,000 projects, the total size of this large
bup archive is under 250GB (!), and we can use it efficiently recover any particular
version of any file in any project. The size is relatively small due to the
excellent deduplication and compression that bup provides.

In addition to the bup snapshots, we also create periodic snapshots of the two
ZFS filesystems mentioned above... just in case. Old snapshots are regularly deleted.
These are accessible to users if they search around enough with the command line, but
are not consistent between different hosts
of the project, hence using them is not encouraged. This ensures that even if the whole
replication/bup system were to somehow
mess up a project, I can still recover everything exactly as it was
before the problem happened; so far there haven't been any reports of problems.

Capacity

Right now there are about 6000 unique weekly users of SageMathCloud and often about 300-400 simultaneous users, and there
are nearly 50,000 distinct projects. Our machines are at about 20% disk space capacity, and most of them can easily be
expanded by a factor of 10 (from 1TB to 12TB). Similarly, disk space for our Google compute engine nodes is
$0.04 GB / month.
So space-wise we could scale up by a factor of 100 without too much trouble.
The CPU load is at about 10% as I write this, during a busy afternoon with 363 clients connected
very actively modifying 89 projects.
The architecture that we have built could scale up to a million users, if only they would come our way...

About Me

I am a professor of mathematics at University of
Washington. In my mathematics research, I use the Birch and
Swinnerton-Dyer conjecture as motivation to explore the
constellation of conjectures and questions about arithmetic invariants of elliptic curves. I do many explicit computations, and started the Sage Mathematical Software project. Currently, I'm working very hard on https://cloud.sagemath.com.