Monday, December 16, 2013

I love the Holiday break. I get to work on https://cloud.sagemath.com (SMC) all day again! Right now I'm working on a multi-data center extension of http://www.gluster.org for storing a large pool of sparse compressed deduplicated ZFS image files that are efficiently replicated between data centers. Soon SMC projects will all be hosted in this, which will mean that they can very quickly be moved between computers, are available even if all but one data center goes down, and will have ZFS snapshots instead of the current snapshot system. ZFS snapshots are much better for this application, since you can force them to happen at a point in time, with tags, and also delete them if you want. A little later I'll even make it so you can do a full download (to your computer) of an SMC project (and all snapshots!) by just downloading the ZFS image file and mounting it yourself. I'm also continuing to work on adding a Google Compute Engine data center; this is the web server parts hosted there right now https://108.59.84.126/, but the real interesting part will be making compute nodes available, since the GCE compute nodes are very fast. I'll be making 30GB RAM 8-core instances available, so one can start a project there and just get access to that -- for free for to SMC users, despite the official price being $0.829/hour. I hope this happens soon.

Tuesday, December 10, 2013

The Sagemath Cloud combines open source technology
that has come out of cloud computing and mathematical software
(e.g., web-based Sage and IPython worksheets) to make online
mathematical computation easily accessible.
People can collaboratively use
mathematical software, author documents, use a full
command line terminal, and edit complicated computer programs,
all using a standard web browser with no special plugins.
The core design goals of the site are collaboration and
very high reliability, with
data mirrored between multiple data centers. The current dedicated infrastructure
should handle over a thousand simultaneous active users,
and the plan is to scale up to tens of thousands of users as demand grows
(about 100 users sign up each day right now).
Most open source mathematical software is pre-installed,
and users can also install their own copies of proprietary software,
if necessary.
There are currently around 1000 users on the site each day from
all over the world.
The Sagemath Cloud is under very active development,
and there is an ongoing commercialization effort through University of Washington,
motivated by many users who have requested more compute power, disk space, or the option to host their own install of the site.
Also, though the main focus is on mathematics, the website has also been
useful to people in technical areas outside mathematics that
involve computation.

Saturday, October 19, 2013

William Stein, the lead developer of Sage, has been developing a new online interface to Sage, the Sage Cloud at
https://cloud.sagemath.com. Currently in beta status, it is already a powerful computation and collaboration tool.
Work is organized into projects which can be shared with others. Inside a project, you can create any number of files,
folders, Sage worksheets, LaTeX documents, code libraries, and other resources. Real-time collaborative editing allows
multiple people to edit and chat about the same document simultaneously over the web.

The LaTeX editor features near
real-time preview, forward and reverse search, and real-time collaboration. Also, it is easy to have Sage do computations
or draw gures and have those automatically embedded into a LaTeX document using the SageTeX package (for example,
after including the sagetex package, typing \sageplot{plot(sin(x))} in a TeX document inserts the plot of sin(x)).
A complete Linux terminal is also available from the browser to work within the project directory. Snapshots are
automatically saved and backed up every minute to ensure work is never lost. William is rapidly adding new
features, often within days of a user requesting them.

Saturday, October 12, 2013

Today's post is from guest blogger, Jason Grout, lead developer of the Sage Cell Server.

The other day some students and I met to do some development on the Sage cell server. We each opened up our shared project on cloud.sagemath.com on our own laptops, and started going through the code. We had a specific objective. The session went something like this:

Jason: Okay, here's the function that we need to modify. We need to change this line to do X, and we need to change this other line to do Y. We also need to write this extra function and put it here, and change this other line to do Z. James: can you do X? David: can you look up somewhere on the net how to do Y and write that extra function? I'll do Z.

Then in a matter of minutes, cursors scattering out to the different parts of the code, we had the necessary changes written. I restarted the development sage cell server running inside the cloud account and we were each able to test the changes. We realized a few more things needed to be changed, we divided up the work, and in a few more minutes each had made the necessary changes.

It was amazing: watching all of the cursors scatter out into the code, each person playing a part to make the vision come true, and then quickly coming back together to regroup, reassess, and test the final complete whole. Forgive me for waxing poetic, but it was like a symphony of cursors, each playing their own tune in their lines of the code file, weaving together a beautiful harmony. This fluid syncing William wrote takes distributed development to a new level.

Thursday, October 3, 2013

The terms of usage of the Sagemath Cloud say "This free service is not guaranteed to have any uptime or backups." That said, I do actually care a huge amount about backing up the data stored there, and ensuring that you don't lose your work.

Bup

I spent a lot of time building a snapshot system for user projects on
top of bup.
Bup is a highly efficient de-duplicating compressed
backup system built on top of git; unlike other approaches,
you can store arbitrary data, huge files, etc.

I looked at many open source options for making efficient de-duplicated distributed snapshots, and I think bup is overall the
best, especially because the source code is readable. Right now https://cloud.sagemath.com makes several thousand bup snapshots every day, and it has practically
saved people many, many hours in potentially lost work (due to them accidentally deleting or corrupting files).

You can access these snapshots by clicking on the camera icon on the right side of the file listing page.

Some lessons learned when implementing the snapshot system

Avoid creating a large number of branches/commits -- creating an almost-empty repo, but with say 500 branches,
even with very little in them, makes things painfully slow, e.g., due to an enormous number of separate
calls to git. When users interactively get directory listings, it should take at most about 1 second to
get a listing, or they will be annoyed. I made some possibly-hackish optimization -- mainly caching --
to offset this issue, which are here in case anyone is interested: https://github.com/williamstein/bup
(I think they are too hackish to be included in bup, but anybody is welcome to them.)

Run a regular test about how long it takes to access the file listing in the latest commit, and if it gets above a threshhold, create a new bup repo. So in fact the bup backup deamons really manage a sequence of bup repos. There are a bunch of these daemons running on different computers, and it was critical to implement locking, since in my experience bad things happen if you try to backup an account using two different bups at the same time. Right now, typically a bup repo will have about 2000 commits before I switch to another one.

When starting a commit, I wrote code to save information about the current state, so that everything could be rolled back in case an error occurs, due to files moving, network issues, the snapshot being massive due to a nefarious user, power loss, etc. This was critical to avoid the bup repo getting corrupted, and hence broken.

In the end, I stopped using branches, due to complexity and inefficiency, and just make all the commits in the same branch. I keep track of what is what in a separate database. Also, when making a snapshot, I record the changed files (as output by the command mentioned above) in the database with the commit, since this information can be really useful, and is impossible to get out of my backups, due to using a single branch, the bup archives being on multiple computers, and also there being multiple bup archives on each computer. NOTE: I've been recording this information for cloud.sagemath for months, but it is not yet exposed in the user interface, but will be soon.

Availability

The snapshots are distributed around the Sagemath Cloud cluster,
so failure of single machines doesn't mean that backups become unavailable.
I also have scripts that automatically rsync all of the snapshot repositories to machines in other locations, and keep
offsite copies as well. It is thus unlikely that any file you create in cloud.sagemath could just get lost. For better
or worse, is also impossible to permanently delete anything. Given the target audience of mathematicians and math students,
and the terms of usage, I hope this is reasonable.

Friday, September 13, 2013

I spent the last two weeks implementing hosted
IPython notebooks with sync for
https://cloud.sagemath.com.
Initially I had just plan to simplify the port forwarding setup, since
using multiple forward and reverse port
forwards seemed complicated. But then I became concerned about
multiple users (or users with multiple browsers) overwriting each
other's notebooks; this is a real possibility, since projects are frequently
shared between multiple people, and everything else does realtime
sync. I had planned just to add some very minimal merge-on-save
functionality to avoid major issues, but somehow got sucked into implementing full realtime sync (even with the other person's cursor showing).

Here's how to try it out

Click +New, then click "IPython"; alternatively, paste in a link to an IPython
notebook (e.g., anything here http://nbviewer.ipython.org/ -- you might need to get the actual link to the ipynb file itself!), or upload a file.

An IPython notebook server will start, the given .ipynb file should
load in a same-domain iframe, and then some of the ipython notebook
code is and iframe contents are monkey patched, in order to support
sync and better integration with https://cloud.sagemath.com.

Open the ipynb file in multiple browsers, and see that changes in
one appear in the other, including moving cells around, creating new
cells, editing markdown (the rendered version appears elsewhere), etc.

Since this is all very new and the first (I guess) realtime sync
implementation on top of IPython, there are probably a lot of issues.
Note that if you click the "i" info button to the right, you'll get a
link to the standard IPython notebook server dashboard.

IPython development

Regarding the monkey patching mentioned above, the right thing to do would
be to explain exactly what hooks/changes in the IPython html client I
need in order to do sync, etc., make sure these makes sense to the
IPython devs, and send a pull request. As an example, in order to do sync
efficiently, I have to be able to set a given cell from JSON -- it's
critical to do this in place when possible, since the overhead of
creating a new cell is huge (due probably to the overhead of creating
CodeMirror editors); however, the fromJSON method in IPython assumes
that the cell is brand new -- it would be nice to add an option to
make a cell fromJSON without assuming it is empty.
The ultimate outcome of this could be a clean well-defined way of
doing sync for IPython notebooks using any third-party sync
implementation. IPython might provide their own sync service and
there are starting to be others available these days -- e.g., Google
has one,
and maybe Guido van Rosum helped write one for Dropbox recently?

How it works

Earlier this year, I implemented
Neil Fraser's differential synchronization
algorithm, since I needed it for file and Sage worksheet editing
in https://cloud.sagemath.com.
There are many approaches to realtime synchronization, and Fraser makes a good argument
for his. For example, Google Wave involved a different approach (Operational Transforms),
whereas Google Drive/Docs uses Fraser's approach (and code -- he works at Google), and you can see
which succeeded. The main idea of his approach is eventually stable iterative process
that involves heuristically making and applying patches on a "best effort" basis; it allows for
all live versions of the document to be modified simultaneously -- the only locking is during the moment
when a patch is applied to the live document.
He also explains how to handle packet loss gracefully.
I did a complete implementation from scratch (except for using
the beautiful Google
diff/patch/match library).
There might be a Python implementation of the algorithm as part of
mobwrite.

The hardest part of this project was using Fraser's algorithm,
which is designed for unstructured text documents,
to deal with IPython's notebook format, which is a structured JSON document.
I ended up defining another less structured format for IPython notebooks, which gets used purely
for synchronization and nothing else. It's a plain text file whose first line is a JSON
object giving metainformation; all other lines correspond, in order, to the JSON for
individual cells. When patching, it is in theory possible in edge cases involving conflicts
to destroy the JSON structure -- if this happens, the destruction is isolated to a single cell, and that part
of the patch just gets rejected.

The IPython notebook is embedded as an iframe in the main https://cloud.sagemath.com page, but with exactly
the same domain, so the main page has full access to the DOM and Javascript of the
iframe. Here's what happens when a user makes changes to a synchronized IPython notebook (and at
least 1 second has elapsed):

The outer page notices that the notebook's dirty flag is set for some reason, which could
involve anything from typing a character, deleting a bunch of cells, output appearing, etc.

Computes the JSON representation of the notebook, and from that the document representation (with 1 line per
cell) described above. This takes a couple of milliseconds, even for large documents, due to caching.

The document representation of the notebook gets synchronized with the version stored on the
server that the client connected with. (This server
is one of many node.js programs that handles many clients at once, and in turn synchronizes
with another server that is running in the VM where the IPython notebook server is running. The sync architecture itself is complicated and distributed, and I haven't described it publicly yet.)

In the previous step, we in fact get a patch that we apply -- in a single automatic operation (so the user is blocked for a few milliseconds) -- to our document representation of the notebook in the iframe. If there are any changes,
the outer page modifies the iframe's notebook in place to match
the document. My first implementation of this update used IPython's noteobook.fromJSON, which could
easily take 5 seconds (!!) or more on some of the online IPython notebook samples.
I spent about two days just optimizing this step.
The main ideas are:

Map each of the lines of the current document and the new document to a unicode character,

Use diff-patch-match to find an efficient sequence of deletions, insertions, swaps to transforms one
document to the other (i.e., swapping cells, moving cells, etc.) -- this is critical to do,

Change cells in place when possible.

With these tricks (and more can be done), modifying the notebook in place takes only a few milliseconds in
most cases, so you don't notice this as you're typing.

Send a broadcast message about the position of your cursor, so the other clients can draw it. (Symmetrically, render the cursor on receiving a broadcast message.)

Monday, September 2, 2013

I'm still working on the IPython notebook integration into https://cloud.sagemath.com right now.
This will be a valuable new feature for users, since there's a large amount of good
content out there being developed as IPython notebooks, and the IPython notebook
itself is fast and rock solid.

I spent the last few days (it took longer than expected) creating a generic way to *securely* proxy arbitrary http-services from cloud projects, which is now done. I haven't updated the page yet, but I implemented code so that

https://cloud.sagemath.com/[project-id]/port/[port number]/...

gets all http requests automatically proxied to the given port at the indicated project. Only logged in users with write access to that project can access this url -- with a lot of work, I think I've set things up so that one can safely create password-less non-ssl web services for a groub of collaborators, and all the authentication just piggy backs on cloud.sagemath accounts and projects: it's SSL-backed (with a valid cert) security almost for free, which solves what I know to be a big problem users have.

The above approach is also nice, since I can embed IPython notebooks via an iframe in cloud.sagemath pages, and the url is exactly the same as cloud.sagemath's, which avoids subtle issues with firewalls, same-source origin, etc. For comparison, here's what the iframe that contains a single ipynb worksheet looks like for wakari.io:

With the wakari.io approach, some users will find that notebooks just don't work, e.g., students at University of Arizona, at least if their wifi still doesn't allow connecting to nonstandard ports, like it did when I tried to setup a Sage notebook server there once for a big conference. By having exactly the same page origin and no nonstandard orts, the way I set things up, the parent page can also directly call javascript functions in the iframe (and vice versa), which is potentially very useful.

IPython notebook servers will be the first to use this framework, then I'll use something similar to serve static files directly out of projects. I'll likely also add sage cell server and the classic sage notebook as well at some point, and maybe wiki's, etc.

Having read and learned a lot of about the IPython notebook, my main concern now is their approach to multiple browsers opening the same document. If you open a single worksheet with multiple browsers, there is absolutely no synchronization at all, since there is no server-side state. Either browser can and will silently overwrite the work of the other when you (auto-)save. It's worse than the Sage Notebook, where at least there is a sequence number and the browser that is behind gets a forced refresh (and a visible warning message about their being another viewer). For running your own IPython notebook on your own computer, this probably isn't a problem (just like a desktop app), but for a long-running web service, where a single user may use a bunch of different computers (home laptop, tablet, office computer, another laptop, etc.) or there may be multiple people involved, I'm uncomfortable that it is so easy for all your work to just get overwritten, so I feel I must find some way to address this problem before releasing IPython support. With cloud.sagemath, a lot of people will likely quickly start running ipython notebook servers for groups of users, since it would take about 1 minute to setup a project with a few collaborators -- then they all get secure access to a collection of ipython notebooks (and other files). So I'm trying to figure out what to do about this. I'll probably just implement a mechanism so that the last client to open an ipython notebook gets that notebook, and all older clients get closed or locked. Maybe in a year IPython will implement proper sync, and I can remove the lock. (On the other hand, maybe they won't -- having no sync has its advantages regarding simplicity and *speed*.)

Wednesday, August 28, 2013

Motivated by work on a book and by the stacks projects,
I just wrote a new web-based LaTeX editor, which I've just released.
You can try it now by making a free account at https://cloud.sagemath.com, then
creating a project, and uploading or creating a .tex file, then opening it.

Features

Side-by-side LaTeX editing, with re-build on save (you can set the autosave interval if you want).

Forward and inverse search.

Parsing of the log file, with buttons to jump to corresponding place in tex file and pdf file.

Preview uses high-resolution color png's, so it will work in browsers that don't have any support for pdf.

The command to LaTeX your document is customizable.

The build process should run LaTeX, bibtex, and sagetex automatically if the log file says they need to be run; otherwise you can click a button to force bibtex or sagetex to run.

Scales up to large documents -- my test document is a book! -- for me sitting at home working on my 134 page book, the time from making a change and clicking "save" to when it appears in the preview pane in high resolution is less than 7 seconds.

Forward and inverse search: jump from point in .tex file to corresponding point in pdf and conversely (it seems the competition doesn't have this, but I bet they will implement it soon after they read this post)

If you need a full xterm for some reason you have it: you can run arbitrary purpose programs on that command line. This means, you can download some data (file, website, database, experimental result files, use git), process them in the most general sense of computing, and generate those files or parts of it for your LaTeX document.

It scales up to large documents more efficiently (in my limited tests), since I was pretty careful about using hashing tricks, parallel compute to generate png's, etc.

A different synchronization implementation for multiple people editing the same file at once; the others lock the editor when the network drops, or reset the docuement when the connection comes back; in real life, network connections are often dropping...

I put some effort into trying to make this latex editor work on iPad/Android, though you'll want to use a bluetooth keyboard since there are major issues with CodeMirror and touch still.

The error messages are not displayed embedded in the tex document (not sure I want this though).

You must have a cloud.sagemath account (free) -- you can't just start editing without signing up.

Single file download is limited to 12MB right now, so if your PDF is huge, you won't be able to just download it -- you can scp it anywhere though using the terminal.

Behind the Scenes

As a professional mathematician, I've spent 20 years using LaTeX, often enhanced with little Python scripts I write to automate the build process somewhat.
Also, I've spent way too much time over the years just
configuring and re-configuring forward and inverse search under Linux, OS X, and Windows with various editors
and previewers.

All the new code I wrote to implement the LaTeX editor is client-side CoffeeScript,
HTML, and CSS, which builds on the infrastructure I've developed over the last year (so, e.g.,
it can run bash scripts on remote linux machines, etc.).
Here are some specific problems I confronted; none of the solutions
are what I expected two weeks ago or first tried!

Problem: How should we display a PDF in the browser

I investigated three approaches to displaying PDF files in the web browser: (1) show a bunch of images (png or jpg), (2) use a native
pdf viewer plugin, and (3) use a javascript pdf renderer (namely pdf.js).
Regarding (2), Chrome and Safari have a native plugin that efficiently
shows a high-quality display of a complete PDF embedded in a web page, but Chromium has nothing by default.
Regarding (3), the Firefox devs wrote pdf.js, which they include with Firefox by default; it
looks good on Firefox, but looks like total crap in Chrome.
In any case, after playing around with (2)-(3) for too long (and even adding a salvus.pdf command
to Sage worksheets in cloud.sagemath), I realized something: the only possible solution is (1),
for the following reasons:

Inverse and forward search: It is impossible to read mouse clicks, page location, or
control the location of the pdf viewer plugin in some
browsers, e.g., in Chrome. Thus only using a PDF plugin would make inverse and forward
search completely impossible. Game over.

It might be possible to modify pdf.js to support what is needed for inverse and forward
search, but this might be really, really hard (for me). Plus the rendering quality of pdf.js
on Chrome is terrible. Game over.

My test document is this book's PDF, which is about 8MB in size. With PDF viewer plugins,
every time the PDF file changes, the entire 8MB pdf file has to be transferred to the browser,
which just doesn't scale -- especially if you replace 8MB by 60MB (say). I want people
to be able to write their books and Ph.D. theses using this editor.
When editing a LaTeX document, the PDF file often changes only a little -- usually only a few
pages changes and everything else remains identical; only the changes should get sent to the
browser, so that even a 1000-page document could be efficiently edited. This sort of thing
doesn't matter when working locally, but when working over the web it is critical.

So we are stuck with (1) for the main PDF preview for a file we are actively editing using LaTeX.
There are a long list of apparent drawbacks:

One substantial drawback to (1) for general PDF display is that there is no way to do full text search
or copy text out of the PDF document. Neither of these drawback matters for the LaTeX editor application
though, since you have the source file right there. Also, there's nothing stopping me from also providing
the embedded PDF viewer, which has search and copy, and that's what I've done for cloud.sagemath.

Another potential drawback of (1) is that it takes a long time to generate jpg or png images
for a large pdf file -- 5 pages is fine, but what about 150 pages? 1000 pages? I tried
using ImageMagick and Ghostscript. ImageMagick is way too slow to be useful for this.
Ghostscript is incredibly powerful for this, and has a wide range of parameters, with numerous
different rendering devices. The solution I choose here is to: (1) generate a high quality
PNG image just for the currently visible pages (and +/-1), then (2) generate medium quality
pages in some neighborhood of the visible pages, then (3) generate low quality PNG's for
all the other pages. All this is done in parallel, since the host VM's have many cores.
Also, we compute the sha1 hashes of the previews, and if the browser already has them,
don't bother to update those images. Finally, it turns out to be important to replace
high quality images by lower quality ones as the user scrolls through the document,
since otherwise the browser can end up using too much memory. A useful trick for the high quality
pages is
using ghostscript's downsampling feature, so the PDF is rendered at 600dpi (say) in memory,
but output at 200dpi to the PNG.

So the Preview tab in the LaTeX editor shows a png-based preview whose quality automatically
enhances as you scroll through the document. This png preview will work on any browser
(for which cloud.sagemath works), irregardless of PDF plugins.
Summary: It is critical to realize exactly what problem we're trying to solve, which is
viewing a PDF that is often changing locally. This is completely different than the general
problem of viewing a static PDF, or editing a PDF itself, or even annotating one.

Problem: how to implement forward and inverse search in the browser

Forward and inverse search let you easily jump back and forth between a point in the source tex file and the
corresponding point in the rendered PDF preview. You need this because editing LaTeX documents
is not WYSIWYG (unless you are using something like Lyx or Texmacs), and without this feature you might find
yourself constantly being lost, doing fulltext search through the source of pdf file, etc., and generally
wasting a lot of effort on something that should be automatic. The first time I used inverse and forward search
was around 2004 with
the Winedt and TexShop
editors, which I think (at the time) used various heuristics to implement them, since they often
didn't quite work right. I 100% assumed that I would have to do use heuristics for cloud.sagemath, and started
working on a heuristic approach based on pdftotext, page percentages, etc.

Then one morning I searched and learned about synctex,
which was "recently" added to the core of pdflatex. The first thing I did was run it, look at the output file
and try to parse with my eyes -- that didn't work. I then searched everywhere and could not find any documentation
about the format of the synctex files; however, I found a paper by the author of synctex and read straight through
it. In that paper, they mention that they provide a C library and C program to parse the synctex files,
and explicitly don't document the format since they don't want anybody to write programs to parse it, since
they reserve the right to significantly change it. No problem -- so I just call out to the shell and
run the synctex program itself with appropriate options. With a little research into scaling factors,
etc., I'm able to map mouse clicks on the png to the data synctex needs to get the corresponding location in the
source file. This is all actually pretty easy and provides forward and inverse search with
absolutely no hacks or heuristics. Also, forward search works well since using PNG's to display the preview means
one can precisely set the preview location.

Problem: making sense of the LaTeX log file

When you build a LaTeX document, tex spits out a log file full of filenames, parentheses, warnings, errors, etc., sometimes stopping to ask you questions, sometimes refusing to exit, etc. This file is NOT (by default, at least) easy for a human to read, at least not me! You can see an error message that refers to a specific location in a file, but which file that is is often listed hundreds of lines before, and you must manually balance paranthesis to figure this out.
I read some documents about its format, and fortunately
found this Javascript library, which parses LaTeX logs.
Cloud runs pdflatex using the option -interact=nonstopmode, so that the whole file gets processed,
then parses the log file, and displays first errors, then typesetting issues (overfull hboxes, etc.), and finally warnings.
Each message has two buttons -- one to jump to the corresponding location in tex file, and one to jump to
the location in the pdf preview.
This is all easy to use, and I've found myself for the first time ever actually going through tex files
and cleaning up the overfull hboxes.
The log file also says when to run sagetex and bibtex, and whether or not to run pdflatex again to update cross references,
and cloud parses that and automatically runs those tools.
For some reason, sagetex doesn't say "run me again" even though it should when you update
existing blocks, and then you have to do it manually by clicking a button.

Summary: I hope this LaTeX editor in the Sagemath Cloud is useful to people
who just want to edit tex documents, play around with sagetex, and not have to worry about configuring anything. Implementing it was fun
and interesting. If you have any questions about the technical details, please ask! Enjoy.

About Me

I am a professor of mathematics at University of
Washington. In my mathematics research, I use the Birch and
Swinnerton-Dyer conjecture as motivation to explore the
constellation of conjectures and questions about arithmetic invariants of elliptic curves. I do many explicit computations, and started the Sage Mathematical Software project. Currently, I'm working very hard on https://cloud.sagemath.com.