Archives

Category: Programming

As I build out CityGraph, I’ve run into the question of which mapping libraries and services to use and why. My purposes are focused on overlaying various types and representations of datasets on (mostly) city-level maps, and modifying those visuals according to user interaction. Here’s what I’ve learned:

Why not Google Maps?

From the start, I narrowed my decision down to Mapbox and Mapzen because they have more robust data visualization APIs and are based on OpenStreetMap. To their credit, I believe Google Maps has better and more reliable data than OpenStreetMap, but I feel it is important to run an open data based service on open mapping data and open source libraries. Additionaly, for my purposes, which are heavily focused on data visualization and interactivity, Google Maps’s lackluster datavis APIs would leave me to rely on something like Leaflet, which doesn’t take advantage of the excellent WebGL features that Mapbox and Mapzen’s libraries have.

Mapbox and Mapzen

Between Mapbox and Mapzen’s rendering libraries and data services/APIs, the choice comes down to what your use cases are. Mapbox has the superior rendering libraries — Mapbox GL libraries work across the web, iOS, and Android. Mapzen has a WebGL renderer, but their mobile library is still in its early stages of development Mapbox seems like the smart choice here.

With respect to data access and API usage, the situation becomes more complicated. If you’re building a commercial application with Mapbox, you have to start out with Mapbox’s Premium plan, which runs at $499/month. If you’re a business with any revenue at all, this is almost certainly worth it, and you can negotiate a higher-tier plan if you exceed the Premium plan’s rates. However, if you aren’t ready to start with the Mapbox Premium plan, Mapzen may be the better choice, because they allow commercial apps to use their free tier. If you don’t care about commercial mapping licensing or supporting thousands of users, then either service’s free tier APIs will almost certainly suit your needs.Mapzen’s rate limits for their free tier are incredibly generous, more so than Mapbox’s, and you can grow your application to support many users before even having to worry about upgrading. It seems their pricing plans are still under development, but I can’t imagine their prices settling any higher than those of Mapbox.

An Ideal Compromise

Ultimately, I decided to go with Mapbox’s libraries for their better cross-platform support and feature-completeness; however, for mapping data and APIs, I chose Mapzen’s services. Every aspect of Mapzen’s stack, from routing to geocoding to tile generation and serving, is open source. So in theory, if you wanted to host your own rate-unlimited Mapzen instance, you could (though it would likely be far more expensive than simply paying Mapbox or Mapzen for their services). And if either service were ever shut down, you could still run your own instances of Mapzen’s open source software and get the same usability. Luckily, Mapbox’s libraries make it easy to use Mapzen’s services. If you have the revenue to do so and aren’t paranoid of a shutdown, paying for Mapbox’s APIs may be the simpler decision. However, Mapzen’s open source approach is inviting and reassuring, and its compatibility with Mapbox’s web and mobile rendering libraries gives me the best of both worlds.

After completing Jeremy Howard’s Deep Learning course, I wanted to put my skills to the test on something fun and interesting, so I set out to train a neural network that classified planets. I’m happy with the end result (and its cheeky name): plaNet.

I wanted to classify major solar system planets based on salient features. The issue with this approach is that there isn’t very much data to train a neural network on. I scraped AstroBin for amateur photos of planets, but I found that most of them simply looked like smudges, and the outer planets were either unrecognizable or missing entirely.

Some of the unaugmented training data used for Jupiter, mostly from NASA.

To get around these issues, I based my approach on two methods: data augmentation on my small dataset, and fine-tuning an existing neural network. Data augmentation is simple in Keras, so I dramatically increased my dataset size simply by applying transformations to my initial images. I fine-tuned my network on VGG’s ImageNet convolutional layers (a classic approach to transfer learning). I dropped out the last fully-connected layer, which was trained to classify everyday objects, and kept the convolutional layers. These layers are great for identifying features — edges, shapes, and patterns — that could still be found in my images of planets. At this point, I pre-calculated the output of the convolutional layer on the initial and augmented datasets in order to easily combine them into one feature set, then I was able to train with a relatively solid test accuracy (~90%). I used a high dropout rate in order to avoid overfitting to my small training dataset, and it seems to have worked.

I want to highlight the simplicity of this approach. Because we’re simply fine-tuning a pre-trained neural network, we can access what is essentially the state of the art in deep learning with just a few lines of code and a small amount of computing time and power (compared to training an entire network from scratch). My work was mostly in preparing the datasets and fine-tuning different parameters until I was happy with the results. If you haven’t already, I encourage you to take a look at the course online. Many thanks to Jeremy Howard for giving me a practical approach to something I’ve only had theoretical backing for so far.

Prepare to fall down a rabbit hole of Linux compiler errors — here’s a guide on how to set up a proper Python and TensorFlow development environment on Columbia’s Yeti HPC cluster. This should also work for other RHEL 6.7 and certain CentOS HPC systems where GLIBC and other dependencies are out of date and you don’t have root access to dig deep into the system. A living, breathing guide is on my GitHub here, and I will keep this post updated in case future versions of TensorFlow are easier to install.

Python Setup

Create an alias for the directory where we’ll do our installation and computing.

You can add this Python to your path, but I am just going to work entirely out of virtual environments and will leave the default path as-is. If you’re particular about folder structure, you can install specific Python versions in (for example) $WORK/applications/python/Python-2.7.12 to keep separate versions well-organized and easily available.

Now, create an alias in your ~/.profile to allow easy access to the virtualenv.

alias pythonenv="source $WORK/applications/pythonenv/bin/activate"

There you have it! Your own local python installation in a virtualenv just a pythonenv command away. You can also install multiple Python versions and pick which one you want for a particular virtualenv. Nice and self-contained.

Bazel Setup

The TensorFlow binary requires GLIBC 2.14, but Yeti runs RHEL 6.7, which ships with GLIBC 2.12. Installing a new GLIBC from source will lead you down a rabbit hole of system dependencies and compilation errors, but we have another option. Installing Bazel will let us compile TensorFlow from source. Bazel requires OpenJDK 8:

Note that /usr/local/cuda-7.5/lib64 is automatically added to $LD_LIBRARY_PATH when you run module load cuda, so we only need to add the other directories. Also note that /usr/local/cuda is symlinked to /usr/local/cuda-7.5, so you don’t need to include the versions in the path directories, but I’m doing it to be explicit.

# This gives you a 1-hour interactive session with GPU support.
# It may take a while to start the interactive session, depending on current wait times.
qsub -I -W group_list=<yetigroup> -l walltime=01:00:00,nodes=1:gpus=1:exclusive_process
# Use latest available gcc for compatibility.
# CUDA loads 7.5 by default.
# Load the proxy to allow TF to download and install protobuf and other dependencies.
module load gcc/4.9.1 cuda proxy
pythonenv
cd $WORK/applications/tensorflow
./configure
# I used all the default settings except for CUDA compute capabilities, which I set to 3.5 for our k20 and k40 GPUs.

Once that is done, make the following change to third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl to add the -fno-use-linker-plugin compiler flag:

The build should fail with an error that goes something likeundefined reference to symbol 'ceil@@GLIBC_2.2.5' or undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'. To fix this, modify LINK_OPTS in bazel-tensorflow/external/protobuf/BUILD by adding the -lm and -lrt flags to //conditions:default:

Deep learning impresses and disappoints

Multiple talks discussed results from deep learning techniques, especially convolutional neural networks, and the effectiveness of the methods varied wildly. Some experiments yielded only 50% classification accuracy, which doesn’t ultimately seem helpful or effective at all. I’m unsure whether other techniques were attempted or considered, but it’s clear that deep learning isn’t the most effective approach for every single problem. It’s a shiny new hammer that makes every problem look like a nail. Libraries like TensorFlow make it more accessible, but there is still a visible gap between those who can implement it and those who can implement it effectively.

Re-inventing the wheel

A few groups demonstrated tools that were developed in-house that already have excellent open source alternatives. I’m not sure whether they were unaware of the existing libraries or just wanted something more finely-tuned for their own purposes, but it seems that a lot of scientific time is spent coming up with solutions for problems that are already solved. Regardless, there were plenty of examples of people who did use open source libraries effectively, so the progress there is something to be proud of.

James Webb Space Telescope and Astronomy

JWST goes well into the infrared
Launch Autumn/winter 2018 — lots of things that can go wrong, but these engineers are awesome.
Science proposals start November 2017.
Routine science observations start six months after launch.
Compared to next-gen observatories, JWST is an old school telescope. We can bring it into the 21st century with better tools for research.
Coordination of development tools with Astropy developers.
Watch the clean room live on the WebbCam(ha!).

Open Source Hardware in Astronomy

hardware.astronomy
Bringing the open hardware movement to astronomy
1) Develop low(er) cost astronomical instruments
2) Invest undergrads in the development (helps keep costs low).
3) Make hardware available to broader community
4) develop an open standard for hardware in astronomy

Citizen Science with the Zooniverse: turning data into discovery (Oxford)

Crowdsourcing has been proven effective at dealing with large, messy data in many cases across different fields.
Amateur consensus agrees with experts 97% of the time (experts agree with each other 98% of the time), and remaining 3% are deemed “impossible” even by experts.
Create your own zooniverse!

Gaffa tape and string: Professional hardware hacking (in astronomy)

Spectra with fiber optic cables on a focal plane.
Move the cables to new locations.
Use a ring-magnet and piezoelectric movement to move “Starbugs” around — messy, inefficient.
Prototyped a vacuum solution that worked fine! This is now the final design.
Hacking/lean prototypes/live demos are effective in showing and proving results to people. Kinks can be ironed out later, but faith is won in showing something can work.

Open Science with K2

Science is woefully underfunded.
Qatar World Cup ($220 billion) vs. Kepler mission ($0.6 billion)
Open science disseminates research and data to all levels of society.
We need more than a bunch of papers on the ArXiv.
Zooniverse promotes active participation.
K2 mission shows the impact of extreme openness.
Kepler contributed immensely to science, but it was closed.
Large missions are too valuable to give exclusively to the PI team — don’t build a wall.
Proprietary data slows down science, misses opportunities for limited-lifetime missions, blocks early-career researchers, and reduces diversity by favoring rich universities.
People are afraid of getting scooped, but we can have more than one paper.
Putting work on GitHub is publishing, and getting “scooped” is actually plagiarism.
K2 is basically a huge hack — using solar photon pressure to balance an axis after K1 broke.
Open approach: no proprietary data, funding other groups to do the same science, requires large programs to keep data open.
K2 vs K1: The broken spacecraft with a 5x smaller budget has more authors and most publications, and more are early-career researchers because all the data is open. 2x increase, and a more fair representation of the astro community.
Call to action: question restrictive policies and proprietary periods. Question the idea of one paper for the same dataset or discovery. Don’t fear each other as competition — fear losing public support.
The next mission will have open data from Day 0 thanks to K2.

I finally finished off some nice plots of my daily text message history for the past ~40 months. The most difficult part was dealing with Google Voice’s terrible exported HTML format. I will post the Python scripts and more detailed plots and interpretations of the data soon, once things are more polished, so more people can plot out and interpret the mundane details of their lives!

After much revision and cleaning-up, the Sage Android application is now at a point where most basic features are functional, and bug reporting, feature requests, and general feedback are needed as work on the application progresses. If you’d like to try the latest APK, you can download it here. Features are always being added (an updated APK with History and Favorites will be available soon!), and you can track the latest updates at the GitHub repository.

The first and most immediately important task of updating the Sage Android application as part of Google’s Summer of Code is to update the way the app interacts with the Sage Cell Server, which performs the calculations in the cloud so that the Android device doesn’t have to locally. Currently, the application communicates with the server through a series of HTTP requests — the client (our app) initializes the connection, then sends and receives query data back and forth from the server. This all sounds fine and efficient, but by relying solely on HTTP requests, the client ends up having to constantly poll the server to check the status of the calculation (to see if there are any updates), which is inefficient and not ideal for a light application on a light device. It’s a decent way of doing things, but the year is 2013, and the Sage Cell Server now supports WebSocket, so my first task is to update the app’s client-serverinteractions so that users can once again send calculations and receive results.

First, the client must make initial contact with the server in a sort of “handshake” between the two. The app will send an HTTP request, and the server will reply with connection details, at which point the WebSocket connection may be established. This seems easy to implement, but as someone who hasn’t spent much time with networking in Java, I had some difficulty getting hands to shake. The requests I was sending were seemingly correct, but the response I received was similar to that of visiting the site directly from a web browser — 405: Method Not Allowed. How could I let the server know that the app was special, and that it wanted to have meaningful interactions together? With some help from Volker, I was able to see that the issue was in the headers of my request. First, it seemed as if I should use GET instead of POST, which turned out not to be the case.

The POST I was looking for (via the Python client), as revealed by WireShark.

I was advised to make use of WireShark to inspect each request I was making (using the Python sample client as my example of proper HTTP ettiquette). A POST request was being made, but it was being made to the wrong URI — I ended up using Java’s built-in URI functionality to generate one properly. At this point, all that was holding my request back from serverside acceptance was its headers. By adding Accept-Econding:identity, I finally had intialized a proper POST request, as indicated by the server’s response.

Success!

At this point, the server responds with a simple JSON object that contains a “kernel_id” and the “ws_url”, indicating that it is ready to begin our session with the information provided. Using the WebSocket URL and the kernel ID, it is simple at this point to establish a connection through one of the many fine WebSocket libraries that exist for Android. Two connections are made: an IOPub socket and a Shell socket. Once the connection is established, the rest is simply a matter of sending and receiving calculations and results in JSON form on both channels. The IOPub channel details the status of the calculation, while the Shell channel deals with the calculation and results themselves. Data is only sent and received when something has actually happened (at which point either the client or server can react accordingly), and once everything is done, the connection is closed. Especially when considering more advanced features such as Interacts, the improvement in networking and overall efficiency and simplicity from switching over to these sockets is clear.

Handling results and sending/receiving calculation data in general should be simpler now that the client/sever networking process is simplified. Now, it is a matter of getting everything to work together in order for us to run calculations from the app itself.

Although at this moment it has no functional networking capabilities (and therefore cannot perform calculations), you can download the older version of the Sage Math Android app from the Play Store and view (and even contribute to!) the source on GitHub. You can also read more about the Sage Cell Server and its interactions on the GitHub Wiki.