Monday, 23 March 2020

A field guide to mapping the Milky Way

How do you go about mapping the galaxy you happen to live inside of ? There's a hell of a lot of information on GalaxyMap.org, but it's not quite what I'm after. So in this post I'll do a step-by-step guide as to how to construct a map using 21 cm neutral atomic hydrogen data. If you actually want to try this for yourself, I'm going to be assuming some familiarity with Python (especially numpy and astropy/pyfits). Otherwise this post should describe the theoretical aspect well enough to be of interest by itself.

Working in Galactic coordinates

HI data has two main advantages : first, it's not at all subject to extinction by intervening stars and dust, and second, it gives us an easy way to measure velocities. Using trigonometry and a few reasonable assumptions, we can convert velocity into distance, with some limitations.

But before that we need to define a coordinate system. The convention for all-sky HI data is to use Galactic coordinates. In this system, the centre of the Galaxy is defined to be at longitude and latitude of 0. Galactic latitude l and longitude b are defined as follows :

Optical image of the Milky Way overlaid with all-sky data from LAB, with the Magellanic Stream highlighted in orange.

There are two main all-sky Galactic HI surveys : the Leiden-Argentine-Bonn survey and the HI4PI survey*. Both have been gridded in a nice friendly way, such that the pixel size is fixed in latitude, longitude, and velocity. Thus once you know the pixel size in each dimension and the world coordinates of any given pixel, you can very easily calculate the exact coordinates of every other pixel.

* It's 4π steradians, but I always read it to mean "HI For Principal Investigator".

For reference, both surveys have the origin at the bottom left (l = +180, b = -90). For the LAB survey, the maximum x-pixel (l) range is 720 and the y-pixel (b) range is 360. The pixel size is 0.5 degrees for both axes. For HI4PI, the x and y ranges are 4320 and 2160 respectively, while the pixel size is 5 arcminutes.

* This value may be slightly off. At the time of writing, I can't access the files I need to check.

Note that these values refer explicitly to the gridded pixel values. These are slightly different from the true resolution values, which are more often quoted in the papers. For the researcher, the real resolution is what matters, but for the data visualiser, it's all about the pixels.

Converting to distance

Once we've found the world coordinates of a pixel, and its flux value, we can then convert this to true 3D position, as follows. First we'll need some assumptions. The Sun is reckoned to be rotating around the centre of the Galaxy with speed V⊙= 220.0 km/s, at a distance R⊙= 8.5 km/s. The rotation curve of the Milky Way we can approximate to be totally flat, so that the velocity at any point Vpnt is also always 220.0 km/s. Given the velocity (vel) of any point, we can then calculate is distance R from the Galactic centre :

Note that this further assumes that this is independent of galactic latitude.This is reasonable because the disc is quite thin, but causes problems for structures which are outside the disc completely.

Next we can calculate the distance of the point from the observer :

Where d1,2 refers to the fact that the equation has two solutions. However, it turns out that this is only really true within side the solar circle, so we don't need to do the calculations twice. Rather we should accept that this region of distance ambiguity is inaccessible to us, so we should only do this calculation if R > R⊙ (we'll see what happens if we disregard this sage advice later on).

If that's so, we can proceed to calculate Cartesian coordinates of our pixel in PPP (position-position-position) space :

These will be in kpc since those are the units we've been working with. We can now iterate over every pixel in our data set and create a full PPP map from our original PPV (position-position-velocity) cube. This is relatively easy to do, and you can find the Python code to do so for LAB data here (note that some simple extra transform is applied to these final equations, just to ensure the data appears in a sensible position in our PPP cube). We just have to specify the size of the cube we want to make and hardly have to worry about anything else at all. Sounds great ! We'll be using the full information from the original data, so we should get a nice, clean, super detailed map at the end, without even having to specify the pixel resolution or anything even slightly complicated, right ?

Wrong. The problem is that PPV maps to PPP in a very strange way, which is not at all intuitive from the equations (unless you're some sort of trigonometric super freak, I guess). There's no guarantee that every pixel in our PPP cube even corresponds to one in our PPV cube. And not all our PPV pixels will contribute anything, since many of them will lie well outside the Galactic disc where our equations are invalid.

Here's what we get from the LAB data if we do this :

Slice through a PPP cube created from LAB data.

In some regions things are relatively good and we can see some nice astrophysical structures, but other parts are hugely undersampled while others have downright weird artifacts. We can do quite a bit better with HI4PI, which has higher resolution and so more fully samples PPP space, but it's still far from perfect.

Why mapping from PPV to PPP is a bad idea

Let's start with the artifacts. Our observations give us velocity along our line of sight, that is, how fast the gas is moving towards or away from us. In reality, the gas is also moving across the sky, but we can't measure that. We can get these "proper motions" for stars with considerable effort, but we just can't get it for gas.

The first problem is not so much that we'd like to know the proper motion (although that'd be nice), but that the equations assume our line of sight velocity measurements are accurate. But because we're inside the disc of the Galaxy, this is not always true. When we look towards the Galactic centre, or in the opposite direction, the only motion of the gas is across the sky - except for a little bit of random motion (~10 km/s). This means that in those regions of low measured velocity, our data is just too inaccurate for our equations to properly convert line of sight to true velocity. Better instrumentation won't help, it's a fundamental limitation of the structure of the Galaxy and our method.

Velocity vectors relative to the centre in green. Blue and red show the components
towards and away from us, respectively.

The second problem is that the equations have that annoying distance ambiguity within the solar radius, where the equation gives two solutions. Although we might be able to break this degeneracy using other data (e.g. by associating the gas with stars of known distances), by itself there's nothing much we can do to save the HI data. So this region, like the low velocity regions, has to be thrown away.

It might help to visualise how velocity maps to distance. One way of doing this is to plot isovelocity lines : lines of constant velocity.

Being inside the disc has weird consequences for what we can detect and where. Since everything's so darn close, sensitivity is extraordinarily high : we can detect essentially all Galactic gas. Looking through our original data cube, we see tonnes of stuff at very low velocities across the entire sky, because gas at high latitudes is only found when it's close to us and, therefore, moving slowly relative to us, due to the thin nature of the disc. But at the same velocities we can also be detecting material on the far side of the Galaxy !

The bottom line for visualisation is that if we start with the PPV map, we don't necessarily fully sample the PPP cube. This explains the other even more serious problem of the image - all those ugly black lines. How can we fix this ? One answer would be to interpolate extra velocity channels and/or spatial pixels in the PPV cube, so that we'd have more points that map to the PPP data. This does help, but it's inefficient and far from perfect. Even using the enormous HI4PI data set, which has vastly better spatial resolution (though similar velocity resolution) gives only a modest improvement in the sampling.

Alternatively, go directly from PPP to PPV

A much better approach is to work backwards. Beginning with a blank PPP cube, we can calculate the corresponding pixel in the PPV cube and use that to fill in the flux values. This essentially knocks all the problems on the head. By iterating every pixel in the PPP cube, we guarantee that we'll sample the whole thing. Although our calculated pixel positions in the PPV cube won't be integer values, all we have to do is simple rounding and we effectively interpolate the missing data (there are more sophisticated ways to do this, but they can wait for another time).

How exactly do we go about this ? We define the coordinate system of the PPP cube arbitrarily. Then, knowing our Galactic coordinate system, we can use some basic trig to calculate the longitude and latitude of any given pixel. We need to get R first, but this is easy because we know the position relative to the galactic centre gc :

I work in degrees, hence the +90 for convention. The pixel positions xyz must be in physical units (kpc). The tan2 function is a wonderful programmatic convention that simplifies things enormously. Using the usual arctan returns values ±90 degrees, since there's a degeneracy in the tan function. Atan2 gets around this by providing two values, returning values ±180 degrees, which is exactly in accordance with Galactic data gridding conventions (we could convery this easily enough to the range 0-360 if we wanted to, but there's absolutely no need).

All we need now is the line of sight velocity. We can get that by rearranging the earlier equation to calculate R :

Since the original PPV data is gridded in a nice friendly way, the hard part's over. Now that we know the longitude, latitude, and velocity of a pixel in the PPP cube, it's trivial to convert this to the pixel in the PPV cube - remember, the original data has pixel size of constant latitude/longitude/velocity.

Voila. We can now extract the corresponding flux value and create a fully sampled PPP cube.

What to do if your data set is feckin' enormous

But wait ! There's one extra complication. If we want to use the LAB data, we can go right ahead and use the final code. Of course, it's always going to be better to use the HI4PI data, but this is difficult to work with because of its gargantuan 35 GB size. The astropy "pyfits" module is not at all good at dealing with large data sets, so we'll need to convert it into a format it can handle. We can do this using the much older miriad software, which was written from an era when a 100 MB file was considered hefty. Consequently it's massively superior in terms of memory management and can process 35 GB files without breaking a sweat, even on a system with 16 GB RAM.

This bit is trivial. First, we convert the FITS file to miriad's own format using the FITS task. Next, we use the same task to extract individual channels, by setting the region parameter to give single-channel slices (unfortunately, this only works when converting from miriad->FITS, not the other way around, which is why we had to convert the file). Annoyingly, miriad insists on producing FITS files of not one but two channels. We can either accept this and have the script only work with the first velocity channel of each cube (it's super simple to slice the data for this), or first process the files and save ourselves from an extra 35 GB of data we don't actually need. Both steps are easy anyway.

Right. We're pretty much there. We've converted our 35 GB single data file into 944 smaller, more manageable files. All we need to do now is modify our PPP code to work with multiple files - and here it is. As a result of all this, we get the following :

Ta-da ! Lovely. Doesn't quite eliminate all of the artifacts, but we can get rid of those in the visualisation stage.

(For the enthusiast : we could in principle go through every pixel in the PPP cube and open the necessary FITS file every time, but this is hugely inefficient - it means opening the files hundreds of millions of times. I estimated it would take the code 4 months to complete, which made me sad. So instead the code precalculates which pixels it needs to extract for each file. It then orders the list, opens each file, extracts all the pixels it needs from that file, and moves on to the next one. This means it only needs a maximum number of 945 file-opening operations and runs in an hour or so. An extra complication is that a list of hundreds of millions of entries starts to cause memory issues, so the code can work in chunks. The user then has to specify the pixel range of the PPP cube they want to search.)

Making things look pretty

For visuals, we have two options. We can either make a volume render using FRELLED, or we can try and pick out the major features using isosurfaces. Volume renders look prettier and use all of the data, but isosurfaces can make it easier to reveal the important structures and are small enough to display as interactive on-line sections of a web page.

Volume rendering via FRELLED need not be explained in any detail here. Let's skip straight to the render :

Sun position in yellow and the galactic centre in green.

What about isosurfaces ? These are something I've struggled with for a while. Eventually I stumbled on a couple of different Python modules that specialise in this. The one I'm using is scikit-image, which can produce meshes in a format that Blender can recognise. It's also fast and deals with large, complicated meshes pretty darn well. It's not 100% foolproof - sometimes meshes have their normal vectors pointing the wrong way, but generally this is easy to fix manually.

(I tried other solutions - extensively - like using metaballs and point cloud skinning scripts, and I'd strongly advise everyone else not to try this. They just don't work very well, at best giving ugly results and at worst being useless and inaccurate. Use a dedicated module and save your sanity !)

The code to generate isosurfaces is a bit of a hack at this stage - incorporating it into the next generation of FRELLED is definitely happening, but that's still a ways off. But for now, it works. You'll need this .blend file (containing an internal script and some pre-set materials) and this external Python script, plus this readme file. Eventually I'll make something less hacky, but not today.

Last but not least - exporting to the web. I wanted to use Sketchfab, which I've been very impressed with, but then I discovered it has a stupid 50 MB file size limit. So I spent a weekend investigating Blend4Web, which is free and totally awesome. It's one of those nice things that just works. So here's an interactive 3D model of the HI content of the Milky Way. It'll work on mobiles (it even has a VR mode option !) but it's better on PC - the labels tend to get cut off on a phone. Click on the buttons in the top left panel to toggle different components, and on the annotations for a bit more information.

It's a 28 MB file, may take a few minutes to load, and Blogger won't let me
embed it correctly, so click here for the interactive version.

It's far from perfect yet - the bloom effect is a bit strong and the anti-aliasing is lousy. But this is my first attempt, so I'm pretty pleased with it. Expect updates on this and lots more interactive content.

So that's it : you now know absolutely everything about how to map the hydrogen content of the Milky Way disc. In a future post I'll look at how to do something similar for the surrounding clouds, which are a lot more fun in 3D because they're found across the entire sky.