Density: Trajectories are often represented as a collection of observations ordered by time. They are dense because, where traditional data sets would have a single observation, trajectories have several.

Non-stationarity: Trajectories often change over time without clearly delineated shifts. This makes it difficult to characterize and compare data via traditional statistics.

Time Warping: Trajectories often describe identical behavior occurring at different speeds. This can cause bias because slower trajectories often have more observations.

Below, we share a recent use case where we analyzed flight trajectories from commercial airline flights in order to automatically find significant heading changes. We used a combination of signal processing and machine learning to deal with the above three challenges. There is nothing that we found that a dedicated analyst couldn’t find. The real novelty of this work is the scale and speed at which we can find these patterns.

Problem

We used training data that covered most of the world over a one-week period. This contained 1.5 million trajectories, with a total of 235 million individual observations.

Approach

With any machine learning project there are, in general, two ways to proceed: supervised or unsupervised. Given the characteristics of our trajectory data, supervised learning seemed like a poor approach. The volume and density of the data would make it difficult to create representative labeled features, and the non-stationary and time warping nature of the data would make it difficult to learn representative input features.

At the same time, a fully unsupervised approach also didn’t seem like a good idea. A fully unsupervised approach would likely do well to cluster and fit the data, but it would give little direction on how to evaluate the results or iteratively improve the underlying model.

In the end we used a semi-supervised approach. To fit the data—and all the oddities it might contain—we used an unsupervised method: mean shift clustering. Then, to evaluate the model, we created a small sample of obvious heading change locations. From these we could determine an approximate guess of precision and recall for the learned mean shift clusters.

Mean Shift Clustering

To get a feel for mean shift clustering, and why we chose it over other unsupervised methods, we can look at the scikit-learn example image to see the results of various clustering algorithms on test data (algorithm names are listed across the top of the image).

First, we wanted a method where the number of clusters wasn’t a parameter of the model. Given one week of world-wide commercial aircraft trajectories, predicting the correct number of turning clusters seemed undoable. This very quickly narrowed the field to DBSCAN and mean shift.

To understand the implications of this choice we can look at the last row in the cluster samples. The methods that use cluster count as a parameter find that number of clusters in the square data—despite the square, for our purposes, being a single cluster (e.g., Mini-batch k-means in the first column finds three clusters in the square). The all-blue squares at the bottom of those columns show that only mean shift and DBSCAN identify the square as a single cluster.

Next, we wanted a method which learned a centroid. The idea here was that, for any given turning location, aircraft would be observed at different points of their turn rather than exactly on the location. By finding the centroid of these observations, no single observation had to be the perfect representation of the turn. Instead we could use many observations from separate trajectories to get a much better estimate.

With these two requirements, mean shift was the only choice. Unfortunately, according to scikit-learn, mean shift doesn’t scale as well as other clustering methods. To address this we made a few modifications to the algorithm (detailed below) to get acceptable performance.

Data Pre-processing

To prep such a large amount of data we used Spark to parallelize computation. We were interested in two things:

Removing potential sensor errors from the data

Selecting only the data that was relevant to the task

To remove errors from the data we looked to signal processing. The field of signal processing has a large collection of error correction tools called filters. These were originally designed for real-time electronic signals such as phone calls, but have been adapted to many other contexts since then. Most of the basic filters are easily implemented in a Spark query. The three we tested were a Gaussian filter, a median filter and an alpha-trim filter. Below we provide a figure that shows how these three filters would change the input of a theoretical data feed.

After applying the filter, we then wanted to narrow down the data to only observations that indicate that a turn is occuring. In fact, by clustering over all individual observations that are involved in a turn, rather than first trying to summarize a turning sequence in a trajectory as a single “turn”, we can work around the time warp challenge. That is, we are now working with single observations, rather than tracks.

To narrow the data, we used Spark DataFrame window functions (also known as analytic functions in SQL). These functions allow data to be sorted and partitioned in a single pass so that an observation can be compared to its immediate predecessor. This let us quickly find all the locations in a trajectory where an aircraft’s heading changed.

Mean Shift Speed Modifications

Because mean shift clustering typically doesn’t scale well, we sought to bound the size of N2 (where N is the number of data points) in order to control our run time. To do this we reduced the full observation set down to a smaller set of evenly spaced bins. We then used scikit-learn’s Kernel Density Estimators to assign a weight to each bin. The following shows how KDE binning can reduce 1000 observations to 100.

Mean Shift Accuracy Modifications

The mean shift implementation in scikit-learn uses a flat average of neighbors (sometimes referred to as a flat kernel). This approach works well if training data is dense at all clusters. Unfortunately, given that our data was for the whole globe and included high and low traffic turning points, this assumption didn’t seem right.

To improve our model’s accuracy, we replaced scikit-learn’s flat kernel with a Gaussian kernel that is less sensitive to outliers. The graph below shows raw input data containing an outlier and how mean shift’s centroid changes for a flat kernel and a Gaussian kernel.

Results

The end result was an algorithm that used a median filter up front to clean the data, followed by KDE binning to reduce the amount of data and finally a Gaussian weighted mean to give more robust statistics. For the data set we describe above the entire end-to-end process takes 40 minutes to complete, and we feel the results speak for themselves. Our final model found turning locations 42 times more frequently than random guessing alone while missing only half as much.

If you’d like to do work like this, we’re hiring! Check out our openings at www.ccri.com/jobs.

]]>3528Odds and Endshttp://www.ccri.com/2019/01/24/odds-and-ends/
Thu, 24 Jan 2019 20:38:48 +0000http://www.ccri.com/?p=3449It’s been a busy winter here at CCRi with expansion of all types going on. We are growing with new and expanded projects, our operations practices are getting fine-tuned, and we continue to grow physically here at our Sachem Village … Read More

]]>It’s been a busy winter here at CCRi with expansion of all types going on. We are growing with new and expanded projects, our operations practices are getting fine-tuned, and we continue to grow physically here at our Sachem Village headquarters, constructing additional space in our newest building.

Happy Birthday, Navy!

Last October, the 13th to be exact, proud navy veteran Mike K. brought in a large sheet cake and a sword(!) to celebrate the US Navy’s 243rd birthday. Veterans from all services (as well as civilians) got to enjoy some cake.

Mike draws First Blood.

Jeff G (also a veteran) takes a stab at it.

Mike K has learned to relax his personal dress code considerably since his Navy days, sometimes living the barefoot/standing desk lifestyle.

NEURIPS

The love of continual learning and research is always in play here at CCRi. Tim E and Alex P took a trip to wintry Montreal last month to attend the conference on Neural Information Processing Systems (NEURIPS) sharing the love with fellow machine learning/computational neuroscience enthusiasts. While there Tim E chatted up Soumith Chintala (of Facebook) who is the lead developer for PyTorch, and Alex P was lucky enough to meet Geoff Hinton, the “grandfather” of neural networks.

Alex P with “The Grandfather” Geoff Hinton

Tim E chats with Soumith Chintala

General CCRi silliness…

Sue and James

Jereme mesmerized by floating blue dogs

Bread Lines: Food Truck lunch line

Ben R threatens to Take Over the World!

Masked Bandit Frank rides off into the sunset

Bub-bye, Big HP! Sorry to see you go. (Not)

Cat Feeder!

The ever-inventive brain of the data scientist/software engineer never sleeps. Our own Eric N has cleverly devised a rustic yet effective dry cat food feeder that runs on a timer and delivers a precise amount of cat food cleanly and efficiently. We didn’t have cat food so a handful of Cocoa Puffs were substituted. The timer was set. We sat, had a beer and waited, along with Jordan’s dog, Max. It worked! Prepare the assembly line.

Waiting for the cat feeder to release the kibble.

Bespoke cat feeder. Order yours today!

Success! Cat Feeder Celebration.

CCRi Holiday Party

It wouldn’t be winter without our annual swanky CCRi Holiday Party at UVA’s Alumni Hall. We put on our sparkles (more or less) and enjoyed sophisticated dining and social interaction. Here are a few moments captured from that event.

]]>3449Using Attribute Values to Tune Vector-Based Entity Representationshttp://www.ccri.com/2018/11/27/using-attribute-values-tune-vector-based-entity-representations/
Tue, 27 Nov 2018 18:02:51 +0000http://www.ccri.com/?p=3419Representing entities by their attributes, like just about any database does, makes it easy to manipulate information about these entities using their attribute values—for example, to list all the employees who have “Shipping” as their “Department” value. The popular machine … Read More

]]>Representing entities by their attributes, like just about any database does, makes it easy to manipulate information about these entities using their attribute values—for example, to list all the employees who have “Shipping” as their “Department” value. The popular machine learning approach of representing entities as vectors created by neural networks lets us identify similarities and relationships (embedded in a “vector space”) that might not be apparent with a more traditional approach, although the black box nature of some of these systems frustrates many users who want to understand more about the nature of the relationships. Our experiments demonstrate methods for combining these techniques that let us use attribute values to tune the use of vector embeddings and get the best of both worlds.

In order to implement this, your data must be represented as tensors, which are a generalization of scalars, vectors, and matrices, to potentially higher dimensions. For many applications, this is already the case. If you are looking at the history of a stock’s price, that data can be represented as a 1-tensor (vector) of floating point values. If you are trying to analyze a photograph, it can be represented as a 3-tensor containing the red, green, and blue values of each pixel. For other applications, such as dealing with text or entities (persons, places, organizations), this gets a little trickier.

The simplest way to represent such data as a tensor is using a one-hot (dummy) encoding.

This has the advantage of uniquely encoding each entity in your data, but it has the disadvantage of being memory-intensive for large datasets. (The memory footprint can be reduced if your tensor library supports sparse tensors, but this typically limits the variety of operations that can be performed.) This approach also suggests that all entities are orthogonal—in other words, it assumes that the words puppy and dog are just as dissimilar as potato and dog.

Entities can also be represented in terms of their attributes or properties, if such data is available. Such encodings have the advantage of being entirely transparent, which is useful when using them as features for other models. This lets us manually scale and weight different features to emphasize or de-emphasize their impact on the model.

You can also use an unsupervised embedding approach to represent the data. Techniques such as GloVe, word2vec, and fastText distribute each entity as points in a continuous vector space such that similar words inhabit locations near to one another. In the following, columns represent coordinates in this vector space:

What if we want to combine the use of attributes with the use of vector space embeddings? The obvious solution is to simply concatenate the vectors, but this has some complications due to information being duplicated. Information that is explicitly encoded in the attribute vectors is also implicitly encoded in the embeddings, which takes away our ability to vary the emphasis on certain features if we want to tune the model’s usage of different attributes. Essentially, what we want is an attribute vector that explicitly represents the known properties of an entity and an embedding vector containing any latent information that is “left over”.

To accomplish this, we combine the principles from two different types of neural network architectures: generative adversarial networks (GANs) and autoencoders. As Wikipedia describes it, an autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. An autoencoder aims to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. It learns to encode and transform the inputs so that the maximum amount of information is retained by ensuring that the encodings can be used to reconstruct the original input. In our experiments, we also use a simple linear mapping that attempts to reconstruct the attribute vector from the latent encodings. This reconstruction loss is also considered by the autoencoder, which it tries to minimize, making these two networks “adversarial”.

Specifically, the autoencoder is comprised of an encoder network ϕ and a decoder network ψ, with one key difference: the attribute vector is concatenated to the encodings before being passed through the the decoder network. Since the objective is to remove attribute information from the encodings, we cannot expect the decoder to reconstruct the original input with just the encodings. By including the raw attribute values to the input of the decoder, we are giving it access to all of the information necessary to reconstruct the original input.

(L is the (euclidean) reconstruction loss.) The other component of our architecture is a discriminator, which attempts to learn a mapping between the encodings and attribute labels.

Where H is the cross-entropy loss.

During training, the autoencoder tries to simultaneously minimize is reconstruction loss while maximizing the discriminator’s cross-entropy loss.

In short, we are trying to learn a new representation of the embeddings that cannot be mapped back to the attributes, but when combined with the attributes can be mapped back to the original embeddings.

Example / Results

To test how well this network works, we use fastText word vectors trained on the Wikipedia corpus as our entity embeddings and WordNet properties as our attributes. Specifically, we consider all of the words in WordNet with the property “animal”. As a very simple test, we try to isolate information about the label “canine” from the fastText embeddings. The embedding space diagram below shows them as yellow circles outlined in black.

As we would expect, initially words with the label “canine” end up mostly concentrated in small clusters in the embedding space:

However, after training the network and passing the embeddings though it to be encoded, words with the label “canine” end up being fairly uniformly distributed in the space:

We can further interpret the effect of this transformation by looking at nearest neighbors in the embedding space. The most similar words to “dog” in the original fastText embeddings are all types of dog:

Puppy

Dachshund

Poodle

Coonhound

Doberman

Hound

Terrier

Retriever

In the transformed embeddings, with canine information removed, the nearest neighbors are a mix of types of housepets and dogs:

Cat

Hamster

Doberman

Kitten

Puppy

Rabbit

Mouse

Dachshund

We also want to ensure that the inclusion and exclusion of canine-ness has little to no effect on the nearest neighbors of non-canine entities. In the original space, the nearest neighbors to “crab” are as follows:

Lobster

Shrimp

Crayfish

Clam

Crustacean

Prawn

Mackerel

Octopus

And in the transformed space:

Lobster

Clam

Shrimp

Octopus

Prawn

Geoduck

Oyster

Squid

Ideally, these lists would be identical, so the result isn’t perfect, but they are quite similar.

Alternatively, instead of ignoring canine-ness we can emphasize that aspect of our representations. In the original fastText embeddings, the nearest neighbors to “wolf” are the following:

Grizzly

Coyote

Raccoon

Jackal

Dog

Bear

Bison

Boar

This result is understandable. Aside from the generic “dog” these are mostly wild animals that inhabit similar biomes. What if we want to emphasize the fact that wolves are canines? We can increase the weight of that dimension of our hybrid embedding, which results in the following:

Coyote

Jackal

Redbone

Dog

Fox

Raccoon

Bluetick

Doberman

Now, wild dogs such as coyote, jackal, and fox are all considered more similar to wolf, while racoon falls from 3rd to 6th, and grizzly, bear, bison, and boar and fall off completely. The tuning of that dimension has improved the nearest-neighbors calculations.

Next, we would repeat the training mentioned earlier but with more attributes being isolated: canine, feline, mammal. We can now have some fun with finding the nearest neighbors to words while altering their attributes. For example, what are the nearest neighbors to “dog” if instead of being canine it was feline?

Feline

Tabby

Angora

Tigress

Kitty

Bobcat

Lioness

Cougar

The five most similar entities to a dog, if it were feline instead of canine, are types of housecat! Similarly, we can look at the most similar entities to “dog” if it were not a mammal:

Pet

Animal

Chicken

Goose

Mouthbreeder

Tortoise

Newt

Goldfish

Which results in non-mammalian animals that are either common farm animals or pets.

Conclusion

We introduce a method for combining unsupervised representations of data with attribute labels into a unified hybrid representation. A novel neural network architecture containing an autoencoder and GAN effectively isolates these attribute labels from the rest of the latent embedding. Using fastText word vectors as embeddings and WordNet hyponym relations from WordNet, we demonstrated the viability of this approach. Finally, we saw the power of the resulting hybrid embeddings for semantic search, where certain aspects of a query can be (de)emphasized and hypothetical “what if” queries can be made where properties can be toggled on and off.

Over the past few years, the use of embeddings to identify entity similarity has been a great benefit to all kinds of search applications, because the use of semantics can do much more than simple string searches or the use of labor-intensive taxonomies. The incorporation of attribute labels into this system—especially when this includes the ability to manually tune their effect—brings even greater power to this approach. Share this post: Follow CCRi:

]]>3419No Flo, We Won’t Gohttp://www.ccri.com/2018/10/01/no-flo-wont-go/
Mon, 01 Oct 2018 14:59:38 +0000http://www.ccri.com/?p=3386Summer finally wound down and crept very slowly into fall. And, boy, so far it’s been a wet one! The majority of days in September experienced everything from mild mist to heavy downfalls as one storm system after another moved … Read More

]]>Summer finally wound down and crept very slowly into fall. And, boy, so far it’s been a wet one! The majority of days in September experienced everything from mild mist to heavy downfalls as one storm system after another moved through our area. Hurricane/TS Florence initially caused us some concern as we feared for potential water damage, but we dodged a big bullet. No matter. The good folks of CCRi will not be deterred from enjoying life and having fun. Here’s what we’ve been up to during the late summer/early fall of 2018.

Ice Cream Truck!

We got a pleasant surprise in mid-August as an ice cream truck rolled into our back parking lot and opened up for business. Nothing beats the doldrums of August like some sweet, creamy ice cream, especially when it’s waiting for you right outside your office window (while the inane music plays its continuous loop).

Boys & Girls Club Symposium

CCRi cyclists take the annual Boys & Girls Club Cycling Challenge pretty seriously and that extends to fund-raising for the organization. In addition to some sponsored pancake breakfasts and ice cream treats, Andrew H, Jordan and Mary decided to throw a money-making symposium to rake in even more cash via voluntary contributions in exchange for delicious brats, chips and Jordan’s own handmade chocolate chip cookie ice cream sandwiches. A fun time for a good cause.

All Hands

It’s September, and that means it’s time for another BIG company meeting where we all get together to review our past accomplishments and carve out where we want to go in the future. This year our Core Values were fine tuned even more, promising lots of solid growth and continued work doing cool things that make our customers happy.

Fall Picnic

Immediately after our All Hands meeting we headed straight to Pen Park. An advanced crew arrived early to get the grills fired up and sizzling with brats, chicken, fish and Philly Cheese-steaks. A big shout out to Andrew H, Jordan and Eric N for their involvement in the cooking operations, and to Louis for his very generous contribution of meats and seafood. (Not to mention his world-famous multi-layer bean dip!) This year Joseph was the sole contributor to the brewing endeavor, thus securing himself as the first place champion straight across the board. His brews included a delicious “clean, malty” Oktoberfest lager with a 6.4% ABV. He also produced a sour beer (4.5% ABV) and even tried his hand at sake (15% ABV), which he claimed “tastes like rice”. Thanks, as always, for sharing your brews with us!

Boys & Girls Clubs Cycling Challenge

This year bad weather (the aforementioned Hurricane/TS Florence) pushed the annual bike ride out a week, and shortened the ride offerings to the 50 mile, the 25 mile and the 11 mile family ride (the 100 and 75 mile rides were omitted). Turnout was lighter than in previous years, but that didn’t stop the CCRi faithful from showing up and getting on their bikes, despite the drizzle. We raised $1270 for the kids. And Cory E’s daughter Lyla was the big winner in the personal fund-raising department, bringing in $230 for the cause!

Trivia

And the trivia team is getting its groove back! After a long summer slump and spotty attendance, the wind is back in our sails and we are racking up the victories. We added some newbies like Cory, Zach, Will and Cole to our fold, and look forward to a long series of fall and winter winnings.

]]>3386Steering Ships Around Hurricane Florencehttp://www.ccri.com/2018/09/18/steering-ships-around-hurricane-florence/
Tue, 18 Sep 2018 13:41:48 +0000http://www.ccri.com/?p=3319The image below loops an animation of Optix.Earth’s rendition of ship traffic around the southeastern United States from Tuesday to Thursday of last week. It shows that all the ships are avoiding an area about 800 miles wide that was … Read More

]]>The image below loops an animation of Optix.Earth’s rendition of ship traffic around the southeastern United States from Tuesday to Thursday of last week. It shows that all the ships are avoiding an area about 800 miles wide that was moving towards the North and South Carolina coast: Hurricane Florence.

exactEarth is a leading provider of global maritime vessel data for ship tracking and maritime situational awareness solutions. They pioneered a powerful new method of maritime surveillance called S-AIS to deliver data for maritime behaviours across all regions of the world’s oceans. This method collects AIS data from a collection of satellites, allowing the data to be unrestricted by terrestrial limitations often imposed by distance, weather, mountains, and other factors. Optix.Earth ingests this data and provides real time visualization of the ships’ locations. In this case, ships are color-coded by their country of origin; dark purple ships, which you can see more of traveling into and out of the Mississippi River, are registered in the U.S.

Clicking any dot on the Optix.Earth interface displays a panel with more information about that ship, as with this Hong Kong-based oil and chemical tanker bound for Corpus Christi:

Available data goes much further back than Tuesday. Below, you can see that on September 1st ships started to avoid the area above the words “Atlantic Ocean” and then avoided a bigger and bigger section of the ocean as that area moved east toward the U.S. As it got closer, a new area opens up in the same place: Hurricane Helene, which was headed north.

The speed of the slider that drives the animation can be easily controlled, and the automation doesn’t have to be automatic; you can drag it back and forth with a mouse for interactive exploration.

These visualizations are based on AIS data, but Optix.Earth can take advantage of and fuse data from a range of both real-time and historical sources. CCRi’s production global-scale analytics can correlate and track entities across multiple sources of data and multiple modalities of data such as geospatial data, text, and imagery.

Want to learn more about how Optix.Earth can help you gain new insights from your spatio-temporal data? Fill out the form on https://optix.earth and we’ll be in touch. Share this post: Follow CCRi:

]]>3319You Don’t Look a Day Over 29…http://www.ccri.com/2018/08/01/you-dont-look-a-day-over-29/
Wed, 01 Aug 2018 15:30:18 +0000http://www.ccri.com/?p=3168CCRi Birthday This past June CCRi celebrated a big birthday of turning 29 years old! We’ve done a lot of growing up since the company started out as a little three person start-up out of UVA, and arriving at our … Read More

This past June CCRi celebrated a big birthday of turning 29 years old! We’ve done a lot of growing up since the company started out as a little three person start-up out of UVA, and arriving at our current location in Sachem Village by the year 2000. We began with one building and are now up to four, not counting satellite offices in NOVA and Colorado. And we’re looking forward to new challenges!

Happy 29th Birthday, CCRi!

Dave Guards the Cake from the Hungry Crowd

Marauders Austin & Jereme: We Want Cake!

Book Club

Our small, but steady monthly book club continues on, reading and evaluating a wide variety of books. Recently we read “Childhood’s End” by Arthur C. Clark and “Tuck Everlasting” by Natalie Babbitt. Our current discussion will focus on Mark Twain’s “The Mysterious Stranger.”

Unexpected Whimsy: Michael Plays his Harmonica

Literary Dueling Banjos: Chris & Bryce

Llama to Knowledge

A couple of CCRi’s project teams decided to check out downtown’s “Wine & Design” for a little after hours bonding and art making. Decide which Llama is YOUR favorite!

Alex P

Tim

Sue

Casey

Alex L

Monica

Vicki (Casey’s wife)

Super Symposium

We love throwing parties and eating food around here so it was time for another Super Symposium, this time featuring delicious barbecue from Red Hub. A grand time was had by all.

]]>3168Data fusion for sociocultural place understanding using deep learninghttp://www.ccri.com/2018/07/10/data-fusion-sociocultural-place-understanding-using-deep-learning/
Tue, 10 Jul 2018 13:52:25 +0000http://www.ccri.com/?p=3159The International Society for Optical Engineering, or SPIE (formerly known as the Society of Photographic Instrumentation Engineers), has published a range of peer-reviewed scientific journals for over 50 years. A recent issue of their Proceedings of SPIE, which publishes research … Read More

]]>The International Society for Optical Engineering, or SPIE (formerly known as the Society of Photographic Instrumentation Engineers), has published a range of peer-reviewed scientific journals for over 50 years. A recent issue of their Proceedings of SPIE, which publishes research presented at their recent conference in Orlando, included an article by CCRi data scientists Jake Popham, Mike Forkin, Nick Hamblet, Bryce Inouye titled “Data fusion for sociocultural place understanding using deep learning.”

Their article describes an ongoing project at CCRi that uses the Accumulo key-value store, the Rya RDF triplestore, the GeoMesa spatio-temporal database, and the Solr free-text indexing system to combine data from OpenStreetMap, the GDELT Global Knowledge Graph, Twitter, DBpedia, and overhead imagery into a knowledge graph that enables the identification of connections, patterns, and relationships between pieces of data from these disparate sources. It does this by using deep learning techniques to create feature vectors that encapsulate the contributions of the various data sources into a single fixed-dimension vector per entity, where entities are people, places, groups, Tweets, concepts, and more. Feature vectors are indexed directly to enable a nearest neighbor lookup, but they are primarily intended as compact knowledge representations for use in downstream modeling that will produce socio-cultural outputs, spatial and otherwise.

The paper describes how the embedding space generated by this knowledge graph can then be used by models such as TransE and TransH for tasks such as land use prediction:

This combination of socio-cultural data from sources such as Twitter with the geospatial storage and analytics that GeoMesa provides is typical of much of the deep learning data fusion work that we’re doing at CCRi, and we’re happy that SPIE gave us the opportunity to spread the word!

This material is based upon work supported by the Engineering Research and Development Center (ERDC) – Construction Engineering Research Laboratory. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the ERDC-CERL.

]]>3159How to use TransE effectivelyhttp://www.ccri.com/2018/06/27/use-transe-effectively/
Wed, 27 Jun 2018 13:57:48 +0000http://www.ccri.com/?p=3135TransE, or Translating Embeddings for Modeling Multi-relational Data, lets us embed the contents of a knowledge graph by assigning vectors to nodes and edge types (a.k.a. predicates) and, for each subject-predicate-object triple, minimizing the distance between the object vector and … Read More

]]>TransE, or Translating Embeddings for Modeling Multi-relational Data, lets us embed the contents of a knowledge graph by assigning vectors to nodes and edge types (a.k.a. predicates) and, for each subject-predicate-object triple, minimizing the distance between the object vector and the translation of the subject vector along the predicate vector. It’s an appealing algorithm for embedding multi-relational data due to its simplicity, efficiency, and quality. However, it can sometimes behave in mysterious and counter-intuitive ways.

To try to isolate and unravel some of these issues, I ran TransE in two dimensions for several small graphs. The advantage of this approach is that we can visualize and animate the whole graph during training to gain intuition for how TransE works. I also plotted the predicate vectors separately to give us a good view of how they change over time. In the examples below, the entity embeddings along with the graph edges appear on the left plot and the predicate vectors appear on the right plot.

Path graph

Paths are probably the simplest possible family of graphs, and they are obviously well-suited to translational embeddings. No surprises here; TransE lays out the graph perfectly, with each edge aligned with the predicate vector:

Path graph (two predicates)

Next, I tried a path graph with two distinct predicates alternating along the path. Again, TransE lays out the path vertices in order, with each edge aligned perfectly with its corresponding predicate vector:

Cycle graph

Next I tried a cycle graph. As expected, the predicate vector goes to zero because it is getting translational updates from all different directions. Still, the layout works pretty well most of the time:

However, sometimes it has to work harder…

and still gets stuck:

This figure-eight configuration seems to be a rather persistent local minimum for the loss function. The predicate vector is small but has a consistent direction, and the two gradient updates from the long edges keep canceling out. So I thought, what if we also include all the reversed edges, to give it a chance to build up some updates in the same direction?

Cycle graph (symmetric edges)

With the reversed edges, the predicate vector is no longer stuck in one direction, and the graph looks rounder and manages to wiggle out of the potential well more often:

This result indicates that it might be worthwhile to include the reverse triples during training, especially if the graph is sparse.

TransE does an admirable job of drawing this tree, even though there is only one predicate. Note that the corruptions seem to provide repulsive forces that push the different branches apart, and these forces accumulate as you go toward the root.

Square grid

Next I tried a square grid with two predicates. The results look pretty good:

There seems to be a clump of five stragglers. What if we freeze the other vertices, turn off the corruptions, and just train those five?

Nice—they find their way with only positive gradient updates. This looks encouraging for the insertion of new entities into the embedding space without retraining the whole thing.

In a future posting, we’ll look at what happens when a single predicate links a subject to many distinct objects, boolean attributes, three-dimensional embeddings, and the effect of various PyTorch optimizers on 3D embeddings.Share this post: Follow CCRi:

]]>3135Ridin’ Highhttp://www.ccri.com/2018/05/10/ridin-high/
Thu, 10 May 2018 20:33:03 +0000http://www.ccri.com/?p=2948High Bridge Ride What’s the first thing that comes to your mind when you think of the month of May? Flowers? Parades? Well, if you’re an avid cyclist (and have Frank D in your employ) then the correct answer is … Read More

What’s the first thing that comes to your mind when you think of the month of May? Flowers? Parades? Well, if you’re an avid cyclist (and have Frank D in your employ) then the correct answer is The National Bike Challenge! CCRi cyclists participate in this event each year with great enthusiasm and good-natured competitiveness. This year the NBC changed up the format a little, with The National League of Bicyclists teaming up with Love to Ride, to have the challenge run just for the Month of May (instead of May through September), taking the summer off, and then returning for another challenge in the month of September. This certainly makes it easier for people who don’t want to put in daily miles during the blistering heat of summer!

A few intrepid cyclists (namely Frank, Don R and myself) decided to get a jump on the cycling season by participating in the Highbridge Trail Ride to raise money and awareness for the Alzheimer’s Association “The Longest Day”. This was the first annual ride, and it was well organized by a lot of friendly people in the equally friendly and inviting town of Farmville, VA. The Highbridge Trail is a wide, mostly flat, cycling, walking and horseback riding trail that sits on a former rail bed, and is currently 31 miles long, and growing. It consists of mostly finely crushed gravel, except for the portion that travels directly over the Appomattox River. This is the famous wooden High Bridge portion of the trail that allows for great scenery for many miles. It also served as a great midway point to stop and appreciate the spectacular spring weather we were having that day.

High Bridge Trail Boardwalk

Don Rude Stops for the View

Your Author on the High Bridge Trail

National Bike Challenge

CCRi pulled together three teams for this year’s challenge: Alpha, Beta and Gamma. Of all the dozens upon dozens of team pools in the country, Team Alpha (Frank, Jeff G, me, James C, Don R, Carsten M Matz, and Leon) is currently sitting at number one! Frank D deserves a lot of credit for keeping the wheels of bicycle awareness and activity going.

Look! Up in the sky! It’s a bird. It’s a plane. It’s Cycle Man! (Frank D.)

Interlude

When he’s not making sure that there’s enough beer in the kegs, helping with conference planning and making sure a lot of back-end functions actually work, Andrew H can occasionally be seen with a teddy bear, or two.

Andrew Greets Every CCRi Visitor with a Teddy Bear!

Spring Brewfest

Falling just one day shy of Cinco de Mayo, our annual Spring Picnic and Brewfest felt well entitled to adopt the CdM theme with tacos, guacamole, chips and lots of beer. Salads, desserts and other potluck rounded out the menu. As usual, Chef Master Kyle outdid himself on his brand spankin’ new grill, serving up grilled flank steak, peppers and hot dogs. On the brewing front this time, due to personal schedules, only two brewers treated us to their fermented handiwork, and we were happy they did. Joseph Featherston produced the most popular brew of the day. His British Golden Ale named “Gold Bullion”(4.8% ABV) was described as “an IPA’s more restrained sibling, using solely Bullion hops to lend a hint of spice and black currant”. It was delicious! In second place was Will Makabenta’s Red Rye Ale (4.5% ABV), another tasty and popular brew, which he named “Chat Clown Rye.” Finally, Joseph also brought in a sampling of his mulberry mead to add a distinct berry flavor to the samplings. And miraculously, unlike in the past, the weather actually cooperated for the entirety of the event with lots of room to enjoy Frisbee on the large field, and for the children to enjoy the sand pit. Leigh brought her yellow lab puppy “Banks”, who was the recipient of much adoration. A good time was had by all.

]]>2948Z Earth, it is round?!http://www.ccri.com/2018/05/04/z-earth-round/
Fri, 04 May 2018 14:15:55 +0000http://www.ccri.com/?p=2963In which visualization helps explain how good indexing goes weird. The flat-Earthers may be on to something. GeoMesa, like many other data stores that index geographic data, uses space-filling curves to impose an order on two-dimensional geometries. It’s easy to … Read More

GeoMesa, like many other data stores that index geographic data, uses space-filling curves to impose an order on two-dimensional geometries. It’s easy to know that “Virginia” as a string data type follows “Illinois” when sorted alphabetically, but it’s much less obvious whether the multi-polygon that is Virginia’s border should come before or after Illinois’ state boundary in an index. The standard solution is to divide the Earth into discrete cells and then use a fixed pattern to define an ordering among those cells. A space-filling curve is a function that, for a given grid-cell resolution, can define a fixed ordering among the cells; imagine a line that visits every cell in the grid, and on which the relative positions of Illinois and Virginia help to build an index that lets a geospatial database manager find locations more quickly.

The advantage of ordering cells is that the columnar data stores GeoMesa supports—Accumulo, HBase, Cassandra, the S3 file system, and Kafka—provide range queries that are easily built from the linear ordering that the curves impose on grid cells. Looking for Charlottesville, Virginia? No worries; that’s in cell #171, right after Venezuela in cell #16 and right before West Africa in cell #18. See the first figure, below, for an illustration of a discretization of the map into grid cells as well as one space-filling curve ordering of those cells.

A commonly used curve in GeoMesa is the Z-order (or Morton) curve. The SFCurve library—also organized within The Eclipse Foundation’s LocationTech working group, and whose principal contributors, like me, work on GeoMesa and GeoTrellis—contains the code for the Z2, or two-dimensional, geo-only version of the Z-order curve. When we visualize the Z2 progression, we typically draw it on a flat, rectangular longitude-latitude map like this:

The curve’s progression from the South-West corner to the North-East corner is clean, easy to follow, and orderly.

It’s also a lie.

Being a bear of very little brain, it took me longer than the other SFCurve contributors to realize that it matters that the Earth may not be flat. The peril is not that we might fall off the edge; rather, the risk is that we misrepresent the relationship between the polar regions and the equatorial regions. To get a sense for how different the Z2 looks in practice, we created the following brief animation2. It uses a 9-bit Z2 curve that creates 32 horizontal cells and 16 vertical cells3 that are 11.25° high by 11.25° wide “squares”. To make the math simpler, we are using a perfect sphere.

Even though the vertices of the curve remain on the surface of the sphere, the edges do not. In fact, the longer the edge, the deeper (closer to the center) it is embedded in the sphere. The smallest Zs almost look like they’re on the surface, but the largest Z, which defines the halves of the map, almost looks like the polar axis. This is a curiosity.

The curve over-represents the polar regions significantly. The top-most and bottom-most horizontal rings cover much less area than do the equatorial rings, but still contain the same number of cells. This is a problem.

How much smaller are polar cells than equatorial cells?

Creating a still, rotated rendering helps to illustrate the problem:

Visually, it is clear that 11.25° squares are no longer meaningful once we’ve used a spherical mapping instead of a flat mapping. This is more than a curiosity, though, because the area of the polar cells is significantly less than the area of the equatorial cells, meaning that a single Z2 index does not carry uniform information: The further the Z2’s cell is from the equator, the more precisely it constrains the member points’ location.

How much? To answer that question, it is helpful to sketch the geometry4:

The sketch uses spherical coordinates. The most important labels are these:

r: the diameter of the spherical Earth, assumed to be 1.0—the unit sphere—because it washes out when we compute the ratio of the area of a polar cell to the area of an equatorial cell; consequently, none of the equations include the radius5.

theta (ϴ): the longitude, defined on [0, 2π] just as it is on the flat map (when using all non-negative degrees), since the translation from radians to degrees maps π radians to 180 degrees.

phi (Φ): the latitude, defined on [0, π] just as it is on the flat map.

There are a few additional symbols for the quantities that most interest us:

Ap: the area of a cell in the ring nearest the pole

Ae: the area of a cell in the ring nearest the equator

The equations for the these two variables, solved by Wolfram Alpha6, are:

Equation 1. Ap: area of a 11.25 degrees-square (π/16 radians-square) cell adjacent to the North pole

For geospatial indexing, cell size matters. If I lose my wallet in the house, I have some hope of finding it without being late for work; if I lose my wallet in the mall, I might as well start canceling my credit cards, because I have no time to conduct an exhaustive search. This is analogous to the problem of using Z2 grid cells on a round earth: near-pole cells are only about one tenth as large as the near-equator cells, which means that they are more precise. If you know that a tweet originated in a near-pole cell, it tells you a lot more about where the Twitter user was, because there’s only a tenth as much wiggle room in that cell compared to the near-equator cell.

Moreover, as we use more bits in the Z2 curve, dΦ gets smaller and smaller, and the rings closest to the pole get even closer. At the same time, the equatorial rings get nearer to the equator, and the ratio of Ap/Ae approaches 0. This is a problem only inasmuch as most of the population (and their attending data that we actually care about) are located far from the poles.

This problem is not specific to the Z2 curve, but applies to all space-filling curves that work on an evenly-gridded version of a flat latitude-longitude surface. Compact Hilbert and Peano curves, for example, share this same pathology. The most effective and obvious workaround is to stop using a flat latitude-longitude surface for drawing uniform grid cells. For example, we could use a space-filling curve that operates on equilateral triangles, and treat the globe like a giant d207 die from Dungeons & Dragons8. A more mundane approach would be to simply change the function that maps latitude values to cells so that the equatorial cells become shorter and the polar cells become taller, reducing the discrepancy in their areas.

For a geospatial index, the core concern isn’t the shape of the grid cells, but what that shape implies about how data can or cannot be stored and queried effectively. The fact that the most precise indexing, and consequently the most efficient retrieval, is allocated to data that are located furthest from the equator is very much a problem for GeoMesa just like it is for all of the other geo-indexing systems that use rectangular lat/long grids.

GeoMesa itself does not define the space-filling curves nor the map from curve indices to grid cells. Those responsibilities belong to the SFCurve library, and so that is where the active research is being done to address these challenges, and that’s the GitHub repository to watch.

Z2 accommodations for a round Earth are just over the horizon. Meanwhile, the next time you see a GeoMesa T-shirt, you’ll have a much better understanding of the math that went into it!

1 Assuming a 6-bit Z-order curve (drawn more like an N-order curve, with an “XYXYXY” bit interleaving) as depicted in the first figure.

2 Thanks to NASA’s Visible Earth for the sphere texture, and their Deep Star Maps for the sky-box background. POV-ray, although it may be old, still does a capable job of rendering the scene.

3 The 9 bits are interleaved as follows: XYXYXYXYX, implying that the large “polar spike” in the image and animation is actually a horizontal jump between the West and East hemispheres.