Evan A. Sultanik, Ph.D.

Recent Content:

Service Discovery on Dynamic Peer-to-Peer Networks Using Mobile Agents

or, How Bumper Cars Relate to Computer Science

The absurd realization that it has almost exactly been a
decade since I defended
my master’s thesis took
me by surprise. It seems like yesterday. That work has
had some decent exposure over the years, but, like
most creative works produced toward the beginning of one’s career, I
do not look back on it as a paragon.

He was referring to my organization of one particular chapter: First I
described the overall technique, then I proceeded to iteratively break
the problem into successively smaller chunks. A bit like taking apart
a matryoshka doll. Or maybe a more apt analogy would be: peeling away
the layers of a rotten onion only to find a miniscule, barely
salvageable core, calling into question the cost/benefit of the whole
excavation. Moshe was right. With that said—and unlike the
eponymous biographee of the aforementioned book—I think I can at
least claim that my ideas are fully “born” in the first volume.

Despite somehow managing to get that chapter of my
thesis published in a book, I feel like it always
suffered from my cumbersome presentation. Therefore, perhaps inspired
by Stetson hats and hip flasks, I have since devised
an analogy that I hope at least makes the problem (if not my solution)
more accessible. That’s what I am going to describe in the remainder
of this post.

You have some control over your own movement, but there are so many
others driving around in the arena that there are constant,
unavoidable collisions that throw you off course. From afar,
everyone’s movement seems random.

The challenge is that you need to know the time, but you do not have a
watch. In fact, there are very few others that have a watch. How
do you find the time?

The naïve solution is to simply yell out, “Who has the time‽” The problems with this solution are:

If everyone needs to know the time at once, there will be very many yelling people.

What if no one with a watch can hear you? Bumper car arenas tend to be loud.

A slightly more intelligent approach might be to:

Take a piece of paper and write a request on it for the current time.

Pass the piece of paper to the next person who bumps into you.

If one who is watch-less receives such a piece of paper, he or she passes it on to the next person that bumps into him or her.

When the note eventually reaches someone with a watch, he or she will write the current time on the piece of paper and send it back to you.

The obvious shortcomings to this approach, however, are that:

When the note eventually gets back to you, the time will be incorrect!

What if, in fact, no one has a watch? How long do you have to wait without receiving a response before you can be sure that no one has a watch?

What if not everyone speaks and/or reads English?

My realization is that: If one roughly knows the topology of the
network (i.e., the locations of all of the cars), and if the
messages are truly passed randomly through the network, one can
use ergodic theory
and random graph theory to accurately predict the
frequency of message arrivals. Assuming knowledge of the network
topology is reasonable, since many ad hoc networking algorithms
already provide it. Even if it is not available, an approximation of
the topological properties is often sufficient. So, what this
provides is a model for predicting how long it will take for one’s
message to eventually reach someone with a watch and, subsequently,
how long it will take for the response to be returned. What I
discovered was that the variance is often small enough such that this
estimate can be used to correctly adjust for the delay in returning
the time. Furthermore, this model provides a probability distribution
for how likely it is that one would have received at least one
response as a function of waiting time. Therefore, if n
seconds have passed and the model says that with probability 0.99
one should have received a response to the time query, yet no
response has been received, one can conclude with 99% certainty that
there is no one in the network with a watch!

The last challenge question, “What if not everyone speaks and/or
reads English?” will have to wait for another post…

Positive Train Control

or, Jet Fuel Can’t Melt Train Tracks

It’s been a week since the
tragic Amtrak derailment in my home town of Philadelphia.
Being an avid train passenger—commuting to and from DC several
times per week—and having taken the ill fated Northeast Regional
Train No. 188 on multiple occasions, this has struck close to
home. I am posting this blog entry from the café car of Train
No. 111, the first Southbound train to commence full Amtrak
service since the disaster.

I realize that it is very early, and the National Transportation
Safety Board investigation is still ongoing.
Speculation—especially by someone like me who is not a
transportation expert—would be unproductive at best, and
offensive to the victims at worst. Perhaps it’s my job as a security
professional—in which I am paid to find vulnerabilities in
systems … or perhaps it’s the recent spate of news that
both cars’ and
even commercial airplanes’ heavily computerized control
systems can be commandeered, wirelessly, by a remote
attacker … or perhaps it is the fact that the crash
occurred immediately after national rail safety week and on the eve of
a legislative debate on cutting Amtrak funding … or
perhaps I’ve just been
reading too much Pynchon… but ever since I heard that the
train was speeding and that there is no direct evidence
incriminating the train operators of negligence (other than the
speed), the first thing that popped into my mind was: Software. I
haven’t heard anyone (other than well-known security
expert Simson Garfinkel) discuss it, so that’s what I’m
going to do in the remainder of this post.

One topic that the media has latched onto
is Positive Train Control (PTC): a
technology that, if only it had been implemented, is purported to have
been able to avert the crash. What the media doesn’t
say is that
the ACS-64 locomotive that was pulling the fateful
train was already equipped with PTC. You see,
the Advanced Civil Speed Enforcement
System&nbsp(ACSES)—Amtrak’s version of PTC for the Northeast
Corridor—requires components both on-board the
locomotive and wayside (on the tracks). In other words, PTC
will only be fully functional if both the locomotive and tracks are
upgraded. Portions of track South of Philadelphia and North of New
York already have support. In the case of the Philadelphia crash, the
locomotive was a new model that had support, but the tracks on which
it derailed did not.

I contend that a software bug in the ACSES system should not
be ruled out as a potential cause of or contributing factor to the
derailment. Let me be clear: I am not claiming that
software was a likely cause of the crash. I am neither a
transportation expert nor do I have any detailed knowledge of the
ACSES implementation. The purpose of this article is to provide
enough evidence that software errors are a plausible enough
explanation that the possibility should at least become a part of the
public discussion.

There is a reasonable precedent of software bugs
causing physical catastrophes. For
example, a software bug in Toyota’s electronic throttle control
system recently caused the massively publicized “unintended
acceleration” problem in many of their vehicles, killing at least 89
people as a result. In 2007, a group of six F-22 Raptor fighter
jets experienced multiple computer crashes coincident with crossing
the international dateline caused by a software bug that did not
anticipate that corner case. The planes lost all navigation and
communication, and were only able to make it back to land by following
their tankers. Vehicles and transportation systems in general are so
complex, automated, and computerized these days that a single software
bug can wreak havoc.

But how could a system that is intended to provide a safeguard against
crashing actually cause a crash?

So, presumably, the system has no control over acceleration, just deceleration.

I am perhaps about to delve too far into the sea of speculation,
but as James Burke so eloquently demonstrated, a
failure in one system can cascade to cause failures in seemingly
independent and unconnected others.

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
Leslie Lamport

There is a display in the cab of
the locomotive with a speedometer, looking something like this:

ACSES Spedometer

When the track is PTC-enabled, there is a second speed readout on the
bottom, showing the maximum speed allowed on the track. When the
track is not PTC-enabled, the readout looks as pictured here. If the
conductor or engineer is relying on that display to gauge the train’s
current speed, he or she is thereby relying on the output of ACSES’s
algorithms, programming, hardware, and sensors. A failure in any of
those pieces could result in an incorrect speed readout, e.g.,
causing one to believe that the speed of the train were actually
slower than in reality. This is similar to
how Air France Flight 447 was doomed by an engineering
design flaw in its airspeed sensors, which caused a failure in the
autopilot software, which reported inaccurate instrumentation to the
pilots, who relied upon the incorrect information, making manual
piloting errors that caused the plane to crash.

I do not wish the tragedy of Amtrak Train No. 188 on anyone; it
could have very easily been me sitting in that café car a week
ago. While history has proven that human error is the most frequent
cause of these types of accidents, we increasingly need to also look
at the software for possible fault.

Tracking Trains

Reverse Engineering and Hacking Text Messages for Great Good

I’ve been regularly commuting between Philly and DC on Amtrak for over
six months now. Being in the Northeast corridor—the only profitable
region in Amtrak’s network—and specifically being in the
NYC↔Philly↔Washington trifecta that accounts for about a
third of Amtrak’s overall nationwide ridership, the service is
generally excellent. Not Shinkansen excellent, but good enough to
take me where I need to go in relative comfort, in a third less time
than would be necessary to drive. And in English, too, so I can die
with a smile on my face, without feelin’ like the good Lord gypped me.

For a pseudo-public entity, Amtrak does a surprisingly good job
keeping up with the technological times. It’s had free WiFi on its
trains for a number of years, the conductors scan tickets using
ruggedized iPhones, and its iOS app lets me store and organize my
myriad tickets in Passbook. One can even present tickets via an Apple
Watch.

The one gripe I have about the system is that, despite Amtrak’s
excellent online tools for tracking the exact location of trains,
there is no good way to get useful train status alerts. I
often work at a location ~45 minutes outside of DC, so I want to know
whether my train is delayed before I depart for the station.

The Southbound trains I take from Philadelphia originate in either New
York or Boston. Therefore, I want to get an alert texted to my phone
when the train has departed New York. If it departed New York on
time, then I can proceed as normal. If it was delayed, then I can hit
the snooze button. If it didn’t even leave yet, then I know that it
will be at least an hour before it arrives in Philly.

The first thing I had to do was figure out a way to send text messages
for free. Every major cellular provider has some sort of gateway
service where an E-mail sent to the proper address will be forwarded
to the associated phone number. Therefore, I created a simple Python
library (in pure Python; no dependencies) for sending free (to the
sender) text messages to phones from every major international
carrier:

Next, I had to reverse engineerdevise a way to get
accurate train status updates. This was relatively straightforward,
but I hesitate to go into details because it might implicate me in
several terms-of-use violations.

I pieced this all together into a new service that I have been using and allowing several fellow commuters to beta-test for the past couple months. I’m excited to release it to the public now:

This website has no affiliation with any railways or train
operators. If you are a railway or train operator, please keep in mind
that this is a toy website created by a single guy with altruistic
motives. Please don’t sue me. Instead, let’s talk about how I can give
you what I’ve created here so that you might host it and provide it as
a service for your customers.

No profit is made off of the services on this website. In fact, I lose
money by running it. Therefore, there are certain concessions that
have been made due to lack of fuding. Namely, there is the possibility
that a malicious party could illegally intercept the messages sent
between a user’s web browser and this server, potentially altering his
or her account and alerts. I have implemented as many security
features as possible to protect against this, but in order to be
almost foolproof I would need to host the site on SSL, which requires
money. If you desire additional security and features, please consider
donating.

In which I annoyingly and gratuitously riff on a theme.

Musical Uncanny Valley

In which I compare annoying music to artisanal ketchup.

While I appreciate Richter’s work and think it’s good, it’s
very hard for me to enjoy it. The problem is that I find
much of it—and particularly Memoryhouse—too reminiscent of
Philip Glass’. The latter’s minimalism and
distinctive brand of ostinato, harmonic chord progressions, variations
in half-time, &c., is clearly being referenced in
Memoryhouse, albeit with perhaps “post-minimalist” orchestration. For
example, when I first listened to November, the first track
of Memoryhouse (q.v.above), it
immediately reminded me of Glass’s String Quartet No. 3, which
predates Memoryhouse by about 20 years:

It’s a bit like when a TV show wants to parody Jeopardy but doesn’t
have the budget to license rights to the Jeopardy Think! music, so it creates a
slightly different version which ends up sounding wrong, despite the
fact that if the original Jeopardy theme had never existed this new
version would be just as popular. What is called a musical pastiche.

That just sounds wrong to me, to the extent that my subconscious is
offended by it. It’s like going to a Michelin 3-starred restaurant
and being served artisanal, house-made ketchup, from locally sourced
organic tomatoes and garlic harvested from the chef’s own private
garden during the last full moon in spring: No matter how good that
ketchup tastes, it’s not going to taste as right as Heinz,
because that’s what you grew up eating. And heaven forbid you’re
served Hunt’s. Did we lose a war?

In my mind, a pastiche is distinct from something like an homage or
inspiration since its similarity to the source material is much more
noticeable. For example, Glass’s predilection for pairing low-pitch
ostinato with higher-pitch simple melodies is technically very similar
to Mozart’s modus operandi, yet we rarely ever hear Glass being
directly compared to Mozart.

A number of years ago I had a subscription to the Philadelphia
Orchestra and attended a concert debuting a new symphony by a
relatively unknown composer (whose name escapes me). The theme was
“The United States.” I’m guessing the concert was held around the
time of Independence Day, but I neither remember nor care to look it
up. The whole thing was very frustrating for me to sit through,
because it was clear to me that every single movement was simply a
pastiche of the work of a much more famous American composer. I would
have much rather heard a performance
of “the real thing.”

Music from completely different genres
can also either purposefully or accidentally trick our brains into
finding similarity. From almost exactly five years ago today:

The field of æsthetics has produced a hypothesis of what
is know as
the uncanny valley: When an object moves and/or
looks very similar to (but not exactly like) a human being, it causes
revulsion to the
observer. Here is a video with some examples, if you want to
be creeped out a bit. As objects become increasingly similar to the
likeness of a human, human observers become increasingly empathetic
toward the object. However, once the object passes a certain
threshold of human likeness, the human starts to be repulsed, until
another likeness threshold at which point the object is almost
indistinguishable from a real human.

I posit that there is a similar phenomenon in music, and that is what
I am experiencing when I listen to Memoryhouse. One of the theories
explaining the uncanny valley is that conflicting perceptual cues
cause a sort
of cognitive dissonance. It is well-known
that the brain behaves almost identically
when imagining a familiar piece of music as it does
when listening to it. This suggests that the brain is
internally “playing” the music in synchrony with what it is hearing,
anticipating each note. In a sense, one’s brain is subconsciously
humming along to the tune. My theory is that when a song (or,
particularly, a pastiche) is similar enough to another song that is
much more familiar, this evokes the same synchronous imagining.
However, once the pastiche deviates from anticipated pattern, this
ruins the synchrony and causes cognitive dissonance.

Excuse the pun, but on that note I’ll leave you with something that
will hopefully not be in your musical uncanny valley: This
wonderful, recent recomposition of Glass’s String Quartet No. 3
for guitar:

At the beginning of the era of strategic nuclear war capability, the U.S. deployed thousands of
air defense fighter aircraft and ground-based missiles to defend the population and the industrial
base, not just to protect military facilities. Every major city was ringed with Nike missile bases to
shoot down Soviet bombers. At the beginning of the age of cyber war, the U.S. government is
telling the population and industry to defend themselves. As one friend of mine asked, “Can you
imagine if in 1958 the Pentagon told U.S. Steel and General Motors to go buy their own Nike
missiles to protect themselves? That’s in effect what the Obama Administration is saying to
industry today.”

That passage has always struck me as proffering a false equivalence.
The US Government has near absolute control over the country’s
airspace, and is thereby responsible to defend against any foreign
aggression therein. Cyber attacks, on the other hand, are much less
overt—with the possible exception of relatively unsophisticated
and ephemeral denial of service (DOS) attacks. A typical
attacker targeting a private company is most likely interested
stealing intellectual property. That’s certainly the trend we’ve seen
since the publication of Clarke’s book back in 2012, vi&.,
Sony, Target, Home Depot, &c. The way those types of
attacks are perpetrated is more similar to a single, covert operative
sabotaging or exfiltrating data from the companies rather than an
overt aerial bombardment.

The actual crime happens within the private company’s
infrastructure, not on the “public” Internet. The Internet
is just used as a means of communication between the endpoints. More
often than not the intermediate communication is encrypted, so even if
a third party snooping on the “conversation” were somehow able to
detect that a crime were being committed, the conversation would only
be intelligible if the eavesdropper were “listening” at the endpoints.

A man has a completely clean criminal record. He drives a legally
registered vehicle, obeying all traffic regulations, on a federal
interstate highway. A police officer observing the vehicle would have
no probable cause to stop him. The man drives to a bank which he robs
using an illegal, unregistered gun. Had a police officer stopped the
bank robber while in the car, the gun would have been found and the
bank robber arrested. Does that give the police the right to stop
cars without probable cause? Should the bank have expected the
government to protect it from this bank robber?

The only way I can see for the government to provide such protections
to the bank, without eroding civil liberties (i.e.,
the fourth amendment), would be to provide additional
security within the bank. Now, that may be well and good in
the bank analogy, but jumping back to the case of cyber warfare, would
a huge, multi-national company like Sony be willing to allow the US
Government to install security hardware/software/analysts into its
private network? I think not.

Social Signals Part 2

The (gratuitous) math behind the magic.

Introduction

In the last post, I presented a new
phenomenon called Social Signals. If you have not yet read
that post, please do so before
reading this continuation. Herein I shall describe the nitty-gritty
details of the approach. If you don’t care for all the formalism, the
salient point is that we present an unsupervised approach
to efficiently detect social signals. It is the approach that
achieved the results I presented in the previous post.

The Formal Model (i.e., Gratuitous Math)

We call a measurement of this overall Twitter traffic a baseline. We posit that the signal associated with a bag of words is anomalous for a given time window if the frequency of the words’ usage in that window is uncorrelated with the baseline frequency of Twitter traffic. The aberrance of an anomalous signal is a function of both its frequency and the extent to which it is uncorrelated from the baseline. The significance of an anomalous signal is a function of both its frequency and also the significance of the other abnormal signals to which it is temporally correlated.

Let $\Sigma^*$ be the set of all possible bags of words in a Twitter
stream. Let $T \subset \mathbb{N}$ be a consecutive set of points in
time that will be used as the index set for the baseline time series,
$B = \{B_t : t \in T\}$. Given a bag of words $W \in \Sigma^*$, the
associated time series for the signal over the window $T$ is given by
$S_W = \{S_{W,t} : t \in T\}$.

The correlation, $\rho : \mathbb{N} \times \mathbb{N} \rightarrow
[-1,1] \subset \mathbb{R}$, between two time series, $B$ and $S_W$, is
calculated using Pearson’s product-moment correlation coefficient:
\begin{multline*}
\rho(B, S_W) \mapsto \frac{\mathrm{Cov}\left(B, S_W\right)}{\sigma_B \sigma_{S_W}}\\= \frac{1}{|T|-1}\sum_{i=1}^{|T|} \left(\frac{B_i - \mu_{B}}{\sqrt{\sigma_B}}\right)\left(\frac{S_{W,i} - \mu_{S_W}}{\sqrt{\sigma_{S_W}}}\right),
\end{multline*}
where $\mathrm{Cov}$ is covariance, $\mu$ is the mean, and $\sigma$ is
the standard deviation of the time series. Values of $\rho$ closer to
zero indicate a lack of correlation, while values closer to positive
and negative one respectively indicate positive and negative
correlation. We define the aberrance of a signal with
respect to the baseline, $\delta : \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{R}^+$, as
the ratio of the signal’s frequency to the magnitude of its correlation
to the baseline: $$\delta\left(B, S_W\right) \mapsto \frac{\sum_{t \in T} S_{W,t}}{|\rho(B, S_W)|}.$$

The significance, $\alpha
: \mathbb{N} \times \mathbb{N} \rightarrow \mathbb{R}$, of an abnormal
signal $S_W \in \mathcal{S}$ with respect to the baseline is defined recursively: The significance of an abnormal signal is equal to its
frequency times the weighted significance of the other abnormal
signals to which it is highly temporally correlated. Formally,
\begin{multline}
\alpha(B, S_W) \mapsto \left(\sum_{t \in T} S_{W,t}\right)\times\\\left(\sum_{S_{W^\prime} \in \phi(S_W)}
2\left(|\rho(S_W, S_{W^\prime})| - \frac{1}{2}\right)\alpha(B, S_{W^\prime})
\right).
\label{equ:alpha}
\end{multline}
Therefore, given the maximally aberrant signals $\mathcal{S}$, we can
use the significance function $\alpha$ to provide a total ordering
over the signals.

Finally, some signals might be so highly temporally correlated that
their bags of words can be merged to create a new, aggregate signal.
This can be accomplished by performing clustering on a graph in which
each vertex is a signal and edges are weighted according to $\phi$.
We have found that this graph is often extremely sparse; simply
creating one cluster per connected component provides good results.

The model described in the previous section provides a framework in which to find the maximally aberrant signals (i.e., what I was hoping would turn out to be social signals). Creating an algorithm to solve that optimization problem is non-trivial, however. For instance, the recursive nature of the definition
of the significance function $\alpha$ requires solving a system of $n$ equations. Fortunately, I was able to identify some relaxations and approximations that make an algorithmic solution feasible. In the remainder of this section, I will describe how I did it.

Given a stream of tweets, a desired time window $T$, and a desired
time resolution, the baseline, $B$, and set of signals, $\Sigma^*$,
are collected by constructing histograms of term frequency in the
tweets for the given time window with bins sized to the desired time
resolution. This process is given
in Algorithm 1. Since the maximum
number of tokens in a tweet is bounded (given Twitter’s 140 character
limit), this algorithm will run in $|E|$ time.

Algorithm 1 Time series construction from a set of tweets.

1:

procedureCollect-Signals($E$, $T$, $\gamma$)

Require: $E = \{e_1, e_2, \ldots\}$ is a set of tweets, each being a tuple $\langle \tau_i, \kappa_i\rangle$ containing the timestamp of the tweet, $\tau_i$ and the content of the tweet, $\kappa_i$. $T$ is the desired time window and $\gamma$ is the desired time resolution.

Ensure: $B$ is the baseline measurement and $\Sigma^*$ is the set of possible signals. For all $W \in \Sigma^*$, $S_W$ is the time series for signal $W$.

2:

$k \gets \frac{|T|}{\gamma}$ /* this is the total number of buckets */

Next, we need to find $\mathcal{S}$: the set of $n$ signals of maximum
aberrance. This can be efficiently calculated in $O\left(|\Sigma^*|\frac{|T|}{\gamma}\right)$ time
(for $n$ independent of $|\Sigma^*|$ or $n \ll
|\Sigma^*|$), e.g., using the median of medians linear
selection algorithm or by using data structures
like Fibonacci heaps or skip lists.

The next step is to sort the $n$ maximally aberrant signals in
$\mathcal{S}$ by significance. The recursive nature of the definition
of the significance function $\alpha$ requires solving a system of $n$
equations. It turns out, however, that the mapping of (\ref{equ:alpha}) is
equivalent to the stationary distribution of the distance matrix defined
by $\phi$. To see this, first construct a $1 \times n$ vector ${\bf
F}$ consisting of the frequencies of the signals: $${\bf F}_i = \sum_{t \in T} S_{W_i, t}.$$
Next, construct an $n \times n$ matrix ${\bf R}$ such that
$${\bf R}_{ij} =
\begin{cases}
2\left(|\rho(S_{W_i},S_{W_j})| - \frac{1}{2}\right) & \text{if}\ S_{W_j} \in \phi(S_{W_i}),\\
0 & \text{otherwise.}
\end{cases}$$
Let $\hat{\bf R}$ be a normalization of matrix
${\bf R}$ such that the entries of each row sum to one.

The limiting (or stationary) distribution of the Markov transition matrix $\hat{\bf R}$, notated $\boldsymbol{\pi}$, is
equivalent to the principal eigenvector (in this case the
eigenvector associated with the eigenvalue 1) of $\hat{\bf R}$: $$\boldsymbol{\pi} = \hat{\bf R}\boldsymbol{\pi}.$$
It is a folklore result that $\boldsymbol{\pi}$ can be numerically calculated as follows:
\begin{equation}
\label{equ:principal}
\boldsymbol{\pi} = [\underbrace{1, 0, 0, 0, 0, \ldots}_{n}]\left(\left(\hat{\bf R} - {\bf I}_n\right)_1\right)^{-1},
\end{equation}
where ${\bf I}$ is the identity matrix and the notation $({\bf X})_1$
represents the matrix resulting from the first column of ${\bf X}$
being replaced by all ones. Let $C_i$ be the set of states in the
closed communicating class (i.e., the set of signals
in the connected component of $S_{W_i}$ as defined by $\phi$) of
state $i$ in $\hat{\bf R}$. Then, by definition of $\alpha$
in (\ref{equ:alpha}) and by construction of ${\bf F}$ and $\hat{\bf
R}$, $$\alpha(B, S_{W_i}) = \pi_i\sum_{j \in C_i}{\bf F}_j.$$

$\boldsymbol{\pi}$ can also be calculated by iterating the Markov process, much like
Google’s PageRank algorithm: $$\lim_{k\rightarrow\infty} \hat{\bf R}^k = {\bf 1}\boldsymbol{\pi}.$$
For large values of $n$, this approach will often converge in fewer
iterations than what would have been necessary to
compute (\ref{equ:principal}). This approach is given in
Algorithm 2. If convergence is in a constant
number of iterations, the algorithm will run in $O\left({n \choose 2}\right)$ time.

Algorithm 2 Calculation of the significance function $\alpha$.

1:

procedureCalculate-Significance(${\bf F}, \hat{\bf R}$)

Require: $d \in (0,1)$ is a constant damping factor (set to a value of $0.85$).

Finally, we want to merge signals that are highly temporally
correlated. This is accomplished by clustering the states of the
transition matrix $\hat{\bf R}$. We have found that this graph is
often extremely sparse; simply creating one cluster per connected
component provides good results. An $O(n)$ method for accomplishing
this is given in Algorithm 3.
Therefore, the entire process of automatically discovering social
signals will run in $O\left(|\Sigma^*|\frac{|T|}{\gamma} + n^2\right)$
time.

Algorithm 3 Clustering of the significant signals by connected component.

1:

procedureMerge-Signals($\mathcal{S}, \hat{\bf R}$)

Ensure: $\mathcal{S}^\prime$ is the set of merged signals.

2:

$\mathcal{S}^\prime \gets \emptyset$ /* the set of merged signals */

3:

$s \gets 0$ /* the current aggregate signal index */

4:

$H \gets \emptyset$ /* a history set to prevent loops when performing a traversal of the graph */

/* add all of $S_{W_j}$’s temporally correlated neighbors to the traversal */

17:

$H \gets H \cup \{k\}$

18:

Stack-Push$(X, k)$

19:

end if

20:

end for

21:

end while

22:

end if

23:

end for

24:

end procedure

So What?

Social networking sites such as Twitter and Facebook are becoming
an important channel of communication during natural disasters. People affected
by an event are using these tools to share information about conditions on the ground
and relief efforts, as well as an outlet for expressions of grief and sympathy. Such responses
to an event in these public channels can be leveraged to develop methods of detecting
events as they are happening, possibly giving emergency management personnel timely
notifications that a crisis event is under way.

Much of the recent research on addressing this need for event detection has focused on
Twitter. Twitter has certain advantages that make it a good model for
such research. Among these is the ability to query tweets from a
particular location, making it possible to know the geographic location of an
emerging event.

Ideally, analysts and first responders would have some system by which
they could mine Twitter for anomalous signals, filtering out
irrelevant traffic. Twitter does have a proprietary algorithm for
detecting “Trending Topics,” however, Twitter’s “Local Trends” feature
for identifying trends by geographic region is currently limited to
countries and major metropolitan areas (exclusive of cities like
Tuscaloosa, Alabama). Furthermore, there is little or no transparency
on the specifics of Twitter’s
algorithm. It is known that the algorithm relies heavily on rapid
spikes in the frequency of usage of a term, therefore biasing
toward signals that have broad appeal. An analyst, on the other hand,
would be interested in any signal that is abnormal; Twitter’s Trending
Topics heuristic might overlook such anomalous signals that have a
more gradual increase in frequency.

This work deviates from existing approaches in a number of ways.
First and foremost, our approach is based upon a model of a phenomena
that was discovered by human analysts that has proven successful for
identifying signals specifically related to general events of
interest, as opposed to ephemeral topics and memes that
happen to have momentarily captured the attention of the collective
consciousness of Twitter. Secondly, the proposed approach does not
rely on detecting spikes in the frequency or popularity of a signal.
Instead, emphasis is placed on detecting signals that deviate from the
baseline pattern of Twitter traffic, regardless of the signal’s
burstiness. Finally, the proposed approach does not rely on term- or
document-frequency weighting of signals
(e.g., tf-idf), which can be troublesome for short
documents (e.g., tweets) because signals’ weights will be
uniformly low and therefore have very high entropy, i.e., the
probability of co-occurrence of two terms in the same tweet is
extremely low, especially if the number of documents is low. Since we
are not characterizing signal vectors for individual tweets, we can
examine the frequencies of individual signals independently. In a
sense, we treat the entire stream of tweets as a single document.

Once again, many thanks to my
collaborators, Clay Fink
and Jaime Montemayor, without whom this work would not have been possible.

Social Signals

or, the basis for an article that was nominated for best paper and subsequently rejected.

Discovery of Social Signals

As will become evident from the following chronology, I was not party to this line of research until after many of the initial discoveries were made, so my account of the earlier details may be inaccurate. To my knowledge, however, this is the first time any of these discoveries have been made publicly available in writing.

In 2011, my colleague Clay Fink had been mining Twitter for tweets for quite some time. He had gigabytes of them, spanning months and covering various worldwide events like the London riots, the Mumbai bombings, the Tuscaloosa—Birmingham tornado, the Christchurch earthquake, the Nigerian bombing, &c. These provided a very interesting and, as it turned out, fruitful source of data to research. They shared the common theme of being crises, yet they were each of a very different nature, e.g., some were man-made and others natural disasters.

Clay had created some algorithms to process the mass of textual data, but there was no good, usable way for a human to sift through it to do manual analysis. Step in another colleague, Jaime Montemayor, who created a data exploration tool called TweetViewer.

TweetViewer

TweetViewer is very simple in concept: It plots the relative term frequencies used in a text corpus over time. Yet this simple concept—coupled with the algorithms and HCI workflow accelerators (as Jaime calls them) behind-the-scenes that allow real-time interaction with and instantaneous feedback from the data—enabled some interesting disoveries. Take the large spike in Twitter traffic around May 2nd as an example: it corresponds to the killing of Osama bin Laden—the event associated with the highest sustained rate of Twitter traffic up to that point. Selecting the term “bin Laden” instantly generates its time series in the lower graph, which correlates perfectly with the spike in overall traffic.

This example perfectly illustrates one of the challenges of social media analytics, for there were three other major events being concurrently discussed on Twitter that were masked by the popularity of the bin Laden killing: the British royal wedding, a $100k fine levvied on Kobe Bryant, the NBA finals, and a tornado that devastated several cities in the Southern United States. Monitoring “trending topics” based upon spikes in the frequency or popularity of terms would not detect these less popular topics of discussion.

By exploring Clay’s corpora in TweetViewer, he and Jaime quickly noticed a pattern: there were certain terms that always seemed to spike in frequency during a crisis event. These included “help,” “please,” and “blood,” as well as various profanities. In the Nigerian bombing corpora, for example, there was a large spike in Twitter traffic temporally correlated with the bomb explosion, and a subsequent spike minutes later correlated with a bomb scare (but no actual blast). The frequency timeseries for the term “blood” pefectly correllates to the actual bomb blast, but does not spike again during the bomb scare. Clay and Jaime dubbed these terms Social Signals.

Automating the Process

When I was first briefed on this phenomenon I was immediately intrigued, primarily around the potential of publishing an academic paper peppered with legitimate use of profanity.

The challenge was that these social signals were discovered by humans. Were there others we had missed? Would this phenomenon extend to languages other than English? Could an algorithm be designed to classify and track social signals?

I played around with the data in TweetViewer for a while, and I eventually developed a theory as to what made a term a social signal. My underlying hypothesis was that the human-identified social signals are a subset of a larger class of abnormal terms whose frequency of usage does not correlate with the overall frequency of Twitter traffic. For each geographic region, Twitter traffic ebbs and flows in a pattern much like a circadian rhythm. In the Twitter posts captured from the Tuscaloosa tornado event, for example, traffic peaks around 23:00 every night and then rapidly falls—as people go to sleep—and reaches a nadir at about 09:00—when people are arriving at work. It is actually very regular, except for when people are discussing an unexpected event.

As an example, consider the terms “please” and “blood,” which we have observed to temporally co-occur with certain types of disaster events. Under normal circumstances these terms are not temporally correlated to each other; however, during a disaster scenario they are. Furthermore, the term “blood” does not have a very high frequency, so methods based strictly on the frequency domain or burstiness of the terms might ignore that signal. Therefore, we first identify the set of maximally abnormal term signals, vi&., those that are temporally uncorrelated with the overall Twitter traffic. Next, these abnormal signals are sorted according to a significance metric which is based on how temporally correlated they are to each other. Finally, an unsupervised graph clustering algorithm is used to merge similar signals.

For the sake of brevity, I am going to save the technical details for a later post. I will likely also be publishing a related manuscript on the arXiv.

Results

We evaluated our approach on a number of Twitter datasets captured
during various disaster scenarios. The first corpus we examined
includes tweets from the April 2011 tornados in Tuscaloosa, Alabama.
This is the interesting corpus mentioned above in which the tornados co-occurred with
the British Royal wedding, a much discussed $100,000 fine of NBA
player Kobe Bryant, and the killing of Osama bin Laden—the event
associated with the highest sustained rate of tweets per second to
date. The second corpus includes tweets from the February 2011
earthquake in Christchurch, New Zealand. The final corpus includes
tweets leading up to and including the July 2011 bombings in Mumbai,
India. This corpus is interesting because it includes tweets from the
prior thirteen days leading up to the bombings.

The time bucket size was set to 20 minutes. The number of
maximally aberrant signals discovered was 300. For each corpus, the
resulting automatically detected signals included the human-identified
social signals “help” and “please.” The Mumbai corpus also
included the “blood” social signal. The clustered signals of
highest significance are given in the following table, in which green signals are ones that are clearly related to an important event:

The Aftermath

Excited that we had developed an automated, completely unsupervised, language-agnostic technique for detecting social signals—which also happened to detect the same signals discovered by human analysis—we quickly wrote up a paper and submitted it to a conference. A few months later the reviews came back:

Reviewer 1

I read this in detail and it is amazing! Accept Nominate for Best Paper

Shortly after, I left JHU and didn’t continue this line of research, so it’s just been bit-rotting for the past few years.

In summary, our paper introduced the discovery of social signals:
elements of text in social media that convey information about
peoples’ response to significant events, independent of specific
terms describing or naming the event. The second
contribution of the paper was a model of the mechanisms by
which such signals can be identified, based on the patterns of
human-identified signals. Finally, a novel unsupervised method for
the detection of social signals in streams of tweets was presented and
demonstrated. For a set of corpora containing both natural and
human-caused disasters, in each instance our model and algorithms were
able to identify a class of signals containing the manually classified
social signals.

Future work includes optimizing the algorithms for efficiency in
processing streaming Twitter traffic. Incorporating estimates
of provenance and trustworthiness from the Twitter social network, as
well as accounting for the propagation of signals through re-tweets,
might improve the relevance of the detected signals. Finally, we hoped to
perform a sensitivity analysis of the algorithms by detecting signals
across domains, e.g., using a baseline from one
disaster-related event to detect signals during a similar event at a
different time.

Success in OS X

10 easy steps (and lots of unnecessary prose) on how to set up a new PowerBook in 48 hours or more.

Hashing Pointers

or: How I Learned to Stop Worrying and Love C++11

For as long as I’ve understood object-oriented programming, I’ve
had an ambivalent relationship with C++. On the one hand, it promised
the low-level control and performance of C, but, on the other, it also
had many pitfalls that did not exist in other high-level managed
languages like Java. For example, I would often find myself in a
position where I wouldn’t be quite sure exactly what the compiler
would be doing under-the-hood, particularly when passing around
objects by value. Would the compiler be “smart” enough to know that
it can rip out the guts of an object instead of making an expensive
copy? Granted, much of my uncomfortableness could have been remedied
by a better understanding of the language specification. But why
should I have to read an ~800 page specification just to understand what
the compiler is allowed to do? The template engine and the STL are
incredibly powerful, but it can make code just as verbose as Java, one
of Java’s primary criticisms. Therefore, I found myself gravitating
toward more “purely” object-oriented languages, like Java, when a
project fit with that type of abstraction, and falling back to C when
I needed absolute control and speed.

A couple years ago, around the time when compilers started having
full support for C++11, I started a project that was tightly coupled
to the LLVM codebase,
which was written in C++. Therefore, I slowly started to learn about
C++11’s features. I now completely agree with Stroustrup: It’s best to think of C++11 like a completely new
language. Features like move
semantics give the programmer complete control over when the
compiler is able to move the guts of objects. The new auto type deduction keyword gets rid of a
significant amount of verbosity, and makes work with complex templates
much easier. Coupled with the new decltype keyword, refactoring object member
variable types becomes a breeze. STL threads now make porting concurrent code much
easier. That’s not to mention syntactic sugar like ranged for statements, constructor inheritance,
and casting keywords. And C++ finally has lambdas! C++11 seems to be a bigger leap forward
from C++03 than even Java 1.5 (with its addition of generics) was
to its predecessor.

As an example, I recently needed an unordered hash map where the
keys were all pointers. For example,

std::unordered_map<char*,bool> foo;

I wanted the keys to
be hashed based upon the memory addresses of the character
pointers, not the actual strings. This is similar to Java’s
concept of an IdentityHashMap. Unfortunately, the
STL does not have a built-in hash function for pointers. So I created one thusly:

Note the use of std::remove_pointer<_>, a great new
STL template that gets the base type of a pointer.

In another instance, I wanted to have a hash map where the keys were
pointers, but the hash was based off of the dereferenced version of
the keys. This can be useful, e.g., if you need to hash a
bunch of objects that are stored on the heap, or whose memory is
managed outside of the current scope. This, too, was easy to
implement:

I realize that this post is a collection of rather mundane code
snippets that are nowhere near a comprehensive representation of the
new language features. Nevertheless, I hope that they will give you
as much hope and excitement as they have given me, and perhaps inspire
you to (re)visit this “new” language called C++11.