Exploring the Star Wars expanded universe (part 1)

In this post, I will try to give some insights on the Star Wars expanded universe. All the data come from the Wikipedia for Star Wars: Wookieepedia.

I have actually started to work on this project 2 years ago. With the release of the new movie The Force Awakens, it is the perfect time to share some fun facts about Star Wars that I didn’t know before (believe me when I say I’m a big fan).

The universe is huge and can be overwhelming to newcomers, especially with the new movie which altered the previously canon timeline. What happened in the books after the Return of the Jedi is now considered as “Legends”. If you want to have a look at the complete timeline, you can find it here but beware it’s insane…

So you claim to have exclusive info, prove it!

All right, all right! I can tell you the number of persons in the universe (on the 25th of December 2015). It is hard to give a correct number because the Wiki is ever changing and some characters are pure trash. However, I extracted: 21,647 characters (yes more than 20k). Once we remove characters whose name start with “Unidentified”, we reach 19,612. Did I tell you that the expanded universe was huge?

What are the most dominant species in the universe?

Let’s continue our investigation, it is known that the universe is vast and composed of different species. But let’s be more precise and plot the repartition of the top-10 species. To get more info on each species you can search Wookieepedia, for instance here is a Twi’lek.

Distribution of Star Wars species

Humans represent 78% of the total population!

By the way, where did you get these infos?

This is a technical part, an uninterested reader may skip it.

To extract these infos I have built a web scraper (a robot) designed to extract data from every character of Wookieepedia. All the data is stored on the fly in a graph database.

However, there is no common index to reach every character page on the site. To be sure that I was actually scrapping all the characters, I had to first fetch all the categories indexing the characters such as Category:Individuals.

From this root category, I created a network (a forest to be precise) of all subcategories under this root. Then for each category, a node in the category network, the robot fetches all the associated unique characters.

The second part of the scraping consists of creating the graph of the characters. Remember that we already fetched our characters so we just need to create links between them. To do that, our robot reads each page and looks for occurrences of other characters.

To resume, our graph database contains all the categories, all the characters and the relations between them (categories and characters). A character A is linked to another one B if its name appears on A’s associated wiki page.

This part was the longest to code and run: scraping all the data takes 2 days to complete.

What are the most connected characters?

Thanks to our newly graph we can extract find the most connected characters of the universe. Here we simply count the total number of edges for each node of our graph. With no surprise, the most connected characters are well-known: 14/15 characters are in the movies.

If you don’t know Revan, check him out. He is a very famous character from the Old Republic Era. Quoting Wookieepedia:

Revan … was a Human male who played pivotal roles as both Jedi and Sith in the Mandalorian Wars and the Jedi Civil War.

Star Wars most connect characters

Character distribution timeline

Not everyone is familiar with the Star Wars expanded universe eras. Did you know that the timeline spans over 36,000 years?

To see a bit clearer, let’s create a timeline chart showing the distribution of characters across these eras. Some of the characters have been discarded (no era info). The extra colors highlight characters living in different eras. For instance Darth Vader both lives on the Rise of the Empire and in the Rebellion era (here in green).

Filling missing eras for some characters

Thanks to our graph of characters, we can use some graph theory algorithms to the rescue. More specifically, we can use a label propagation algorithm. This idea is simple: imagine that I am Darth Malgus, if most of my neighbors on the graph belong to the Old Republic Era there is a good chance that I also belong to the same era.

The algorithm will iteratively propagate “labels”, here the era value on each node, to neighbors on the graph. To set a label on an empty node, we count the number of occurrences of each label from its neighbors and take the label with the max count (crudely).

A picture is worth a thousand words so let’s visualize our label propagation algorithm. In the following, nodes painted in black will be painted according to their most probable color using their neighbors.

Part of the Star Wars character graph colored by era. Black nodes represent missing values. Red nodes belong to the Rise of the Empire Era, blue nodes to the Rebellion era, green ones to both eras.

How can we be sure that new labels are actually correct?

Well, we cannot be sure if we really don’t know the labels. But to test our algorithm and tune its parameters, we can extract a known subset of the network and remove the labels to simulate missing values. Since we know the ground truth, we can compute an accuracy score and assert the performance of the algorithm. In this particular example, with[latex]40\%[/latex] of missing data the accuracy is [latex]0.8[/latex]. It means that we know only [latex]60\%[/latex] of the values, [latex]\frac{8}{10}[/latex] labels are correct. If it doesn’t look much, let me tell you that it’s actually very good (for a simple algorithm).

The result of the label propagation algorithm. Black nodes have been replaced by the best compromise using their neighbors.

What’s next? Star Wars faction graph

A lot can be done with this dataset especially when we harvest the power of graph theory. For instance, let’s visualize the graph of the principal factions in the Star Wars universe. The node size is proportional to the number of characters in the faction. The color of each node follows the standard Star Wars color scheme: reds for the dark side, blues for the light side. Criminal organizations are in yellow. The links between factions summarize interactions between characters (edges of the character graph).

Your eyes can deceive you. Don’t trust them (Obi-Wan Kenobi, ep. 4)

At some point, I have to conclude this post. But this is not over, trust me. If you don’t, well, I find your lack of faith disturbing… Want to have a peek at what comes next?

Let me tell you that the universe has 464 Jedi and Sith masters so far!