musings on the intersections of the world

Author: Sarah

The world is, to varying degrees, sheltering-in-place during this global coronavirus pandemic. Starting in March, the pandemic started to affect me personally:

I started working from home on March 6th.

Governor Gavin Newsom announced on March 11 that any gatherings over 250 people were strongly discouraged, effectively cancelling all concerts for the month of March.

On March 16th, the mayor of San Francisco along with several other counties in the area, announced a shelter-in-place order.

Ever since then, I’ve been at home. Given all these changes in my life, I was curious what new patterns I might see in my music listening habits.

With large gatherings prohibited, I went to my last concert on March 7th. With gatherings increasingly cancelled nationwide, and touring musicians postponing and cancelling events, March 27th, Beatport hosted the first livestream festival, “ReConnect. A Global Music Series”. Many more followed.

Because I’m me, and I have so much data about my music listening patterns, I wanted to explore what trends might be emerging in my personal habits. I analyzed the months March, April, and May during 2020, and in some cases compared that period against the same period in 2019, 2018, and 2017. The screenshots of data visualizations in this blog post represent data points from May 15th, so it is an incomplete analysis and comparison, given that May in 2020 is not yet complete.

Looking at my listening habits during this time period, with key dates highlighted, it’s clear that the very beginning of the crisis didn’t have much of an effect on my listening behavior. However, after the shelter-in-place order, the amount of time I spent listening to music increased. After that increase it’s remained fairly steady.

Key dates such as the first case in the United States, the first case in California, and the first case in the Bay Area are highlighted along with other pandemic-relevant dates.

Listening behavior during March, April, and May over time

When I started my analysis, I looked at my basic listening count from traditional music listening sources. I use Last.fm to scrobble my listening behavior in iTunes, Spotify, and the web from sites like YouTube, SoundCloud, Bandcamp, Hype Machine, and more.

If you just look at 2018 to 2020, it seems like my listening habits are trending upward, maybe with a culmination in 2020. But comparing against 2017, it isn’t much of a difference. I listened to 25% fewer tracks in 2018 compared with 2017, 19% more tracks in 2019 compared with 2018, and 25% more tracks in 2020 compared with 2019.

If I break that down by when I was listening by comparing my weekend and weekday listening habits from the previous 3 years to now, there’s still perhaps a bit of an increase, but nothing much.

With just the data points from Last.fm, there aren’t really any notable patterns. But number of tracks listened to on Spotify, SoundCloud, YouTube, or iTunes provides an incomplete perspective of my listening habits. If I expand the data I’m analyzing to include other types of listening—concerts attended and livestreams watched—and change the data point that I’m analyzing to the amount of time that I spend listening, instead of the number of tracks that I’ve listened to, it gets a bit more interesting.

While the number of tracks I listened to from 2019 to 2020 increased only 25%, the amount of time I spent listening to music increased by 74%, a full 150 hours more than the previous year during this time period. And May isn’t even over yet!

It’s worth briefly noting that I’m estimating, rather than directly calculating, the amount of time spent listening to music tracks and attending live music events. To make this calculation, I’m using an estimate of 3 hours for each concert attended, 4 hours for each DJ set attended, 8 hours for each festival attended, and an estimate of 4 minutes for each track listened to, based on the average of all the tracks I’ve purchased over the past two years. Livestreamed sets are easier to track, but some of those are estimates as well because I didn’t start keeping track until the end of April.

I spent an extra 150 hours listening to music this year during this time—but when was I spending this time listening? If I break down the amount of time I spent listening by weekend compared with weekdays, it’s obvious:

Before shelter-in-place, I’d spend most of my weekends outside, hanging out with friends, or attending concerts, DJ sets, and the occasional day party. Now that I’m spending my weekends largely inside and at home, coupled with the number of livestreaming festivals, I’m spending much more of that time listening to music.

I was curious if perhaps working from home might reveal new weekday listening habits too, but the pattern remains fairly consistent. I also haven’t worked from home for an extended period before, so I don’t have a baseline to compare it with.

It’s clear that weekends are when I’m doing most of my new listening, and that this new listening likely isn’t coming from my traditional listening habits. If I split the amount of time that I spend listening to music by the type of listening that I’m doing, the source of the added time spent listening is clear.

Hello, livestreams. If you look closely you can also spy the sliver of a concert that I attended on March 7th.

Livestreams dominate, and so does Shazam

All of the livestreams I’ve been watching have primarily been DJ sets. Ordinarily, when I’m at a DJ set, I spend a good amount of time Shazamming the tracks I’m hearing. I want to identify the tracks that I’m enjoying so much on the dancefloor so I can track them down, buy them, and dig into the back catalog of those artists.

So I requested my Shazam data to see what’s happening now that I’m home, with unlimited, shameless, and convenient access to Shazam.

For the time period that I have Shazam data for, the correlation of Shazam activity to number of livestreams watched is fairly consistent at roughly 10 successful Shazams per livestream.

Given the correlation of Shazam data, as well as the continued focus on watching DJ sets, I wanted to explore my artist discovery statistics as well. Especially when it seemed like my listening activity hadn’t shifted much, I was betting that my artist discovery statistics have been increasing during this time. If I look at just the past few years, there seems to be a direct increase during this time period.

However, after I add 2017 into the list as well, the pattern doesn’t seem like much of a pattern at all. Perhaps by the end of May, there will be a correlation or an outsized increase. But at least for now, the added number of livestreams I’ve been watching don’t seem to be producing an equivalently high number of artist discoveries, even though they’re elevated compared with the last two years.

That could also be that the artists I’m discovering in the livestreams haven’t yet had a substantial effect on my non-livestream listening patterns, even if there’s 91 hours of music (and counting) in my quarandjed playlist where I store the tracks that catch my ear in a quarantine DJ set. Adding music to a playlist, of course, is not the same thing as listening to it.

Livestreaming as concert replacement?

Shelter-in-place brought with it a slew of event cancellations and postponements. My live events calendar was severely affected. As of now, 15 concerts were affected in the following ways:

The amount of time that I spend at concerts compared with watching livestreams is also starkly different.

I’ve spent 151 hours (and counting) watching livestreams, the rough equivalent of 50 concerts—my entire concert attendance of last year. This is almost certainly because I’m often listening to livestreams, rather than watching them happen.

Concerts require dedication—a period of time where you can’t really do anything else, a monetary investment, and travel to and from the show. Livestreams don’t have any of that, save a voluntary donation. That makes it easier to turn on a stream while I’m doing other things. While listening to a livestream, I often avoid engaging with the streaming experience. Unless the chat is a cozy few hundred folks at most, it’s a tire fire of trolls and not a pleasant experience. That, coupled with the fact that sitting on my couch watching a screen is inherently less engaging than standing in a club with music and people surrounding me, means that I’m often multitasking while livestreams are happening.

The attraction for me is that these streams are live, and they’re an event to tune into, and if you don’t, you might miss it. Because it’s live, you have the opportunity to create a shared collective experience. The chatrooms that accompany live video streams on YouTube, Twitch, and especially with Facebook’s Watch Party feature for Facebook Live videos, are what foster this shared experience. For me, it’s about that experience, so much so that I started a chat thread for Jamie xx’s 2020 Essential Mix so that my friends and I could experience and react to the set live. This personal experience is contrary to the conclusion drawn in this article on Hypebot called Our Music Consumption Habits Are Changing, But Will They Remain That Way? by Bobby Owsinski: “Given the choice, people would rather watch something than just listen.”. Given the choice, I’d rather have a shared collective experience with music rather than just sit alone on my couch and listen to it.

Of course, with shelter-in-place, I haven’t been given a choice between attending concerts and watching livestreamed shows. It’s clear that without a choice, I’ll take whatever approximation of live music I can find.

I’ve been thinking for some time about the derived metadata that Spotify and other digital streaming services construct from the music on their platforms. Spotify’s current business revolves around providing online streaming access to music and podcasts, as well as related content like playlists, to users.

Like any good SaaS business, their primary goal is to acquire and keep customers. As a digital streaming service business, the intertwined goal is to provide quality content to those customers. The best way to do both of those is to derive and collect metadata about customer usage patterns, but also about the content being delivered to the customers. The more you know about the content being delivered, the more you can create new distribution mechanisms for the content and make informed deals to acquire new content.

Creating metadatasets from the intellectual property of artists

Today, when labels and distributors provide music to digital streaming services (artists can’t provide it directly), they grant those services permission to make the music tracks available to users of the digital streaming services. Based on my review of the Spotify Terms and Conditions of Use, Spotify for Artists Terms and Conditions, and the distribution agreement for a commonly-used distribution service, DistroKid, artists don’t grant explicit permission for what services do next—create metadata about those tracks. An exception relevant with the DistroKid distribution agreement is if artists sign up for an additional service, DistroLock, they then are bound by an additional addendum granting the service permission to create an audio fingerprint to uniquely represent the track so that it can be used for copyright enforcement and possibly to pay out royalties.

In his book Metadata, Jeffrey Pomerantz defines metadata as “a means by which the complexity of an object is represented in a simpler form.” In this case, streaming services like Spotify create different types of metadata to represent the complexity of music with various audio features, audio analysis statistics, and audio fingerprints. The services also gather “use metadata” about how customers use their services—at what point in a song a person hits skip, what devices they use to listen, their location when listening, and other data points.

Creating metadatasets is crucial to delivering content

Pandora has patents for the types of musicmetadata that they create, that behind the “music genome project”. Spotify also has patents (and a crucial one from their acquisition of the Echo Nest) to do the same, as well as many that cover the various applications of those metadata.

Spotify currently provides a subset of the insights they derive from the combination of use metadata with music track metadata to artists with the Spotify for Artists service. The end user license agreement for the service makes it clear that it’s a free service and Spotify cannot be held responsible for the relative accuracy of the data available. Emphasis mine:

Spotify for Artists is a free service that we are providing to you for use at our discretion. Spotify for Artists may provide you with the ability to view demographic data on your fans and usage data of your music. While we work hard to ensure the accuracy of the data, we do not guarantee that the Spotify for Artists Service or the data that we collect from the Service will be error-free or that mistakes, including mistakes in the data insights that we provide to you, will not happen from time to time.

These insights, provided to artists, labels, and distributors, guide marketing campaigns, tour planning, artist-specific investments, and even music production styles. Thing is, it’s tough to decipher exactly how these companies create the metadatasets that all these valuable insights rely on, and how the accuracy of that metadata is (if at all) validated.

How the metadatasets get made

In an episode of Vox Earworm, the journalist Matt Daniels of The Pudding and Estelle Caswell of Vox briefly discuss how the metadatasets of Spotify and Pandora were created, pointing out that Spotify has 35 million songs, but the metadataset is algorithmically generated. Meanwhile, Pandora has only 2 million songs, but those 450 total attributes were defined and applied by a combination of trained musicologists and algorithms to the songs. Their discussion starts at 1:45 in this episode and continues for about 90 seconds.

The features in the metadatasets have been defined by algorithms written by trained musicologists, amateur musicians, or even ordinary data scientists without musical training or expertise. The specific features collected by Spotify are publicly available in their audio features API and audio analysis API endpoints, and both include metadata that objectively describe each track, such as duration, as well as more subjective features such as acousticness, liveness, valence, and instrumentalness.

The more detailed audio analysis API features splits up each track into various sections and segments, and computes features and confidence levels for each of the sections and segments.

Acoustic metadata, which is the “numerical or mathematical representation of the sound of a track”,

Cultural metadata, which “refers to text-based information describing listener’s reactions to a track or song”, and

Explicit metadata, which “refers to factual or explicit information relating to music”.

The explicit metadata is information such as “track name” or “artist name” or “composer, while the acoustic metadata can be an acoustic fingerprint to represent the song, or can include features like “tempo, rhythm, beats, tatums, or structure, and spectral information such as melody, pitch, harmony, or timbre.” The cultural metadata is where the more subjective features come from, and it can come from a variety of different subjective sources: “expert opinion such as music reviews”, “listeners through Web sites, chat rooms, blogs, surveys, and the like”, as well as information “generated by a community of listeners and automatically retrieved from Internet sites, chat rooms, blogs, and the like.” The patent gives other examples such as “sales data, shared collections, lists of favorite songs, and any text information that may be used to describe, rank, or interpret music.” It can also build off of existing databases made available by companies like Gracenote, AllMusic (referenced as AMG, now RhythmOne, in the patent), and others.

We know a little about how Spotify and Pandora create their metadatasets. We know less about how representative those metadatasets are, both in terms of feature coverage and music coverage.

Barely knowing which features are available for Pandora, and even while having a decent idea of what Spotify has available, it’s possible that the features that exist in the metadatasets are incomplete. The features in the metadatasets could be limited to those that were the easiest to compute at the time, those that are deemed interesting by the creators, or even those that are highly-correlated with profitable user behavior. It’s expensive to create, store, and apply new metadata features, so businesses must have a clear value proposition before developing new models or tasking more musicologists with the creation of a unique audio feature.

Based on the locations of Spotify, Pandora, and the companies informing their metadatasets, it’s likely that the datasets that these metadatasets and their features are built on aren’t representative of music worldwide but instead include bias toward music that is easily available in their geographic locations.

The size of the datasets that underpin the metadata creation varies—Pandora has 2 million tracks, Spotify has 35 million—the representativeness of the data sample is more important than the size. And that is a variable that we have almost no information about.

I haven’t done (and can’t do) the data analysis to determine the distribution of tracks in those giant datasets. Without that I can only speculate:

It’s possible that both of them have a disproportionate concentration of artists that create and record music in the United States and Western Europe.

It’s almost certain that both of those datasets contain only music recorded in the digital or digital-adjacent eras. Music recorded in analog tape eras that haven’t been digitized can’t be represented in the datasets.

It’s unlikely that the datasets include music by artists lacking the internet connection necessary to digitally distribute their music, even if it is digitized.

We could learn more about the representativeness of the datasets used to create the metadatasets if we knew more about how the metadatasets themselves are validated. But again, that’s another area that lacks clarity.

How the metadatasets get validated… or not

Their uniqueness of their businesses are built on these metadatasets, but it doesn’t seem like there are processes in place to validate the features developed and in use by Pandora and Spotify across the industry. There’s no central database of tracks that I know of, a “Tom’s Diner” of audio feature validation, that can be used to tune the accuracy of audio features that exist in multiple industry metadatasets. Instead, much like the lossy compression of an MP3, there is just the “close enough for our purposes” approximation for validation.

Spotify uses a prediction model to predict the subjective (and harder-to-compute) features such as liveness, valence, danceability, and presence of spoken word lyrics. In the patent filing, they disclose the validation methods used for the features predicted by that model:

Comparing the results of the model to a “ground truth dataset” created from already-labeled data sourced in part from “crowdsourced online music datasets such as SOUNDCLOUD, LAST.FM, and the like” [sic].

Evaluating the percentage of true positives, false negatives, and true negatives returned by the model predictions for features with a binary value (true or false).

The patent then describes taking appropriate steps to bolster training data and improve coverage of the datasets to produce more accurate results in response to the validation results. However, since this is a patent filing rather than a blog post describing their data science practices, we don’t know how often the prediction models and training datasets are updated, or what other methods are used to compile and validate the training datasets themselves.

Lacking an objectively true value for many of these audio features, it’s difficult for services to reliably validate their metadatasets. In fact, rather than comparatively validating their metadatasets, many of the metadatasets are built on top of each other. The Spotify patent for the prediction model makes it clear that the “ground truth dataset” used for validation is partially sourced from other metadatasets. This Echo Nest patent that I discussed earlier makes it clear that different types of metadata can come from pre-existing metadatasets.

Without large-scale understanding of metadata validity across these existing metadatasets, it’s likely that errors and biases in the metadata can proliferate as new ones are created. Eventually, that lack of quality metadata can have a disproportionate effect on the artists creating the music that this metadata is derived from.

Why metadata quality matters

Spotify and Pandora both rely extensively on these metadatasets to deliver valuable streaming services to customers and to create engaging content like playlists and stations for their listeners. Spotify has positioned itself as a valuable distribution and marketing mechanism for artists, to the point that they’ve devised a new scheme where artists and labels can pay for privileges like prominent playlist placement or spotlights in Spotify.

Metadata underpins the business model of these companies, shaping our experience of music by directly affecting how music is distributed and consumed. But we don’t know how valid the metadata is, we don’t know if it’s biased, and we don’t know how much of a feedback loop is involved in its interpretation to create new distribution and consumption mechanisms.

If these companies don’t do more to improve the quality of metadata, artists can lose revenue and miss out on distribution opportunities. Listeners can get bored by the sameness of playlists, or the inaccurate interpretations of their radio station requests, and stop using Spotify and Pandora to discover new music. Without representative and valid metadata, music loses.

What went into writing this

I read a lot over the past few months that informed my thinking in this essay, or some of the points that I made, without being something I quoted or linked directly in the text. I also am grateful to the conversations I had with my former colleague Jessica about this topic, and the feedback that my former colleague Neal gave me on an earlier version of this post.

Spotify background

I read the Spotify API documentation for the audio features and audio analysis endpoints.

Ticket buying in the modern era is pretty brutal. You find out your favorite artist is coming to town, and with any luck, you discover this before the tickets go on sale. Then you start planning to get tickets. Set up a calendar reminder with a link to the site, then you get ready. If there are presales, you ask friends or you check emails — if you’re a dedicated concertgoer, you probably get emails from the promoters, venues, and maybe even your favorite artists’ fan clubs — tracking down the codes.

Then you get ready, mouse pointer cued up at 9:59, waiting until tickets go on sale. The time flips, it’s 10:00 AM and you click! Prepared to quickly select 2, best available (or GA floor, because who wants a balcony seat), and add to cart. But wait! You see the dreaded message. You’re in a queue. Now all you can do is desperately stare at the webpage, hoping nothing changes. What if a browser extension interferes? What if your browser freezes up? Finally, you’re out of the queue. You go to select your tickets, but wait. GA is all gone. All that’s left is the seated Loge. For a band that you dance to. Or worse, it’s already sold out. All that time, all that anxiety, all that preparation, only to get shut down.

And that’s just the presale. You’ll do the whole thing over again at the next presale, or during the general onsale, hoping that the artist and the venue were strategic enough to set some tickets aside for each sale. If it comes down to it, you might have to show up to the venue an hour early (or more) before the show starts to get one of the limited tickets available at the door.

This week was a brutal one for ticket sales for me and my friends. A show at a 2000+ capacity venue sold out within a few minutes during the presale, and a second show added later also sold out within minutes. The Format announced their first live dates in years, playing 2 shows in NYC and Chicago both, and 1 show in Phoenix. The presale tickets for all the shows sold out within a minute, or in the case of Phoenix, was plagued by ticket website issues but still managed to sell out by the end of the day. By the time the general ticket sales happened, they’d announced an additional show in each city. The general ticket sales also sold out within minutes, and Phoenix ended up with a third show before the day was up.

How does it happen? And why do we put ourselves through this?!

It’s important to note that buying concert tickets at all is a privilege. Some people (like me) make it a lifestyle to go to concerts and DJ sets. Others save their money and spend big to get great seats to see favorite artists in arena shows. But it takes money, time, and a bit of luck (or planning) to get tickets and get to a show.

Whether or not you manage to get tickets to a show depends on several factors:

Did you hear about the show before the tickets went on sale?

Did you have enough money at the time tickets went on sale (and in general) to afford the tickets?

Is your work schedule stable enough to know that you can go to the show if you buy tickets immediately when they go on sale?

If any one of these factors doesn’t work out, then you don’t have tickets to the show. Whether or not you get the opportunity to see an artist perform in concert at all is up to a whole other set of factors, subject to the careful strategies of the music industry combined with the artistic whims of the performers.

If an artist doesn’t have a big enough fanbase in your city, and if it isn’t geographically convenient with available music venues, the artist probably won’t stop in your city. Even if they stop, the venue size can play a crucial role in whether or not you’ll get tickets to the show—will they be available, and will you even want them?

Artists, especially after they’ve “gotten big”, can crave smaller, more intimate shows. But those are the shows that tend to sell out in a minute—especially if the fanbase in a certain city is larger than anticipated or if the artist is only playing a limited number of shows and end up drawing people from out of the ordinary reach of a venue.

Other times, artists can analyze the size of their fanbase in a city and then choose a venue—without considering if the venue size is appropriate for their type of music. Bon Iver toured 20,000+ seat arenas on their last tour, while they’re famous for their intimate music and have videos on YouTube with hundreds of thousands of views of Justin Vernon playing to just 1 fan. Even if an artist’s fanbase is large enough to fill an arena, the fans still might not want to buy tickets to see them in an arena.

Beyond those considerations, artists can’t always play the venues they want to play due to promoter restrictions or other industry partnerships, sometimes leading to uncharacteristic bookings at oddly-sized or oddly-shaped venues: DJs playing a concert hall, rock bands in a semi-seated venue, or possibly even skipping a city entirely.

The venue an artist chooses (or is forced to choose) can be a key factor when you’re deciding if you want to get tickets. But the artist (and their tour manager, and others) have still more to do before this concert happens.

The ticket prices have to be set. Surely venues and promoters have set costs and prices that end up as effective ticket minimums for many shows, but artists certainly have a level of influence as well. Especially high-profile artists like Taylor Swift have chosen on past tours to make affordable tickets available to their fans.

And therein lies the rub: artists can price competitively, or highly, knowing they can charge a certain price and still sell out their show (or nearly sell it out). But they can also price affordably, hoping that legitimate fans will be able to snap up tickets when they go on sale, rather than delaying their purchase and being forced to buy from scalpers.

OK so we’re still trying to buy these concert tickets. You’ve heard about the show, the artist has booked the venue and priced their tickets, you’ve got the money, you’ve got the time, you are ready at 10am on a Friday (or a Wednesday or a Thursday for those sweet sweet presale tickets). Where are you buying your tickets?

If it isn’t a site you’ve used before, you might want to consider if it requires an account to buy tickets. If it does, you have to make one and make sure you’re signed in before you try to buy the tickets. You also want to consider if the show big enough that you’ll end up in a queue to buy the tickets, and if the site is reliable enough to handle the load of a lot of people trying to buy tickets without crashing or throwing an error.

Beyond site reliability, you have to consider your personal threshold for every ticket-buyer’s worst nightmare: fees. Almost every ticket purchase includes fees. How high do the fees need to be before you abandon your ticket purchase entirely?

You also have to consider if there will be fees added to the face value of the ticket, and how high are too high of fees before you abandon the ticket purchase entirely. Of course, the irony of paying ticket fees is that most fans (myself included) dislike paying them because for so long the fees are hidden—last minute additions to your total, spiking the cost of $35 tickets to $60 at times. But it can be argued that transparently-disclosed fees are acceptable, and even necessary to provide a resilient, secure, reliable ticketing site—as well as to pay the promoters working hard to make sure your favorite band actually stops in your city.

Artists, promoters, venues, and ticketing sites do a lot to try to prevent ticket scalpers from bombing the market and selling out a show in minutes only to relist the tickets minutes later at unbelievable prices. Innovations in ticket technology, new marketplaces, and just plain making it harder to get tickets:

Ticketmaster added the ability for artists or venues to require a rotating barcode for mobile tickets, preventing screenshots of tickets from being sold. Unfortunately, this was enforced with disastrous effect at a Black Keys show in LA. Anti-reselling measures only work if they punish the resellers, instead of the buyers, which is what happened in this case.

Artists also commonly say that they’re working with venues and ticket sites to identify legitimate ticket purchasers compared to scalpers.

What makes a ticket purchaser legitimate? Probably some degree of purchasing tickets in a specific geographic region and in clusters of genres, likely combined with some fraud analysis. Then I wonder how suspicious my own ticket purchasing habits must look to the algorithms at times. As long as we’re attempting to define what a legitimate ticket purchaser looks like, we can consider who deserves the presale codes for shows.

There’s a notion that only “real fans” deserve first access to presale codes and tickets. But how do you verify and validate true fans? You could use specific digital consumption patterns, such as those that are probably used to give out Spotify presale codes, but those are limited to only those listening habits that are directly observable in digital data. Artists want people to buy tickets to their shows—that’s why often, presale codes are straightforward to track down.

Most often, getting tickets to a show is a matter of knowing the right people at the right time that might have information you don’t have. Songkick is there to fill in the gaps, alongside emails and texts from promoters and venues. But ultimately, nothing beats having a community of fans. And that was the thing that fascinated me about the article in The Atlantic about the modern ticket scalpers. Me and my friends, we use many of the same tactics to buy tickets. It’s a privilege and a challenge to get the tickets we want, but we love going to concerts. And often, it feels like it’s the only way these days we can help artists make money.

Google has created a dataset search for researchers or the average person looking for datasets. On the one hand, this is a cool idea. Datasets are hard to find in cases, and this ostensibly makes the datasets and accompanying research easier to find.

In my opinion this dataset search is problematic for two main reasons.

1. Positioning Google as a one-stop-shop for research is risky.

There’s consistent evidence that many people (especially college students who don’t work with their library) start and end their research with Google, rather than using scholarly databases, limiting the potential quality of their research. (There’s also something to be said here about the limiting of access to quality research behind exploitative and exclusionary paywalls, but that’s for another discussion).

Google’s business goal of being the first and last stop for information hunts makes sense for them as a company. But such a goal doesn’t necessarily improve academic research, or the knowledge that people derive based on information returned from search results.

2. Datasets without datasheets easily lead to bias.

The dataset search is clearly focused on indexing and making more available as many datasets as possible. The cost of that is continuing sloppy data analysis and research due to the lack of standardized Datasheets for Datasets (for example) that fully expose the contents and limitations of datasets.

The existing information about these datasets is constructed based on the schema defined by the dataset author, or perhaps more specifically, the site hosting the dataset. It’s encouraging that datasets have dates associated with them, but I’m curious where the description for the datasets are coming from.

Only the description and the name fields for the dataset are required before a dataset appears in the search. As such, the dataset search has limitations. Is the description for a given dataset any higher quality than the Knowledge Panels that show up in some Google search results? How can we as users independently validate the accuracy of the dataset schema information?

The quality of and details provided in the description field vary widely across various datasets (I did a cursory scan of datasets resulting from a keyword search for “cheese”) indicating that having a plain text required field doesn’t do much to assure quality and valuable information.

When datasets are easier to find, that can lead to better data insights for data analysts. However, it can just as easily lead to off-base analyses if someone misuses data that they found based on a keyword search, either intentionally or, more likely, because they don’t fully understand the limitations of a dataset.

Some vital limitations to understand when selecting one for use in data analysis are things like:

What does the data cover?

Who collected the data?

For what purpose was the data collected?

What features exist in the data?

Which fields were collected and which were derived?

If fields were derived, how were they derived?

What assumptions were made when collecting the data?

Without these valuable limitations being made as visible as the datasets themselves, I struggle to feel overly encouraged by this dataset search in its current form.

Ultimately, making information more easily accessible while removing or obscuring indicators that can help researchers assess the quality of the information is risky and creates new burdens for researchers.

Spotify’s 2019 Wrapped aims to give you an overview of your past year’s listening habits. It proclaims: these were your top 5 tracks and artists! You spent this much time listening to your favorite artist!

This year (the last year of the decade) they also expanded to all of the 2010s, sharing the top artists and tracks for each year in the decade that you used Spotify.

Because I have my own data that combines Last.fm listening data, my iTunes music library, and concert-relevant activities, this is my comparison of Spotify’s data with my own listening habits (more exhaustively tracked).

I have Last.fm set up to monitor Spotify, but also tracks that I listen to in Google Chrome, using the Music app on my iPhone, and local iTunes listening on my personal laptop. Spotify, of course, just sees Spotify.

According to Spotify, my top 5 artists were:

Tourist

Manatee Commune

Lane 8

Amtrac

SebastiAn

According to my own data, my top 5 artists were:

Tourist

Lane 8

Benoit & Sergio

The Vaccines

Litany

Manatee Commune was just 4 listens behind Litany, with 76 total listens for the year so far. It’s an impressive showing from them, considering that I never ended up purchasing any tracks by them. I own full albums or several tracks by all the other artists in both of my top 5 lists, making it easier for me to rack up listens—I listen only to music that I own or untrackable DJ sets in SoundCloud while I’m mobile.

The track stats are where my data really starts to differ from Spotify’s…

Pretty stark difference in those lists and those numbers. Manatee Commune is nowhere in sight. A large reason for that is because my listening pattern with Manatee Commune almost perfectly lines up with seeing them live (indicated by the orange triangle and dotted line):

But enough about Manatee Commune. Let’s talk about the real star of 2019: Tourist! All of the data agrees that he was my top artist of 2019. I saw him twice in concert (and I’ll see him again in a couple weeks).

I’ve listened to him sporadically since 2012, first listening to a track of his in December 2012, discovering a few tracks every few years following, until I saw him live in March this year. You can see what happened after that in this graph:

Interestingly enough, I went to that show in March 2019 to see Gilligan Moss, who made my top 10 artists last year and were my most popular newly-discovered artist of 2018. If I hadn’t discovered them last year, I probably wouldn’t have gone to that show at all, and this year would have been completely different.

Spotify claims that I spent 8 hours listening to Tourist this year. My own data? Rough calculations estimate that I’ve spent 15 hours and 15 minutes listening to Tourist. To put that in perspective, I spent at least 238 hours listening to music this year. At least 6% of my total listening time was spent on this one artist. Nice.

[I calculated this by counting the listens for specific tracks in my Last.fm data, then looking up the lengths of those tracks in my iTunes data and multiplying the number of listens by the track lengths. Of course this means that I’m not even considering the tracks that aren’t in my library, since I’m missing that metadata.]

According to Spotify, my favorite Tourist song was “Too Late – Continuous Mix”. If I consolidate the two similar tracks in my data (data consistency is hard), that’s also true overall—that track has 24 listens so far (which would actually make it #4 on my top tracks of the year).

To move beyond Tourist, Spotify also told me that I discovered 1503 new artists, and that Plastic Plates were my favorite of those. Meanwhile in my data, I see that I discovered 2857 artists this year (probably at least 100 of those are random Youtube videos that got mis-filed), with my top 5 discoveries being:

Benoit & Sergio with 86 listens

Kölsch with 55 listens

warner case with 39 listens

Parra for Cuva with 38 listens

Lindstrøm with 28 listens

According to my data, I only listened to Plastic Plates 3 times this year after discovering them on January 2, 2019.

You can see more details about my new discoveries, along with a sparkline of my listening patterns for those artists throughout the year, in this table:

I spent 35,496 minutes listening to music this year, according to Spotify. Spotify’s data is much better than mine in this respect (100% coverage of metadata!) because my data tells me I only spent 14,296 minutes listening to music. In reality, it’s probably closer to the sum of those numbers.

What else happened in 2019 that Spotify doesn’t know about?

My top 10 albums of the year:

This is where you can see data struggles once again. Those two Tourist albums are essentially the same album, just differently-named in Spotify vs iTunes, so that is actually my most-listened-to album of the year. The Hood Internet metadata was incomplete when they shared their 1979-1983 mashup tracks on SoundCloud, so the free downloads that I added to my iTunes library show up without an album. Which is actually technically correct.

In addition to those top 10 albums, here is an area graph showing the total listens of my top 10 artists over 2019:

That looks somewhat exciting until you see those numbers stacked against all the other artists I listened to this year:

I added 344 new tracks (so far) to my iTunes library, and listened to 5,902 different tracks a total of 9,823 times (so far). I went to 49 concerts (with my 50th of the year lined up for tonight!) seeing a total of 136 artists (so far). Numbers!

My most frequented-venues of the year were 1015 Folsom and Audio, followed closely by the Fox Theater, Great American Music Hall, and The Fillmore. The artist I saw most frequently was Teh Raptor (DJ sets), soon to be tied by Tourist when I see him for the third time (in general and in 2019) in a couple weeks.

Spotify was also able to tell me some things that I can’t yet identify, namely that I listened to artists from 73 different countries. I’m hopeful that next year I’ll have additional metadata from the MusicBrainz database set up and correlating with my Splunk indexes.

The 2010s: Best of the Decade

Because 2019 is the last year of the decade, Spotify also added some stats for the entire decade to their #wrapped feature.

My top 5 artists of the 2010s according to Spotify are:

Daughter

Hey Rosetta!

CHVRCHES

Cold War Kids

The Format

According to my data, these artists are my top 5 of the 2010s:

Hey Rosetta! with 1162 listens

Alkaline Trio with 803 listens

Cold War Kids with 743 listens

Manchester Orchestra with 721 listens

CHVRCHES with 674 listens

Motion City Soundtrack just barely missed out on the top 5, with 673 total listens. The Format, meanwhile are in 10th with 493 total listens. Daughter didn’t make my top 10, but are instead 24th with 314 total listens for the decade.

It’s probably then no surprise to learn that my top album of the decade is Hey Rosetta!’s Second Sight, which features the first 2 songs of my top 5 of the decade, and came out in 2014. My other top albums of the decade:

According to Spotify I’ve been using their service since 2011, but I think it’s actually more like 2013—and this is borne out in their data. They list my top tracks and artists for the decade only starting in 2013. Let’s compare!

I added extra lines for the years when “Trish’s Song” took the top spot because that song is a lullaby and I listen to it accordingly—so perhaps I should consider the second place song as the “true” top song for that year.

Fun fact, my fifth-most-listened-to track of 2014 is The Riff-Off from Pitch Perfect. I watched it so many times on YouTube y’all now have some idea how obsessed I was (am).

In 2016, Run Away With Me by Carly Rae Jepsen took the top track slot by 1 listen. Ariana Grande’s Into You trailed with 77. Those 2 tracks were on a playlist of only 4 tracks that I listened to a LOT that year. The other 2 tracks from the playlist were my 4th- and 5th-most-listened tracks of the year: Ingrid Michaelson’s Hell No with 52 listens and Adele’s Send My Love (To Your New Lover), also with 52 listens.

My top artists for each year of the past decade are as follows, comparing Spotify’s data with my data. I gotta say, I wasn’t expecting to see Taylor Swift take the top spot for 2014.

Year

Spotify

Total Listens

My Data

Total Listens

2010

–

–

Alkaline Trio

352

2011

–

–

Tegan and Sara

236

2012

–

–

Smoking Popes

241

2013

The Format

9

Cold War Kids

57

2014

The Format

69

Taylor Swift

108

2015

Hey Rosetta!

–

Hey Rosetta!

591

2016

Jason Derulo

179

Hey Rosetta!

332

2017

Cold War Kids

156

The xx

166

2018

Poolside

–

Poolside

162

2019

Tourist

–

Tourist

251

My music listening habits (and possibly also my data fidelity) dropped dramatically in the early/mid-2010s, which is why those numbers are so different compared with other years.

It’s fun to see my overall trend in top artists for the decade. It’s almost like the 2013/2014 dropoff in music listening also coincided with a pivot in terms of what artists I was listening to.

I was also in college until 2012, but Manchester Orchestra and Alkaline Trio and Someone Still Loves You Boris Yeltsin almost totally drop out of the listening patterns after 2013, taken over by Hey Rosetta!, CHVRCHES, The Format, and The xx resurging in 2017.

In total I listened to 27,068 unique tracks 100,240 times since 2010, spending 223,350 minutes (at least) listening to music. I purchased a total of 772 songs from the iTunes store in the last decade, with nearly half of those purchases happening this year.

Considering those total minutes spent listening, Spotify also shared minutes spent listening data all the way back to 2015! I’ve already talked about why my data is so different from Spotify’s (~ incomplete metadata ~) but here’s how the numbers compare:

Year

Spotify

My Data

2015

11,834

20,462

2016

28,659

19,959

2017

26,137

13,919

2018

35,655

16,737

2019

35,496

14,296

It’s fun to review the artists that I’ve discovered in the past decade, with Hey Rosetta!, CHVRCHES, Mumford & Sons, Two Door Cinema Club, and Daughter taking the top 5 spots.

I’ve attended 163 concerts so far in the last decade, seeing a total of 404 artists The distribution of those concerts and artists over time is interesting to look at as well: spikes while I was in college, but not really taking off until I moved to San Francisco and joined a concert community group in the area.

I saw several artists multiple times throughout the decade, some as supporting acts (Future Feats, who I saw as an opening act 3 times, despite not enjoying their sets) and others as a combination of supporting and main acts (such as Smoking Popes).

My most-visited venue of the decade was The Independent, which I’ve been to 14 times. I don’t think I made it to a single show there in 2019, but hopefully I’ll be back for a 15th visit soon.

This has been a lot of data. I shared a similar roundup last year around this time, My 2018 Year in Music: Data Analysis and Insights. It’s fascinating to look back at the entire decade and reflect on how my life has changed, how my music taste and listening habits have shifted (or not) over time, and see the influence of live music attendance in my listening patterns and popular artists. Whether I’m using Spotify, iTunes, the Music app on my phone, SoundCloud, YouTube, or seeing live music, I’m glad I have music in my life.

Splunk software provides powerful data collection, analysis, and reporting functionality. The new slogan, “data is for doing”, alongside taglines like “the data-to-everything platform” and “turn data into answers” want to bring the company to the forefront of data powerhouses, where it rightly belongs (I’m biased, I work for Splunk).

There is nuance in those phrases that can’t be adequately expressed in marketing materials, but that are crucial for doing ethical and unbiased data analysis, helping you find ultimately better answers with your data and do even better things with it.

Start with the question

If you start attempting to analyze data without an understanding of a question you’re trying to answer, you’re going to have a bad time. This is something I really appreciate about moving away from the slogan “listen to your data” (even though I love a good music pun). Listening to your data implies that you should start with the data, when in fact you should start with what you want to know and why you want to know it. You start with a question.

Data analysis starts with a question, and because I’m me, I want to answer a fairly complex question: what kind of music do I like to listen to? This overall question, also called an objective function in data science, can direct my data analysis. But first, I want to evaluate my question. If I’m going to turn my data into doing, I want to consider the ethics and the bias of my question.

Consider what you want to know, and why you want to know it so that you can consider the ethics of the question.

Is this question ethical to ask?

Is it ethical to use data to answer it?

Could you ask a different question that would be more ethical and still help you find useful, actionable answers?

Does my question contain inherent bias?

How might the biases in my question affect the results of my data analysis?

Questions like “How can we identify fans of this artist so that we can charge them more money for tickets?” or “What’s the highest fee that we can add to tickets where people will still buy the tickets?” could be good for business, or help increase profits, but they’re unethical. You’d be using data to take actions that are unfair, unequal, and unethical. Just because Splunk software can help you bring data to everything doesn’t mean that you should.

Break down the question into answerable pieces

If my question is something that I’ve considered ethical to use data to help answer, then it’s time to consider how I’ll perform my data analysis. I want to be sure I consider the following about my question, before I try to answer it:

Is this question small enough to answer with data?

What data do I need to help me answer this question?

How much data do I need to help me answer this question?

I can turn data into answers, but I have to be careful about the answers that I look for. If I don’t consider the small questions that make up the big question, I might end up with biased answers. (For more on this, see my .conf17 talk with Celeste Tretto).

So if I consider “What kind of music do I like to listen to?”, I might recognize right away that the question is too broad. There are many things that could change the answer to that question. I’ll want to consider how my subjective preferences (what I like listening to) might change depending on what I’m doing at the time: commuting, working out, writing technical documentation, or hanging out on the couch. I need to break the question down further.

A list of questions that might help me answer my overall question could be:

What music do I listen to while I’m working? When am I usually working?

What music do I listen to while I’m commuting? When am I usually commuting?

What music do I listen to when I’m relaxing? When am I usually relaxing?

What are some characteristics of the music that I listen to?

What music do I listen to more frequently than other music?

What music have I purchased or added to a library?

What information about my music taste isn’t captured in data?

Do I like all the music that I listen to?

As I’m breaking down the larger question of “What kind of music do I like to listen to?”, the most important question I can ask is “What kind of music do I think I like to listen to?”. This question matters because data analysis isn’t as simple as turning data into answers. That can make for catchy marketing, but the nuance here lies in using the data you have to reduce uncertainty about what you think the answer might be. The book How to Measure Anything by Douglas Hubbard covers this concept of data analysis as uncertainty reduction in great detail, but essentially the crux is that for a sufficiently valuable and complex question, there is no single objective answer (or else we would’ve found it already!).

So I must consider, right at the start, what I think the answer (or answers) to my overall question might be. Since I want to know what kind of music I like, I therefore want to ask myself what kind of music I think I might like. Because “liking” and “kind of music” are subjective characteristics, there can be no single true answer that is objective truth. Very few, if any, complex questions have objectively true answers, especially those that can be found in data.

So I can’t turn data into answers for my overall question, “What kind of music do I like?” but I can turn it into answers for more simple questions that are rooted in fact. The questions I listed earlier are much easier to answer with data, with relative certainty, because I broke up the complex, somewhat subjective question into many objective questions.

Consider the data you have

After you have your questions, look for the answers! Consider the data that you have, and whether or not it is sufficient and appropriate to answer the questions.

The flexibility of Splunk software means that you don’t have to consider the questions you’ll ask of the data before you ingest it. Structured or unstructured, you can ask questions of your data, but you might have to work harder to fully understand the context of the data to accurately interpret it.

Before you analyze and interpret the data, you’ll want to gather context about the data, like:

Is the dataset complete? If not, what data is missing?

Is the data correct? If not, in what ways could it be biased or inaccurate?

Is the data similar to other datasets you’re using? If not, how is it different?

This additional metadata (data about your datasets) can provide crucial context necessary to accurately analyze and interpret data in an unbiased way. For example, if I know there is data missing in my analysis, I need to consider how to account for that missing data. I can add additional (relevant and useful) data, or I can acknowledge how the missing data might or might not affect the answers I get.

After gathering context about your datasets, you’ll also want to consider if the data is appropriate to answer the question(s) that you want to answer.

In my case, I’ll want to assess the following aspects of the datasets:

Is using the audio features API data from Spotify the best way to identify characteristics in music I listen to?

Could another dataset be better?

Should I make my own dataset?

Does the data available to me align with what matters for my data analysis?

You can see a small way that the journalist Matt Daniels of The Pudding considered the data relevant to answer the question “How popular is male falsetto?” for the Vox YouTube series Earworm starting at 1:45 in this clip. For about 90 seconds, Matt and the host of the show, Estelle Caswell, discuss the process of selecting the right data to answer their question, including discussing the size of the dataset (eventually choosing a smaller, but more relevant, dataset) to answer their question.

Is more data always better?

Data is valuable when it’s in context and applied with consideration for the problem that I’m trying to solve. Collecting data about my schedule may seem overly-intrusive or irrelevant, but if it’s applied to a broader question of “what kind of music do I like to listen to?” it can add valuable insights and possibly shift the possible overall answer, because I’ve applied that additional data with consideration for the question that I’m trying to answer.

“how complete, how smart, are these decisions if you’re ignoring vast swaths of your data?”

On the one hand, having more data available can be valuable. I am able to get a more valuable answer to “what kind of music do I like” because I’m able to consider additional, seemingly irrelevant data about how I spend my time while I’m listening to music. However, there are many times when you want to ignore vast swaths of your data.

The most important aspect to consider when adding data to your analysis is not quantity, but quality. Rather than focusing on how much data you might be ignoring, I’d suggest instead focusing on which data you might be ignoring, for which questions, and affecting which answers. You might have a lot of ignored data, but put your focus on the small amount of data that can make a big difference in the answers you find in the data.

“More data lead to better conclusions only when we know how to take advantage of their information. In other words, size does matter, but only if it is used appropriately.”

The most important aspect of adding data to an analysis is exactly as the academics point out: it’s only more helpful if you know what to do with it. If you aren’t sure how to use additional data you have access to, it can distract you from what you’re trying to answer, or even make it harder to find useful answers because of the scale of the data you’re attempting to analyze.

Douglas Hubbard in the book How to Measure Anything makes the case that doing data analysis is not about gathering the most data possible to produce the best answer possible. Instead, it’s about measuring to reduce uncertainty in the possible answers and measuring only what you need to know to make a better decision (based on the results of your data analysis). As a result, such a focused analysis often doesn’t require large amounts of data — rough calculations and small samples of data are often enough. More data might lead to greater precision in your answer, but it’s a tradeoff between time, effort, cost, and precision. (I also blogged about the high-level concepts in the book).

If I want to answer my question “What kind of music do I like to listen to?” I don’t need the listening data of every user on the Last.fm service, nor do I need metadata for songs I’ve never heard to help me identify song characteristics I might like. Because I want to answer a specific question, it’s important that I identify the specific data that I need to answer it—restricted by affected user, existence in another dataset, time range, type, or whatever else.

Keep context alongside the data

Indeed, the white paper talks about bringing people to a world where they can take action without worrying about where their data is, or where it comes from. But it’s important to still consider where the data comes from, even if you aren’t having to worry about it because you use Splunk software. It’s relevant to data analysis to keep context about the data alongside the data.

For example, it’s important for me to keep track of the fact that the song characteristics I might use to identify the type of music I like come from a dataset crafted by Spotify, or that my listening behavior is tracked by the service Last.fm. Last.fm can only track certain types of listening behavior on certain devices, and Spotify has their own biases in creating a set of audio characteristics.

If I lose track of this seemingly-mundane context when analyzing my data, I can potentially incorrectly interpret my data and/or draw inaccurate conclusions about what kind of music I like to listen to, based purely on the limitations of the data available to me. If I don’t know where my data is coming from, or what it represents, then it’s easy to find biased answers to questions, even though I’m using data to answer them.

If you have more data than you need, this also makes keeping context close to your data more difficult. The more data, the more room for error when trying to track contextual meaning. Splunk software includes metadata fields for data that can help you keep some context with the data, such as where it came from, but other types of context you’d need to track yourself.

More data can not only complicate your analysis, but it can also create security and privacy concerns if you keep a lot of data around and for longer than you need it. If I want to know what kind of music I like to listen to, I might be comfortable doing data analysis to answer that question, identifying the characteristics of music that I like, and then removing all of the raw data that led me to that conclusion out of privacy or security concerns. Or I could drop the metadata for all songs that I’ve ever listened to, and keep only the metadata for some songs. I’d want to consider, again, how much data I really need to keep around.

Turn data into answers—mostly

So I’ve broken down my overall question into smaller, more answerable questions, I’ve considered the data I have, and I’ve kept the context alongside the data I have. Now I can finally turn it into answers, just like I was promised!

It turns out I can take a corpus of my personal listening data and combine it with a dataset of my personal music libraries to weight the songs in the listening dataset. I can also assess the frequency of listens to further weight the songs in my analysis and formulate a ranking of songs in order of how much I like them. I’d probably also want to split that ranking by what I was doing while I was listening to the music, to eliminate outliers from the dataset that might bias the results. All the small questions that feed into the overall question are coming to life.

After I have that ranking, I could use additional metadata from another source, such as the Spotify audio features API, to identify the characteristics of the top-ranked songs, and ostensibly then be able to answer my overall question: what kind of music do I like to listen to?

By following all these steps, I turned my data into answers! And now I can turn my data into doing, by taking action on those characteristics. I can of course seek out new music based on those characteristics, but I can also book the ideal DJs for my birthday party, create or join a community of music lovers with similar taste in music, or even delete any music from my library that doesn’t match those characteristics. Maybe the only action I would take is self-reflection, and see if what the data has “told” me is in line with what I think is true about myself.

It is possible to turn data into answers, and turn data into doing, with caution and attention to all the ways that bias can be introduced into the data analysis process. But there’s still one more way that data analysis could result in biased outcomes: communicating results.

Carefully communicate data findings

After I find the answers in my data, I need to carefully communicate them to avoid bias. If I want to tell all my friends that I figured out what kind of music I like to listen to, I want to make sure that I’m telling them that carefully so that they can take the appropriate and ethical action in response to what I tell them.

I’ll want to present the answers in context. I need to describe the findings with the relevant qualifiers: I like music with these specific characteristics, and when I say I like this music I mean this is the kind of music that I listen to while doing things I enjoy, like working out, writing, or sitting on my couch.

I also need to make clear what kind of action might be appropriate or ethical to take in reaction to this information. Maybe I want to find more music that has these characteristics, or I’d like to expand my taste, or I want to see some live shows and DJ sets that would feature music that has these characteristics. Actions that support those ends would be appropriate, but can also risk being unethical. What if someone learns of these characteristics, and chooses to then charge me more money than other people (whose taste in music is unknown) to see specific DJ sets or concerts featuring music with those characteristics?

Data, per the white paper, “must be brought not only to every action and decision, but to every department.” Because of that, it’s important to consider how that happens. Share relevant parts of the process that led to the answers you found from the data. Communicate the results in a way that can be easily understood by your audience. This Medium post by Cecelia Shao, a product manager at Comet.ml, covers important points about how to communicate the results of data analysis.

Use data for good

I wanted to talk through the data analysis process in the context of the rebranded slogans and marketing content so that I could unpack additional nuance that marketing content can’t convey. I know how easy it is to introduce bias into data analysis, and how easily data analysis can be applied to unethical questions, or used to take unethical actions.

As the white paper aptly points out, the value of data is not merely in having it, but in how you use it to create positive outcomes. You need to be sure you’re using data safely and intelligently, because with great access to data comes great responsibility.

Go forth and use the data-to-everything platform to turn data into doing…the right thing.

Disclosure: I work for Splunk. Thanks to my colleagues Chris Gales, Erica Chen, and Richard Brewer-Hay for the feedback on drafts of this post. While colleagues reviewed this post and provided feedback, the content is my own and represents my own views rather than those of Splunk the company.

Several years ago I wrote about fragmented music libraries and music discovery. In light of the overwhelming popularity of Spotify and the dominance of streaming music (Spotify, Apple Music, Amazon Music, Tidal, and others), I’m curious if music libraries even exist anymore. Or, if they exist today, will they continue to exist?

My guess is that the only people still maintaining music libraries are DJs, fervent music fans (like myself), or people that aren’t using streaming music at all (due to age, lack of interest, or lack of availability due to markets or internet speeds).

I was chatting with a friend of mine that has a collection of vinyl records, but she only ever listens to vinyl if she’s relaxing on the weekend. Oftentimes she’s just asking Alexa to play some music, without much attention to where that music is coming from. With Amazon Music bundled into Amazon Prime for many members, people can be totally unaware that they’re using a streaming service at all. I’d hazard that this interaction pattern is true for most people, especially those that never enjoyed maintaining a music library but instead collected CDs and records because that was the only way to be able to listen to music at all.

Even my own habits are changing, perhaps equally due to time constraints as due to current music technology services. I used to carefully curate playlists for sharing with others, listening in the car, mix CDs, and for radio shows. These days I make playlists for many of those same purposes on Spotify, but the songs in my “actual” music library (iTunes) aren’t categorized into playlists at all anymore, and I give the playlists I make on my iPhone random names like “Aaa yay” to make the playlists easier to find, rather than to describe the contents.

I’m limited by storage size in terms of what I can add to my iPhone, just like I was with my iPod, but that shapes my experience of the music. Since I’m limited to a smaller catalogue, I’m able to sit with the music more and create more distinct memories. There are still songs that remind me of being in Berlin in 2011, limited to the songs that I added to my iPod before I left the United States because the internet I had access to in Germany was too slow to download new music and add it to my iPod.

Nowadays, I am less motivated to carefully manage my iTunes library because it’s only on one device, whereas I can access my Spotify library across multiple devices. That’s the one I find myself carefully creating folders of playlists for, organizing and sorting tracks and playlists. A primary reason for the success of Spotify for my listening habits is the social and collaborative nature of it. It’s easy to share tracks with others, make a playlist for a DJ set that I went to to share with others, contribute to a weekly collaborative playlist with a community of fellow music-lovers, or to follow playlists created by artists and DJs I love. My local library can give me a lot, but it can’t give me that community interaction.

Indeed, in 2015 that’s something I identified as lacking. I felt that it was harder to feel part of a music culture, writing:

“It’s harder than it used to be to feel connected with music. It’s not a stream or a subculture one is tapped into anymore, because it’s so distributed on the web. There’s so much music, and it lives in so many different services, that the music culture has imploded a bit.”

I feel completely differently these days, thanks to a vibrant live music community in San Francisco. I loathe Facebook, but the groups that I’m a part of on that site enable me to feel connected to a greater music scene and community that supplement my connection to music and music discovery. Ironically, Facebook groups have also helped my music culture experience become more local. The music blogs that I used to be able to tap into are now largely defunct, or have multiple functions (the burning ear also running vinyl me please, or All Things Go also providing news and an annual festival in DC). Instead yet another way I discover new music is by paying attention to the artists and DJs that people in these Facebook groups are talking about and posting tracks and albums from.

Despite the challenges of a local music library, I keep buying digital music partially because I made a promise to myself when I was younger that I’d do so when I could afford to, partially to support musicians and producers, and partially because I distrust that streaming services will stick around with all the music I might want to listen to. I’d rather “own” it, at least as best as I can when it’s a digital file that risks deletion and decomposition over time.

Music discovery in the past was equal parts discovery and collection, with a hefty dose of listening after I collected new music.

I’d do the following when discovering new music:

Writing down song lyrics while listening to the radio or while working my retail job, then later looking up the tracks to check out albums from the library to rip to my family computer.

Following music blogs like The Burning Ear, All Things Go, Earmilk, Stereogum, Line of Best Fit, then downloading what I liked best from their site from MediaFire or MegaUpload to save to my own library.

Trolling through illicit LiveJournal communities or invite-only torrent sites to download discographies for artists I already liked, or might like.

Over time, those music blogs shifted to using SoundCloud, the online communities and torrent sites shuttered, and I started listening to more music on streaming sites instead. The loop stopped going from discovery to collection and instead to discovery, like, and discovery again.

Find a new track, listen, click the heart or the plus sign, and move on. Rarely do you remember to go back and listen to your fully-compiled list of saved tracks (or even if you do, trying to listen to the whole thing on shuffle will be limited by the web app, thanks SoundCloud).

This type of cycle is faster than the old cycle, and more focused on engagement with the service (rather than the music) and less on collecting and more on consuming. In some ways, downloading music was like this too. When I accidentally deleted my entire music library in 2012, the tatters of my library that I was able to recover from my iPod was a scant representation of my full collection, but included in that library was discographies that I would likely never listen to. Now that it’s been years, there have been a few occasions where I go back and discover that an artist I listen to now is in that graveyard of deleted songs, but even knowing that, I’m not sure I would’ve gotten to it any sooner. I was always collecting more than I was listening to.

Streaming music lets me collect in the same way, but without the personal risk. It just makes me dependent on a third-party entity that permits me to access the tracks that they store for me. I end up with lists of liked tracks across multiple different services, none of which I fully control. These days my music discovery is now largely driven by 3 services: Spotify, Shazam, and Soundcloud. Spotify pushes algorithmic recommendations to me, Shazam enables me to discover what track the DJ is currently playing when I’m out at a DJ set, and Soundcloud lets me listen to recorded DJ sets as well as having excellent autoplay recommendations. In all of them I have lists of tracks that I may never revisit after saving them. Some of them I’ll never be able to revisit, because they’ve been deleted or the service has lost the rights to the track.

In 2015 I lamented the fragmentation of music discovery, but looking back, my music discovery was always shared across services, devices, and methods—the central iTunes library was what tied the radio songs, the library CDs, the discography downloads, and the music blog tracks together. The real issue is that the primary music discovery modes of today are service-dependent, and each of those services provides their own constructs of a music library. I mentioned in 2015 that:

“my library is all over the place. iTunes is still the main home of my music—I can afford to buy new music when I want —but I frequent Spotify and SoundCloud to check out new music. I sync my iTunes library to Google Play Music too, so I can listen to it at work.”

While this is still largely true, I largely consume Spotify when I’m at work, listen to SoundCloud sets or tracks from iTunes when I’m on-the-go with my phone, and listen to Spotify or iTunes when I’m on my personal laptop. That’s essentially 2.5 places that I keep a music library, and while I maintain a purchase pipeline of tracks from Spotify and SoundCloud into my iTunes library, it’s a fraction of my discoveries that make it into my collection for the long term. The days of a true central collection of my library are long since past.

It seems a feat, with all these digital cloud music services streaming music into our ears, to have a local music library. Indeed, what’s the point of holding onto your local files when it becomes so difficult to access it? iTunes is becoming the Apple Music app, with the Apple Music streaming service front and center. Spotify is, well, Spotify. And SoundCloud continues to flounder yet provides an essential service of underground music and DJ sets. Google Play Music exists, but only has a web-based player (no client) to make it easier to access and listen to your local library after you’ve mirrored it to the cloud. Streaming is convenient. But streaming music lets others own your content for you, granting you subscription access to it at best, ruining the quality of your music listening experience at worst.

A recent essay by Dave Holmes in Esquire talks about “The Deleted Years”, or the years that we stored music on iPods, but since Spotify and other streaming services, have largely moved on from. As he puts it,

“From 2003 to 2012, music was disposable and nothing survived.”

Perhaps it’s more true that from 2012 onward, music is omnipresent and yet more disposable. It can disappear into the void of a streaming service, and we’ll never even know we saved it. At least an abandoned iPod gives us a tangible record of our past habits.

“I’m worried that, for internet music culture, what’s coming is the loss of a place that offered innumerable avenues for creativity, for enjoyment, for discovery of music that couldn’t and wouldn’t be created anywhere else. And, like everyone who has ever invested enough emotion in an online space long enough to make it their own, I’m wondering what’s next.”

As the music industry moves away from downloads and toward building streaming platforms, international sovereignty becomes more of a barrier to people listening to music and discussing it with others, because they don’t have access to the same music on the same platforms. As Sean Michaels points out in The Morning News several years ago:

one of the undocumented glitches in the current internet is all its asymmetrical licensing rules. I can’t use Spotify in Canada (yet). Whenever I’m able to, there’s no guarantee that Spotify Canada’s music library will match Spotify America’s. Just as Netflix Canada is different than Netflix US, and YouTube won’t let me see Jon Stewart. As we move away from downloads and toward streaming, international sovereignty is going to become more and more of a barrier to common discussions of music.

Location has always been a challenge to music access, but it’s important to keep in mind that the internet and music streaming has not been an equitable boon to music access—it is still controlled.

If you’re thinking about changing careers, or want guidance in determining whether your career is right for you, I hope this post can help you! It’s all about how I defined my career values and reframed how I thought about my career and my future.

Why I needed to define my career values

A couple years ago I was comfortable in my position at work. After four years in my career, surrounded by talk of the importance of having a growth mindset, I thought maybe I was too comfortable. As a technical writer, I was contributing to product management conversations, and thinking intensely about customer needs, and realized I wanted to be even more involved in what we were choosing to build. I took a training course, and found a job on the product management team in my company that appealed to my interests. 11 months later, I went back to documentation, after realizing that that pathway better suited my career values. Throughout those 11 months and in the time since, I’ve worked to determine what I really want to get out of my career, and make sure that what I am doing fits those values.

Ask myself some questions

I started by asking myself some questions, common ones that people recommend when you’re thinking about making a job change. I found that I was better able to answer these questions after I’d already made a job change, likely because I didn’t have that much work experience before making the career change. I asked myself the following questions:

What makes me excited to go into work?

What makes me dread going into work?

What helps me feel validated or appreciated at work?

Working with others? Contributing in meetings? Reporting project status on a regular basis?

After changing roles, I realized that many parts of my technical writing position, and the way that my team and my duties were structured, were very well suited to my working styles. However, since I hadn’t had much experience with other types of work, I hadn’t identified them as vital to my work. Switching positions forced me to reexamine what parts of a role were vital to my happiness at work, and in what way.

Find strategies that have worked for others

I found several strategies that worked for others by listening to some You 2.0 episodes from the Hidden Brain podcast.

I realized there were options to transform a job I was already in by finding more enjoyable aspects within it by listening to the You 2.0 Dream Jobs and You 2.0 How to Build a Better Job podcast episodes of Hidden Brain. The dream jobs episode helped me consider whether I was looking for too much meaning and validation within my job, and if I needed to separate those pursuits more. The how to build a better job helped me consider what I could shift within my day-to-day job in terms of focus or duties so that I could enjoy it as well.

I also looked beyond work-specific strategies. The episode You 2.0: How Silicon Valley Can Help You Get Unstuck taught me perhaps my favorite tidbit, which is to apply iterative methodologies to your life. Yes it’s kitschy, but one example he mentioned resonated deeply with me: the notion of creating multiple five year plans. Whenever I’d previously considered how my future might look, it was easy to get stressed about the fact that I have one future available to me and ~ people ~ expect me to have a plan for it. But this philosophy helped me realize that I can have multiple plans for it, and test them out.

I put together several five year plans to speculate about where I spend my time and how my life might look if I stayed pursuing product management roles, or how it might look if I was doing tech writing for those years, as well as what it might look like if I took a different role entirely, moved cities, or even moved countries. This helped me consider what types of futures excited me, and position my work priorities alongside my overall life priorities. They aren’t separate, and I wanted to be sure that I didn’t consider them separately. I also realized that this exercise wasn’t about making these plans and then choosing one of them, but rather choosing the elements of each of the plans that made sense to me and got me excited about the future. I plan to revisit this exercise and continue to evaluate the spectrum of futures available to me.

The episode You 2.0: Decide Already! interviewed Dan Gilbert, the author of the excellent book Stumbling on Happiness. Both the episode and the book helped me consider the ways that being anxious about the future and planning for it and attempting to reduce uncertainty about it wasn’t necessarily making me feel better about it—and might actually be making me feel worse. So despite creating some five year plans, allowing room and flexibility in those plans, and welcoming uncertainty in my work and life is also crucial.

Take a values-centered approach

In my personal life I was working on developing and defining my personal values, using a card-sorting exercise similar to this one from the Urban Indian Health Institute. Defining my personal values, and understanding them as a way to assess whether or not my goals and day-to-day tasks were fulfilling or not, turned out to be vital. I attempted to apply a similar framework to my work goals and fulfillment as well, and identify one or more overarching themes that I could associate with my career.

Putting all the strategies together to define career values

After assessing the structure of work that I thrive and find validation in, I was better able to understand what I found fulfilling about a career, and what I could look for in future roles to find fulfillment and the right kind of comfort. A work environment with clear expectations and measurable, tangible results, was vital. A team that I could collaborate with and draw support from, while also working semi-independently, was also important to me.

After creating multiple five year plans, I was able to realize that a career path more similar to the one I had as a technical writer was more valuable to me than one that was closer to product management, where I’d be busier and spending more time and stress on work than on my personal life. In addition, by engaging with the technical writer community, I realized that the futures available to me with a technical writing career were more broad, varied, and flexible than I’d previously realized. I didn’t need the power and recognition within a company that a product management position might offer me, because that power and recognition would also come with added responsibilities, time commitments, and stressful challenges.

I attempted to reverse-engineer my career values based on these experiences and my personal values exercise. I ultimately centered on a core career value of “Information Conveyance”. What this means to me is that if I spend my time at work learning and sharing information with others, I will likely feel fulfilled and be excited to go to work. Defining this as a career value allowed me to move past specific roles and titles, because multiple career paths can help me support this value. Right now I love technical writing, but other functions like communication strategy, developer advocacy, community management, instructional designer, and others align with this value and are available to me as other potential career paths.

Data analysis is a valuable way to learn more about what documentation tasks to prioritize above others. My post (and talk), Just Add Data, presented at Write the Docs Portland in 2019, talk about this broadly. In this post I want to cover in detail a number of different data types that can lead to valuable insights for prioritization.

This list of data types is long, but I promise each one contains value for a technical writer. These types of data might come from your own collection, a user research organization, the business development department, marketing organization, or product management organization:

User research reports

Support cases

Forum threads and questions

Product usage metrics

Search strings

Tags on bugs or issues

Education/training course content and questions

Customer satisfaction survey

More documentation-specific data types:

Documentation feedback

Site metrics

Text analysis metrics

Download/last accessed numbers

Topic type metrics

Topic metadata

Contribution data

Social media analytics

Many of these data types are best used in combination with others.

User research reports

User research reports can contain a lot of valuable data that you can use for documentation.

Types of customers being interviewed

Customer use cases and problems

Types of studies being performed

This can give you insight into both what the company finds valuable to study (so some insight into internal priorities) but also direct customer feedback about things that are confusing or the ways that they use the product. The types of customers that are interviewed can provide valuable audience or persona-targeting information, allowing you to better calibrate the information in your documentation. See How to use data in user research when you have no web analytics on the Gov.UK site for more details about what you can do with user research data.

Support cases

Support cases can help you better understand customer problems. Specific metrics include:

Number of cases

Frequency of cases

Categories of questions

Customer environments and licenses

With these you can compile metrics about specific customer problems, the frequency of problems, and the types of customers and customer environments that are encountering specific problems, allowing you to better understand target customers, or customers that might be using your documentation more than others. Support cases are also rich data for common customer problems, providing a good way to gather new use cases and subjects for topics.

Forum threads and questions

These can be internal forums (like Splunk Answers for Splunk) or external ones, like Reddit or StackOverflow.

Common questions

Common categories

Frequently unanswered questions

Post titles

If you’re trying to understand what people are struggling with, or get a better sense of how people are using specific functionality, forum threads can help you understand. The types of questions that people ask and how they phrase them can also help make it clear what kinds of configuration combinations might make specific functions harder for customers. Based on the question types and frequencies that you see, you might be able to fine-tune existing documentation to make it more user-centric and easily findable, or supplement content with additional specific examples.

Product usage metrics

Some examples of product usage metrics are as follows:

Time in product

Intra-product clicks

Types of data ingested

Types of content created

Amount of content created

Even if you don’t have specific usage data introspecting the product, you can gather metrics about how people are interacting with the purchase and activation process, and extrapolate accordingly.

Number of downloads and installs

License activations and types

Daily and monthly active users

You can use this type of data to better understand how people are spending their time in your product, and what features or functionality they’re using. Even if a customer has purchased or installed the product, it’s even more valuable to find out if they’re actually using it, and if so, how.

If your product is only in beta, and you want more data to help you prioritize an overall documentation backlog, such as topics that are tied to a specific release, you can use some product usage data to understand where people are spending more of their time, and draw conclusions about what to prioritize based on that.

Maybe the under-utilized features could use more documentation, or more targeted documentation. Maybe the features themselves need work. Be careful not to draw overly-simplistic conclusions about the data that you see from product usage metrics. Keep context in mind at all times.

Search strings

You can gather search strings from HTTP referer data from web searches performed on external search sites such as Google or DuckDuckGo, or from internal search services. It’s pretty unlikely that you’ll be able to gather search strings from external sites given the widespread implementation of HTTPS, but internal search services can be vital and valuable data sources for this.

Look at specific search strings to find out what people are looking for, and what people are searching that’s landing them on specific documentation pages. Maybe they’re searching for something and landing on the wrong page, and you can update your topic titles to help.

JIRA or issue data

You can use metrics from your issue tracking services to better understand product quality, as well as customer confusion.

Number of issues/bugs

Categories/tags/components of issues/bugs

Frequency of different types of issues being created/closed

Issue tags or bug components can help you identify categories of the product where there are lots of problems or perhaps customer confusion. This is especially useful data if you’re an open source product and want to get a good understanding of where there are issues that might need more decision support or guidance in the documentation.

Training courses

If you have an education department, or produce training courses about your product, these are quite useful to gather data from. Some examples of data you might find useful:

Questions asked by customers

Questions asked by course developers

Use cases covered by content in courses

Enrollment in courses

Categories of courses offered

Also useful to correlate this with other data to help identify verticals of customers interested in different topics. Because education and training courses cover more hands-on material, it can be an excellent source of use case examples, as well as occasions where decision support and guidance is needed.

Customer surveys

Customer surveys especially cover surveys like satisfaction surveys and sentiment analysis surveys. By reviewing the qualitative statements and types of questions asked in the surveys, you can gain valuable insights and information like:

What do people think about the product?

What do people want more help with?

How do people think about the product?

How do people feel about the product?

What does the company want to know from customers?

What are the company priorities?

This can also help you think about how the documentation you write has a real effect on peoples’ interactions with the product, and can shift sentiment in one way or another.

Documentation feedback

Direct feedback on your documentation is a vital source of data if you can get it.

Qualitative comments about the documentation

Usefulness votes (yes/no)

Ratings

Even if you don’t have a direct feedback mechanism on your website, you can collect documentation feedback from internal and external customers by paying attention in conversations with people and even asking them directly if they have any documentation feedback. Qualitative comments and direct feedback can be vital for making improvements to specific areas.

Site metrics

If your documentation is on a website, you can use web access logs to gather important site metrics, such as the following:

Page views

Session data like time on page

Referer data

Link clicks

Button clicks

Bounce rate

Client IP

Site metrics like page views, session data, referer data, and link clicks can help you understand where people are coming to your docs from, how long they are staying on the page, how many readers there are, and where they’re going after they get to a topic. You can also use this data to understand better how people interact with your documentation. Are readers using a version switcher on your page? Are they expanding or collapsing information sections on the page to learn more? Maybe readers are using a table of contents to skip to specific parts of specific topics.

You can split this data by IP address to understand groups of topics that specific users are clustering around, to better understand how people use the documentation.

Text analysis metrics

Data about the actual text on your documentation site is also useful to help understand the complexity of the documentation on your site.

Flesch-Kincaid readability score

Inclusivity level

Length of sentences and headers

Style linter

You can assess the readability or usability of the documentation, or even the grade level score for the content to understand how consistent your documentation is. Identify the length of sentences and headers to see if they match best practices in the industry for writing on the web. You can even scan content against a style linter to identify inconsistencies of documentation topics against a style guide.

Download metrics

If you don’t have site metrics for your documentation site, because the documentation is published only via PDF or another medium, you can still use metrics from that.

Download numbers

Download dates and times

Download categories and types

You can use these metrics to gather interest about what people want to be reading offline, or how frequently people are accessing your documentation. You can also correlate this data with product usage data and release cycles to determine how frequently people access the documentation compared with release dates, and the number of people accessing the documentation compared with the number of people using a product or service.

Topic type metrics

If you use strict topic typing at your documentation organization, you can use topic type metrics as an additional metadata layer for documentation data analysis. Even if you don’t, you can manually categorize organize your documentation by type to gather this data.

What are the topic types?

How many topic types are there?

How many topics are there of each type?

Understanding topic types can help you understand how reader interaction patterns can vary for your documentation by type, or whether your developer documentation has predominantly different types of documentation compared with your user documentation, and better understand what types of documentation are written for which audiences.

Topic metadata

Metadata about documentation topics is also incredibly valuable as a correlation data source. You can correlate topic metadata like the following information:

What are the titles?

Average length of a topic?

Last updated and creation dates

Versions that different topics apply for

You can correlate it with site metrics, to see if longer topics are viewed less-frequently than shorter topics, or identify outliers in those data points. You can also manually analyze the topic titles to identify if there are patterns (good or bad) that exist.

Contribution data

If you have information about who is writing documentation, and when, you can use these types of data:

Last updated dates

Authors/contributors

Amount of information added or removed

Contribution data can tell you how frequently specific topics were updated to add new information, and by whom, and how much information was added or removed. You can identify frequency patterns, clusters over time, as well as consistent contributors.

It’s useful to split this data by other features, or correlate it with other metrics, especially site metrics. You can then identify things like:

Last updated dates by topic

Last updated dates by product

Last updated dates over time

to see if there are correlations between updates and page views. Perhaps more frequently updated content is viewed more often.

Social media analytics

Social media referers

Link clicks from social media sites

If you publicize your documentation using social media, you can track the interest in the documentation from those sites. If you’re curious about social media referers leading people to your documentation, and see whether or not people are getting to your documentation in that way. Maybe your support team is responding to people on twitter with links to your documentation, and you want to better understand how frequently that happens and how frequently people click through those links to the documentation…

You can also identify whether or not, and how, people are sharing your documentation on social media by using data crawled or retrieved from those sites’ APIs, and looking for instances of links to your documentation. This can help you get a better sense of how people are using your documentation, how they’re talking about it, how they feel about it, and whether or not you have an organic community out there on the web sharing your documentation.

Beyond documentation data

I hope that this detail has given you a better understanding of different types of data, beyond documentation data, that are available to you as a technical writer to draw valuable conclusions from. By analyzing these types of data, you are prepared for prioritizing your documentation task list, but also better able to understand the customers of your product and documentation. Even if only some of these are available to you, I hope they are useful. Be sure to read Just Add Data: Using data to prioritize your documentation for the full explanation of how to use data in this way.