hrbrmstr’s Year In Review

# Throughout this document there will be commentary in, well, comments.
# The code is not nearly as concise as I would have liked it to be, but such is the way of things.
# Once it's up on GitHub, do not hesitate to as questions in issues or even submit PRs for views you create.
#
# IMPORTANT set "include_wordpress" above to "no" if you aren't a wordpress users (or remove those code chunks)

Quantifying the [Social] Year

Most folks take some time to reflect on the past year as the new annum approaches and I am no exception to that generalization. While I take stock across a number of areas, this year I wanted to take a more data-driven approach to how I spend my digitial social time. To that end, I put together this R markdown document to catalogue various bits of information from four major time-capturing areas: StackOverflow, GitHub, blogging and Twitter.

Rather than hard code years and ids, I’ve made this a generic, parameterized document so others could strip away the blathering and grab their own stats for their activities. Most of the sections have in-year data as well as comparisons between the past 2 years and will (hopefully) work without much — if any — code tweaking.

Expand the folded code sections to see how the sausage was made. I’ve added some commentary along the way in the code blocks in case you want to see how various bits were composed. More than half the reason to make this example was to expose StackOverlow, GitHub and WordPress APIs to folks.

I tried to stick with light slate gray and spring green for previous and current year aesthetics. I’ve had to expand into using slate blue when the current year’s visualization involved text or the visualizations spread across years without emphasizing a particular year. Upon review, I believe this was consistent, but GH issue note any areas you think could be improved.

A huge, preamble #ty to @mrshrbrmstr and @dataandme for suffering through a review of this in draft form.

StackOverflow

explore new “worlds” in R (I — scarily — know way more about genes than I really should thanks to this)

I kinda-sorta knew where & how I spend my time there, but finally decided to help quanitify it a bit.

# You'll see this pattern quite a bit in this document, so I'll explain it here.
#
# API requests eat up your own API quota bits and also consume bandwidth and CPU
# time for the (free) services we all use. It's not cool to repeatedly hit the
# servers for data that doesn't change.
#
# Now, I'm not giving you my data, but this pattern will make it easier for you
# to cache your own results.
#
# A data file (RDS) is defined and checked for.
# If it does not exist, API calls are made and then cached into it
# If it does exist, the data is read from the cache.
#
# If you ever need to refresh the data, just [re]move the cached RDS files.
#
# The .gitignore _should_ keep your data off github, but you're responsible for
# that in the long run.
so_data_file <- file.path(rt, "data", "my_so.rds") # Where we're going to store cached SO data
if (!file.exists(so_data_file)) {
# grab my answers for 2 years
my_answers <- stack_users(
params$stack_user, "answers",
fromdate=as.integer(as.POSIXct(as.Date(sprintf("%s-01-01",
params$curr_year-1)))),
todate=as.integer(as.POSIXct(as.Date(sprintf("%s-12-31", params$curr_year)))),
pagesize=100,
num_pages=50
)
# now get the question data for those answers (this was much easier on the SO data site with SQL btw)
starts <- seq(1, length(my_answers$question_id), 100)
ends <- c(starts[-1]-1, length(my_answers$question_id))
map2_df(starts, ends, ~{
stack_questions(my_answers$question_id[.x:.y], pagesize=100)
}) -> my_answers_qs
# grab my comments (I didn't rly do anything with them for the review)
my_comments <- stack_users(
params$stack_user, "comments",
fromdate=as.integer(as.POSIXct(as.Date(sprintf("%s-01-01",
params$curr_year-1)))),
todate=as.integer(as.POSIXct(as.Date(sprintf("%s-12-31", params$curr_year)))),
pagesize=100,
num_pages=50
)
# grab badge data for previous year (it doesn't come with the date so we have to do it this way)
stack_users(
params$stack_user, "badges",
fromdate=as.integer(as.POSIXct(as.Date(sprintf("%s-01-01", params$curr_year-1)))),
todate=as.integer(as.POSIXct(as.Date(sprintf("%s-12-31", params$curr_year-1)))),
pagesize=100,
num_pages=30
) %>%
mutate(year = params$curr_year-1) -> my_badges_prev_year
# and this year
stack_users(
params$stack_user, "badges",
fromdate=as.integer(as.POSIXct(as.Date(sprintf("%s-01-01", params$curr_year)))),
todate=as.integer(as.POSIXct(as.Date(sprintf("%s-12-31", params$curr_year)))),
pagesize=100,
num_pages=30
) %>%
mutate(year = params$curr_year) -> my_badges_curr_year
# finally, get my reputation
my_rep <- stack_users(
params$stack_user, "reputation-history",
pagesize=100,
num_pages=100
)
# bundle it up in a list (I regret not doing that for the other sections)
list(
my_answers = my_answers,
my_answers_qs = my_answers_qs,
my_comments = my_comments,
my_badges = bind_rows(my_badges_prev_year, my_badges_curr_year),
my_rep = my_rep
) -> my_so
write_rds(my_so, so_data_file)
} else {
my_so <- read_rds(so_data_file)
}
# clean up the answers and get it out of the list
tbl_df(my_so$my_answers) %>%
mutate(month = as.Date(format(creation_date, "%Y-%m-01")),
year = factor(lubridate::year(creation_date))) -> answers

I’m IRL “busier” these days and the waffle chart does reflect that I’ve not helped as many folks on SO this year (with new answers) as I did the previous one. However, the reduced answer count is also indicative of a deliberate shift into answering certain tags in preparation for writing another tome (which is reflected in the treemap to be presented in a bit).

Despite being a bit more picky about which questions I answer, my answer acceptance rate hasn’t really improved. But, one reason for this in 2017 is that I made some efforts to go back in time on SO and add “modern R” answers to older questions, and the folks on SO who post questions have little incentive to go back to old questions for any reason (if David or Julia are reading this, it might be worth poking at a new badge or rep increase endorphin stimulus for question owners for this particular condition).

As noted, I’ve had quite a bit going on this year, which impacted the cadence from the previous year. One reason for more fall/winter activity in-general is that it helps combat seasonal affective disorder.

I was also curious about the questions associated with my top answers.

First a view of answer scores vs question views (this surprised me during EDA work, which is why I’m including it). One “SO FAIL” of mine is that I tend to gravitate towards pretty niche topics since they tend to be harder/more interesting (to me):

The subtitle for the next graph says it all: What tags do I seem to gravitate towards answering and how has that changend in the last 24 months?. As noted, I’m not surprised by the focus on “web things” since it was deliberate. I can also correlate the reduced number of answers to no longer helping with gambling, sports or financial web scraping questions (for many reasons).

The “reputation trend” chart is a pretty boring visualization and I almost didn’t include it, but wanted to see if there were any surprises in the reputation cadence. An artefact of being on SO for a long time is that one’s older answers help prop up reputation even if one’s current activity has waned a bit.

Despite initial appearances, this next vis isn’t just a humblebrag. I wanted to validate some (most) of my previous assertions about reputation gaining and I think this view does that — especially for “Necromancer” and “Nice Answer” — but, also the fact that badges can accumulate faster if older answers get upticks from new folks in search of answers.

Reflection & Speculation

I continue to firmly believe you can truly significantly help others (I’ve had numerous folks tell me that an answer helped them through gnarly areas of thesis writing or work-problems) and hone+expand your own skills by answering questions on SO. While reputation counters and badges are nice, I’m not really “in it” for the bling, but more for said impact. A major pre-regret of 2018 is that I won’t have as much time in the first six months of the year due to book writing and teaching, but I also hope those activities result in helping as many if not more folks with R-centric data science questions.

GitHub

I’ll be up-front about not knowing what I wanted to track from GitHub. I’m not “graded” at work or at home on GH commits (well, I don’t think@mrshrbrmstr judges me by GH commits). As noted, I do what I do to help folks, not for the bling. I’m also a bit reticent to have stars, watches or forks indicate utility of something since many folks use such things for “bookmarks” (I know I do). I push stuff to GH primarily to supplement blogs or stage packages, and I’m not sure I’d change behaviour based on any data views, here, but shall endeavour to look at the data in ways not easily done via GH itself.

For lack of knowledge of any other metric, I’m defining “top” by GitHub stars. To that end, the following shows the weekly commit activity (or, as I have dubbed it “commit pulse”) for my “top 20” repos in 2017:

This tracks well with active memory. I deliberately spent quite a bit of time on hrbrthemes, sergeant, splashr and decapitated since they help me with daily work-work but also felt they’d be useful to the broader community.

The late-in-the-year bumps for some of the packages (most of the repos I have are R packages) was also a deliberate attempt to get some updates on CRAN before 2017 came to a close. Alas, most will have to wait for 2018 since I don’t like to bug CRAN with too many maintenance updates in a year and had ideas for improved functionality in most of them as I tidied them up.

I did have more activity in-general on GitHub this year when compared to the previous one:

Reflection & Speculation

The increase in the number of packages is due, in-part, to my tendency towards SODD. However, I also make a package when I need some functionality but refuse to use source() since I’m adamant that such practices are — in the long run — unsustainable. And, if I need something it’s highly likely at least one other human will need it, and that’s a good enough reason to formalize the functionality into a package.

Having said that, package maintenance has become an emerging issue for me as users of slackr will be glad to point out. Not all of these GitHub-homed packages are on CRAN, but more than a few are. I started “cleaning house” in Q4 and have archived a few repos and consolidated functionality of some into other packages. I’m going to be very picky about what I put on CRAN moving forward, but will also be releasing a slew of new ones to CRAN in support of the forthcoming web scraping tome. There will be a drat repo setup for it, but they need to be on CRAN too so “devtools” is not a barrier to entry for readers.

I also need to start using PR and Issue templates and figure out a way to make it easier for folks to contribute.

I ended up writing a new package pressur for this section as I wanted clean/quick access to blog stats and I use WordPress for blogging. As noted more than once, I’m in cybersecurity and have had to help many folks with WordPress security issues over the years and the only way to continue to be able to do that is to run it. I also have a teensy bit of honeypot code in the one I run so I can see attacker behaviour. WordPress counts things well and it saves me from relying on Google or having to save my access logs for too long.

Beyond some in-week stats reviews, I haven’t looked at WP stats in any real way before and I had some questions I wanted answered:

Has my blathering changed in the past two years (i.e. post frequency and post attributes)

What were my top posts in 2017

I have been eeriely consistent across the past two years in terms of how much I blather per-post. I remember going dark during the election and it further shows here.

While I wanted to see my top posts each month, I decided to have some fun with this visualization. It packs-in the views for every JetPack-tracked posted written since January 1, 2016. I highlighted posts written in each month in 2017 in spring green. Astute readers (if you’ve made it this far) will note the use of Helvetica vs Roboto Condensed for the title & views annotations. Helvetica was the only tolerable (free) choice I had that included superscript 6. Check the code block for how I managed to get the in-month green dots on top.

I owe @henrikbengtsson an adult beverage or three for his assistance in ensuring this chart wasn’t too confusing. His annotation help was/is most appreciated.

Reflection & Speculation

I blogged more but fell victim to my long-form post style (again). I really want to do more short posts in 2018 and will likely not blog much in the first half of the year, but hope to augment book-ish things with some useful blogs. I will very likely dual-purpose the existing blog for the book vs start a book-centric blog and may need to augment this Rmd next year to account for the fact that book-ish things will be intertwined with GH and WP (and, even Twitter) activity.

I could have gone crazy here but virtually any of us on Twitter can go to Twitter’s analytics page and see Tweet data. And, while I would have liked to use rtweet, the “download my complete twitter archive” was just too easy to pass up. You can do that as well and just plug-in the data file to replicate this (fingers-crossed).

This one was somewhat interesting since it clearly showed the impact Murica arguing about and electing a baffoon had on me:

I’ve been obesessing a bit on Tweet length distribtions since Twitter increased the limit. I knew I pushed the 280 limit a bit this quarter and wanted to see by just how much. However, my first reaction to this char was: “Wow. I really went dark.”

I fear that I’ll over-blather to 280 more than I should when Twitter decides to grace us macOS folks with a desktop app that can support 280 characters.

I seem to have re-tweeted or twitter-linked way more this year than last year (yes, I brought this somewhat ineffective chart-type back for a brief reprise). GitHub-ish domains are up there for both years and I can squint and see the New York Times trying to get above the fray, too. But, there are quite a number of domains in the “word soup” at the bottom and just focusing on the domain.tld doesn’t really help much (I tried).

Reflection & Speculation

I’m not optimistic that Twitter will get better in 2018. One thing I can do to help it is to not tweet about politics. It clearly doesn’t do any good and a thoughtful long-form post would be a better use of time if the topic warrants exposition. The Library of Congress will also stop archiving all tweets next year, so history will be better served getting content into the global caches of The Internet Archive and Google.

I try to promote the work of others now and will redouble those efforts in 2018. But, there’s a down-side to that which this light analysis showed me: I visit far too many sites. I try to make a conscious decision to only hit up a few sites since the internet is truly fraught with peril. I’ve got a regularly updated hosts file that would make advertising executives across the world want to commit Hari Kari if everyone adopted it. I run uBlock origin and add custom rules to it regularly. I use a security-centric DNS configuration and I have more than a few endpoint protections active. Even with those precautions, there’s no way I was fully protected when I hit those ~1,100 domains and I shared links to folks who do not have the same set of precautions in place.

I’m not sure what I’m going to do about this, but it’s made me realize I’m not as picky as I thought I was and I definitely need a better workflow to help ensure the safety of folks I redirect.

FIN

That wraps up this data-driven year-in-review. It’s far from perfect, but I dug into some things I wanted to know and hopefully provided enough code and expository to help others do the same (or better!).

Let’s hope and pray 2018 has far better things in store for us all than the past two years have sent our collective way.