I ate Fish and Chips at the Rose and Crown in honor of our upcoming trip to London.

Waiting for the week to start. Looking forward to Friday on Sunday.

Listening To: Audri Nix - Inevitable

]]>https://www.jnel.me/pathing-users-in-google-analytics/5d000452341aa90c67802360Tue, 11 Jun 2019 20:08:41 GMTUser-level pathing is a useful tool in the Analytics toolkit when it comes to undestanding the complete journey for your customers from the moment they enter to the moment they leave. Unfortunately, Google Analytics allows you to do a very limited level of user-level pathing out of the box. With a little bit of tweaking, however, we can get some pretty solid user journey pathing reports.

This is a two-part series where I'll cover the following topics:

Setting up Google Analytics to get every page path that a user touches along with the timestamp

Using Python to ingest the GA page pathing data to aggregate user journeys to understand which ones are the most familiar. Bonus: understanding which pages contribute to conversions in the journey!

End Result

Let's start with what we want out of this and work backwards. Ideally we would have a report that lets us see, at the user level, each page they visited and whether they converted or not.

user_path

occurances

homepage > exit

102

homepage > category_a > exit

85

homepage > category_a > product_b

25

In order to derive the above table we need a few things. We need the timestamp of each pageview of each user. Next we need to sequence each page by timestamp for each user. In order to do this we need to create two custom dimensions: Cookie ID and Timestamp.

Step 1

Go to Admin > Property > Custom Dimensions and create two new custom dimensions. One for Timestamp (MS) and one for CookieID Remember the dimension IDs.

Step 2

Go to GTM and create three new variables. The trigger for each of these should be a pageview so they fire on each and every page the user visits.

Now you have the base setup finished. After publishing your GTM container you should start to see the data populate for users containing CookieIDs and timestamps. You will have to wait 24 hours before you see anything so don't hit refresh over and over again.

Step 4

The easiest way to see this data is to create a custom report. This will give you the raw list of each users' pageviews along with a milisecond timestamp associated with it.

You know have everything you need in order to start manipulating and pathing user data across your websites. Stay tuned for part 2 where we'll be going over how to import this data into Python in order to derive the table at the top of the article and start to analyze the most popular aggregate page paths.

]]>A frequent need in data analysis is to understand the order that a given event ocurred in for a given user or account. An example question that lends itself to this is "What is the avg 1st order value of people who have purchased 2 items from us?"

When I

]]>https://www.jnel.me/sequential-rank-ordering-with-pandas/5cedce73341aa90c67802300Wed, 29 May 2019 00:26:28 GMTA frequent need in data analysis is to understand the order that a given event ocurred in for a given user or account. An example question that lends itself to this is "What is the avg 1st order value of people who have purchased 2 items from us?"

When I first started in analytics, I typically did this with SQL, however I thought I'd share how to do this in Pandas because I've been really enjoying how brief and concise the code is.

Above we create a new column, order_rank in our dataframe. We sort our dataframe of orders by date and group the dataframe by user_id and apply the transform method to the date series to come up with a sequential rank from low to high.

We take it from thisTo this!

The above screenshots give you an example of the use of this. Now you can filter the dataframe by order number 2, create a new dataframe of user_ids that have more than one order and countless other things.

]]>I've been interested more and more lately with replacing Gmail. It's actually pretty difficult to do since the entire suite of Google products does quite a bit but one thing I'm interested in lately is subscribing to services that do not mine my data or do not use my engagement]]>https://www.jnel.me/replacing-gmail-protonmail-tutanota-mailbox-org/5ce1a0b3341aa90c67802253Sun, 19 May 2019 18:59:30 GMTI've been interested more and more lately with replacing Gmail. It's actually pretty difficult to do since the entire suite of Google products does quite a bit but one thing I'm interested in lately is subscribing to services that do not mine my data or do not use my engagement with their platform to better run ads against. Don't worry, the irony of a marketing analyst not wanting ads run against them is not lost on me.

I started researching different email providers and I had a few criteria.

Secure

Two-Factor Authentication

Usability

These were really all that it comes down to. I tried several services.

Protonmail

Tutanota

Mailbox.org

Protonmail

Protonmail is one of the most popular suggestions when you start researching alternatives to gmail. It was featured on Mr. Robot and gained a following associated with cryptofans and people of varying security needs.

A couple of selling points featured heavily are that they are A) located in a bunker in Switzerland and B) End to End encrypted.

Protonmail is great if your primary concern is the privacy of your communications between other members of a community that is concerned with the privacy of their communications.

My use case and threat model are slightly different than Protonmails. 99% of the people I need to communicate with are on free emails and nobody is interested in clicking a link to decrypt a message I sent. True story, in testing these my wife responded to an encrypted email I sent to her with "Who do you think you are, James Bond?"

Protonmail is actually great though, I wasn't too bothered about the inability to have IMAP support since the iOS app is smooth. The downside for me, as I eventually found, was the price. It is about 5x the price of the next two I treid.

Tutanota

Tutanota is a secure email provider located in Germany with a heavy emphasis on privacy and security. Browsing the site you get the feel that they are a lean company that is heavily engaged with their audience with consistent posts on social media and an active update calendar. These are great signs.

Another awesome thing is their 2FA capability. Tutanota support U2F which means you can elevate your security game with a hardware security token.

At about $1 a month, I thought I had found my service. Except there were a couple of things that sort of made it difficult for me to jump in headfirst.

The iOS app is slow. As in it lags and loads and generally takes a long time. As I went back and forth between the Tutanota app and Protonmail app I longed for a Tutanota app that was as quick as the Protonmail app. If they supported IMAP I could get past this and use a different mail client but I just can't live with slow loading apps in 2019.

If this were the only thing, I could probably live happily ever after with Tutanota, but there were just a lot of extra features, I realized, that I could also have at a similar price point.

Mailbox.org

I am not sure where I first heard about Mailbox.org but and my wife was getting very tired of me changing my email address every 3 days so it was coming down to the wire.

Mailbox.org is an email provider based in Germany that emphasizes privacy but also provides a suite of other products.

Cost

For about $1 a month you get 2GB of email storage and 100MB of cloud storage that can be used across their Google Drive competitor (think spreadsheets, documents, etc). They also offer a CalDAV calendar and a contact book.

At this point I'm starting to think that this is something that can completely carry me off the Google ecosystem.

2FA

But let's talk 2FA for a moment here. Mailbox.org, you have some odd 2FA practices. Their 2FA is sort of like combining both of the F's into 1. In that if you want to use a TOTP or a hardware security key, you have to type a pin+TOTP into your password field upon logging in. So both factors are completed in one step...which is fine but not the best. I would love to see some U2F support out of Mailbox.org to elevate it. Also the password you create when you first start the account is now used as the general purpose APP password for your iDevices and other things that aren't as 2FA friendly. This is less than desirable because you are reusing the same app password that grants the same scope to pretty much every app. This was almost enough to make me move on.

Encrypted Mailbox

However we get to encrypted mailbox. Mailbox.org allows you to have an encrypted mailbox so that even if an attacker is able to access your account by, let's say entering your app password into a mail client, your entire inbox is encrypted with PGP. Since my threat model is more about random hackers and less about targeted state attacks, I can sleep soundly knowing that even if someone manages to breach my account, they'll be greeted with a series of messages that are encrypted with my private PGP key. The only thing they can see would be the sender and the subject, which for me is perfectly fine.

You might wonder, if your inbox is encrypted, how does that work with iOS? As you might expect, yes just IMAPPing your account on your phone will result in you receiving a bunch of encrypted emails meaning you cannot use default apps to receive your emails unless you like copying and pasting attachment contents into a PGP decrypter with your key.

This is where Canary comes in. Canary is an app for iOS and MacOS that allows you to store your PGP key locally so that it can decrypt your incoming messages on the fly. You can now read the contents of your encrypted inbox on your phone. They also have a desktop client but Thunderbird accomplishes the same thing and is free so I tend to just use that.

Disposable Emails

Another awesome feature is disposable emails. While they do offer plus-aliasing and catch-alls on custom domains that you can use for sorting and cataloging your inbound emails, sometimes you may be required to provide an email to a less-than trustworthy site. In these instances, Mailbox.org lets you create a disposable email address that is good for 30 days that funnels emails back to you.

Productivity

I mentioned the calendar and spreadsheet apps briefly but it truely is a selling point in my opinon. When you log in to the web client you are seated at a hub containing snapshots of your inbox, calendar, tasks, appointments, storage quota and a few other items you can customize. The ability to quickly see everything at a glance is very nice and prevents me from having to open up multiple windows to parse through my day.

Conclusion

In the end I opted for Mailbox.org. The wide range of products offered with the confidence instilled in their combination of encrypted inbox/multifactor authentication along with the great pricing made it an easier choice. I still have love for Tutanota because I am behind the spirit of what they are doing 100% so I'll keep my paid account with them for a while but for getting things done and general productivity I am optimistic I have found my workhorse.

]]>A Bayesian approach to A/B testing is often touted as superior to frequentist hypothesis testing because of its supposed ability to handle smaller sample sizes as well as the ability to use varying degrees of prior knowledge to inform the analysis. My intention here isn't to provide the mathematical]]>https://www.jnel.me/a-b-testing/5cb9ddc3341aa90c67802184Fri, 19 Apr 2019 15:09:27 GMTA Bayesian approach to A/B testing is often touted as superior to frequentist hypothesis testing because of its supposed ability to handle smaller sample sizes as well as the ability to use varying degrees of prior knowledge to inform the analysis. My intention here isn't to provide the mathematical background behind running these sorts of test but a computational method for arriving at a step where you can make a decision.

We'll be using PYMC3 for this so first we import that.

import pymc3 as pm

Here we can imagine a scenario in which we are running a test that is displaying two different versions of an ad to customers arriving on our website.

In this example we ran our test for a week or so and our data shows us that 3,551 individuals were exposed to our control while 5,693 were exposed to our variation. 63 success events were recorded from the control group while 125 were recorded from the variation group.

n1 = 3551
x1 = 63
n2 = 5693
x2 = 125

Next we have to set the parameters for our prior probability distribution. An A/B test is not entirely unlike the ubiquituous coin-toss example. I went ahead and borrowed the PYMC3 code structure for a coin toss from here and modified it slightly to suit our needs.

An uninformed prior for this would be a Beta(1,1) which means we say that before we know anything, all values of the potential conversion rates are equally likely.

alpha = 1
beta = 1

Now we construct the model. PYMC3 uses this specific type of syntax with a with statement. Notice we are constructing two different probability distributions and then creating a third based on the difference between the two. That is to say, we are interested in the difference in the probable conversion rates between A and B.

We are using the Metropolis-Hastings algorithm to solve the model computationally. While we could solve this analytically, it's a bit more fun.
Finally, we are interested in looking at what the probability is that the Variation is greater than our Control.

_ = pm.plot_posterior(trace['difference'], ref_val=0)

This plots a graph of the difference distribution with a reference bar at 0. This probability distribution represents the various probabilities that B is greater than A.

We can see from the above chart that the difference distribution is saying that the 95% highest density interval is between -.001 and 0.01. Also, the probability that B is better than A is 93.3%. We have a 6.7% probability that B is worse. We are safe to go with B.

It's important to frame the decision in terms of the risk too. Seeing a 80% probability of B being greater than A might sound reasonable, but if the stakes are high enough you might not be willing to accept the 20% chance that implementing B is worse than A!

]]>

A short while ago I published a rather technical post on the development of a python-based attribution model that leverages a probabilistic graphical modeling concept known as a Markov chain.

I realize what might serve as better content is actually the motivation behind doing such a thing, as well as

A short while ago I published a rather technical post on the development of a python-based attribution model that leverages a probabilistic graphical modeling concept known as a Markov chain.

I realize what might serve as better content is actually the motivation behind doing such a thing, as well as providing a clearer understanding of what is going on behind the scenes. So to that end, in this post I'll be describing the basics of the Markov process and why we would want to use it in practice for attribution modeling.

What is a Markov Chain

A Markov chain is a type of probabilistic model. This means that it is a system for representing different states that are connected to each other by probabilities.

The state, in the example of our attribution model, is the channel or tactic that a given user is exposed to (e.g. a nonbrand SEM ad or a Display ad). The question then becomes, given your current state, what is your next most likely state?

Well one way to estimate this would be to get a list of all possible states branching from the state in question and create a conditional probability distribution representing the likelihood of moving from the initial state to each other possible state.

So in practice, this could look like the following:

Let our current state be SEM in a system containing the possible states of SEM, SEO, Display, Affiliate, Conversion, and No Conversion.

After we look at every user path in our dataset we get conditional probabilities that resemble this.

Notice how the sum of the probabilities extending from the SEM state equal to one. This is an important property of a Markov process and one that will arise organically if you have engineered your datset properly.

Connect all the nodes

Above we only identified the conditional probabilities for scenario in which our current state was SEM. We now need to go through the same process for every other scenario that is possible to build a networked model that you can follow indefinitely.

Intuition

Now up to this point I've written a lot about the process of defining and constructing a Markov chain but I think at this point it is helpful to explain why I like these models over standard heuristic based attribution models.

Look again at the fully constructed network we have created, but pay special attention to the outbound Display vectors that I've highlighted in blue below.

According to the data, we have a high likelihood of not converting at about 75% and only a 5% chance of converting the user. However, that user has a 20% probability of going proceeding to SEM as the next step. And SEM has a 50% chance of converting!

This means that when it comes time to do the "attribution" portion of this model, Display is very likely to increase its share of conversions.

Attributing the Conversions

Now that we have constructed the system that represents our user behavior it's time to use it to re-allocate the total number of conversions that occured for a period of time.

What I like to do is take the entire system's probability matrix and simulate thousands of runs through the system that end when our simulated user arrives at either conversion or null. This allows us to use a rather small sample to generalize because we can simulate the random walk through the different stages of our system with our prior understanding of the probability of moving from one stage to the next. Since we pass a probability distribution into the mix we are allowing for a bit more variation in our simulation outcomes.

After getting the conversion rates of the system we can simulate what occurs when we remove channels from the system one by one to understand their overall contribution to the whole.

We do this by calculating the removal effect[1] which is defined as the probability of reaching a conversion when a given channel or tactic is removed from the system.

In other words, if we create one new model for each channel where that channel is set to 100% no conversion, we will have a new model that highlights the effect that removing that channel entirely had on the overall system.

Mathematically speaking, we'd be taking the percent difference in the conversion rate of the overall system with a given channel set to NULL against the conversion rate of the overall system. We would do this for each channel. Then we would divide the removal CVR by the sum of all removal CVRs for every channel to get a weighting for each of them so that we could finally then multiply that number by the number of conversions to arrive at the fractionally attributed number of conversions.

If the above paragraph confuses you head over to here and scroll about a third of the way down for a clear removal effect example. I went and made my example system too complicated for me to want to manually write out the the removal effect CVRs.

That's it

Well by now you have a working attribution model that leverages a Markov process for allocating fractions of a conversion to multiple touchpoints! I have also built a proof-of-concept in Python that employs the above methodology to perform markov model based attribution given a set of touchpoints.[2]