Posts categorized "Principal Analytics Prep"

In Part 1 of my KDnuggets article, I explained what hiring managers mean when they look for critical thinking in the arena of data science and analytics. These requirements relate to the nature of data problems found in industry and business settings. The datasets are generally observational, self-selected, non-random, with hidden biases, and increasingly OCCAM (link); the business leaders have high-level objectives ("we want to increase customer loyalty"). The data scientist/analyst is the person in the "middle," trying to figure out how to make the problem precise, and solvable by a systematic analysis of available data.

In Part 2, I offer some practice case interview questions, based on three recent news events

the college admissions scandal

IPOs of ride-sharing companies like Lyft and Uber

the Blue Apron post-IPO doldrums.

Long a staple of the management consulting hiring process, the case interview is a free-flowing dialogue between the interviewer and the interviewee. The interviewer is holding back some data to simulate what is known at the beginning of a data analysis process. The interviewee must be willing to probe, digging out more data, and shaping the structure of the analysis. The end product is an analytical framework. No one knows if the framework would be successful until it is implemented.

Those who do well in case interviews are good at (a) thinking on their feet (b) embracing uncertainty e.g. by making appropriate assumptions (c) listening to the interviewer's hints and (d) persuading.

***

As I mentioned in Part 2, the best way to practice is to form a group of 3-5 people, and interview each other. If there is enough interest, we can start a group in the comments below.

Most of business journalism reads like PR put out by companies. Still, some pieces are more blatant than others. The NY Times just printed an example of such gibberish, this opinion piece about Wework.

Wework is one of those "unicorn" startups that investors are hoping to sell to the public before the impending stock market collapse for gazillion dollars. Its main business is renting office space to mostly small businesses, entrepreneurs, or small divisions of large companies. The primary "innovation" is the short-term lease, as short as month-by-month. (Disclosure: my startup was a sub-tenant of Wework through the Harvard Business School Startup Studio. I am a satisfied customer.)

As with many successful startups, Wework offers a product to a segment of customers who find it irresistible. For a startup business like mine, it's much too risky to commit to a 10- or 20-year commercial lease. Most startups fail within a short few years, and the entrepreneurs are already taking on a lot of other risks. So they have an attractive product but does it have a viable business model to deliver that product profitably?

With few exceptions, Wework does not own the buildings it leases out - it has signed long-term leases with landlords, and then rent to startups and smaller businesses who sign short-term leases, thus the risk of the long-term lease shifts from the entrepreneurs to Wework.

The Times article presents an argument by analogy to refute an obvious weakness of Wework's business model. This same weakness has felled many bankrupt companies in the past. When you have long-term obligations and short-term revenue sources, you're highly vulnerable to the "credit crunch." This scenario played out in the Great Recession when many companies suddenly found that banks were unwilling to roll over their debt, which immediately put them out of existence. In the case of Wework, critics also worry that even the revenue sources will disappear since many startups will fold due to lack of customers and/or funding during a recession.

The reporter (Andrew Ross Sorkin) now builds a very flimsy argument based on "too big to fail." It cites some absolute numbers, such as square feet of space it has signed leases for, number of workers using the space, number of employees, etc. Then he arrived at the conclusion that: "when the next economic downturn comes — and it will — WeWork’s landlords will actually be less likely to evict the company if it doesn’t pay its rent."

For an argument based on big numbers, this one is particularly unmoored. None of those numbers are provided in context. What proportion of occupied commercial real estate has Wework leased out? What proportion of revenues does Wework contribute to the biggest landlords? The 1,300 employees may make Wework "one of the biggest architecture firms in the world" but the number does not qualify it as "too big to fail". (Fact check: the largest architecture firm is Gensler, according to Architectural Record, which employs over 5,000, according to Wikipedia. So Gensler is about 4 times the size of Wework.)

In addition, the argument misleads readers about what happened during the Great Recession. The banks who were the creditors did not renegotiate contracts with the debtors who no longer could pay. Far from it, they got bailed out by the U.S. government, using the argument that the failure of those loans would "bring down the entire global economy as we know it".

This brings me to the final, and most salient, issue. What would those generous landlords be saving in this case? Not the global economy, but a business yet to find a self-sustainable business model - as Sorkin himself cited earlier in the article. Wework is currently running losses of over a billion dollars a year. According to Bloomberg, Wework made ~$800 million in 2017, and spent $1.8 billion. We also know Wework has sold junk bonds but I'm not sure we know the full amount of debt it is carrying on its books.

Wework has made some small steps to reducing its risk. It has indeed attracted some larger companies to rent space with longer leases. But those companies are doing it because the "longer" leases are still much shorter than what they could get from traditional landlords. Usually, the space is used by short-term, even experimental, business units.

***

Now on to something of interest only to number geeks. I have questions about this statement: "the company says its average customer has a lease of seven to eight months and new customers are signing leases that average 20 months."

I wonder how this "average" is computed. Note that most of the little guys are on month-by-month leases, which renew until cancelled. This is "right-censored" data in which we don't know the end date yet, so a lot of assumptions and projections go into that average. It's got to be an average of projections as opposed to an average of observed data.

Also, I expect that the distribution of the size of customers will be heavily "right-skewed": a vast majority of customers will be freelancers, and small offices (who sign very short leases), with a few larger tenants (signing longer leases). It's challenging to interpret an "average" unless we also know the "median".

We are delighted to feature two great speakers at our Data Science Networking Night on Tuesday, June 26, 2018. The talks cover two aspects of analytics and data science that combine art and science. This is a founding principle of Principal Analytics Prep, the training program I founded to produce well-rounded individuals who bring critical thinking and technical prowess to data analysis.

Professor Andrew Gelman's talk will be on story-telling. He poses a paradox: in statistics, we aim for representative samples and averages but stories emphasize the extreme and unusual. Is it possible to use stories to narrate statistics?

Shira Mitchell, from Mathematica Policy Research, discusses her work with the Millennium Villages Project, aimed at improving well-being in Africa. Many real-world programs like the MVP generate observational data which present major challenges to analysts. How do we know if these projects achieve their targets?

After the talks, there will be an opportunity to meet other data scientists and analysts.

Food and beverages will be served.

The event is held in downtown Manhattan.

Due to overwhelming demand, we just opened some more spots. Register here before they run out!

At the end of last week, Facebook scrambled to get in front of some unsavory press coverage, by “proactively” suspending Cambridge Analytica – the data analytics outfit credited with the unlikely successes of the Brexit campaign in the U.K. and the Trump campaign in the U.S. – from its social media platform. It knew that The Guardian, and The New York Times were poised to publish critical articles about how Cambridge Analytica exploited the Facebook platform in building its invasive database on 50 million Americans, data that form the foundation for psychological scoring algorithms used to target and sway voters during the 2016 Presidential election.

As explained in my previous posts (here and here), data sleaze is the practice of taking and trading consumer data serendipitously. The third parties frequently utilize such data in ways that usurp the consumer’s self-interest.

Facebook’s response has led to an avalanche of negative publicity, and there is perhaps some hope that it, as well as other tech firms including Google, Twitter, etc., may finally take action to stop the data sleaze.

I’ve decided to split this post into two parts. Part 1 is about the inner working of the data-sleaze operation. Part 2, to be published later this week, is a call for industry leaders to take proactive action to curb the excesses of data sleaze. Part 1 explains the underlying technologies because one can never fully trust information coming mostly from conflicted actors.

In a Nutshell, How Our Personal Data Got Taken

Cambridge Analytica has boasted frequently to the media that they have amassed an extensive database of millions of Americans, which it has claimed is used to predict the psychological states of voters in support of election candidates who wanted to target and sway likely voters. A key part of this database consists of data gleaned from Facebook accounts. Cambridge Analytica purchased Facebook data from an outfit called Global Scientific Research (GSR), which is run by Aleksandr Kogan, an assistant professor of psychology at the University of Cambridge. GSR is a for-profit entity that he manages, separate from his academic appointment.

Dr. Kogan amassed the data by means of an online survey – which is a psychometric test similar to the Big 5 Personality Test – advertised on a service called Mechanical Turks, run by Amazon. Mechanical Turks are people willing to do “small tasks” for cents per task. In this case, the small task is completing the psychometric test. With a twist: Dr. Kogan also requires that the Turks download an app which they must connect to Facebook, and in so doing, they must permit him access to their Facebook data, as well as their network of Facebook friends.

Facebook data is valuable because they include real names and email addresses (only a minority of users block such snooping). These can then be used as match keys to connect with other sources of data, such as electoral rolls. See here for the kinds of data that Facebook allows partners to obtain.

Back in 2015 and 2016, the media were already on to this story. Several articles were published by The Intercept and The Guardian. Facebook at the time was slow to react. It wasn’t until 2016 that Facebook lawyers informed Kogan and his associates that they violated Facebook’s platform policies, and requested that these entities delete the data. Reporters tracked down some of the Facebook data, and a whistle-blower has come forward with documents, so it appears that Facebook, GSR, and Dr. Kogan have all been less than honest about the data sleaze. Indeed, at a hearing in the U.K., Facebook and Cambridge Analytica representatives denied that the political consulting firm has ever obtained or used Facebook data in its work.

The Enablers

The process of data sleaze outlined above is not unique to GSR or Cambridge Analytica. Many other companies in the social media ecosystem rely on variants. Let’s run down the list of enablers in this process.

Facebook – The popular social-media platform is at the center of this controversy, precisely because it has built such a powerful database. This database is hugely valuable to marketers who want to know what we like, and who we know. The social-media company has mastered the art of getting people to share their personal data by providing free, useful services or convenience via the platform. The same machinery that powers Facebook’s billions of revenues is driving data sleaze.

Terms and conditions and privacy policies – In every case including this one, tech firms expressly use terms and conditions as cover for invasion of privacy. They hide behind the façade of “if you don’t agree with our terms, then don’t use our service.” Later, many of these firms devolve to even more sly tactics, such as “if you continue to use our service, we assume you agree with our terms.” It’s a form of blackmail. Very few users read these terms and policies, but the businesses claim with a straight face that they have obtained permission from users to collect their data. Facebook and Cambridge Analytica argued that user permission was properly obtained. Dr. Kogan apparently disclosed to survey respondents that their data could be used for any reason. Because of his affiliation with University of Cambridge, some of the Turks were misled into thinking they were taking part in an academic study.

Bait and switch – the psychological test is a front for collecting each respondent’s Facebook Graph. The Facebook data contain information about who knows whom. Similarly, every weather app is a front for a detailed database of user locations at all times. In my view, the most important dataset Dr. Kogan wants is not the results of the psychological testing, as widely reported, but the names and emails of the network of friends and acquaintances of all those who signed up. A few hundred thousand, self-selected responses to the survey are not sufficient to create an accurate model of every American’s psychological state.

Data sharing technologies – a typical app delivers a service to users by pulling in various sources of data and integrating them. In order to support real-time sharing of data between app developers and data collectors like Facebook, data collectors set up automated processes by which apps can pull down the data. There are usually costs associated with these interfaces, especially when a sizable amount of data is delivered, which is a source of revenues for the data collectors. It’s hard to control access given that these systems allow automated bots to interface with them. Dr. Kogan, for example, could create both a good bot collecting data for academic research and a bad bot siphoning data to Cambridge Analytica.

Data governance black hole – once the data show up in one database, it is bound to show up in many databases –internally as well as at third parties. Once the data reach a third party, Facebook cannot know how many copies are made, and where those copies are. Even within Facebook, with so many employees having access to the data, it is almost impossible to monitor who has copied the data where. Facebook and other social-media outlets have community rules. For example, Facebook has “platform policies” that restrict friends’ data to noncommercial use such as improving user experience. Talk about unenforceable! Facebook can only know what data have been sent to a third party but it has no way of knowing how the third party is utilizing the data.

Data deletion myth – just as it takes special skills to eradicate a file on one’s PC, so it is basically impossible to remove all traces of data from existence. We have trouble even counting and locating all copies of a given dataset. Thus, Facebook didn’t even bother to check whether Cambridge Analytica, and selected third parties, truly destroyed the data they were supposed to. Facebook didn’t suspend the controversial company until the media unearthed evidence that the data haven’t been destroyed.

Mechanical Turks – these bit players were exploited as pawns to sell out their “friends” for mere cents.

Weak regulation and enforcement – Europe might be finally ready to enact laws to regulate the data collection industry but the U.S. government sees no evil.

Anonymity – many businesses trot out this buzz word to justify their data collection operations. To put it bluntly, we are being lied to. Anonymity is declared every time an analyst replaces identifiers (such as emails) with encrypted or scrambled versions of those entities. However, in most cases, lookup tables are available to unmask these users.

I have been contributing to Andrew's thread on how to get into the data science field. A recent college grad with a degree in environmental science and minor in statistics wants a job. Andrew suggests getting a job in industry - which I think is an excellent suggestion.

Here is my advice:

Figure out what he enjoys doing – is it coding or is it problem solving? Those are two different jobs, one is software engineering, the other is more statistics and analyses. If he is in NYC, come to one of my public lectures at NYPL in which I explain how to pick a career path within this wide and exciting field. [The next one is on the May schedule.]

Once he has picked an area, and hopefully also an industry, then he needs to reach out and talk to as many people in industry as possible. Go to networking events and meetups.

Then apply to jobs. The job search is a job in itself; keep applying until someone gives you a chance. You will encounter lots of rejection but keep trying.

If nothing is working, consider going to a bootcamp. They are set up to give you practical skills that appeal to hiring managers. Talk to the bootcamp organizers to get a sense of what their vision is, and see if it’d help you make your case.

One reason I have organized a bootcamp is that for some, it will be very difficult to break into the field without extra help - both filling knowledge gaps and making industry connections. I give the above advice to my students as well. They need to find a job that matches their temperament, and then work hard at convincing hiring managers to take a chance.

***

Next Tuesday, we are hosting an Open House.

If you're interested in learning about our vision, drop by and say hello.

Here is a problem staring many digital/Web/social media analysts in the face today: what if you are told that the majority of the data you have been dutifully reporting, analyzing and (gasp!) modeling are fake data?

By fake data, I mean, useless numbers that have no bearing on reality: visits to websites that never happened, clicks on ads by hired hands, clicks on ads by bots, clicks on ads that are buried layers deep invisible to any humans, video "views" that result from automatically playing clips, video "views" that last one second, ad reach (i.e. number of people who have seen the ad) that exceeds Census counts, reviews planted by hired hands, etc. etc.

Every one of the above is not fictional but the reality of the uncontrolled and unaudited, increasingly machine-driven and complex, secretive world of digital advertising. All major players - Google, Facebook, Microsoft, ad networks like AppNexus and Mediamath - are implicated.

I raised the alarm two years ago in an article at Harvard Business Review, featuring the work of leading ad fraud researcher Dr. Augustine Fou. Recently, there is tidal wave of news reports about all kinds of ad fraud and fake data.

I have invited Dr. Fou to comment on this fast-developing situation in the Principal Analytics Prep Webinar on Wednesday night. Learn more about the Webinar and register for free here.

***

The focus of most news items are from the perspective of brand advertisers who belatedly are waking up to the huge amount of dollars wasted. And a big story is being missed. Such waste was enabled by massive amounts of data that we now know are fake.

What about the zillions of reports, analyses and models created over the last 20 years by countless data "scientists" and analysts, in which the data from Google, Facebook, and myriad digital marketing vendors are taken at face value as accurate?

In fact, the digital advertising industry was built on the promise that it is more measurable, more accountable and more cost-effective. What Dr. Fou shows is that only basic statistics is needed to uncover such fraud.

Data cleaning is a huge time sink already without fake data - now, we have to wrestle with mountains of fake data. But that is the reality, and we have to rise up to it.

My numbersense class for Principal Analytics Prep has passed the midway point, and we covered probabilistic thinking. Yesterday, the idea came to me that the surge protector is a good example to use in teaching probability. I just need to find some statistics - which turns out to be hard to come by... so if you know some sources, please let me know!

Appliances in the U.S. are rated 110 (or is it 120) volts. Anyone who have lived elsewhere (say, Europe or parts of Asia) may know that in some countries, appliances use 240 volts. If we take an American appliance to Europe, the 240 volts of electricity will immediately kill the device. The 110V or 240V standard is an average value, and we should expect fluctuations around those values (which are peaks of a sine curve). There is a natural variability in the voltage.

Then there are surges. Statistically, we can define surges as rare events - maybe voltage that is at least three standard deviations above the normal value. (There may be an official definition but I wasn't able to find it on a quick Google search.) Surges are apparently caused by lightning or switching. (Here is a somewhat useful NIST document, which may have been partly or wholly written by suppliers of surge protection equipment.)

I would like to find some data on the statistical distribution of voltage delivered. Then, students can figure out if the data resemble a normal distribution or some other probability model. We can estimate the frequency of surges. This can lead to a quantitative assessment of expected loss due to power surges.

Further, we must account for an asymmetrical cost structure. Having too low a voltage is also a problem. However, it is a much lesser problem compared to having a too high a voltage. There also does not appear to be such a thing as a "negative power surge".

The above lays the groundwork for making a decision about whether one should buy a surge protector. There are different types of surge protectors providing different levels of protection at different prices. How can we decide whether to invest in the next better surge protector?

The NIST document referenced above makes this non-quantified assertion: "A large stack of dollar bills and some change to replace your unprotected computer, if and when a lightning or some other surge destroyed it ..... or use a small number of bills to purchase a 'surge protector' for peace of mind and effective protection." How can we quantify such a statement?

Last week, I got served a dose of predictive analytics. I got an email solicitation from LinkedIn, presenting a list of jobs that they think I might be interested in. This email is algorithmically generated, and LinkedIn tells me that I received it because I clicked on a job posting for a senior data analyst position at Ogilvy & Mather, a top advertising agency based in Manhattan.

Yes, I did click on that job posting while preparing for the information session for Principal Analytics Prep a few days before the email arrived. A podcast of the event is available here.

Here is what the email looks like.

There are several things one can learn from this email:

The field of analytics is absolutely exploding. There are six pages of jobs related to the one job I clicked on. Most of these jobs are junior positions (“senior analyst”), because the ad I clicked on is at that level.

Many top companies are hiring. The competitors of Ogilvy – AKGA, MRM/McCann, OMD USA, RAPP/Omnicom, J Walker Thompson – are also competing for talent. Not just advertising but other related industries are also hiring senior analysts. I recognize Shazam (famous mobile song-recognition app), frog (top design agency), A+E network, Etsy (noteworthy startup retailer), Luxottica (high-end makers of glasses), AIG, Mercer (top management consultancy), Mastercard, S&P, and Burtch Works (a top executive recruiter in the data space – also our guest at the info session!) So, the jobs are at top companies, startups and small businesses.

The jobs are spread out over all industries. Just in that small, nonrandom sample, we have representation of advertising, retail, technology, media and entertainment, e-commerce, credit cards, finance and insurance, human resources, management consultancy, and graphic design.

Analytics are needed in all job functions. The job seeker should also consider unconventional career paths e.g. the analyst at Burtch Works is a recruiter but with a specialty in data science and analytics. Another unusual path is to become an account manager at a digital advertising agency – you may not be running analyses all day long but if you have superior analytical skills, you’d be much better at explaining results and data-driven recommendations to your clients. Similarly, a salesperson for an analytics product company should definitely use foundational analytics knowledge.

One of the key reasons I started Principal Analytics Prep is to open doors in the job market for people of diverse backgrounds and diverse career paths. Analytics and data jobs are not limited to technical people who are coders and engineers. There are plenty of exciting job opportunities across all industries and job functions for data wizards with unconventional backgrounds. Please contact us if you want to learn more about how we can help guide you to your next career in data.

Last week, I gave an information session on the next-gen data analytics bootcamp called Principal Analytics Prep that we just launched. A recap of the session is available here as a podcast.

To prepare for the session, I did a job search on Linkedin, and found over 80,000 open positions in the U.S. matching the word "analytics". Of these, about 3,500 positions are junior or entry-level positions in the greater NYC area. Over half of those positions have been posted in the last 30 days. There are lots of analytics jobs in the business world, and hiring managers are having trouble filling them.

One reason for the difficulty is that everyone wants to hire from a small pool of experienced people. This is rational behavior on the part of hiring managers.

First, there is a huge gap between book knowledge and practice in this field. This is most apparent, for example, in design of experiments (the theory behind A/B testing). Academic courses focus on design and analysis but in real life, the most crucial matters are getting buy-in from various parties affected by testing, and making sure that the design is operationalized properly.

Second, a good data analyst needs to have a wide-lens perspective, which must be developed over time.

Third, corporate America has cut training budgets to nothing. Managers are not rewarded for training workers; in fact, they may be chastised for hiring untrained talent. It typically takes 12 months to get a new employee up to speed, for the employee to turn from being a cost driver to a productive worker.

Finally, training frequently backfires on the supervisor. It is common to be offered a big wage hike (say 20%) when switching employers but it is almost unheard-of to be offered more than 5% even when an internal promotion is rewarded. Thus, after 12 months of training, the hiring manager might learn that the employee has been poached.

Our instructors are all practitioners who typically hold senior level positions in industry. They help bridge the theory-practice gap, and also provide a deep network of professional contacts within the industry.