jueves, 12 de febrero de 2015

Yesterday I was in a drugstore and saw the UP MOVE monitors on sale. I am blessed to live in Germany, a highly developed country, so, may be, it's totally normal to encounter physical activity trackers sold in a suburban (no offence, Dachau) shop. May be, this is exactly where they belong to: to essential goods, next to band-aids and aspirin pills. And ayurvedic body oils, of course.

I am a strong believer in quantified self, and I agree with everyone on that knowledge is power: the better you know yourself, the wiser you act. Also, I am a believer in actimetric devices, not because I am engaged with them professionally, but the other way around. Actigraphs are simple in their underlying idea, yet they provide very concise inference. They are absolutely great when you want to find out something related to move and sleep. Just, you know, because when you move you move, and when you sleep you don't move. This is roughly the idea. And as these devices are also very precise, and nowadays they are wrapped in all kind of handy and beautiful interfaces, I like them a lot. So, when I see portable actigraphs on sale in a drugstore, I conclude that people rely on them more and more with every day, and it makes me glad.

I wish Jawbone would let me dig my nose into their data, because their database must be huge, and I am sure I could infer many pretty things on their basis.

Nowadays, activity trackers come in al shapes in forms: from clips to smart watches. What is common for all of them is that they contain an accelerometer that allows to capture the tiniest motions of the body. These gadgets not only help track physical activity but also provide inference about sleep.

Actimetric devices have been vastly implemented for different scopes of medical research. As being portable and relatively cheap, they are handy to measure anything that has to do with motions, and therefore, they are incredibly convenient for research of physical activity and sleep-related matters.

It is intuitively clear and backed by scientific evidence that poor sleep, short sleep, stress (and constipation) affect how we feel and, therefore, perform. We might think that we know ourselves well enough to estimate our sleep, but unfortunately our perceptions do not match the reality.

This is exactly one of those cases when "machine knows better", and we are lucky that, in our modern world, we have reliable, portable and attainable tools such as actimetric monitors to quantify ourselves.

Actigraphs have been validated against polysomnographs (PSG) which, in their turn, are recognised as the gold standard for sleep analysis. In young adult populations, actimetric devices predict sleep bouts correctly in whopping 96% of cases, they predict wake bouts correctly in 36% of cases, and are overall accurate in 86% of cases.

While accelerometers have been known to the world for over a century - take a look at the Google Ngram chart below (this interactive graph shows first occurence and prevalence of listed terms in the body of worlds books tracked by Google) - the use of actigraphic devices to measure sleep has more recent history: the oldest entry that I have discovered in PubMed dates back to 1988 (I was 3 then).

The rationale behind the use of actigraphy for sleep analysis are plain and simple. Below is a brief explanation on how sleep can be actually assessed by these devices.

An accelerometer is a tool that captures accelerations of an object (or a subject, for that matter) across three othogonal motion planes: vertical, horizontal and depth. In a tracking device, accelerations are captured at every epoch (i.e., time unit) which is commonly set to one second.

There are specific algorthims that analyse activity counts at ever moment during a certain period (think one night), and they allot this period between two buckets: sleep state and wake state. Two of the most common methods are: the Sadeh algorithm, which is mostly used for adolescent populations, and the Cole-Kripke algorithm, which is typically employed for adults. Both of the algorithms are validated and provide similar results, where the Sadeh method has a somewhat higher sensitivity and specificity.

And then the accounting happens. All added up, the sleep bouts form the sleep duration. Also, depending on its interface, the program associated with the actimetric device can report many other things like sleep latency (the time in which you fall asleep), wake after sleep onset (the total time you have been awake after the commencement of sleep) and sleep efficiency.

The last one somewhat ambiguous because it can either denote the ratio between your total sleep time and the time you were in bed, or, alternatively, it can compare your total sleep time to some benchmark time. For example, if you have a commercial sleep tracker and you set a goal of sleeping 8 hours every night, the numbers on your screen will reflect the percentage your sleep time makes of this goal.

My advice: don't do that. If you struggle to sleep these 8 hours for any reason - a sleep disorder, a demanding office job, a baby, - your constant "underperformance in sleep" will only stress you out. As a result, you will be all worked up and sleep worse. True story.

How much sleep you really need is a complicated matter which could be discussed in detail, but it is nicely wrapped up in this chart.

lunes, 2 de febrero de 2015

Repeated measurement studies are conducted frequently in all areas of research. In epidemiology, these designs are used very often. Sometimes they are referred to as longitudinal studies implying that subjects are followed over time, and that the aim is to track their exposure to risk factors, or outcome, or both.

In economics, longitudinal data are called panel data. These can describe how a variety of economic factors, for instance, macroeconomic parameters, like inflation, poverty or GPD, change over time between some economic subjects. Panel data come in handy for competitiveness studies. Econometricians know a a vast variety of recipes to cook such data sets: from plain univariate ANOVA to Bayesian MCMC models.

Many tutorials on practical handling of repeated measurement data employ the dietox data set: 831 observations containing weights of slaughter pigs measured weekly. So, indeed, these designs are to be encountered in any area of research.

In practice, two main types of models are implemented to deal with panel data: the generalised linear mixed models (GLMM) and the generalised estimation equations (GEE). A lot of smart scientific text is written on both of them, so I won't elaborate on how they work. I'll only briefly mention that they provide somewhat similar inferences. GLMM come up with "preciser" estimates due to random effects and a goodness-of-fit measure which is the maximum likelihood. GEE models do not compute likelihood, they use a so-called quasi-likelihood which is computationally simpler to obtain. The GEE approach has met a lot of followers who praise its advantages over the GLMM method. The main strengths include relative ease of calculation and weaker dependence on a "correct" multivariate distribution specification: GEE models can yield consistent results even in case of misspecification of a data correlation structure. The main critics that is attributed to GEE models in comparison to GLMM are: being population-averaged (read "less precise") and that quasilikelihood these models use is only apt for estimation and does not provide means for testing model fit and therefore for choosing a correct model specification. However, Pan (2001) has proposed a quasilikelihood information criteria (QIC), and it has been successfully implemented in practice.

When it comes to practical handling of data, equally wrong GEE and GLMM methods are provided in Matlab, SAS and R, but unfortunately - not in Python. However, you can run R code from Python, and the library rpy2 is the most popular way to do this.

A recent paper (Nooraee et al. 2014) has compared GEE models fitted by SAS, SPSS and several R packages and has found differences in the output.

I mostly use R for the reasons of convenience and, wait fo it, crossplatformness (this word does exist!), but I have all due respect to and can even use SAS and Matlab. So I have decided to fit a GEE model in different programs and see what comes out.

And R is a freeware which means that you use it at your own risk. It also means that if you want to do something, anything, there are at least two libraries written independently by different people that allow you to do that. In particular, if you want to fit a GEE model, your first choices would be provided in packages geepack, gee and repolr (see Nooraee et. al 2014 for a comprehensive comparison of performance).

Inequality

So, what about the inequality?

In this entry, inequality does not imply comparison between analytical models or their practical realisation. What is meant is the real world economic inequality, on a macrolevel. I have downloades a small data set from the website of the World Bank. It shows economic discrepancy measured by a so-called Gini coefficient. 20 EU countries are used for the analysis, and the data has been observed for 11 years (with missings): from 2004 to 2014.

The following indicators have been used: GINI index, mean consumption per capita, and the shares of income distributed among the quintiles of population. The latter means: how much of the total income is held by the poorest 20% following up to how much is held by the richest 20%. Yes, this is exactly about how much of the world's money is possessed by the top 1% but on a fractioned and a deconstructed level. Then, I have computed an interquintile range as the difference of income possession between the richest and the poorest. These factors have been opposed to one another in a correlation test (Pearson's) to check for basic linear connections:

gini

consumption

1st quintile

2nd quintile

3rd quintile

4th quintile

5th quintile

IQiR

gini

1.00

-0.56

-0.96

-0.97

-0.93

-0.01

0.99

1.00

consumption

-0.56

1.00

0.49

0.66

0.51

-0.12

-0.55

-0.54

1st quintile

-0.96

0.49

1.00

0.90

0.81

-0.16

-0.93

-0.96

2nd quintile

-0.97

0.66

0.90

1.00

0.92

-0.06

-0.96

-0.96

3rd quintile

-0.93

0.51

0.81

0.92

1.00

0.25

-0.96

-0.93

4th quintile

-0.01

-0.12

-0.16

-0.06

0.25

1.00

-0.13

-0.04

5th quintile

0.99

-0.55

-0.93

-0.96

-0.96

-0.13

1.00

0.99

IQiR

1.00

-0.54

-0.96

-0.96

-0.93

-0.04

0.99

1.00

So it can be seen that the interquintile range correlates with the GINI coefficient almost perfectly (the correlation coefficient is: r= 0.999) and so does the 5th quintile of the data which stands for the share of income held by the richest 20%. Basically, the wealth of the wealthiest alone reflects inequality perfectly.

A scatterplot:

I have decided to proceed with this predictor because the mean consumption has only been provided in 16 observations. My other idea was to add any GDP measure as a predictor, but my aim was to compare software performances and not to make a macroeconomic finding, so I left this urge unattended.

A very basic GEE model uses 76 observations of 18 countries over 9 years. Both in R ans in SAS realistaions, these models operate on complete observations with no missing values. GEE help fit population averaged models, and it relies on the concept of correlation structure for times. Here, it is assumed that the correlation structure is autoregressive of order 1, because the inequality of the given year will have the strongest relationship with the inequality of the preceding and/or the subsequent year.

The same analysis can be done in SAS:
The code:

GEE Model Information

Correlation Structure

AR(1)

Subject Effect

country_name (18 levels)

Number of Clusters

18

Correlation Matrix Dimension

9

Maximum Cluster Size

9

Minimum Cluster Size

1

Algorithm converged.

Working Correlation Matrix

Col1

Col2

Col3

Col4

Col5

Col6

Col7

Col8

Col9

Row1

1.0000

0.9541

0.9103

0.8685

0.8286

0.7906

0.7543

0.7197

0.6867

Row2

0.9541

1.0000

0.9541

0.9103

0.8685

0.8286

0.7906

0.7543

0.7197

Row3

0.9103

0.9541

1.0000

0.9541

0.9103

0.8685

0.8286

0.7906

0.7543

Row4

0.8685

0.9103

0.9541

1.0000

0.9541

0.9103

0.8685

0.8286

0.7906

Row5

0.8286

0.8685

0.9103

0.9541

1.0000

0.9541

0.9103

0.8685

0.8286

Row6

0.7906

0.8286

0.8685

0.9103

0.9541

1.0000

0.9541

0.9103

0.8685

Row7

0.7543

0.7906

0.8286

0.8685

0.9103

0.9541

1.0000

0.9541

0.9103

Row8

0.7197

0.7543

0.7906

0.8286

0.8685

0.9103

0.9541

1.0000

0.9541

Row9

0.6867

0.7197

0.7543

0.7906

0.8286

0.8685

0.9103

0.9541

1.0000

GEE Fit Criteria

QIC

79.3518

QICu

86.0000

Analysis Of GEE Parameter Estimates

Model-Based Standard Error Estimates

Parameter

Estimate

Standard Error

95% Confidence Limits

Z

P-value

Intercept

-16.3367

0.9553

-18.2090

-14.4645

-17.10

<.0001

year

2004

-0.2521

0.2592

-0.7602

0.2560

-0.97

0.3309

year

2005

-0.2286

0.2650

-0.7480

0.2907

-0.86

0.3882

year

2006

-0.2318

0.2598

-0.7410

0.2774

-0.89

0.3722

year

2007

-0.1116

0.2535

-0.6085

0.3852

-0.44

0.6597

year

2008

-0.3641

0.2533

-0.8605

0.1324

-1.44

0.1506

year

2009

-0.2944

0.2585

-0.8011

0.2124

-1.14

0.2549

year

2010

0.2325

0.2425

-0.2428

0.7078

0.96

0.3378

year

2011

0.0223

0.2294

-0.4273

0.4719

0.10

0.9226

year

2012

0.0000

0.0000

0.0000

0.0000

.

.

fifth_quintile

1.2089

0.0233

1.1633

1.2545

51.97

<0 .0001="" p="">0>

Scale

0.7654

.

.

.

.

.

The zero estimates for year 2012 is explained by the fact that there is only one observation for this data point which is Romania (Gini of 27.3). It is not a good practice to include such observations in a model, but they are handy to show differences in software output.

SAS and R (geepack) have yielded somewhat different estimates.The scale parameter reported by R is 0.3038, and the one reported by SAS is 0.765. Then, in the correlation structure, the alpha reported by R is 0.698 and the one reported by SAS is 0.954. The significance, direction and magnitude of coefficient estimates hold.

Bonus track: Pigs

So, as I mentioned in the beginning, the dietox dataset is often used for training purposes in R. As not having found any comparison of analysis of this data sets for R and SAS, I have ran a very plain model in both: a GEE model with a Poisson distribution function with a logarithmic link (the default), where pig's weight has been modeled as a function of time and copper intake. Coefficient estimates and standard deviations differed at second decimal points, which is ok and may be due to different conversion algorithms and floating point numbers handling. However, the scale and correlation parameters have shown differences in the first decimal point.

Résumé

SAS and R come up with slightly different estimates, yet for sake of practical analysis the difference is negligible. Results are generally consistent with one another.

lunes, 1 de diciembre de 2014

Some time ago when I was looking into the German credit scoring dataset, different models have shown that a "moral" of a person, namely, their empirically observed tendency to return borrowed money, is one of the main predictor for their credit score.

Which is kind of obvious.

This measure of "moral" can be deduced when data on one's credit history is available. In countries with long-established credit traditions, like Germany, it is normally not a problem to get this empirical evidence. However, in some countries this is not the case. Moreover, you can always stumble upon people who have not applied for any kind of loan, ever, anywhere.

How would you go about this problem?

For example, you can rely on a report coming from an independent credit agency (think SCHUFA in Germany). They track your incomes and expenditure and come up with a verdict. You can reach out to these agencies when you need to be backed up for, say, renting a flat, too. Such ratings are not available everywhere, and also, their estimate is not always the ultimate truth: SCHUFA has allegedly committed some errors in its analyses.

The second option to consider is try to come up with a tailored subjective estimate of your solvency based on your behavior and mindset. These two options should be taken into consideration separately.

In 2012, SCHUFA announced that they were going to use the personal data from people's Facebook profiles in their scoring, which, of course, has raised a vast negative feedback. Nevertheless, Kreditech, a Hamburg-based fintech startup providing microcredits to people, found in 2012, does act on this SCHUFA initiative. According to this article in the Economist, Kreditech asks a potential borrower for access to their data on Facebook, and, based on their profile and contacts, can infer, whether this person is likely to return a loan. The Economist quotes a Kreditech representative as saying that an applicant with a Facebook friend who has defaulted on a firm's loan would most probably be rejected.

This kind of verdict goes well in line with the proverb: "Tell me who your friend is and I'll tell you who you are". However, an average person aged between 25 and 43 has 360 friends on Facebook. This is in the US, but you get my point. So, out of these 360, one could have a friend (or two) who is a taxidermist, a tuk tuk driver or who has otherwise failed in grad school, but that does not help infer anything about the applicant. I am still waiting to get a reply from Kreditech, hopefully, I'll receive it sometime.

Facebook should indeed make a very good living from people's mundane life data. My assumption is supported by the fact that someone very knowledgeable I am friends with left his job as a quantitative analyst in finance to take a role of a software developer in Facebook. Also, their have just announced their new hire, Vladimir Vapnik, which means that they can afford it. The question is, do the users benefit from Facebook to the same extent as Facebook benefits from them.

Back to the task of credit scoring, there is another, indirect way to estimate someone's credibility. These are the psychometric scores based on questionnaires. The major name among companies that are doing such analyses is The Entrepreneurial Finance Lab (or the EFL). It is a business that originates from the Harvard University. It was found in 2006, and it's aim has been to take credit scoring to the next level, making it scalable and independent from data on past credit history, business field or financial documentation in possession.

I have found some evidence of implementation of the psychometrical analyses for credit scoring in the internet, but no example of a questionnaire itself. While I can imagine that analysts may employ some kind of deep learning to establish links between someone's personality traits and their "morale", on a micro and a macro levels, with a reduced knowledge on how this is actually done, this is still a black box to me. So I should just take someone's words that this works. And when it does, this is indeed amazing because it can help bridge so many gaps in crediting uncovered before.

In the meantime, it would be, perhaps, easier to make the potential borrowers go through a polygraphic analysis.

martes, 11 de noviembre de 2014

Sleep has been a field of ongoing research for many years now. In is quantified, qualified, widely documented, discussed, argued about etc. Every sensible person in the world knows that they should sleep around 8 hours per day, and there are many guidelines out there, e.g. the ones from the U.S. National Health, Lung and Blood Institute.

For the last year and a half, I have been very concerned about how much I sleep. Since I got an activity tracker, I have realised that on the average I sleep less than the benchmark value of 8 hours.

The absolute majority of portable sleep-tracking devices aimed for both clinical and quotidian use contain an accelerometer inside.

Seriously, check out the last link if you are in search of a sleep device for self: it's a great read!

So, an accelerometer registers your acceleration along the three axes and then the relevant software translates the data into a one-dimensional continuous variable representing activity counts. Based on specific algorithms (e.g. this one), the ability of an accelerometer to track even the tiniest motions and the supposition that when you sleep you don't move, the software classfies your time in bed into periods when you are asleep and when you are awake. The latter is called wake after sleep onset (WASO). The time during which you fall asleep is referred to as the sleep latency. When your portable monitoring device says that you have slept 7 hours it means the time you were asleep, i.e. the time in bed minus WASO minus sleep latency.

You normally also get to know your sleep efficiency, i.e. the percentage that the time you have slept makes up from your total time in bed or from your goal if you are aiming for a specific sleep time.

This efficiency measure is one of the ways to look onto the quality of sleep.

There is a huge body of scientific research that considers sleep quality: 6246 records on Pubmed as of today. Quality of sleep is something that is conceptually simple and complex at the same time, and, to my knowledge, there exist many different and no ultimate ways to measure and report on it.

Three main directions to approach quality of sleep can be outlined. The first one is by mere asking. You can address a respondent with a question about how well they have slept. You can take it one step further and suggest that they rate the quality of their sleep on some scale. In order to boost the validity of your judgement, you can ask them to do so for several days in a row (for example, seven) - and thus you will come up with what is called a sleep diary.

The second way is to employ the renown Pittsburgh Sleep Quality Index (PSQI). This is a questionnaire-based measure developed by the researchers of the University of Pittsburg, and it has been extensively used in research (1680 records on Pubmed by today). The index's values range from "poor" to "good", and various sleep parameters across several domains are used along the way, including: subjective sleep quality, sleep latency, use of sleep medication and others. In a relevant questionnaire, a subject should give answers to all questions based on what they think that applies to the majority of days during the month prior to questioning. It is a highly subjective questionnaire and index, and this is what it is sometimes criticised for.

The third manner to track sleep quality is to use a quantitative objective measure of sleep efficiency mentioned above.

Independently from the way how you measure the quality of sleep, it is obvious that getting quality sleep every night or at least most of the nights is very important for well-being of practically everyone.

It is naturally up to you how far you should go in controlling your performance in sleep. I have worn both commercial and professional devices, and I can say that wearing a tracker makes you think more about how much you sleep and move. This is not necessarily a good thing and, depending on your mindset, it can make you upset or anxious .

Anyway, self-awareness is a good thing, and if you don't want to engage yourself in wearing a monitor, you can keep a sleep diary for a while of even compute the value of the Pittsburgh Sleep Quality Index for yourself.

jueves, 30 de octubre de 2014

Almost anyone who speaks English and has given a though to founding or joining a startup is most surely aware of the Lean Startup philosophy.

The concept has been invented by Eric Ries, a young and successful U.S. entrepreneur who wrote a great book explaining how to successfully build and run startup businesses. This book has had a major impact, and the idea has grown to an international movement and has met many practitioners across the globe.

Munich too has a top notch group of lean startuppers, and it's been up and running since 2012. The group gathers together once a month, and the meetings are organised through the Meetup platform. Like everything in Bavaria, these meeting are of superb quality: the venues are great, the speakers are awesome, and the public is friendly and inspiring. Also, like everything in Bavaria, these meetups involve beer. Good beer. And pizzas. Which are courtesy of numerous group sponsors.

Yesterday, there was a meetup. I have decided to blog bout it because it was just great.

Here's an outline of talks:

Remote employees for an IT company: yay or nay? A view of a CTO

The first presentation was given by Dimitar Siljanovski, a CTO of Cuponation, a successful startup that offer retail discount coupons to customers and is operating all across Europe. As a tech person himself, but also an executive, Dimitar was talking about hiring and managing remote teams for IT businesses.

In his talk, he explained what is the main motivation for remote hires, which is costs and quality. I found it particularly funny that Dimitar mentioned that they prefer to hire Ukrainians rather than Russians, because the latter are too expensive. He has told the audience about where and how they find the people, and he has been able to justify that the concept actually works: the churn rate among in-house staff is twice as higher as the one amongst remotes! To say nothing of the fact that a very little number of employees have left the company at all.

Dimitar has also elaborated quite a bit on how they at Cuponation maintain the remotes motivated: e.g. promote, relocate etc., and on how otherwise located developers help expand the teams locally. He mentioned that one of the obstacles for foreign developers was their slight insecurity in their English language skills (for example, when there is a necessity to participate at meetings, and how easily this could be overcome). He has mentioned that understanding different cultures is of the highest importance. Dimitar has also admitted that one of the issues he bumped into once was a conflict of two dazzling personalities. He has confessed that he always intends to hire the best, those he himself can learn from, but it is understandable that when two stars collide it might cause problems for the whole team. Such issues are, however, also manageable, as far as he explained.

All in all, remote working in IT is working. The complete presentation could be found here.

Hogwarts for entrepreneurs: The Founder Institute

The second talk was presented by Jan Kennedy from the Founder Institute. This is an educational organisation backed by Microsoft which trains people to be entrepreneurs. They start with testing whether you have a so-called Entrepreneur DNA determining how apt you are to find and run a business. More information on that can be found here, but as far as I have retrieved from Jan's brief presentation, one has to be a determined and well-balanced person in order to succeed in building a company from scratch. Finally, once the test is passed, you can apply for the program. Then, during 4 months you will get mentored on how to be an entrepreneur, and you are promised to get a personal approach to that based on your own strengths and weaknesses revealed by the DNA test.

The enrollment deadline for Munich is approaching, but if you are elsewhere and are curious about the FI activities, you can attend one of their events listed here.

When your train is late, there is a business opportunity there

Finally, the last talk that has been presented by a lean startupper Thomas Hartmann who has taken up a common problem and found a market in it.

Not many people are aware of it, but, according to EU legislation, if your train is 1 hour late, you are entitled to receive a refund of 25% of your travel fare. If the delay reaches or exceeds 2 hours, your compensation rises up to a whopping 50%. This is very well explained here (in German).

However, train companies seem to be reluctant in giving your money back. For example, with DB, the main German train company, you can only request your refund at the station, in person, directly addressing a company's representative, or you can write them a letter with a request. No e-mail or phone calls would be accepted and there is no app for that either.

Or, there was no such app. Until Thomas has come up with an idea to set up a service that would help people get reimbursed without making them fill in and file all these forms or queue up losing time among other frustrated customers. He has created a business called Bahn-Erstattung.de which aims to simplify the refund process for the end users. All you have to do is to send a picture of your ticket and a repayment request. You do so via a smartphone up. The company takes it right from there, and you have your money back in about a month.

This idea, simple and smart, has aroused a great interest in the audience. When people were asking Thomas about what he would have done differently if he was to start all over again, he said that he would not invest that much in the technical part. Similarly to Couponation, he has also hired a remote developer, in this particular case, from India, to create the software. And this has worked well for him.

Is remote work a one solution for all?

To close the loop, I would like to add a couple of thoughts of mine on remote work. The successes provided by hiring remote programmers have got me thinking whether long-distance collaboration is a good idea for any kind of labour. I have found a very interesting article in the MIT Sloan Management Review explaining how to "set up remote workers to thrive". The author lists four challenges standing on the way of distant working and then takes a closer look on each of them suggesting solutions to the problems. In my opinion, this article is not only an absorbing read, but also a result of a thorough research backed by 13 publications. What has surprised me quite a bit is that no particular kind of job is listed anywhere in the text. This leads to a conclusion that remote work challenges are generic, regardless of what you actually do for the company. Since these issues are (allegedly) manageable, one may be able to successfully work from "a Galaxy far, far away" doing many different things, and not necessarily writing code.

This kind of work relationship might not suit every kind of personality and situation but is worth consideration.

miércoles, 22 de octubre de 2014

I really love reading Data Science Central and, in particular, Analytic Bridge. Several days ago, I have received their newsletter in which they reported on a lot of stuff including the ever popular Big Data. In one of the articles, there is an overview of a selection of case studies on big data.

One of them speaks (briefly, and provides a wrong link) about a German, Hamburg-based fintech company - Kreditech - that applies sophisticated data analysis for the purposes of credit scoring. Exciting, right? Especially taking into consideration the fact that Kreditech takes into account the behavioral angle. As reported, the company has found and makes use of interesting connections between a person's social media behaviour and their financial credibility. This guys have reached a massive success and hey have expanded the shop to many countries, including Spain and Russia. On their webpage, they say that they rely on big data and complex machine-learning algorithms to make faster and better scoring decisions.

Some time ago, I have read a great article that I have found on LinkedIn. It is a blog post from Simon Gibbons, someone who has worked in credit business for more than 20 years, and he admits that he is glad that there are things that have not changes about this job since then.

Which has brought me to thinking that machine learning is one of them. Indeed, it is no secret that logistic regression is The Algorithm used around to perform credit scoring. And if you want to take probably the most amazing online course on machine learning existing nowadays, offered by Stanford University on Coursera platform and elsewhere - (do act on this urge!) - you will very soon find out that logistic regression is one of the very basic machine learning algorithms. Statisticians may laugh now. Economists working in scoring may now say they are advanced in machine learning.

Nevertheless, in my humble opinion, logistic regression does a great job, because its output is well-interpretable and flexible. First and foremost, the regression coefficients can be translated to odds of a person being credit-worthy conditioned by a given factor. Second is that the method output is the probability of a person being credible or not. So, when you want to classify people accordingly, you can play a little bit with the probability threshold value. Thus, you can mitigate the risks of misclassification, namely avoid getting to many false positives and the opposite. Third is that because the underlying algorithm is standard, it can be upgraded (considering regularisation terms, for example), and this can improve a model's prediction power on a new data.

A hidden gem: German credit scoring datasets

It is really hard to get personal financial data - it may be even harder than getting clinical data. Luckily, there exists a wonderful German credit scoring data set . Provided by Munich's LMU university, it has been used extensively by well-known German statistics professors - G. Tutz, L. Fahrmeir and A. Hamerle - for educational purposes. The dataset can be downloaded from the LMU webpage or as a part of the R package Fahrmeir, but I suggest the first option, because in the package, as least in my download, the dataset appears to be somewhat trimmed.

In his book, "Regression for Categorical Data", particularly, in the exercise 15.13 Dr. G. Tutz suggests that a reader uses this dataset to fit a bunch of classifiers: including linear, logistic, nearest neighbours, trees or random forest methods. It is further proposed to split the set in the proportion into the training and the test sets, in the proportion 80/20, and compute test errors.

Playground

I, too, could not resist the temptation to put my hands on this dataset, and I've tried several methods. Here below I report on three of them that I am particularly fond of: the application and possible results.

At first, the whole dataset has ben fit by these models in order to see how they approximate it and what features they rely upon at most. The next step was to split the data several times into train and test sets, as recommended in the book, and test how these models, fitted on train data, predict on test data.

Fitting the dataset

Logistic regression

So, to start with, I have run a logistic regression classifier. This has yielded me the following set of significant predictors:

The following factors are among them:

balance of current account (categorised),

credit period,

moral, namely, empirically observed tendency to return lend money,

purpose of credit,

balance of savings account (categorised),

being a woman,

installment in percent of available income,

being a garantor,

time of residence at the current home address,

type of one's housing,

not being foreign workforce (Gastarbeiter in German).

Here, the values are rounded to 3 digits. Exponent the coefficients for the conditional odds ratios.

Then, the goodness of fit can be deduced from a bunch of statistical tests for models, or by crosstabbing the actual values and the fit. Here, the threshold issue should be brought up again. The logistic model outputs fitted probabilities. Therefore, one can classify the respective values as setting the decision margin. The most common and intuitive way to do so, also suggested by the shape of the logistic function curve, is to set it to 0.5. This would yield the following fit:

or 78.7% of correctly classified values. In statistics and machine learning, two terms are used quite extensively to elicit the goodness-of-fit of a given classifier: namely, precision, recall and the F1-score:

Precision is the share of true positives out of the whole predicted positives.

Recall is the share of true positives out of the whole observed positives.

F1 score is the weighted average of precision and recall that provides an estimate of how good the classifier is: 0 is the worst value, 1 is the best value.

Depending on what the primary aim of a classifier is, the threshold value can deviate from 0.5. If one wants to predict credit worthiness very confidently, then the threshold can be set higher. If one wants to avoid missing too many worthy people, namely, to avoid false negatives, the threshold can be set lower.

Here below is the table showing what different threshold values yield:

So, as expected, setting threshold to 0.7 yields the best precision, the value of 0.3 provides the best recall and the highest F1-score, and the value of 0.5 leads to the highest overal percentage of correctly classified subjects.

Tree

The CART algorithm introduced in 1985 by Leo Breiman cannot be underestimated. I love this algorithm and I use it a lot. What I particularly like about it is that it can report on variable importance both in terms of GINI importance (or impurity) and information gain. Also, pruning, i.e. reducing the size of a tree, is a very important concept which helps create usable models.

In the beginning, a classification tree has been fit with no prunng involved.

Variable importance (GINI):

balance of current account: 31

duration: 15

purpose of credit: 11

credit amount: 11

value of savings: 10

most valuable assets: 9

living at current address: 7

previous credits: 2

type of housing: 1

working for current employer. 1

job type: 1

These values of GINI impurities have been rescaled to add up to 100, so one can quickly see the relative importance of the factors.

The first tree with the complexity parameter of 0.01 has resulted to have 81.

In R, a CART tree can be fitted using the package rpart. When it comes to fitted values, rpart can return them, in particular, both as fitted values and as probabilities of belonging to a class. If the latter option is chosen, then one can employ the same moving threshold paradigm.

Opting for the default fitted values has yielded the following classification:

The fit is better than for logistic regression: in each category in particular and overall - 79.7%.

Having refitted the model basing the splits on the information gain, has elicited the following variable importances:

balance of current account: 35

duration: 14

purpose of credit: 11

moral: 11

value of savings: 10

credit amount: 7

most valuable assets: 6

living at current address: 2

previous credits: 1

type of housing: 1

age: 1

This is a slightly different set. The fit is a little bit worse - under the same complexity of the tree (0.01) - 79.3%, but still a bit better than for logistic regression.

Pruned to complexity of 0.05, both trees yield the same accuracy of the overal fit: 74.7%

Support Vector Machine

I am a massive fan of SVM. Mostly because of all this dimensionality reduction and the "kernel trick". Support vectors offer a whole different approach to classification, and the underlying models are very flexible - because of kernels and regularisation.

The implementation of SVM in R is amazing and is done via linking the e1071 package. The default kernel is Gaussian, which is referred to as radial basis function, and the selection of other implemented functions include: linear (dot product), polynomial and sigmoid kernels. I think that this is an exhaustive set for basic research needs, but I kind of am interested in implementing other kernels and using them with this classifier. The only minor thing that I'd change is I'd call the radial basis kernel Gaussian - which it actually is. RBF is a broader term: a Laplacian kernel is also a radial basis function. But I'm being picky, perhaps.

Anyway, as my aim was to fit the dataset I considered it ok to massively overfit it and set the regularisation term to whatever works.

The table below reports accuracy of classification for different kernels and cost parameters (in %):

For Gaussian and polynomial (of degree 3, which is the default) kernels the fit improves drastically with the growth of the cost parameter. Here below is a similar table but reflecting the number of support vectors each model relies upon:

The size of the dataset is 1000 observations.

The gamma parameter (or the kernel scale parameter), which is, by default in the function equals 1/(number of features) - including dummies - has remained untouched.

Prediction

The second part of the excercise from the Prof. Tutz's book suggests splitting the dataset several times into train and test parts and then fit the model using the first part and test its predictive performance on the second.

Using random sampling, have split the dataset 10 times, assigning 20% of it to the training set and 80% to the test set. The tree model has used the default complexity parameter of 0.01. The SVM model has been implemented with the use of the Gaussian kernel and the cost parameter of 5.

Below, there are the validation results for each trial reported as the percentage of cases classified in the test set correctly:

And the summary of the results:

As seen from the boxplot, SVM outperforms other methods in prediction accuracy, followed by logistic regression. CART, however, had higher average performance than logistic regression, and the smalles results variability of the three (SD=1.81%).

I then ran the same analysis resampling the data 100 times and have come up with the following results:

The respective standard deviations are:

Logistic regression: 2.44 %

CART: 2.85 %

SVM: 2.75 %

Finally, I have run 1000 inerations of the same analysis. just to see if the results hold. And they hold:

As for variability of results, the respective standard deviations were:

Logistic regression: 2.84 %

CART: 2.86 %

SVM: 2.66 %

This comparison could be taken several steps further. Namely, the data could be split into train, cross-validation and test sets, where the first serves to fit the model, the second - to adjust the parameters, and the third - to test the performance of the resulting classifier. However, there is always room for improvement, and these results already can provide one with an idea of the methods.

If you are still bearing with me, please let me draw your attention to the existence of such an important predictor in the dataset as presence/absence of a telephone in a person's posession. I believe, back in the days they ment stationary phones not even the oversized mobile Motorolas. What would it be now, an iPhone 6?

lunes, 13 de octubre de 2014

During the last six months, I have been working mostly in R. R is great for research purposes, and I am not participating in these endless discussions about what is cooler: R, Python, Matlab, SAS or you name it. As being priviledged by speaking all of the above mentioned languages with a greater or lesser fluency, I can compare, and therefore I think that it all comes down to what you want to do in the end.

One of the things that I have adopted from my working-exclusively-in-Python experience is is the test-triven development (TDD) paradigm. Now, even writing my research code in R, I can't help creating these tests.

There is actually not much new to say about unit testing, because the topic is extensively covered elsewhere. In my humble opinion, this blog post offers the most awesome coverage of unit testing that I have ever seen.

TDD in general and unit tests in partucular are often neglected by R users - unless they are writing a package.

I think the added value of unit tests for research code cannot be overestimated since, despite popular beliefs of people unfamiliar with R, the language is much more than - how one of my classmates liked to put it - "a sophisticated statistical calculator".

Of course, many-many research findings have been successfully made employing script-based code, but when you have to do similar things multple times, and when you can wrap your code up and make it unfold beautifully with every call, testing is comes in very handy.

R has a certain characteristics: there exist at least one implementation (i.e., package) for almost anything. For some things, there are multiple ways to do them. I don't really know why people reinvent the wheel, but my guess is that when the current state of things is not working for them, they prefer to start from scratch rather than to dig into someone's code.

So, if you are eager to to unit test your thoroughly developed work you can opt for - at least - these three packages:

The last one does not seem to be used very often. The second has the fame because it has been developed by the very Hadley Wickham, and is allegedly used by him in his packages. To those unfamiliar with the name, let me just say that he is the reference R guy, a visualisation guru and the ggplot2 creator. He has a $60 worth book published by Springer Verlag and a stackoverflowing reputation on Stack Overflow.

I am using the first package from the list, RUnit, and not for the reason that is has been created by fellow German people working in field of epidemiology. I do so merely because RUnit is so similar to the unit testing framework of Python that I am already familiar with. It is reportedly alike to the unit testing approach implemented in Java. I don't know Java, so I can't tell. What I can tell that RUnit is great for use. It is clear, comprehensive and disambiguous. Moreover, it comes with a terrific reference manual that is a great read - apart from being informative. It provides a simple yet exhaustive explanation of what unit tests are, why they are helpful and how they differ from integration tests. Also, it provides guidelines on how to write unit tests. It is quite unlikely to encounter a line like:

Unfortunately, to my knowledge, there exist no implementation of test suits in any of R IDEs. But this is not a major problem, especially for those R users who, like me, started their journey with R using console only.

So, if you want to define a test suite in R, all you need to do is link the library, defineTestSuite(), runTestSuite() and, if you wish to, printTextProtocol() for your tests.