Tag Archives: Principal component analysis

One of the charges against analytics is that it hasn’t really demonstrated its utility, particularly in relation to recruitment. This is an argument I have some sympathy with. Having followed football analytics for over three years, I’m well-versed in the metrics that could aid decision making in football but I can appreciate that the body of work isn’t readily accessible without investing a lot of time.

Furthermore, clubs are understandably reticent about sharing the methods and processes that they follow, so successes and failures attributable to analytics are difficult to unpick from the outside.

Rather than add to the pile of analytics in football think-pieces that have sprung up recently, I thought I would try and work through how analysing and interpreting data might work in practice from the point of view of recruitment. Show, rather than tell.

While I haven’t directly worked with football clubs, I have spoken with several people who do use numbers to aid recruitment decisions within them, so I have some idea of how the process works. Data analysis is a huge part of my job as a research scientist, so I have a pretty good understanding of the utility and limits of data (my office doesn’t have air-conditioning though and I rarely use spreadsheets).

As a broad rule of thumb, public analytics (and possibly work done in private also) is generally ‘better’ at assessing attacking players, with central defenders and goalkeepers being a particular blind-spot currently. With that in mind, I’m going to focus on two attacking midfielders that Liverpool signed over the past two summers, Adam Lallana and Roberto Firmino.

The following is how I might employ some analytical tools to aid recruitment.

Initial analysis

To start with I’m going to take a broad look at their skill sets and playing style using the tools that I developed for my OptaPro Forum presentation, which can be watched here. The method uses a variety of metrics to identify different player types, which can give a quick overview of playing style and skill set. The midfielder groups isolated by the analysis are shown below.

Midfield sub-groups identified using the playing style tool. Each coloured circle corresponds to an individual player. Data via Opta.

I think this is a useful starting point for data analysis as it can give a quick snap shot of a player and can also be used for filtering transfer requirements. The utility of such a tool is likely dependent on how well scouted a particular league is by an individual club.

A manager, sporting director or scout could feed into the use of such a tool by providing their requirements for a new signing, which an analyst could then use to provide a short-list of different players. I know that this is one way numbers are used within clubs as the number of leagues and matches that they take an interest in outstrips the number of ‘traditional’ scouts that they employ.

As far as our examples are concerned, Lallana profiles as an attacking midfielder (no great shock) and Firmino belongs in the ‘direct’ attackers class as a result of his dribbling and shooting style (again no great shock). Broadly speaking, both players would be seen as attacking midfielders but the analysis is picking up their differing styles which are evident from watching them play.

Comparing statistical profiles

Going one step further, fairer comparisons between players can be made based upon their identified style e.g. marking down a creative midfielders for taking a low number of shots compared to a direct attacker would be unfair, given their respective roles and playing style.

Below I’ve compared their statistical output during the 2013/14 season, which is the season before Lallana signed for Liverpool and I’m going to make the possibly incorrect assumption that Firmino was someone that Liverpool were interested in that summer also. Some of the numbers (shots, chances created, throughballs, dribbles, tackles and interceptions) were included in the initial player style analysis above, while others (pass completion percentage and assists) are included as some additional context and information.

The aim here is to give an idea of the strengths, weaknesses and playing style of each player based on ranking a player against their peers. Whether a player ranks low or high on a particular metric is a ‘good’ thing or not is dependent on the statistic e.g. taking shots from outside the box isn’t necessarily a bad thing to do but you might not want to be top of the list (Andros Townsend in case you hadn’t guessed). Many will also depend on the tactical system of their team and their role within it.

Lallana profiles as a player who is good/average at several things, with chances created seemingly being his stand-out skill here (note this is from open-play only). Firmino on the other hand is strong and even elite at several of these measures. Importantly, these are metrics that have been identified as important for attacking midfielders and they can also be linked to winning football matches.

Based on these initial findings, Firmino looks like an excellent addition, while Lallana is quite underwhelming. Clearly this analysis doesn’t capture many things that are better suited to video and live scouting e.g. their defensive work off the ball, how they strike a ball, their first touch etc.

At this stage of the analysis, we’ve got a reasonable idea of their playing style and how they compare to their peers. However, we’re currently lacking further context for some of these measures, so it would be prudent to examine them further using some other techniques.

Diving deeper

So far, I’ve only considered one analytical method to evaluate these players. An important thing to remember is that all methods will have their flaws and biases, so it would be wise to consider some alternatives.

For example, I’m not massively keen on ‘chances created’ as a statistic, as I can imagine multiple ways that it could be misleading. Maybe it would be a good idea then to look at some numbers that provide more context and depth to ‘creativity’, especially as this should be a primary skill of an attacking midfielder for Liverpool.

Without wishing to go into too much detail, Lallana is pretty average for an attacking midfielder on these metrics, while Firmino was one of the top players in the Bundesliga.

I’m wary of writing Lallana off here as these measures focus on ‘direct’ contributions and maybe his game is about facilitating his team mates. Perhaps he is the player who makes the pass before the assist. I can look at this also using data by looking at the attacks he is involved in. Lallana doesn’t rise up the standings here either, again the quality and level of his contribution is basically average. Unfortunately, I’ve not worked up these figures for the Bundesliga, so I can’t comment on how Firmino shapes up here (I suspect he would rate highly here also).

Recommendation

Based on the methods outlined above, I would have been strongly in favour of signing Firmino as he mixes high quality creative skills with a goal threat. Obviously it is early days for Firmino at Liverpool (a grand total of 239 minutes in the league so far), so assessing whether the signing has been successful or not would be premature.

Lallana’s statistical profile is rather average, so factoring in his age and price tag, it would have seemed a stretch to consider him a worthwhile signing based on his 2013/14 season. Intriguingly, when comparing Lallana’s metrics from Southampton and those at Liverpool, there is relatively little difference between them; Liverpool seemingly got the player they purchased when examining his statistical output based on these measures.

These are my honest recommendations regarding these players based on these analytical methods that I’ve developed. Ideally I would have published something along these lines in the summer of 2014 but you’ll just have to take my word that I wasn’t keen on Lallana based on a prototype version of the comparison tool that I outlined above and nothing that I have worked on since has changed that view. Similarly, Firmino stood out as an exciting player who Liverpool could reasonably obtain.

There are many ways I would like to improve and validate these techniques and they might bear little relation to the tools used by clubs. Methods can always be developed, improved and even scraped!

Hopefully the above has given some insight into how analytics could be a part of the recruitment process.

Coda

If analytics is to play an increasing role in football, then it will need to build up sufficient cachet to justify its implementation. That is a perfectly normal sequence for new methods as they have to ‘prove’ themselves before seeing more widespread use. Analytics shouldn’t be framed as a magic bullet that will dramatically improve recruitment but if it is used well, then it could potentially help to minimise mistakes.

Nothing that I’ve outlined above is designed to supplant or reduce the role of traditional scouting methods. The idea is just to provide an additional and complementary perspective to aid decision making. I suspect that more often than not, analytical methods will come to similar conclusions regarding the relative merits of a player, which is fine as that can provide greater confidence in your decision making. If methods disagree, then they can be examined accordingly as a part of the process.

Evaluating players is not easy, whatever the method, so being able to weigh several assessments that all have their own strengths, flaws, biases and weaknesses seems prudent to me. The goal of analytics isn’t to create some perfect and objective representation of football; it is just another piece of the puzzle.

truth … is much too complicated to allow anything but approximations – John von Neumann

*I’ve done this by calculating percentile figures to give an indication of how a player compares with their peers. Values closer to 100 indicate that a player ranks highly in a particular statistic, while values closer to zero indicate they attempt or complete few of these actions compared to their peers. In these examples, Lallana and Firmino are compared with other players in the attacking midfielder, direct attacker and through-ball merchant groups. The white curved lines are spaced every ten percentiles to give a visual indication of how the player compares, with the solid shading in each segment corresponding to their percentile rank.

At the recent OptaPro Forum, I was delighted to be selected to present to an audience of analysts and representatives from the football industry. I presented a technique to identify different player types using their underlying statistical performance. My idea was that this would aid player scouting by helping to find the “right fit” and avoid the “square peg for a round hole” cliché.

In the presentation, I outlined the technique that I used, along with how Dani Alves made things difficult. My vision for this technique is that the output from the analysis can serve as an additional tool for identifying potential transfer signings. Signings can be categorised according to their team role and their performance can then be compared against their peers in that style category based on the important traits of those player types.

The video of my presentation is below, so rather than repeating myself, go ahead and watch it! The slides are available here.

Each of the player types is summarised below in the figures. My plan is to build on this initial analysis by including a greater number of leagues and use more in-depth data. This is something I will be pursuing over the coming months, so watch this space.

I’ve previously looked at whether different playing styles can be assessed using seasonal data for the 2011/12 season. The piece concentrated on whether it was possible to separate different playing styles using a method called Principal Component Analysis (PCA). At a broad level, it was possible to separate teams between those that were proactive and reactive with the ball (Principal Component 1) and those that attempted to regain the ball more quickly when out of possession (Principal Component 2). What I didn’t touch upon was whether such features were potentially more successful than others…

Below is the relationship between points won during the 2011/12 season and the proactive/reactive principal component. The relationship between these variables suggests that more proactive teams, that tend to control the game in terms of possession and shots, are more successful. However, the converse could also be true to an extent in that successful teams might have more of the ball and thus have more shots and concede fewer. Either way, the relationship here is relatively strong, with an R2 value of 0.61.

Relationship between number of points won in the 2011/12 season with principal component 1, which relates to the proactive or reactive nature of a team. More proactive teams are to the right of the horizontal axis, while more reactive teams are to the left of the horizontal axis. The data is based on the teams in the top division in Germany, England, Spain, France and Italy from WhoScored. The black line is the linear trend between the two variables. A larger interactive version of the plot is available either by clicking on the graph or clicking here.

Looking at the second principal component, there is basically no relationship at all with points won last season, with an R2 value of a whopping 0.0012. The trend line on the graph is about as flat as a pint of lager in a chain sports bar. There is a hint of a trend when looking at the English and French leagues individually but the sample sizes are small here, so I wouldn’t get too excited yet.

Playing style is important then?

It’s always tempting when looking at scatter plots with nice trend lines and reasonable R2 values to reach very steadfast conclusions without considering the data in more detail. This is likely an issue here as one of the major drivers of the ‘proactive/reactive’ principal component is the number of shots attempted and conceded by a team, which is often summarised as a differential or ratio. James Grayson has shown many times how Total Shots Ratio (TSR, the ratio of total shots for/(total shots for+total shots against)) is related to the skill of a football team and it’s ability to turn that control of a game into success over a season. That certainly appears to play a roll here, as this graph demonstrates, as the relationship between points and TSR yields an R2 value of 0.59. For comparison, the relationship between points and short passes per game yields an R2 value of 0.52. As one would expect based on the PCA results and this previous analysis, TSR and short passes per game are correlated also (R2 = 0.58).

Circular argument

As ever, it is difficult to pin down cause and effect when assessing data. This is particularly true in football when using seasonal averaged statistics as score effects likely play a significant role here in determining the final totals and relationships. Furthermore, the input data for the PCA is quite limited and would be improved with more context. However, the analysis does hint at more proactive styles of play being more successful; it is a challenge to ascribe how much of this is cause and how much is effect.

Danny Blanchflower summed up his footballing philosophy with this quote:

The great fallacy is that the game is first and last about winning. It is nothing of the kind. The game is about glory, it is about doing things in style and with a flourish, about going out and beating the other lot, not waiting for them to die of boredom.

The question is, is the glory defined by the style or does the style define the glory?

The perceived playing style of a football team is a much debated topic with conversations often revolving around whether a particular style is “good/bad” or “entertaining/boring”. Such perceptions are usually based upon subjective criteria and personal opinions. The question is whether the playing style of a team can be assessed using data to categorise and compare different teams.

WhoScored report several variables (e.g. data on passing, shooting, tackling) for the teams in the top league in England, Spain, Italy, Germany and France. I’ve collated these variables for last season (2011/12) in order to examine whether they can be used to assess the playing style of these sides. In total there are 15 variables, which are somewhat limited in scope but should serve as a starting point for such an analysis. Goals scored or conceded are not included as the interest here is how teams actually play, rather than how it necessarily translates into goals. The first step is to combine the data in some form in order to simplify their interpretation.

Principal Component Analysis

One method for exploring datasets with multiple variables is Principal Component Analysis (PCA), which is a mathematical technique that attempts to find the most common patterns within a dataset. Such patterns are known as ‘principal components’, which describe a certain amount of the variability in the overall dataset. These principal components are numbered according to the amount of variance in the dataset that they account for. Generally this means that only the first few principal components are examined as they account for the greatest percentage variance in the dataset. Furthermore, the object is to simplify the dataset so examining a large number of principal components would somewhat negate the point of the analysis.

The video below gives a good explanation of how PCA might be applied to an everyday object.

Below is a graph showing the first and second principal components plotted against each other. Each data point represents a single team from each of the top leagues in England, Spain, Italy, Germany and Italy. The question though is what do each of these principal components represent and what can they tell us about the football teams included in the analysis?

Principal component analysis of all teams in the top division in England, Spain, Italy, Germany and France. Input variables are taken from WhoScored.com for the 2011/12 season.

The first principal component accounts for 37% of the variance in the dataset, which means that just over a third of the spread in the data is described by this component. This component is represented predominantly by data relating to shooting and passing, which can be seen in the graph below. Passing accuracy and the average number of short passes attempted per game are both strongly negatively-correlated (r=-0.93 for both) with this principal component, which suggests that teams positioned closer to the bottom of the graph retain possession more and attempt more short passes; unsurprisingly Barcelona are at the extreme end here. Total shots per game and total shots on target per game are also strongly negatively-correlated (r=-0.88 for both) with the first principal component. Attempted through-balls per game are also negatively correlated (r=-0.62). In contrast, total shots conceded per game and total aerial duels won per game are positively-correlated (r=0.65 & 0.59 respectively). So in summary, teams towards the top of the graph typically concede more shots and win more aerial duels, while as you move down the graph, teams attempt more short passes with greater accuracy and have more attempts at goal.

The first principal component is reminiscent of a relationship that I’ve written about previously, where the ratio of shots attempted:conceded was well correlated with the number of short passes per game. This could be interpreted as a measure of how “proactive” a team is with the ball in terms of passing and how this transfers to a large number of shots on goal, while also conceding fewer shots. Such teams tend to have a greater passing accuracy also. These teams tend to control the game in terms of possession and shots.

The second principal component accounts for a further 18% of the variance in the dataset [by convention the principal components are numbered according to the amount of variance described]. This component is positively correlated with tackles (0.77), interceptions (0.52), fouls won (0.68), fouls conceded (0.74), attempted dribbles (0.59) and offsides won (0.63). In essence, teams further to the right of the graph attempt more tackles, interceptions and dribbles which unsurprisingly leads to more fouls taking place during their matches.

The second principal component appears to relate to changes in possession or possession duels, although the data only relates to attempted tackles, so there isn’t any information on how successful these are and whether possession is retained. Without more detail, it’s difficult to sum up what this component represents but we can describe the characteristics of teams and leagues in relation to this component.

The first and second components together account for 55% of the variance in the dataset. Adding more and more components to the solution would drive this figure upwards but in ever diminishing amounts e.g. the third component accounts for 8% and the fourth accounts for 7%. For simplicity and due to the further components adding little further interpretative value, the analysis is limited to just the first two components.

Assessing team playing styles

So what do these principal components mean and how can we use them to interpret team styles of play? Putting all of the above together, we can see that there are significant differences between teams within single leagues and when comparing all five as a whole.

Within the English league, there is a distinct separation between more proactive sides (Liverpool, Spurs, Chelsea, Manchester United, Arsenal and Manchester City) and the rest of the league. Swansea are somewhat atypical, falling between the more reactive English teams and the proactive 6 mentioned previously. Stoke could be classed as the most “reactive” side in the league based on this measure.

There isn’t a particularly large range in the second principal component for the English sides, probably due the multiple correlations embedded within this component. One interesting aspect is how all of the English teams are clustered to the left of the second principal component, which suggests that English teams attempt fewer tackles, make fewer interceptions and win/concede fewer fouls compared with the rest of Europe. Inspection of the raw data supports this. This contrasts with the clichéd blood and thunder approach associated with football in England, whereby crunching tackles fly in and new foreign players struggle to adapt to the intense tackling approach. No doubt there is more subtlety inherent in this area and the current analysis doesn’t include anything about the types of tackles/interceptions/fouls, where on the pitch they occur or who perpetrates them but this is an interesting feature pointed out by the analysis worthy of further exploration in the future.

The substantial gulf in quality between the top two sides in La Liga from the rest is well documented but this analysis shows how much they differed in style with the rest of the league last season. Real Madrid and Barcelona have more of the ball, take more shots and concede far fewer shots compared with their Spanish peers. However, in terms of style, La Liga is split into three groups: Barcelona, Real Madrid and the rest. PCA is very good at evaluating differences in a dataset and with this in mind we could describe Barcelona as the most “different” football team in these five leagues. Based on the first principal component, Barcelona are the most proactive team in terms of possession and this translates to their ratio of shots attempted:conceded; no team conceded fewer shots than Barcelona last season. This is combined with their pressing style without the ball, as they attempt more tackles and interceptions relative to many of their peers across Europe.

Teams from the Bundesliga are predominantly grouped to the right-hand-side of the second principal component, which suggests that teams in Germany are keen to regain possession relative to the other leagues analysed. The Spanish, Italian and French tend to fall between the two extremes of the German and English teams in terms of this component.

All models are wrong, but some are useful

The interpretation of the dataset is the major challenge here; Principal Component Analysis is purely a mathematical construct that doesn’t know anything about football! While the initial results presented here show potential, the analysis could be significantly improved with more granular data. For example, the second principal component could be improved by including information on where the tackles and interceptions are being attempted. Do teams in England sit back more compared with German teams? Does this explain the lower number of tackles/interceptions in England relative to other leagues? Furthermore, the passing and shooting variables could be improved with more context; where are the passes and shots being attempted?

The results are encouraging here in a broad sense – Barcelona do play a different style compared with Stoke and they are not at all like Swansea! There are many interesting features within the analysis, which are worthy of further investigation. This analysis has concentrated on the contrasts between different teams, rather than whether one style is more successful or “better” than another (the subject of a future post?). With that in mind, I’ll finish with this quote from Andrés Iniesta from his interview with Sid Lowe for the Guardian from the weekend.

…the football that Spain and Barcelona play is not the only kind of football there is. Counter-attacking football, for example, has just as much merit. The way Barcelona play and the way Spain play isn’t the only way. Different styles make this such a wonderful sport.