Category: Data Science

This blog provides a technical explanation of the analysis underlying the medical paper about male cyclists described previously. Part of the skill of a data scientist is to choose from the arsenal of machine learning techniques the tools that are appropriate for the problem at hand. In the study of male cyclists, I was asked to identify significant features of a medical data set. This article describes how the problem was tackled.

Data

Fifty road racing cyclists, riding at the equivalent of British Cycling 2nd category or above, were asked to complete a questionnaire, provide a blood sample and undergo a DXA scan – a low intensity X-ray used to measure bone density and body composition. I used Python to load and clean up the data, so that all the information could be represented in Pandas DataFrames. As expected this time-consuming, but essential step required careful attention and cross-checking, combined with the perseverance that is always necessary to be sure of working with a clean data set.

The questionnaire included numerical data and text relating to cycling performance, training, nutrition and medical history. As a result of interviewing each cyclist, a specialist sports endocrinologist identified a number of individuals who were at risk of low energy availability (EA), due to a mismatch between nutrition and training load.

Bone density was measured throughout the body, but the key site of interest was the lumbar spine (L1-L4). Since bone density varies with age and between males and females, it was logical to use the male, age-adjusted Z-score, expressing values in standard deviations above or below the comparable population mean.

The measured blood markers were provided in the relevant units, alongside the normal range. Since the normal range is defined to cover 95% of the population, I assumed that the population could be modelled by a gaussian distribution in order to convert each blood result into a Z-score. This aligned the scale of the blood results with the bone density measures.

Analysis

I decided to use the Orange machine learning and data visualisation toolkit for this project. It was straightforward to load the data set of 46 features for each of the 50 cyclists. The two target variables were lumbar spine Z-score (bone health) and 60 minute FTP watts per kilo (performance). The statistics confirmed the researchers’ suspicion that the lumbar spine bone density of the cyclists would be below average, partly due to the non-weight-bearing nature of the sport. Some of the readings were extremely low (verging on osteoporosis) and the question was why.

Given the relatively small size of the data set (a sample of 50), the most straightforward approach for identifying the key explanatory variables was to search for an optimal Decision Tree. Interestingly, low EA turned out to be the most important variable in explaining lumbar spine bone density, followed by prior participation in a weight-bearing sport and levels of vitamin D (which was, in most cases, below the ideal level of athletes). Since I had used all the data to generate the tree, I made use of Orange’s data sampler to confirm that these results were highly robust. This had some similarities with the Random Forest approach. Although Orange produces some simple graphical tools like the following, I use Python to generate my own versions for the final publication.

Decision Tree from Orange

Published Decision Tree

Finding a robust decision tree is one thing, but it was essential to verify whether the decision variables were statistically significant. For this, Orange provides box plots for discrete variables. For my own peace of mind, I recalculated all of the Student’s T-statistics to confirm that they were correct and significant. The charts below show an example of an Orange box plot and the final graphic used in the publication.

Orange Box Plot

Final Box Plots

The Orange toolkit includes other nice data visualisation tools. I particularly liked the flexibility available to make scatter plots. This inspired the third figure in the publication, which showed the most important variable explaining performance. This chart highlights a cluster of three cyclists with low EA, whose FTP watts/kg were lower than expected, based on their high training load. I independently checked the T-statistics of the regression coefficients to identify relationships that were significant, like training load, or insignificant, like percentage body fat.

Orange Scatter Plot

Published Scatter Plot

Conclusions

The Orange toolkit turned out to be extremely helpful in identifying relationships that fed directly into the conclusions of an important medical paper highlighting potential health risks and performance drivers for high level cyclists. Restricting nutrition through diet or fasted rides can lead to low energy availability, that can cause endocrine responses in the body that reduce lumbar spine bone density, resulting in vulnerability to fracture and slow recovery. This is know as Relative Energy Deficiency in Sport (RED-S). Despite the obsession of many cyclists to reduce body fat, the key variable explaining functional threshold power watts/kg was weekly training load.

Some commentators were skeptical of Team Sky’s explanation for Chris Froome’s 80km tour-winning attack on stage 19 of the Giro. His success was put down to the detailed planning of nutrition throughout the ride, with staff positioned at strategic refuelling points along the entire route. If you consider how skeletal the riders look after two and a half weeks of relentless competition, along with the limits on what can be physically absorbed between stages, the nutrition story makes a lot of sense. Did Yates, Pinot and Aru dramatically fall by the wayside simply because they ran out of energy?

The best performing cyclists have excellent balancing skills. This includes the ability to match energy intake with energy demand. The pros benefit from teams of support staff monitoring every aspect of their nutrition and performance. However, many serious club-level cyclists pick up fads and snippets of information from social media or the cycling press that lead them to try out all kinds ideas, in an unscientific manner, in the hope of achieving an improvement in performance. Some of these activities have potentially harmful effects on the body.

Competitive riders can become obsessed with losing weight and sticking to extremely tough training schedules, leading to both short-term and long-term energy deficits that are detrimental to both health and performance. One of the physiological consequences can be a reduction in bone density, which is particularly significant for cyclists, who do not benefit from gravitational stress on bones, due to the non-weight-bearing nature of the sport. In a recent paper, colleagues at Durham University and I describe an approach for identifying male cyclists at risk of Relative Energy Deficit in Sport (RED-S).

You need a certain amount of energy simply to maintain normal life processes, but an athlete can force the body into a deficit in two ways: by intentionally or unintentionally restricting energy intake below the level required to meet demand or by increasing training load without a corresponding increase in fuelling.

Our bodies have a range of ways to deal with an energy deficit. For the average, slightly overweight casual cyclist, burning some fat is not a bad thing. However, most competitive cyclists are already very lean, making the physiological consequences of an energy deficit more serious. Changes arise in the endocrine system that controls the body’s hormones. Certain processes can shut down, such as female menstruation, and males can experience a reduction in testosterone. Sex steroids are important for maintaining healthy bones. In our study of 50 male competitive cyclists, the average bone density in the lumbar spine, measured by DXA scan, was significantly below normal. Some relatively young cyclists had the bones of a 70 year old man!

The key variable associated with poor bone health was low energy availability, i.e. male cyclists exhibiting RED-S. These riders were identified using a questionnaire followed by an interview with a Sports Endocrinologist. The purpose of the interview was to go through the responses in more detail, as most people have a tendency to put a positive spin on their answers. There were two important warning signs.

Among riders with low energy availability, bone density was not so bad for those who had previously engaged in a weight-bearing sport, such as running. For cyclists with adequate energy availability, those with vey low levels of vitamin D had weaker bones. Across the 50 cyclists, most had vitamin D levels below the level of 90 nmol/L recommended for athletes, including some who were taking vitamin D supplements, but clearly not enough. Studies have shown that the advantages of athletes taking vitamin D supplements include better bone health, improved immunity and stronger muscles, so why wouldn’t you?

In terms of performance, British Cycling race category was positively related with a rider’s power to weight ratio, evaluated by 60 minute FTP per kg (FTP60/kg). Out of all the measured variables, including questionnaire responses, blood tests, bone density and body composition, the strongest association with FTP60/kg was the number of weekly training hours. There was no significant relationship between percentage body fat and FTP60/kg. So if you want to improve performance, rather than starving yourself in the hope of losing body fat, you are better off getting on your bike and training with adequate fuelling.

Cyclists using power meters have the advantage of knowing exactly how many calories they have used on every ride. In addition to taking on fuel during the ride, especially when racing, the greatest benefits accrue from having a recovery drink and some food immediately after completing rides of more than one hour.

For those wishing to know more about RED-S, the British Association of Sports and Exercise Medicine has provided a web resource.

A related blog will explore the machine learning and statistical techniques used to analyse the data for this study.

As you upload your data, you accumulate a growing history of rides. It is helpful to find ways of classifying different types of activities. Races and training sessions often include laps that are repeated during the ride. Many GPS units can automatically record laps as you pass the point where you began your ride or last pressed the lap button. However, if the laps were not recorded on the device, it is tricky to recover them. This article investigates how to detect laps automatically.

First consider the simple example of a 24 lap race around the Hillingdon cycle circuit. Plotting the GPS longitude and latitude against time displays repeating patterns. It is even possible to see the “omega curve” in the longitude trace. So it should be possible to design an algorithm that uses this periodicity to calculate the number of laps.

This is a common problem in signal processing, where the Fourier Transform offers a neat solution. This effectively compares the signal against all possible frequencies and returns values with the best fit in the form of a power spectrum. In this case, the frequencies correspond to the number of laps completed during the race. In the bar chart below, the power spectrum for latitude shows a peak around 24. The high value at 25 probably shows up because I stopped my Garmin slightly after the finish line. A “harmonic” also shows up at 49 “half laps”. Focussing on the peak value, it is possible to reconstruct the signal using a frequency of 24, with all others filtered out.

So we’re done – we can use a Fourier Transform to count the laps! Well not quite. The problem is that races and training sessions do not necessarily start and end at exactly the starting point of a lap. As a second example, consider my regular Saturday morning club run, where I ride from home to the meeting point at the centre of Richmond Park, then complete four laps before returning home. As show in the chart below, a simple Fourier Transform approach suggests that ride covered 5 laps, because, by chance, the combined time for me to ride south to the park and north back home almost exactly matches the time to complete a lap of the park. Visually it is clear that the repeating pattern only holds for four laps.

Although it seems obvious where the repeating pattern begins and ends, the challenge is to improve the algorithm to find this automatically. A brute force method would compare every GPS location with every other location on the ride, which would involve about 17 million comparisons for this ride, then you would need to exclude the points closely before or after each recording, depending on the speed of the rider. Furthermore, the distance between two GPS points involves a complex formula called the haversine rule that accounts for the curvature of the Earth.

Fortunately, two tricks can make the calculation more tractable. Firstly, the peak in the power spectrum indicates roughly how far ahead of the current time point to look for a location potentially close to the current position. Given a generous margin of, say, 15% variation in lap times, this reduces the number of comparisons by a whole order of magnitude. Secondly, since we are looking for points that are very close together, we only need to multiply the longitudes by the cosine of the latitude (because lines of longitude meet at the poles) and then a simple Euclidian sum the squares of the differences locates points within a desired proximity of, say, 10 metres. This provides a quicker way to determine the points where the rider was “lapping”. These are shaded in yellow in the upper chart and shown in red on a long/latitude plot below. The orange line on the upper chart shows, on the right hand scale, the rolling lap time, i.e. the number of seconds to return to each point on the lap, from which the average speed can be derived.

Two further refinements were required to make the algorithm more robust. One might ask whether it makes a difference using latitude or longitude. If the lap involved riding back and forth along a road that runs due East-West, the laps would show up on longitude but not latitude. This can be solved by using a 2-dimensional Fourier Transform and checking both dimensions. This, in turn, leads to the second refinement, exemplified by the final example of doing 12 ascents of the Nightingale Lane climb. The longitude plot includes the ride out to the West, 12 reps and the Easterly ride back home.

The problem here was that the variation in longitude/latitude on the climb was tiny compared with the overall ride. Once again, the repeating section is obvious to the human eye, but more difficult to unpick from its relatively low peak in the power spectrum. A final trick was required: to consider the amplitude of each frequency in decreasing order of power and look out for any higher frequency peaks that appear early on the list. This successfully identified the relevant part of the ride, while avoiding spurious observations for rides that did not include laps.

The ability for an algorithm to tag rides if they include laps is helpful for classifying different types of sessions. Automatically marking the laps would allow riders and coaches to compare laps against each other over a training session or a race. A potential AI-powered robo-coach could say “Ah, I see you did 12 repeats in your session today… and apart from laps 9 and 10, you were getting progressively slower….”

In a recent blog, I described an experiment to train a deep neural network to distinguish between photographs of Vincenzo Nibali and Alejandro Valverde, using a very small data set of images. In the conclusion, I suggested that the network was probably basing its decisions more on the colours of the riders’ kit rather than on facial recognition. This article investigates what the network was actually “looking at”, in order to understand better how it was making decisions.

The issues of accountability and bias were among the topics discussed at the last NIPS conference. As machine learning algorithms are adopted across industry, it is important for companies to be able to explain how conclusions are reached. In many instances, it is not acceptable simply to rely on an impenetrable black box. AI researchers and developers need to be able to explain what is going on inside their models, in order to justify decisions taken. In doing so, some worrying instances of bias have been revealed in the selection of data used to train the algorithms.

I went back to my rider recognition model and used an approach called “Class Activation Maps” to identify which parts of the images accounted for the network’s choice of rider. Making use of the code provided in lesson 7 of the course offered by fast.ai, I took advantage of my existing small set of training, validation and test images of the two famous cyclists. Starting with a pre-trained version of ResNet34, the idea was to replace the last two layers with four new ones, the crucial one being a convolutional layer with two outputs, matching the number of cyclists in the classification task. The two outputs of this layer were 7×7 matrix representations of the relevant image.

The final predictions of the model came from a softmax of a flattened average pooling of these 7×7 representations. The softmax output gave the probabilities of Nibali and Valverde respectively. Since there was no learning beyond the final convolution, the activations of the two 7×7 matrices represented the “Nibali-ness” and “Valverde-ness” of the image. This could be displayed as a heat map on top of the image.

Examples are shown below for the validation set of 10 images of Nibali followed by 10 of Valverde. The yellow patch of the heat map highlights the part of the image that led to the prediction displayed above each image. Nine out of ten were correct for Nibali and six for Valverde.

Class Activation Maps applied to the validation set

The heat maps were very helpful in understanding the model’s decision making process. It seemed that for Nibali, his face and helmet were important, with some attention paid to the upper part of his blue Astana kit. In contrast, the network did a very good job at identifying the M on Valverde’s Moviestar kit. It was interesting to note that the network succeeded in spotting that Nibali was wearing a Specialized helmet whereas Valverde had a Catlike design. Three errors arose in the photos of his face, which was mistaken for Nibali’s. In fact, any picture of a face led to a prediction of Nibali, as demonstrated by the cropped image below that was used for training.

Why should that be? Looking back at the training set, it turned out that, by chance, there were far more mugshots of Nibali, while there were more photos of Valverde riding his bike, with his face obscured by sunglasses. This was an example of unintentional bias in the training data, providing a very useful lesson.

The final set of pictures shows the predictions made on the out-of-sample test set. All the predictions are correct, except the first one, where the model failed to spot the green M on Valverde’s chest and mistook the blurred background for Nibali. Otherwise the results confirmed that the network looked at Nibali’s face, the rider’s helmet or Valverde’s kit. It also remembered seeing an image of Nibali holding the Giro trophy in the training set.

Class Activation Maps applied to the test set

In conclusion, Class Activation Maps provide a useful way of visualising the activations of hidden laters in a deep neural network. This can go some way to accounting for the decisions that appear in the output. The approach can also help identify unintentional bias in the training set.

My last blog explored the effectiveness of deep learning in spotting the difference between Vincenzo Nibali and Alejandro Valverde. Since the faces of the riders were obscured in many of the photos, it is likely that the neural network was basing its evaluations largely on the colours of their team kit. A natural next challenge is to identify a rider’s team from a photograph. This task parallels the approach to the kaggle dog breed competition used in lesson 2 of the fast.ai course on deep learning.

Eighteen World Tour teams are competing this year. So the first step was to trawl the Internet for images, ideally of riders in this year’s kit. As before, I used an automated downloader, but this posed a number of problems. For example, searching for “Astana” brings up photographs of the capital of Kazakhstan. So I narrowed things down by searching for “Astana 2018 cycling team”. After eliminating very small images, I ended up with a total of about 9,700 images, but these still included a certain amount of junk that I did have the time to weed out, such as photos of footballers or motorcycles in the “Sky Racing Team”,.

The following small sample of training images is generally OK, though it includes images of Scott bikes rather than Mitchelton-Scott riders and a picture of Sunweb’s Wilco Kelderman labelled as FDJ. However, with around 500-700 images of each team, I pressed on, noting that, for some reason, there were only 166 of Moviestar and these included the old style kit.

Small sample of training images

For training on this multiple classification problem, I adopted a slightly more sophisticated approach than before. Taking a pre-trained Resnet50 model, I performed some initial fine-tuning, on images rescaled to 224×224. I settled on an optimal learning rate of 1e-3 for the final layer, while allowing some training of lower layers at much lower rates. With a view to improving generalisation, I opted to augment the training set with random changes, such as small shifts in four directions, zooming in up to 10%, adjusting lighting and left-right flips. After initial training, accuracy was 52.6% on the validation set. This was encouraging, given that random guesses would have achieved a rate of 1 in 18 or 5.6%.

Taking a pro tip from fast.ai, training proceeded with the images at a higher resolution of 299×299. The idea is to prevent overfitting during the early stages, but to improve the model later on by providing more data for each image. This raised the accuracy to 58.3% on the validation set. This figure was obtained using a trick called “test time augmentation”, where each final prediction is based on the average prediction of five different “augmented” versions of the image in question.

Given the noisy nature of some of the images used for training, I was pleased with this result, but the acid test was to evaluate performance on unseen images. So I created a test set of two images of a lead rider from each squad and asked the model to identify the team. These are the results.

75% accuracy on the test set

The trained Resnet50 correctly identified the teams of 27 out of 36 images. Interestingly, there were no predictions of MovieStar or Sky. This could be partly due to the underrepresentation of MovieStar in the training set. Froome was mistaken for AG2R and Astana, in column 7, rows 2 and 3. In the first image, his 2018 Sky kit was quite similar to Bardet’s to the left and in the second image the sky did appear to be Astana blue! It is not entirely obvious why Nibali was mistaken for Sunweb and Astana, in the top and bottom rows. However, the huge majority of predictions were correct. An overall success rate of 75% based on an afternoon’s work was pretty amazing.

The results could certainly be improved by cleaning up the training data, but this raises an intriguing question about the efficacy of artificial intelligence. Taking a step back, I used Bing’s algorithms to find images of cycling teams in order to train an algorithm to identify cycling teams. In effect, I was training my network to reverse-engineer Bing’s search algorithm, rather than my actual objective of identifying cycling teams. If an Internet search for FDJ pulls up an image of Wilco Kelderman, my network would be inclined to suggest that he rides for the French team.

In conclusion, for this particular approach to reach or exceed human performance, expert human input is required to provide a reliable training set. This is why this experiment achieved 75%, whereas the top submissions on the dog breeds leaderboard show near perfect performance.

Chris Froome has been logging data on Strava since the beginning of the year. He had already completed over 4,500km, around Johannesburg, in the first four weeks of January. The weather has been hot and he has been based at an altitude of around 1350m. Some have speculated that he has been replicating the conditions of a grand tour, so that measurements can be made that may assist in his defence against the adverse analytical finding made at last year’s Vuelta.

Whatever the reasons, Froome chose to “Empty the tank” with epic ride on 28 January, completing 271km in just over six hours at an average of 44.8kph. The activity was flagged on Strava, presumably because he completed it suspiciously fast. For example, he rode the 20km Back Straight segment at 50.9kph, finishing in 24:24, nearly four minutes faster than holder of the the KOM: a certain Chris Froome. Since there was no significant wind blowing, one can only assume he was being motor-paced.

One interesting thing about rides displayed publicly on Strava is that anyone can download a GPX file of the route, which shows the latitude, longitude and altitude of the rider, typically at one second intervals. Although Froome is one of the professional riders who prefer to keep their power data private, this blog explores the possibility of estimating power from the GPX file. The plan is similar to the way Strava estimates power.

Knowledge is power

An interesting case study is Froome’s TT Bike Squeeeeze from 6 January, which included a sustained 2 hour TT effort. Deriving speed and gradient from the GPX file is straightforward, though it is helpful to include smoothing (say, a five second average) to iron out noise in the recording. It is simple to check the average speed and charts against those displayed on Strava.

Several factors affect air density. Firstly, we can obtain the local weather conditions from sources, such as Weather Underground. Froome set off at 6:36am, when it was still relatively cool, but he Garmin shows that it warmed up from 18 degrees to 40 degrees during the ride. Taking the average of 29 for the whole ride simplifies matters. Air pressure remained constant at around 1018hPa, but this is always quoted for sea level, so the figure needs to be adjusted for altitude. Froome’s GPS recorded an altitude range from 1242m to 1581m. However we can see that his starting altitude was recorded as 1305m, when the actual altitude of this location was 1380m. We conclude that his average altitude for the ride, recorded at 1436m, needs to be corrected by 75m to 1511m and opt to use this as an elevation adjustment for the whole ride. This is important because the air is sufficiently less dense at this altitude to have a noticeable impact on aerodynamic drag.

An estimate of power requires some additional assumptions. Froome uses his road bike, TT bike and mountain bike for training, sometimes all in the same ride, and we suspect some rides are motor-paced. However, he indicates that the 6 January ride was on the TT bike. So a CdA of 0.22 for drag and a Crr of 0.005 for rolling resistance seem reasonable. Froome weighs about 70kg and fair assumptions were taken for the spec of his bike. Finally, the wind was very light, so it was ignored in the calculations.

Under these assumptions, Froome’s estimated average power was 205W. The red shaded area marks a 2 hour effort completed at 43.7kph, with a higher average power of 271W. His maximal average power sustained over one hour was 321W or 4.58W/kg. There is nothing adverse about these figures; they seem to be eminently within the expected capabilities of the multiple grand tour winner.

Of course, quite a few assumptions went into these calculations, so it is worth identifying the most important ones. The variation of temperature had a small effect: the whole ride at 18 degrees would have required an average of 209W or, at 40 degrees, 201W. Taking account of altitude was important: the same ride at sea level would have required 230W, but the variations in altitude during the ride were not significant. At the speeds Froome was riding, aerodynamics were important: a CdA of 0.25 would have needed 221W, whereas a super-aero CdA of 0.20 rider could have done 195W. This sensitivity analysis suggests that the approach is robust.

Running the same analysis over the “Empty the tank” ride gives an average power requirement of 373W for six hours, which is obviously suspect. However, if he was benefiting from a 50% reduction in drag by following a motor vehicle, his estimated average power for the ride would have been 244W – still pretty high, but believable.

Posting rides on Strava provides an independently verifiable adjunct to a biological passport.

In the previous blog, I explored the structure of a data set of summary statistics from over 800 rides recorded on my Garmin device. The K-means algorithm was an example of unsupervised learning that identified clusters of similar observations without using any identifying labels. The Orange software, used previously, makes it extremely easy to compare a number of simple models that map a ride’s statistics to its type: race, turbo trainer or just a training ride. Here we consider Decision Trees, Random Forests and Support Vector Machines.

Decision Trees

Perhaps the most basic approach is to build a Decision Tree. The algorithm finds an efficient way to make a series of binary splits of the data set, in order to arrive at a set of criteria that separates the classes, as illustrated below.

Decision Tree

The first split separates the majority of training rides from races and turbo trainer sessions, based on an average speed of 35.8km/h. Then Average Power Variance helped identify races, as observed in the previous blog. After this, turbo trainer sessions seemed to have a high level of TISS Aerobicity, which relates to the percentage of effort done aerobically. Pedalling balance, fastest 500m and duration separated the remaining rides. An attractive way to display these decisions is to create a Pythagorean Tree, where the sides of each triangle relate to the number of observations split by each decision.

Pythagorean Tree

Random Forests

Many alternative sets of decisions could separate the data, where any particular tree can be quite sensitive to specific observations. A Random Forest addresses this issue by creating a collection of different decision trees and choosing the class by majority vote. This is the Pythagorean Forest representation of 16 trees, each with six branches.

Pythagorean Forest

Support Vector Machines

A Support Vector Machine (SVM) is a widely used model for solving this kind of categorisation problem. The training algorithm finds an efficient way to slice the data, that largely separates the categories, while allowing for some overlap. The points that are closest to the slices are called support vectors. It is tricky to display the results in such a high dimensional space, but the following scatter plot displays Average Power Variance versus Average Speed, where the support vectors are shown as filled circles.

Support Vectors shown as filled circles

Comparison of results

A Confusion Matrix provides a convenient way to compare the accuracy of the models. This correlates the predictions versus the actual category labels. Out of the 809 rides, only 684 were labelled. The Decision Tree incorrectly labelled 20 races and 7 turbos as training rides. The Random Forest did the best job, with only six misclassifications, while the SVM made 11 errors.

Decision Tree

Random Forest

SVM

Looking at the classification errors can be very informative. It turns out that the two training rides classified as races by the SVM had been accidentally mislabelled – they were in fact races! Furthermore, looking at the five races the that SVM classified as training rides, I punctured in one, I crashed in another and in a third race, I was dropped from the lead group, but eventually rolled in a long way behind with a grupetto. The Random Forest also found an alpine race where my Garmin battery failed and classified it as a training ride. So the misclassifications were largely understandable.

After correcting the data set for mislabelled rides, the Random Forest improved to just two errors and the SVM dropped to just eight errors. The Decision Tree deteriorated to 37 errors, though it did recognise that the climbing rate tends to be zero on a turbo training session.

Prediction

Having trained three models, we can take a look at the sample of 125 unlabelled rides. The following chart shows the predictions of the Random Forest model. It correctly identified one race and suggested several turbo trainer sessions. The SVM also found another race.

Random Forest predictions of unlabelled rides

Conclusions

Several lessons can be learned from these experiments. Firstly, it is very helpful to start with a clean data set. But if this is not the case, looking at the misclassified results of a decent model can be useful in catching mislabelled data. The SVM seemed to be good for this task, as it had more flexibility to fit the data than the Decision Tree, but it was less prone to overfit the data than the Random Forest.

The Decision Tree was helpful in quickly identifying average speed and power variance (chart below) as the two key variables. The SVM and Random Forest were both pretty good, but less transparent. One might improve on the results by combining these two models.

Distribution of APV (large peak at zero is where no power was recorded for ride)