Since I took part in the 2014 edition of MalmöMilen (a 10km race in Malmö, Sweden), one of the things I have wanted to do for a long time was to step back and have a look at the results. There are accessible from this page, which is easy to scrap and analyze.

The data source is divided into 12 different web pages for men, 9 for women, so the web page needs to be scraped several times.
Let's define a function for that.

In [2]:

raceNumber={"M":8364,"W":8541}defscrap_page(gender,runners,p):page=requests.get("http://www.racetimer.se/en/race/resultlist/1997?checkpoint=9999&layout=marathon&page="+str(p)+"&rc_id="+str(raceNumber[gender]))soup=BeautifulSoup(page.text)if(soup==None):returntable=soup.body.table.find(id='top3-list')lines=table.find_all('tr')iflen(lines)==0:returnforlineinlines:c=line.find_all('td')iflen(c)==0:continuer={}r["rank"]=int(c[0].text)r["fullname"]=c[1].text.replace("»","").strip()split=r["fullname"].rfind(" ")r["firstname"]=r["fullname"][:split].strip()r["lastname"]=r["fullname"][split+1:].strip()r["city"]=c[3].text.strip()# get a proper value for the year of birthtry:r["yob"]=int(c[2].text.strip())ifc[2].text.strip()!=""else-1exceptValueError:r["yob"]=-1r["nb"]=int(c[4].text.strip())r["time"]=c[5].text.strip()# convert the time to secondsparts=r["time"].split(':')hours,minutes,seconds=0,0,0iflen(parts)==2:minutes,seconds=partselse:hours,minutes,seconds=partsr["timeInSeconds"]=int(hours)*3600+int(minutes)*60+int(seconds)runners[gender].append(r)

One interesting stat is to compare how the runners expected to perform to how they actually performed.

Let's redraw the previous start group distribution, which corresponds to the runners own pronostic. The following graph displays the time the participants thought it will take for them to complete the race.

In comparison, we display the same graph, but this time with the final values. Note that there might be a small gap between the number of registered runners and the number of people who actually finished the race.

But anyway the graph still gives a pretty good idea of how the race went, and some differences are quite noticeable.

For example the "elite runners" seem to be quite modest, while a few hundreds of runners more did it in more than 60 minutes.

In [8]:

all_runners=runners["M"]+runners["W"]intervals=[(0,40),(41,42),(43,45),(46,50),(51,55),(56,60),(61,200)]ids=[0]*1000fori,interinenumerate(intervals):ids[inter[0]:inter[1]]=[i]*(inter[1]-inter[0]+1)c=Counter([ids[int(math.ceil(r["timeInSeconds"]/60))]forrinall_runners])results=[cnt[1]forcntinc.items()]# remove barnloppet for this analysisfinal_groups=sorted_groups[:-1]final_y_pos=np.arange(len(final_groups))[::-1]rects=plt.barh(final_y_pos,results,align='center',alpha=0.6)plt.yticks(final_y_pos,final_groups)plt.xlabel('Number of participants')plt.title('Final time distribution')fori,rectinenumerate(rects):plt.text(0.95*rect.get_width(),rect.get_y()+rect.get_height()/2.0,results[i],ha='right',va='center')plt.show()

The data is not 100% good, some year of birth are missing. Let's remove them.

In [9]:

# remove runners without a proper year of birthweird_yob=[rforrinall_runnersifr["yob"]==-1orr["yob"]>=2014orr["yob"]<1914]all_runners_yob=[rforrinall_runnersifrnotinweird_yob]print(str(len(weird_yob))+" people have a weird year of birth ... Examples: "+str([r["yob"]forrinweird_yob[:15]]))

group_by_yob=itertools.groupby(sorted(all_runners_yob,key=lambdax:x["yob"]),lambdax:x["yob"])avg_by_yob=[(k,stats.mean([r["timeInSeconds"]forring]))fork,gingroup_by_yob]defsecondsToHumanTime(x,pos):returnstr(datetime.timedelta(seconds=x))avg_by_yob=sorted(avg_by_yob,key=lambdax:x[0])yob_x=[x[0]forxinavg_by_yob]yob_y=[x[1]forxinavg_by_yob]plt.xlabel('Year of birth')plt.ylabel('Average time per year of birth')plt.title('Variation of the time given the year of birth')plt.plot(yob_x,yob_y,'bo')plt.gca().yaxis.set_major_formatter(FuncFormatter(secondsToHumanTime))plt.show()

The time value plotted above represents the average time per year. The age distribution is maybe weird, let's have a look at it.

In [11]:

group_by_yob=itertools.groupby(sorted(all_runners_yob,key=lambdax:x["yob"]),lambdax:x["yob"])sum_by_yob=[(k,len(list(g)))fork,gingroup_by_yob]sum_x=[x[0]forxinsum_by_yob]sum_y=[x[1]forxinsum_by_yob]plt.xlabel('Year of birth')plt.ylabel('Number of runners')plt.title('Number of runners per year of birth')plt.plot(sum_x,sum_y,'bo',linewidth=2)plt.show()

total_number=len(all_runners)total_sson=sum([1ifr["lastname"][-4:]=="sson"else0forrinall_runners])print(str(total_sson)+" runners have a lastname finishing with \"sson\". Welcome to Sweden.")labels='Lastnames finishing in "sson"','Others'sizes=[total_sson,total_number-total_sson]colors=['gold','lightskyblue']explode=(0.1,0)plt.pie(sizes,explode=explode,labels=labels,colors=colors,autopct='%1.1f%%',shadow=True,startangle=90)plt.axis('equal')plt.show()

1438 runners have a lastname finishing with "sson". Welcome to Sweden.

This website does not host notebooks, it only renders notebooks
available on other websites.