Cap 10k Race Data
The Capitol 10,000 is a total friggin’ blast. Anyone with a remote interest in running should participate. Rarely do you get to run up the middle of the Congress bridge straight to the capitol through an unstoppable sea of humanity. On your way to the finish line you’ll pass spectators offering beer, bacon, and donuts.
Before you can catch your breath, they make the results available online. I’ve been looking for an opportunity to jump back into Python, so naturally I scraped the HTML and ran it through a script so people can play with the data.
Here’s how it went down.
The HTML returned by their API is truly hideous. A mere 20k rows cost 16MB… because they’re chock full of <b>, <font>, and “OnMouseOver”. After pulling out the relevant data, the file size is about 800KB.
Here’s the python script used to generate tab delimited .csv files from the HTML input. BeautifulSoup can take a few minutes to parse the larger files.
from bs4 import BeautifulSoup #produce tab delimited CSV files from large html files for year in range(2008, 2013): print 'processing ' + str(year) file = str(year) + '.html' f = open(file, 'r'); html = f.read() soup = BeautifulSoup(html) table = soup.find("table") #first row with column titles rows = table.findAll('tr') i = 0 #write data to csv file outfile = str(year) + '.csv' out = open(outfile, 'w') this_row = '' for row in rows: cols = row.findAll('td') for col in cols: b = col.find('b') text = str(b.string) this_row += text this_row += ' ' if i % 2 == 0: this_row += '\n' out.write(this_row) this_row = '' i += 1 out.close(); f.close()
Playing with the data
Building a simple scatterplot was way harder than it should have been. Google docs’ spreadsheet crashes when I try to build a chart. The Python library matplotlib chokes on the rows where Age is “None”. Wolfram Alpha rejects it, probably for the same reason. LibreOffice’s Spreadsheet just made my laptop really hot. Octave is inscrutable. But Plot for OSX did the job.
I’d like to see someone with MATLAB skills come up with some more advanced plots or divine some insight from the data. The columns available for 2012 are:
- Division Place
- Gun Time
- Chip Time
- Overall Place
- Gen Place
- Total Pace
- Total Div
- Total Gend
- Tot AG
Here’s a zip archive of all the tab delimited CSV files 2008-2012.