Cap 10k Race Data

What's the rush?

The Capitol 10,000 is a total friggin’ blast. Anyone with a remote interest in running should participate. Rarely do you get to run up the middle of the Congress bridge straight to the capitol through an unstoppable sea of humanity. On your way to the finish line you’ll pass spectators offering beer, bacon, and donuts.

Before you can catch your breath, they make the results available online. I’ve been looking for an opportunity to jump back into Python, so naturally I scraped the HTML and ran it through a script so people can play with the data.

Here’s how it went down.

Results are posted to mychiptime.com. They have some client-side JavaScript query their backend with your specified parameters. With a little help from Chrome’s developer console, I found their URL scheme. Use wget to download the data. It’s HTML that gets inserted directly into the page with a $(“#blah”).html(response).

wget http://www.mychiptime.com/searchResultGen.php?eID=3526&show=all"

The HTML returned by their API is truly hideous. A mere 20k rows cost 16MB… because they’re chock full of <b>, <font>, and “OnMouseOver”. After pulling out the relevant data, the file size is about 800KB.

Here’s the python script used to generate tab delimited .csv files from the HTML input. BeautifulSoup can take a few minutes to parse the larger files.

from bs4 import BeautifulSoup
#produce tab delimited CSV files from large html files
for year in range(2008, 2013):
	print 'processing ' + str(year)
	file = str(year) + '.html'
	f = open(file, 'r');

	html = f.read()
	soup = BeautifulSoup(html)
	table = soup.find("table")	

	#first row with column titles
	rows = table.findAll('tr')

	i = 0
	#write data to csv file
	outfile = str(year) + '.csv'
	out = open(outfile, 'w')
	this_row = ''
	for row in rows:
		cols = row.findAll('td')
		for col in cols:
			b = col.find('b')

			text = str(b.string)
			this_row += text
			this_row += '	'

		if i % 2 == 0:
			this_row += '\n'
			out.write(this_row)
			this_row = ''
		i += 1

	out.close();
	f.close()

Playing with the data

Building a simple scatterplot was way harder than it should have been. Google docs’ spreadsheet crashes when I try to build a chart. The Python library matplotlib chokes on the rows where Age is “None”. Wolfram Alpha rejects it, probably for the same reason. LibreOffice’s Spreadsheet just made my laptop really hot. Octave is inscrutable. But Plot for OSX did the job.

2012 Cap 10k

I’d like to see someone with MATLAB skills come up with some more advanced plots or divine some insight from the data. The columns available for 2012 are:

  • Name
  • Division Place
  • Gun Time
  • Chip Time
  • Overall Place
  • Age
  • Zip
  • Gen Place
  • Total Pace
  • Total Div
  • Total Gend
  • Tot AG

Here’s a zip archive of all the tab delimited CSV files 2008-2012.

Leave a Reply

Your email address will not be published. Required fields are marked *