Tuesday, March 3, 2015

Using Kimono to extract Cricket Data

Data makes visualization interesting, this goes without saying. But, as we progress and improve our skills in data visualization and analytics we come across many different sets of data that look interesting to us. We develop ideas on how we can use this particular data or implement/conduct analysis on it. So we start with the project enthusiastically and many times dream about it or fantasize about it.

Suddenly you are awaken from this dream when you discover that the data that you thought was so amazing is not so easily downloadable. I come across this situation many times. A lot of times the data is available in csv, xml, html, json or txt formats.

I was awaken from a similar dream when i cam across two very interesting infographics related to baseball. These are generated by New York Times and can be found here and here. So i wondered if similar infographics were generated for cricket. I was really surprised to find that even though in cricket we talk about strike rates and batting averages not much had be written to analyze the game or players. I started digging for data and discovered this.

I was like WOW there is so much data going as far as 1971. I was soon going to discover another WOW moment right after the first one. A website as big as ESPN has no API to extract this data from the web. If i want to extract all the data i will basically have to set my query to display 200 entries, i would then have to copy the data to csv - go to the next page and do the same for 58 times. I aint got time for that.

So i started searching on Google on an alternative which was slightly better : right click and choose "view page source" . The data is available in <tr> <td>  html tags. I am weak when it comes to html. But i thought i could use R and extract the HTML data saved in 58 different text files or something like that.

Finally god heard my prayers and i came across this amazing post  I followed the post to the original blog and discovered KIMONO. This is an amazing tool that does the trick. You can basically create a link for the first page. when you go to the next page you will see that the same link has a new element page=2. We will exploit this HTML element to our advantge and edit it in excel using a combination of concatenate function and referencing a series of page numbers. So at the end you should have 58 links the only thing changing is the page numbers. Feed these links to kimono and start crawling. You can download the entire data set in just 10 to 15 mins.

In the next post i will try to explain how kimono works. If you are too curious there are few good videos on youtube.


No comments:

Post a Comment