Saturday, March 21, 2015

Analyzing Cricket Data using Shiny

Not sure if many people are aware that currently cricket world cup is going on in Australia and New Zealand. New York Times website has an article on country by country probability of winning in each pool->Cricket World Cup :2015.

Readers interested in learning about the game here is the link : Cricket 

I recently created a shiny app in R to visualize batting statistics.In my previous posts i have discussed how one can download data from the web using kimono. Scrapping the data from ESPN website is a painful process but worth the effort.

My main aim of downloading the data was to visualize the same using shiny. This is my first app using shiny and i have hosted the same using the shiny cloud service. I am in the process of documenting the steps and code needed to replicate this application. I will also make the data available to users for download. Here is the link to the App ->  cricket

I love New York Time visualizations and have tried to replicate some of the Baseball visualizations using batting statistics in Cricket.

Wednesday, March 4, 2015

Using Kimono to Extract the Data

As per our discussion yesterday we would like to extract some data from the web using Kimono. I would use this in cases when a structured API is not available. A lot of websites allow users to generate and scrape data using API services.

A lot of these API services are free and sometimes they allow users to have certain number of free API calls. Many news agencies(New York Times), social networking sites( Twitter and Facebook), and Web App (Tinder and Uber) have their own API services.  But, as pointed out in the previous post i was unable to find one for http://stats.espncricinfo.com/ci/engine/stats/index.html.

Hence in this post we will learn to use the Kimono Lab to extract the data. As the first step we need to signup and log in to the website.

Kimono Lab: https://www.kimonolabs.com/

The website will request you to add the kimono icon to your web browser. This is necessary to generate the API.








Next step is open the webpage that you like to extract the data from, in our case it is this .

Now we fire up Kimono by clicking on the icon, you should see a kimono menu appear right underneath your browser window. It should look similar to the image shown below:









Now give a name to your first selection( as shown in the image below). We will only extract names of the players and Runs scored by them. Note that the first page of the website displays a list of 50 players. Your icon should now be a hand and not an arrow using this click on the names in the HTML table. The selection will turn yellow. to confirm your selection use the check mark .You might also see some unnecessary items being selected as well but we can remove them from the selection by crossing them out. We observe exact 50 entries on the kimono menu which is exactly what we want.



















Repeat the same procedure and select the Runs columns from the webpage.








We are almost done. On the top right corner we see a "DONE" button. When you click the DONE button you will observe that the webpage will change and kimono will ask you to fill up the details for your API, as shown in the image below. Once you name your API click on create API.















kimono will start building an API for you. Once kimono finishes it will generate a link which is used to access the data.. When we click on the link we will be redirected to kimono website which will look like the image below:





We can either download the data using the csv format or json. I personally prefer csv. I would highly recommend readers to access different tabs and learn some other tricks. The kimono blog also provides ideas on how its services could be efficiently used to build some cool apps.

Note that we only scrapped page 1 of the espn website. what about the rest of the pages. Things are very simple from here. click on crawl setup as shown in the image above. Now use the dropdown titled crawl strtegy and change it to Manual URL List.

You will observe a link on the Right Side. Also you will observe space to paste other links. I would simply use this link paste it in excel and use concatenate function in excel to generate additional links. Lets say we like to extract data from 40 pages we will simply change the page number in the links from page = 2 to page=3... to page=40. copy and paste the links from excel to kimono manual crawl setup. Click on start crawl. Kimono will now crawl to every page and download the names and runs columns from espn website.



Tuesday, March 3, 2015

Using Kimono to extract Cricket Data

Data makes visualization interesting, this goes without saying. But, as we progress and improve our skills in data visualization and analytics we come across many different sets of data that look interesting to us. We develop ideas on how we can use this particular data or implement/conduct analysis on it. So we start with the project enthusiastically and many times dream about it or fantasize about it.

Suddenly you are awaken from this dream when you discover that the data that you thought was so amazing is not so easily downloadable. I come across this situation many times. A lot of times the data is available in csv, xml, html, json or txt formats.

I was awaken from a similar dream when i cam across two very interesting infographics related to baseball. These are generated by New York Times and can be found here and here. So i wondered if similar infographics were generated for cricket. I was really surprised to find that even though in cricket we talk about strike rates and batting averages not much had be written to analyze the game or players. I started digging for data and discovered this.

I was like WOW there is so much data going as far as 1971. I was soon going to discover another WOW moment right after the first one. A website as big as ESPN has no API to extract this data from the web. If i want to extract all the data i will basically have to set my query to display 200 entries, i would then have to copy the data to csv - go to the next page and do the same for 58 times. I aint got time for that.

So i started searching on Google on an alternative which was slightly better : right click and choose "view page source" . The data is available in <tr> <td>  html tags. I am weak when it comes to html. But i thought i could use R and extract the HTML data saved in 58 different text files or something like that.

Finally god heard my prayers and i came across this amazing post  I followed the post to the original blog and discovered KIMONO. This is an amazing tool that does the trick. You can basically create a link for the first page. when you go to the next page you will see that the same link has a new element page=2. We will exploit this HTML element to our advantge and edit it in excel using a combination of concatenate function and referencing a series of page numbers. So at the end you should have 58 links the only thing changing is the page numbers. Feed these links to kimono and start crawling. You can download the entire data set in just 10 to 15 mins.

In the next post i will try to explain how kimono works. If you are too curious there are few good videos on youtube.