Wednesday, March 4, 2015

Using Kimono to Extract the Data

As per our discussion yesterday we would like to extract some data from the web using Kimono. I would use this in cases when a structured API is not available. A lot of websites allow users to generate and scrape data using API services.

A lot of these API services are free and sometimes they allow users to have certain number of free API calls. Many news agencies(New York Times), social networking sites( Twitter and Facebook), and Web App (Tinder and Uber) have their own API services.  But, as pointed out in the previous post i was unable to find one for http://stats.espncricinfo.com/ci/engine/stats/index.html.

Hence in this post we will learn to use the Kimono Lab to extract the data. As the first step we need to signup and log in to the website.

Kimono Lab: https://www.kimonolabs.com/

The website will request you to add the kimono icon to your web browser. This is necessary to generate the API.








Next step is open the webpage that you like to extract the data from, in our case it is this .

Now we fire up Kimono by clicking on the icon, you should see a kimono menu appear right underneath your browser window. It should look similar to the image shown below:









Now give a name to your first selection( as shown in the image below). We will only extract names of the players and Runs scored by them. Note that the first page of the website displays a list of 50 players. Your icon should now be a hand and not an arrow using this click on the names in the HTML table. The selection will turn yellow. to confirm your selection use the check mark .You might also see some unnecessary items being selected as well but we can remove them from the selection by crossing them out. We observe exact 50 entries on the kimono menu which is exactly what we want.



















Repeat the same procedure and select the Runs columns from the webpage.








We are almost done. On the top right corner we see a "DONE" button. When you click the DONE button you will observe that the webpage will change and kimono will ask you to fill up the details for your API, as shown in the image below. Once you name your API click on create API.















kimono will start building an API for you. Once kimono finishes it will generate a link which is used to access the data.. When we click on the link we will be redirected to kimono website which will look like the image below:





We can either download the data using the csv format or json. I personally prefer csv. I would highly recommend readers to access different tabs and learn some other tricks. The kimono blog also provides ideas on how its services could be efficiently used to build some cool apps.

Note that we only scrapped page 1 of the espn website. what about the rest of the pages. Things are very simple from here. click on crawl setup as shown in the image above. Now use the dropdown titled crawl strtegy and change it to Manual URL List.

You will observe a link on the Right Side. Also you will observe space to paste other links. I would simply use this link paste it in excel and use concatenate function in excel to generate additional links. Lets say we like to extract data from 40 pages we will simply change the page number in the links from page = 2 to page=3... to page=40. copy and paste the links from excel to kimono manual crawl setup. Click on start crawl. Kimono will now crawl to every page and download the names and runs columns from espn website.



No comments:

Post a Comment