Tuesday, October 27, 2015

Reverse Geocode using Google API and XML package in R - PART 1

World Map

Almost all the times we com across data that is not in the format we require it to be. Sometimes it has some information we need and other times we require additional information that can be retrieved using the data at hand. In the current tutorial we will explore the issue related to the case where data is present but we need some more information.

Idea:

The  motivation behind this post is very simple. I was able to find data related to all the subway stops in New York City (NYC). Hence, i knew the latitude and longitude for each stop. What i was missing was the actual physical address. The task was to use Google Geocode API and extract the data for physical address of the subway post.

The obvious question is why do i require this sort of data? I am trying to create a visualization where i require this data. For now, lets us concentrate on the task at hand: how do we extract the data using Google API service.

Set up:

If you are new to the idea of web scraping then you may not have heard about API.  API is an acronym for Application Program Interface (API). Many websites provide API services and the idea is simply to give access to their data. It is surprising to see that nowadays how many websites provide API services. Some of the well know websites are New York Times, Twitter, Facebook, Uber etc. The API services help the developer to construct apps based on the information provided by the API. You can read more on this topic by simply typing "API" in google.

For the current project we will need a google ID, which you might already have. Now to extract the data from API service you will need to create an API key. This key will be used as an input while constructing the link.

Go to google developer console and click on Credential under APIs & Auth. Now click on Add credential -> API Key ->browser key -> "give it some name"-> click create. This will create an API key. Note that every API service will need a separate key and the calls made to each API service is limited. The google developer console will help you track the # of free calls made to the service. To understand on what information you can extract from any of the API service please refer to the respective API documentation found on google.

You are now all set for accessing the google API service !!! YAY !!!

LINK, XML, JSON and More... :

To extract data using the API service we need to create a simple HTML link wherein some information is standard API information and some is custom based on users requirement. The following is a break down of the Geocode API link.

http://maps.googleapis.com/maps/api/geocode/output?parameters

If we paste this in our browser we do not get back anything. It will simply error out for us. The reason being we are missing an API key and Parameters. Following is the link with all the parameters:

https://maps.googleapis.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=YOUR_API_KEY

If you have your API key, you can use the link mentioned above along with your key. Paste the entire link in your favorite browser and you should be able to get XML output. Now try the same procedure but change the xml to json and you will see JSON format too. For our task we will use the xml format to get the data and filter it. XML output in your browser has a lot of information and the first task for us is to find if the information we are looking for is present or not.

We need reverse geocode since we will be supplying Lat and Long for the subway stops as input and getting back the address for that location as XML output. The following link is an example of reverse geocode with parameter latlng.

https://maps.googleapis.com/maps/api/geocode/xml?latlng=40.714224,-73.961452&key=YOUR_API_KEY

Try changing the 40.714224 and -73.961452 to some other lat and lng and you will see that your xml output will get updated.

The xml output that you will observe is a long list of street names, zip codes, lat and long, neighborhood etc. You may or may not need all this information. We will learn about parsing the XML tree and filtering the data in a way that it can be used efficiently.

In the next Part 2 we will study to create a link using the paste() function in R and further use the readLines() function in R to extract information.  







Tuesday, October 13, 2015

Pi an experiment with circlize package

Pi to 1000 places connected

Idea:

In the past i have often observed circular plots (cant find a technical name) in 

1) New York times[1] to display human genome data
2) Migration patterns among humans[2]
3) to display import and export data (Bloomberg terminal).
4) Mathematical art [3]

My understanding of these circular plots is very new but considering the popularity of these plots i believe it would be unjust not to discuss them on my blog. This post is motivated by an Instagram image of a similar plot created using circos software[4]. The image is created using pi numbers upto 1000 decimal places and connecting the consecutive numbers. We know the value of pi is 3.141.... and the visualization is generated by connecting 3 to 1, 1 to 4, 4 to 1 and so on. We all know numbers in a pi are random and follow no particular order but we see a sort of an image emerge over here. A structure in random values.

Data:   


The data file was created by copying and pasting pi numbers from here. You would observe that copying and pasting values from web into excel is a problem but i prefer using csv format and hence i had no other choice. I further tidied the data using the =left() and = right() functions to display the values in one column.copied them to the next column. In case you like to access the file it is here .

Code:


install.packages("circlize")
library("circlize")
circos.par(gap.degree = 3)
pi= read.csv("pi.csv")
colors= c("#542D15","#C07F4D", "#4B2078","#C153C9","#0A6A6C","#1DCEC6","#094AA5","#0181FD","white","#92B966" )
chordDiagram(pi, grid.col= colors, grid.border=c("white"),transparency=0.5)

Load up RStudio:

Circlize is a beautiful package, especially if you are trying to visualize genome data or if you are trying to exhibit flow of information from one sector to the next. The author has done an amazing job at explaining the package [5]. 

In our case we just need 3 lines of code. The circos.par() function simply creates gaps between the sector so that the image looks prettier.

I sort of cheated by creating the image in R and then importing he same to inkscape to add the background black color. Also note that we have 10 sectors for values going from 0 to 9. However, they are not placed in ascending or descending order. I am still learning the package and can probably show you how to order the sectors in my next post.

References:
[1] NY Times :  Genome data
[2] Migration Flows :Migration Flows
[3] Mathematical Art :Pi
[4] Circos Sotware : circos website
[5] An Introduction to circlize package: circlize