Wednesday, November 18, 2015

Reverse Geocode using Google API and XML package in R - PART 3

So this is it !!! In the current tutorial I will show you my entire code that i used to grab the data from web as well as learn to parse the XML data. This tutorial will be a bit longer but since you have already come to this point lets just get done with it.

The Final Code: 

The following lines of code will generate a few files in R. But the final data will be stored in a file called final. I am not great at coding so please comment in case you have an easier way. However, It does do the trick. Many of the functions in the code below are common and if you have studied R in the past you are familiar with them. Some functions have been discussed in Part 2 of the post with the same name.

What is new :

1) XML library and some useful functions to parse the xml data.
2) matrix() function.

library("XML")
setwd("C:\\Users\\agohil\\Documents\\Book\\blogposts\\NYC")
sub= read.csv("subway.csv")
sub=na.omit
link=matrix(NA, nrow=494, ncol=1)
duh=list()
semi=list()
add1=matrix(NA, ncol=1, nrow=494)
add2=matrix(NA, ncol=1, nrow=494)
add3=matrix(NA, ncol=1, nrow=494)
for(i in 1:494){
  link[i,1]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                 "latlng=",sub[i,3],"&key=your API Key here",sep="")
  duh[i]=list(readLines(link[i,1]))
  duh[[i]] = xmlTreeParse(duh[[i]], useInternalNodes=TRUE)
  duh[[i]]=getNodeSet(duh[[i]], "//formatted_address")
  semi[[i]]= xmlToDataFrame(duh[[i]],stringsAsFactors=FALSE)
  add1[i] = semi[[i]][1,1]
  add2[i] = semi[[i]][2,1]
  add3[i]=semi[[i]][3,1]
  final = cbind(add1,add2,add3)
}

In part 2 we already saw how to create a loop and create a list which contains 10 separate link.

readLines() Function:

The readLines() function will simply grab the entire XML output and store it. In order to learn more type ?readLines() in R console window.

  duh[i]=list(readLines(link[i,1]))

The above mentioned line of code is placed inside the loop so that every time the data gets a new lat and lng it stores this xml as a new list element. We have created an empty list called duh outside the loop. In case you encounter errors i would start with using only 10 points and get a small number of data. If everything works well (which hardly does) you should have a list of 494 row items. Following is a small code snippet which will generate link and we can display the links in R by typing in test or test[[1]] to display all the links or just the first link respectively.

library(XML)
sub= read.csv("subway.csv")
test=matrix(NA, nrow=2, ncol=1)
duh=list()
for(i in 1:2){
test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?", "latlng=",sub[i,3],"&YOUR API Key HERE",sep="")
duh[i]=list(readLines(test[i,1]))
}
test
test[[1]]
test[[2]]


Just in case if you had entered your actual API key you should also be able to access the file called duh. When you type duh in the R console window it will get filled up with the XML elements.

Parsing the XML data:

Once you have all this wonderful data we need to make sure we extract what we need and create a table or a data frame object in R. Let me tell you this is really easy. We would use the XML package in R to parse the data that we just downloaded in R. I exactly don't know why we need to do this. But if you don't parse the data many of the XML package functions will not work.

duh[[i]] = xmlTreeParse(duh[[i]], useInternalNodes=TRUE)
duh[[i]]=getNodeSet(duh[[i]], "//formatted_address")

The function xmlTreeParse() will take basically 2 arguments- the file name which contains the data and we have to set the useInternalNodes=TRUE. Note that we need to set the Internal node as TRUE for processing the file using XPath. Don't worry it sounds scarry but its not that bad.

Now i am not sure if you remember but we had the lat and lon data and we wanted to use reverse Geocode to get the actual address. If we look at the XML output on the web we will see that XML elements have a simple tree like structure. There are parent nodes and child nodes.

We are interested in the node called formatted address. Hence we use the function getNodeSet() from the XML package in R to retrieve those XML elements. Note that the first argument in the function is data file and the second argument is the Xpath expression. To learn this language in detail please go to the link mentioned in the reference section of this post.

A simple question to ask is why do we use duh[[i]] ? We know that duh is a list and if you observe how R creates a list you will understand the use of [[ ]]. We use the [[ i ]] to extract elements from the list. In our case duh holds 2 xml outputs since i goes from 1:2. But when you run the entire code you will have i going from 1 to 494.

In the image below we observe a part of the xml output. Note that this is the output after the parsing XML and before running the getNodeSet() function.


Finally we create a data frame from the list.

 semi[[i]]= xmlToDataFrame(duh[[i]],stringsAsFactors=FALSE)

Note that the system encounters multiple "formatted address" in the XML tree hence if you look at the semi[[1]] you will observe that you have 10 elements. But again we know its a list and to extract only the first row we use semi[[1]][1,1]. and we get our desired address.

semi[[1]]
                                 text
1 5959 Broadway, Bronx, NY 10463, USA
2         Kingsbridge, Bronx, NY, USA
3          West Bronx, Bronx, NY, USA
4                      Bronx, NY, USA
5                   New York, NY, USA
6                Bronx, NY 10471, USA
7               Bronx County, NY, USA
8                       New York, USA
9                       United States
> semi[[1]][1,1]
[1] "5959 Broadway, Bronx, NY 10463, USA"



In the main code reference in the beginning of this post you will observe that we created 3 column and we bound them together using the basic R function called cbind().

add1[i] = semi[[i]][1,1] # to extract the first row
add2[i] = semi[[i]][2,1]# to extract the second row
add3[i]=semi[[i]][3,1]# to extract the third row
final = cbind(add1,add2,add3) # we bind all together

In case you encounter errors or have issues understanding the code please leave a comment.

Reference:

1) http://gastonsanchez.com/teaching/ has excellent power point on XML and HTML tree parsing.

Tuesday, November 3, 2015

Reverse Geocode using Google API and XML package in R - PART 2

As promised in my previous post this post will dive deeper into understanding how to create links in R and further execute them to generate a list of XML output. In the third and final post we will use this list to filter the data and extract the information we need.

Setup:

If you have RStudio then this is the right time to power it up. In case you are new to R RStudio is an IDE for R. Once we have the RStudio open we can install the package and load the library called XML in R. We need the package for the third part of this series.

Setting up the Link:

In our previous post we discussed the API link that we need to send to google along with the latitude and longitude data to get back the actual address/ cross street of that location. This was refereed as Reverse Geocoding.

The data file which consists of all the stops information along with latitude and longitude data is available on MTA website [1] as well as here.

I have filtered the file downloaded from MTA website as that file consisted of more information. I used the column labelled stop_code  to filter only the stops that had 1. Since these were the exact locations of each stops. You can find the filtered data, that we will use in this exercise,here.

Before we use the data lets just look into paste() function. The simplest way to learn about it is type ?paste in your R Console Window. The paste() function uses just one arguments a list of values to be connected together. Try typing paste paste("Hello","World"). In case you like to separate these two words with a special character we can add the sep argument in the paste function like paste("Hello","World", sep=","). In order to create an API link we need to add the lat and lon data as well as the API key.

paste("https://maps.googleapis.com/maps/api/geocode/xml?","latlng=40.88925,-73.89858","&key=YOUR API KEY",sep="")

the above link will generate the following:

"https://maps.googleapis.com/maps/api/geocode/xml?latlng=40.88925,-73.89858&key=YOUR API KEY"

Now, to extract the XML data from google api service and bring it in RStudio we will use the readLines() function as follows:

link=paste("https://maps.googleapis.com/maps/api/geocode/xml?","latlng=40.88925,-73.89858","&key=YOUR API KEY",sep="")
data = readLines(link)
head(data)

Looping:

One question that arises is that what if we have a csv file with different lat and lon data how could you grab data for all those lat and long. The answer is simple write a loop function in R to generate links and then use the readLines() function to get the XML data in R. Its hard for me to get into the details of writting a loop but a simple google search should help you.

Following is an example of a simple loop which will use the data file and generate a list consisting of 10 values (links in our case).

setwd("C:\\Users\\agohil\\Documents\\Book\\blogposts\\NYC")
sub= read.csv("subway.csv")
sub=na.omit
test=list()
for(i in 1:10){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

In the above mentioned code we set the the working directory using the setwd() function in R. The data is read in R using the read.csv() function. I usually use the na.omit() function to remove any NA in the data file. In the current data file we do not have any NA. But i like to be sure.

I have created a third column in the data file to make my life a bit easier. The third column uses the values from the first 2 columns separated by a comma. This third column will be used to generate the link. Google requires the latlng argument to be separated by a comma as in
latlng=40.88925,-73.89858

Finally the Loop:

In the loop below we start with an empty list test=list() before we execute the list. This is needed because when we run the loop R needs to store these values some where. In our case it is test.

for(i in 1:10){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

In the loop above R will substitute values 1 through 10 every time it comes across an i. The sub[i,3] is simply instructing R to go to the  row i and column 3( as in sub[1,3], [2,3]...,[10,3]) and substitute the value in the paste function.

The whole process can easily be extended to all the rows in any file by simply using the following 
for(i in 1:494){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

Conclusion:

The main objective of writing this post was to introduce new users of R to Looping in R, help them understand and use the paste() and readLines() function. In the next post we will use the extracted XML package in R to filter the list and extract the data we need.

Tuesday, October 27, 2015

Reverse Geocode using Google API and XML package in R - PART 1

World Map

Almost all the times we com across data that is not in the format we require it to be. Sometimes it has some information we need and other times we require additional information that can be retrieved using the data at hand. In the current tutorial we will explore the issue related to the case where data is present but we need some more information.

Idea:

The  motivation behind this post is very simple. I was able to find data related to all the subway stops in New York City (NYC). Hence, i knew the latitude and longitude for each stop. What i was missing was the actual physical address. The task was to use Google Geocode API and extract the data for physical address of the subway post.

The obvious question is why do i require this sort of data? I am trying to create a visualization where i require this data. For now, lets us concentrate on the task at hand: how do we extract the data using Google API service.

Set up:

If you are new to the idea of web scraping then you may not have heard about API.  API is an acronym for Application Program Interface (API). Many websites provide API services and the idea is simply to give access to their data. It is surprising to see that nowadays how many websites provide API services. Some of the well know websites are New York Times, Twitter, Facebook, Uber etc. The API services help the developer to construct apps based on the information provided by the API. You can read more on this topic by simply typing "API" in google.

For the current project we will need a google ID, which you might already have. Now to extract the data from API service you will need to create an API key. This key will be used as an input while constructing the link.

Go to google developer console and click on Credential under APIs & Auth. Now click on Add credential -> API Key ->browser key -> "give it some name"-> click create. This will create an API key. Note that every API service will need a separate key and the calls made to each API service is limited. The google developer console will help you track the # of free calls made to the service. To understand on what information you can extract from any of the API service please refer to the respective API documentation found on google.

You are now all set for accessing the google API service !!! YAY !!!

LINK, XML, JSON and More... :

To extract data using the API service we need to create a simple HTML link wherein some information is standard API information and some is custom based on users requirement. The following is a break down of the Geocode API link.

http://maps.googleapis.com/maps/api/geocode/output?parameters

If we paste this in our browser we do not get back anything. It will simply error out for us. The reason being we are missing an API key and Parameters. Following is the link with all the parameters:

https://maps.googleapis.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=YOUR_API_KEY

If you have your API key, you can use the link mentioned above along with your key. Paste the entire link in your favorite browser and you should be able to get XML output. Now try the same procedure but change the xml to json and you will see JSON format too. For our task we will use the xml format to get the data and filter it. XML output in your browser has a lot of information and the first task for us is to find if the information we are looking for is present or not.

We need reverse geocode since we will be supplying Lat and Long for the subway stops as input and getting back the address for that location as XML output. The following link is an example of reverse geocode with parameter latlng.

https://maps.googleapis.com/maps/api/geocode/xml?latlng=40.714224,-73.961452&key=YOUR_API_KEY

Try changing the 40.714224 and -73.961452 to some other lat and lng and you will see that your xml output will get updated.

The xml output that you will observe is a long list of street names, zip codes, lat and long, neighborhood etc. You may or may not need all this information. We will learn about parsing the XML tree and filtering the data in a way that it can be used efficiently.

In the next Part 2 we will study to create a link using the paste() function in R and further use the readLines() function in R to extract information.  







Tuesday, October 13, 2015

Pi an experiment with circlize package

Pi to 1000 places connected

Idea:

In the past i have often observed circular plots (cant find a technical name) in 

1) New York times[1] to display human genome data
2) Migration patterns among humans[2]
3) to display import and export data (Bloomberg terminal).
4) Mathematical art [3]

My understanding of these circular plots is very new but considering the popularity of these plots i believe it would be unjust not to discuss them on my blog. This post is motivated by an Instagram image of a similar plot created using circos software[4]. The image is created using pi numbers upto 1000 decimal places and connecting the consecutive numbers. We know the value of pi is 3.141.... and the visualization is generated by connecting 3 to 1, 1 to 4, 4 to 1 and so on. We all know numbers in a pi are random and follow no particular order but we see a sort of an image emerge over here. A structure in random values.

Data:   


The data file was created by copying and pasting pi numbers from here. You would observe that copying and pasting values from web into excel is a problem but i prefer using csv format and hence i had no other choice. I further tidied the data using the =left() and = right() functions to display the values in one column.copied them to the next column. In case you like to access the file it is here .

Code:


install.packages("circlize")
library("circlize")
circos.par(gap.degree = 3)
pi= read.csv("pi.csv")
colors= c("#542D15","#C07F4D", "#4B2078","#C153C9","#0A6A6C","#1DCEC6","#094AA5","#0181FD","white","#92B966" )
chordDiagram(pi, grid.col= colors, grid.border=c("white"),transparency=0.5)

Load up RStudio:

Circlize is a beautiful package, especially if you are trying to visualize genome data or if you are trying to exhibit flow of information from one sector to the next. The author has done an amazing job at explaining the package [5]. 

In our case we just need 3 lines of code. The circos.par() function simply creates gaps between the sector so that the image looks prettier.

I sort of cheated by creating the image in R and then importing he same to inkscape to add the background black color. Also note that we have 10 sectors for values going from 0 to 9. However, they are not placed in ascending or descending order. I am still learning the package and can probably show you how to order the sectors in my next post.

References:
[1] NY Times :  Genome data
[2] Migration Flows :Migration Flows
[3] Mathematical Art :Pi
[4] Circos Sotware : circos website
[5] An Introduction to circlize package: circlize

Tuesday, August 25, 2015

Taylor Rule - using Shiny in R

I am writing this blog post after a really long time. I had the R code and the shiny app ready but somehow i did not get time to post the same. This post is inspired from Taylor Rule screen in Bloomberg terminal. If you are working in finance Bloomberg terminal is like your lifeline. I came across Taylor rule screen on the terminal one of the days while researching on something unrelated to taylor rule. Users who have access to the Bloomberg terminal can access the screen by just typing "taylor" in the terminal.

All the students who have learned about Macro Economics have definitely heard about the taylor rule. The best way to learn about taylor rule is to google it ;). My aim in this post is to develop a shiny app that closely resembles the Bloomberg screen and calculates the taylor rule.

The app can be accessed by using the link : Taylor

The App also contain instruction on how to use the App. The App uses API service provided by Federal Reserve Bank of St. Louis . The data is extracted by the App as soon as the user selects the date in the calendar and further calculates the Taylor Rule. Every time user changes the inputs the taylor estimate gets recalculated.

In the past i have promised the code for my shiny app but somehow it has not happened. I will try my best to post the same as soon as possible.

Tuesday, May 12, 2015

Interracial marriage - step plot in R

Interracial Marriages in USA

I have tried to replicate the visualization i saw on bloomberg for social change in USA. However, i was unable to animate the chart exactly. The visualization artists at bloomberg have generated the graphic using Scalable Vector Graphics (SVG). I observe USA map in a grid format on top left corner of this visualization. To learn how to create a similar grid like cartogram in R please refer to my previous post.

If you have used R at some point in the past you must be familiar with setwd() and read.csv() function

suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(googleVis)))
setwd("C:/Users/agohil/Book/blogposts/economics")
race= read.csv("races.csv")

I end up spending more time cleaning my data and less creating a visual representation. We would now clean the data using the dplyr library and add 2 additional columns. The first column is the frequency and the next column is the cummulative sum of the frequency column.

Why do we do this? well, the answer lies in what is the story that you like to tell the readers. We want to show the rate of social change in USA. We will study this aspect by looking at how quickly did each state in USA change its law to leagalize Interracial Marriages.

The visualzation is called a step plot or stepped area chart. The concept is not new but adding animation gives it a new perspective. The first section discusses the data manipulation using dplyr, the second section uses R basic plot to generate a step plot.

Section 1: Data Manipulation In our visualization we would like to plot the Years on the X axis and cumulative freq on the Y axis.To calculate the frequency we first generate a new data frame called data with distinct years. We can do this easily using dplyr’s distinct() function. The first argument in the distinct function is the data frame and the second argument is the column to be used. You should have have 20 observations in the data frame called data.

Now we use the group_size() function which will simply creates groups and group_size() will count the observations in each group. The group_size() function will generate the frequency we require. We can now use the cumsum() function to generate the cumulative frequency. Finally, we can combine the 2 data frames using the cbind() function. The data is almost complete.

The cumsum() takes one argument which is the numeric vector. To learn more about any of the functions readers can type ?cumsum in the R command window or refer to the dplyr manual.

freq = group_size(data_new)
data = cbind(data,freq)
cum= cumsum(data$freq)
data = cbind(data,cum)

Section: 2 Data Visualization To generate a step plot in Basic R use the plot() function with the argument type = “s”. In the code menationed below most of the arguments are simple to understand. A new user of R should either look at my older posts or simply type ?plot or ?par in R console window to understand these functions in detail.

par(bty = "n")
plot(data$Year, data$cum, type = "s", col = "red",lwd = 3, xlab = "Years", ylab = "Number of States",main = "Speed of social change in USA", xaxp = c(1780,1967,11))

If the user omits specifying the xaxp argument they will observe that the plot will not display the last year- 1967. We often would like our audience to view the first and last observations as they may be the key in delivering the right message.

In case of our visualization we observe that period from 1950 to 1967 is very important as many states started accepting the change but it took about 17 years for all the states to change the law.

par(bty = "n")
plot(data$Year, data$cum, type = "s", col = "red",lwd = 3, xlab = "Years", ylab = "Number of States",main = "Speed of social change in USA", xaxp = c(1780,1967,11))

In R users have 2 options they can either customize the plot using the xaxp argument or use the axis() function. Specifying xaxp is very easy and serves our pupose. The first argument in xaxp is the first value that we would like R to display, the second value is the last observation to display. Finally, the third value is the number of intervals between the ticks.

Readers can learn more about xaxp by typing ?par in the R console window. Note that eventhough R specifies xaxp as an argument in the par() function we can specify it in the plot() function as shown in the code above.