Tuesday, November 3, 2015

Reverse Geocode using Google API and XML package in R - PART 2

As promised in my previous post this post will dive deeper into understanding how to create links in R and further execute them to generate a list of XML output. In the third and final post we will use this list to filter the data and extract the information we need.

Setup:

If you have RStudio then this is the right time to power it up. In case you are new to R RStudio is an IDE for R. Once we have the RStudio open we can install the package and load the library called XML in R. We need the package for the third part of this series.

Setting up the Link:

In our previous post we discussed the API link that we need to send to google along with the latitude and longitude data to get back the actual address/ cross street of that location. This was refereed as Reverse Geocoding.

The data file which consists of all the stops information along with latitude and longitude data is available on MTA website [1] as well as here.

I have filtered the file downloaded from MTA website as that file consisted of more information. I used the column labelled stop_code  to filter only the stops that had 1. Since these were the exact locations of each stops. You can find the filtered data, that we will use in this exercise,here.

Before we use the data lets just look into paste() function. The simplest way to learn about it is type ?paste in your R Console Window. The paste() function uses just one arguments a list of values to be connected together. Try typing paste paste("Hello","World"). In case you like to separate these two words with a special character we can add the sep argument in the paste function like paste("Hello","World", sep=","). In order to create an API link we need to add the lat and lon data as well as the API key.

paste("https://maps.googleapis.com/maps/api/geocode/xml?","latlng=40.88925,-73.89858","&key=YOUR API KEY",sep="")

the above link will generate the following:

"https://maps.googleapis.com/maps/api/geocode/xml?latlng=40.88925,-73.89858&key=YOUR API KEY"

Now, to extract the XML data from google api service and bring it in RStudio we will use the readLines() function as follows:

link=paste("https://maps.googleapis.com/maps/api/geocode/xml?","latlng=40.88925,-73.89858","&key=YOUR API KEY",sep="")
data = readLines(link)
head(data)

Looping:

One question that arises is that what if we have a csv file with different lat and lon data how could you grab data for all those lat and long. The answer is simple write a loop function in R to generate links and then use the readLines() function to get the XML data in R. Its hard for me to get into the details of writting a loop but a simple google search should help you.

Following is an example of a simple loop which will use the data file and generate a list consisting of 10 values (links in our case).

setwd("C:\\Users\\agohil\\Documents\\Book\\blogposts\\NYC")
sub= read.csv("subway.csv")
sub=na.omit
test=list()
for(i in 1:10){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

In the above mentioned code we set the the working directory using the setwd() function in R. The data is read in R using the read.csv() function. I usually use the na.omit() function to remove any NA in the data file. In the current data file we do not have any NA. But i like to be sure.

I have created a third column in the data file to make my life a bit easier. The third column uses the values from the first 2 columns separated by a comma. This third column will be used to generate the link. Google requires the latlng argument to be separated by a comma as in
latlng=40.88925,-73.89858

Finally the Loop:

In the loop below we start with an empty list test=list() before we execute the list. This is needed because when we run the loop R needs to store these values some where. In our case it is test.

for(i in 1:10){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

In the loop above R will substitute values 1 through 10 every time it comes across an i. The sub[i,3] is simply instructing R to go to the  row i and column 3( as in sub[1,3], [2,3]...,[10,3]) and substitute the value in the paste function.

The whole process can easily be extended to all the rows in any file by simply using the following 
for(i in 1:494){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

Conclusion:

The main objective of writing this post was to introduce new users of R to Looping in R, help them understand and use the paste() and readLines() function. In the next post we will use the extracted XML package in R to filter the list and extract the data we need.

No comments:

Post a Comment