Wednesday, November 18, 2015

Reverse Geocode using Google API and XML package in R - PART 3

So this is it !!! In the current tutorial I will show you my entire code that i used to grab the data from web as well as learn to parse the XML data. This tutorial will be a bit longer but since you have already come to this point lets just get done with it.

The Final Code: 

The following lines of code will generate a few files in R. But the final data will be stored in a file called final. I am not great at coding so please comment in case you have an easier way. However, It does do the trick. Many of the functions in the code below are common and if you have studied R in the past you are familiar with them. Some functions have been discussed in Part 2 of the post with the same name.

What is new :

1) XML library and some useful functions to parse the xml data.
2) matrix() function.

library("XML")
setwd("C:\\Users\\agohil\\Documents\\Book\\blogposts\\NYC")
sub= read.csv("subway.csv")
sub=na.omit
link=matrix(NA, nrow=494, ncol=1)
duh=list()
semi=list()
add1=matrix(NA, ncol=1, nrow=494)
add2=matrix(NA, ncol=1, nrow=494)
add3=matrix(NA, ncol=1, nrow=494)
for(i in 1:494){
  link[i,1]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                 "latlng=",sub[i,3],"&key=your API Key here",sep="")
  duh[i]=list(readLines(link[i,1]))
  duh[[i]] = xmlTreeParse(duh[[i]], useInternalNodes=TRUE)
  duh[[i]]=getNodeSet(duh[[i]], "//formatted_address")
  semi[[i]]= xmlToDataFrame(duh[[i]],stringsAsFactors=FALSE)
  add1[i] = semi[[i]][1,1]
  add2[i] = semi[[i]][2,1]
  add3[i]=semi[[i]][3,1]
  final = cbind(add1,add2,add3)
}

In part 2 we already saw how to create a loop and create a list which contains 10 separate link.

readLines() Function:

The readLines() function will simply grab the entire XML output and store it. In order to learn more type ?readLines() in R console window.

  duh[i]=list(readLines(link[i,1]))

The above mentioned line of code is placed inside the loop so that every time the data gets a new lat and lng it stores this xml as a new list element. We have created an empty list called duh outside the loop. In case you encounter errors i would start with using only 10 points and get a small number of data. If everything works well (which hardly does) you should have a list of 494 row items. Following is a small code snippet which will generate link and we can display the links in R by typing in test or test[[1]] to display all the links or just the first link respectively.

library(XML)
sub= read.csv("subway.csv")
test=matrix(NA, nrow=2, ncol=1)
duh=list()
for(i in 1:2){
test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?", "latlng=",sub[i,3],"&YOUR API Key HERE",sep="")
duh[i]=list(readLines(test[i,1]))
}
test
test[[1]]
test[[2]]


Just in case if you had entered your actual API key you should also be able to access the file called duh. When you type duh in the R console window it will get filled up with the XML elements.

Parsing the XML data:

Once you have all this wonderful data we need to make sure we extract what we need and create a table or a data frame object in R. Let me tell you this is really easy. We would use the XML package in R to parse the data that we just downloaded in R. I exactly don't know why we need to do this. But if you don't parse the data many of the XML package functions will not work.

duh[[i]] = xmlTreeParse(duh[[i]], useInternalNodes=TRUE)
duh[[i]]=getNodeSet(duh[[i]], "//formatted_address")

The function xmlTreeParse() will take basically 2 arguments- the file name which contains the data and we have to set the useInternalNodes=TRUE. Note that we need to set the Internal node as TRUE for processing the file using XPath. Don't worry it sounds scarry but its not that bad.

Now i am not sure if you remember but we had the lat and lon data and we wanted to use reverse Geocode to get the actual address. If we look at the XML output on the web we will see that XML elements have a simple tree like structure. There are parent nodes and child nodes.

We are interested in the node called formatted address. Hence we use the function getNodeSet() from the XML package in R to retrieve those XML elements. Note that the first argument in the function is data file and the second argument is the Xpath expression. To learn this language in detail please go to the link mentioned in the reference section of this post.

A simple question to ask is why do we use duh[[i]] ? We know that duh is a list and if you observe how R creates a list you will understand the use of [[ ]]. We use the [[ i ]] to extract elements from the list. In our case duh holds 2 xml outputs since i goes from 1:2. But when you run the entire code you will have i going from 1 to 494.

In the image below we observe a part of the xml output. Note that this is the output after the parsing XML and before running the getNodeSet() function.


Finally we create a data frame from the list.

 semi[[i]]= xmlToDataFrame(duh[[i]],stringsAsFactors=FALSE)

Note that the system encounters multiple "formatted address" in the XML tree hence if you look at the semi[[1]] you will observe that you have 10 elements. But again we know its a list and to extract only the first row we use semi[[1]][1,1]. and we get our desired address.

semi[[1]]
                                 text
1 5959 Broadway, Bronx, NY 10463, USA
2         Kingsbridge, Bronx, NY, USA
3          West Bronx, Bronx, NY, USA
4                      Bronx, NY, USA
5                   New York, NY, USA
6                Bronx, NY 10471, USA
7               Bronx County, NY, USA
8                       New York, USA
9                       United States
> semi[[1]][1,1]
[1] "5959 Broadway, Bronx, NY 10463, USA"



In the main code reference in the beginning of this post you will observe that we created 3 column and we bound them together using the basic R function called cbind().

add1[i] = semi[[i]][1,1] # to extract the first row
add2[i] = semi[[i]][2,1]# to extract the second row
add3[i]=semi[[i]][3,1]# to extract the third row
final = cbind(add1,add2,add3) # we bind all together

In case you encounter errors or have issues understanding the code please leave a comment.

Reference:

1) http://gastonsanchez.com/teaching/ has excellent power point on XML and HTML tree parsing.

Tuesday, November 3, 2015

Reverse Geocode using Google API and XML package in R - PART 2

As promised in my previous post this post will dive deeper into understanding how to create links in R and further execute them to generate a list of XML output. In the third and final post we will use this list to filter the data and extract the information we need.

Setup:

If you have RStudio then this is the right time to power it up. In case you are new to R RStudio is an IDE for R. Once we have the RStudio open we can install the package and load the library called XML in R. We need the package for the third part of this series.

Setting up the Link:

In our previous post we discussed the API link that we need to send to google along with the latitude and longitude data to get back the actual address/ cross street of that location. This was refereed as Reverse Geocoding.

The data file which consists of all the stops information along with latitude and longitude data is available on MTA website [1] as well as here.

I have filtered the file downloaded from MTA website as that file consisted of more information. I used the column labelled stop_code  to filter only the stops that had 1. Since these were the exact locations of each stops. You can find the filtered data, that we will use in this exercise,here.

Before we use the data lets just look into paste() function. The simplest way to learn about it is type ?paste in your R Console Window. The paste() function uses just one arguments a list of values to be connected together. Try typing paste paste("Hello","World"). In case you like to separate these two words with a special character we can add the sep argument in the paste function like paste("Hello","World", sep=","). In order to create an API link we need to add the lat and lon data as well as the API key.

paste("https://maps.googleapis.com/maps/api/geocode/xml?","latlng=40.88925,-73.89858","&key=YOUR API KEY",sep="")

the above link will generate the following:

"https://maps.googleapis.com/maps/api/geocode/xml?latlng=40.88925,-73.89858&key=YOUR API KEY"

Now, to extract the XML data from google api service and bring it in RStudio we will use the readLines() function as follows:

link=paste("https://maps.googleapis.com/maps/api/geocode/xml?","latlng=40.88925,-73.89858","&key=YOUR API KEY",sep="")
data = readLines(link)
head(data)

Looping:

One question that arises is that what if we have a csv file with different lat and lon data how could you grab data for all those lat and long. The answer is simple write a loop function in R to generate links and then use the readLines() function to get the XML data in R. Its hard for me to get into the details of writting a loop but a simple google search should help you.

Following is an example of a simple loop which will use the data file and generate a list consisting of 10 values (links in our case).

setwd("C:\\Users\\agohil\\Documents\\Book\\blogposts\\NYC")
sub= read.csv("subway.csv")
sub=na.omit
test=list()
for(i in 1:10){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

In the above mentioned code we set the the working directory using the setwd() function in R. The data is read in R using the read.csv() function. I usually use the na.omit() function to remove any NA in the data file. In the current data file we do not have any NA. But i like to be sure.

I have created a third column in the data file to make my life a bit easier. The third column uses the values from the first 2 columns separated by a comma. This third column will be used to generate the link. Google requires the latlng argument to be separated by a comma as in
latlng=40.88925,-73.89858

Finally the Loop:

In the loop below we start with an empty list test=list() before we execute the list. This is needed because when we run the loop R needs to store these values some where. In our case it is test.

for(i in 1:10){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

In the loop above R will substitute values 1 through 10 every time it comes across an i. The sub[i,3] is simply instructing R to go to the  row i and column 3( as in sub[1,3], [2,3]...,[10,3]) and substitute the value in the paste function.

The whole process can easily be extended to all the rows in any file by simply using the following 
for(i in 1:494){
  test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
                   "latlng=",sub[i,3],"YOUR API KEY",sep="")
 }

Conclusion:

The main objective of writing this post was to introduce new users of R to Looping in R, help them understand and use the paste() and readLines() function. In the next post we will use the extracted XML package in R to filter the list and extract the data we need.