The Final Code:
The following lines of code will generate a few files in R. But the final data will be stored in a file called final. I am not great at coding so please comment in case you have an easier way. However, It does do the trick. Many of the functions in the code below are common and if you have studied R in the past you are familiar with them. Some functions have been discussed in Part 2 of the post with the same name.What is new :
1) XML library and some useful functions to parse the xml data.
2) matrix() function.
library("XML")
setwd("C:\\Users\\agohil\\Documents\\Book\\blogposts\\NYC")
sub= read.csv("subway.csv")
sub=na.omit
link=matrix(NA, nrow=494, ncol=1)
duh=list()
semi=list()
add1=matrix(NA, ncol=1, nrow=494)
add2=matrix(NA, ncol=1, nrow=494)
add3=matrix(NA, ncol=1, nrow=494)
for(i in 1:494){
link[i,1]= paste("https://maps.googleapis.com/maps/api/geocode/xml?",
"latlng=",sub[i,3],"&key=your API Key here",sep="")
duh[i]=list(readLines(link[i,1]))
duh[[i]] = xmlTreeParse(duh[[i]], useInternalNodes=TRUE)
duh[[i]]=getNodeSet(duh[[i]], "//formatted_address")
semi[[i]]= xmlToDataFrame(duh[[i]],stringsAsFactors=FALSE)
add1[i] = semi[[i]][1,1]
add2[i] = semi[[i]][2,1]
add3[i]=semi[[i]][3,1]
final = cbind(add1,add2,add3)
}
In part 2 we already saw how to create a loop and create a list which contains 10 separate link.
readLines() Function:
The readLines() function will simply grab the entire XML output and store it. In order to learn more type ?readLines() in R console window.duh[i]=list(readLines(link[i,1]))
The above mentioned line of code is placed inside the loop so that every time the data gets a new lat and lng it stores this xml as a new list element. We have created an empty list called duh outside the loop. In case you encounter errors i would start with using only 10 points and get a small number of data. If everything works well (which hardly does) you should have a list of 494 row items. Following is a small code snippet which will generate link and we can display the links in R by typing in test or test[[1]] to display all the links or just the first link respectively.
library(XML)
sub= read.csv("subway.csv")
test=matrix(NA, nrow=2, ncol=1)
duh=list()
for(i in 1:2){
test[i]= paste("https://maps.googleapis.com/maps/api/geocode/xml?", "latlng=",sub[i,3],"&YOUR API Key HERE",sep="")
duh[i]=list(readLines(test[i,1]))
}
test
test[[1]]
test[[2]]
Just in case if you had entered your actual API key you should also be able to access the file called duh. When you type duh in the R console window it will get filled up with the XML elements.
Parsing the XML data:
Once you have all this wonderful data we need to make sure we extract what we need and create a table or a data frame object in R. Let me tell you this is really easy. We would use the XML package in R to parse the data that we just downloaded in R. I exactly don't know why we need to do this. But if you don't parse the data many of the XML package functions will not work.
duh[[i]] = xmlTreeParse(duh[[i]], useInternalNodes=TRUE)
duh[[i]]=getNodeSet(duh[[i]], "//formatted_address")
The function xmlTreeParse() will take basically 2 arguments- the file name which contains the data and we have to set the useInternalNodes=TRUE. Note that we need to set the Internal node as TRUE for processing the file using XPath. Don't worry it sounds scarry but its not that bad.
Now i am not sure if you remember but we had the lat and lon data and we wanted to use reverse Geocode to get the actual address. If we look at the XML output on the web we will see that XML elements have a simple tree like structure. There are parent nodes and child nodes.
We are interested in the node called formatted address. Hence we use the function getNodeSet() from the XML package in R to retrieve those XML elements. Note that the first argument in the function is data file and the second argument is the Xpath expression. To learn this language in detail please go to the link mentioned in the reference section of this post.
A simple question to ask is why do we use duh[[i]] ? We know that duh is a list and if you observe how R creates a list you will understand the use of [[ ]]. We use the [[ i ]] to extract elements from the list. In our case duh holds 2 xml outputs since i goes from 1:2. But when you run the entire code you will have i going from 1 to 494.
In the image below we observe a part of the xml output. Note that this is the output after the parsing XML and before running the getNodeSet() function.
Finally we create a data frame from the list.
semi[[i]]= xmlToDataFrame(duh[[i]],stringsAsFactors=FALSE)
Note that the system encounters multiple "formatted address" in the XML tree hence if you look at the semi[[1]] you will observe that you have 10 elements. But again we know its a list and to extract only the first row we use semi[[1]][1,1]. and we get our desired address.
semi[[1]]
text
1 5959 Broadway, Bronx, NY 10463, USA
2 Kingsbridge, Bronx, NY, USA
3 West Bronx, Bronx, NY, USA
4 Bronx, NY, USA
5 New York, NY, USA
6 Bronx, NY 10471, USA
7 Bronx County, NY, USA
8 New York, USA
9 United States
> semi[[1]][1,1]
[1] "5959 Broadway, Bronx, NY 10463, USA"
add1[i] = semi[[i]][1,1] # to extract the first row
add2[i] = semi[[i]][2,1]# to extract the second row
add3[i]=semi[[i]][3,1]# to extract the third row
final = cbind(add1,add2,add3) # we bind all together