Tuesday, May 12, 2015

Interracial marriage - step plot in R

Interracial Marriages in USA

I have tried to replicate the visualization i saw on bloomberg for social change in USA. However, i was unable to animate the chart exactly. The visualization artists at bloomberg have generated the graphic using Scalable Vector Graphics (SVG). I observe USA map in a grid format on top left corner of this visualization. To learn how to create a similar grid like cartogram in R please refer to my previous post.

If you have used R at some point in the past you must be familiar with setwd() and read.csv() function

suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(googleVis)))
setwd("C:/Users/agohil/Book/blogposts/economics")
race= read.csv("races.csv")

I end up spending more time cleaning my data and less creating a visual representation. We would now clean the data using the dplyr library and add 2 additional columns. The first column is the frequency and the next column is the cummulative sum of the frequency column.

Why do we do this? well, the answer lies in what is the story that you like to tell the readers. We want to show the rate of social change in USA. We will study this aspect by looking at how quickly did each state in USA change its law to leagalize Interracial Marriages.

The visualzation is called a step plot or stepped area chart. The concept is not new but adding animation gives it a new perspective. The first section discusses the data manipulation using dplyr, the second section uses R basic plot to generate a step plot.

Section 1: Data Manipulation In our visualization we would like to plot the Years on the X axis and cumulative freq on the Y axis.To calculate the frequency we first generate a new data frame called data with distinct years. We can do this easily using dplyr’s distinct() function. The first argument in the distinct function is the data frame and the second argument is the column to be used. You should have have 20 observations in the data frame called data.

Now we use the group_size() function which will simply creates groups and group_size() will count the observations in each group. The group_size() function will generate the frequency we require. We can now use the cumsum() function to generate the cumulative frequency. Finally, we can combine the 2 data frames using the cbind() function. The data is almost complete.

The cumsum() takes one argument which is the numeric vector. To learn more about any of the functions readers can type ?cumsum in the R command window or refer to the dplyr manual.

freq = group_size(data_new)
data = cbind(data,freq)
cum= cumsum(data$freq)
data = cbind(data,cum)

Section: 2 Data Visualization To generate a step plot in Basic R use the plot() function with the argument type = “s”. In the code menationed below most of the arguments are simple to understand. A new user of R should either look at my older posts or simply type ?plot or ?par in R console window to understand these functions in detail.

par(bty = "n")
plot(data$Year, data$cum, type = "s", col = "red",lwd = 3, xlab = "Years", ylab = "Number of States",main = "Speed of social change in USA", xaxp = c(1780,1967,11))

If the user omits specifying the xaxp argument they will observe that the plot will not display the last year- 1967. We often would like our audience to view the first and last observations as they may be the key in delivering the right message.

In case of our visualization we observe that period from 1950 to 1967 is very important as many states started accepting the change but it took about 17 years for all the states to change the law.

par(bty = "n")
plot(data$Year, data$cum, type = "s", col = "red",lwd = 3, xlab = "Years", ylab = "Number of States",main = "Speed of social change in USA", xaxp = c(1780,1967,11))

In R users have 2 options they can either customize the plot using the xaxp argument or use the axis() function. Specifying xaxp is very easy and serves our pupose. The first argument in xaxp is the first value that we would like R to display, the second value is the last observation to display. Finally, the third value is the number of intervals between the ticks.

Readers can learn more about xaxp by typing ?par in the R console window. Note that eventhough R specifies xaxp as an argument in the par() function we can specify it in the plot() function as shown in the code above.

No comments:

Post a Comment