Thursday, October 6, 2016

Generating a pyramid plot in R

pyramid plot


In the current post i will describe one of the easier ways to generate a pyramid plot. The firt time i had seen a pyramid plot was on New York Times website. The NY times visualiztion used data from the American Cancer society to show new cases of cancer in 2007. The visualization can be viewed here.

I have also seen census bureau use pyramid plots to display distribution of population by age. Flowing data website used an animated pyramid plot to show prevalence of obesity in USA here


We will require to download 2 different packages. The plotrix package will allow us to generate the pyramid plot. The tidyverse package is used to generate additional columns as well as data manipulation. Note that data is never in the form we require it to be.

#install.packages(c("plotrix","tidyverse") #should be ran only once


The data for the pyramid plot was downloaded from American Cancer Society. Both the files used in the code can be found:

Data Cleaning:

Once the two files are downloaded we will read the data using the read.delim( ) function. It should be noted that i have used only a sample of the data since it is easier to display 9 cancer types compared to 55 different types of cancer in one single visualization.

I have retained all 55 categories of cancer types for the death.txt file to show how dplyr can be used to clean the data. The new tidyverse packages loads all the functions of dplyr and hence we do not need to load dplyr package anymore.

incidence= read.delim("incidence.txt" , stringsAsFactors = FALSE, na.strings = "n/a")
death=read.delim("death.txt",stringsAsFactors = FALSE, na.strings = "n/a")
death= death[,-2]
colnames(incidence)= c("type","female","male")
colnames(death)= c("type","female","male")
To clean the data we need to go a step further. We know that we only need data from death data for just 9 types of cancer. We do this by using the inner_join( ) function. To learn more about this function type ?inner_join() in R console window.

Once we clean the data we create additional data fields using mutate( ) function from dplyr package. This is required as we like to plot the data inside the plotting margin window.

data= inner_join(incidence, death, by=c("type"))
colnames(data)= c("type", "in.female","in.male","de.female","de.male")
data= mutate(data, in.f= in.female/1000,
                    in.m = in.male/1000,
                    d.m= de.male/1000)


Finally, We generate the plot using the pyramid.plot( ) function of plotrix package. Note that there are other packages that would assist in generating the pyramid plot.

The types variable is created to label the plot.The first two arguments of the pyramid.plot() function are the data to be plotted on the left side and the right side of the plot. The laxlab and raxlab arguments allow to label the data on the left and right side. The gap argument allows to create gap between the left and right plots. We can play around with this argument to fit the labels well within the left and right plot.

We need to plot 2 sets of data for males and females. This can be achieved by using the add=TRUE argument to overlay the incidence data with the number of deaths. However, we have also passed a space argument to make the plot look similar to the one in New York Times article.

types= c("Breast","Esophagus","Kidney","Leukemia","Liver","Lung","Lymphoma","Ovary","Pancreas","Prostate")
             laxlab= c(0,50,100,150,200,250),
             top.labels=c("Female","Types of Cancer","Male"),labels=types,
             gap  =25, labelcex = .8, unit="$ in 000's",lxcol="#edf8e9", rxcol="#f2f0f7")
## [1] 5.1 4.1 4.1 2.1
             laxlab= c(0,50,100,150,200,250),
             top.labels=c("Female","","Male"),labels=types,space= 0.4,
             gap  =25, labelcex = 1, unit="",lxcol="#74c476", rxcol="#9e9ac8", add=TRUE)