Thursday, October 6, 2016

Generating a pyramid plot in R

pyramid plot

Introduction:

In the current post i will describe one of the easier ways to generate a pyramid plot. The firt time i had seen a pyramid plot was on New York Times website. The NY times visualiztion used data from the American Cancer society to show new cases of cancer in 2007. The visualization can be viewed here.

I have also seen census bureau use pyramid plots to display distribution of population by age. Flowing data website used an animated pyramid plot to show prevalence of obesity in USA here

Packages:

We will require to download 2 different packages. The plotrix package will allow us to generate the pyramid plot. The tidyverse package is used to generate additional columns as well as data manipulation. Note that data is never in the form we require it to be.

#install.packages(c("plotrix","tidyverse") #should be ran only once
library(plotrix)
library(tidyverse)

Data:

The data for the pyramid plot was downloaded from American Cancer Society. Both the files used in the code can be found:

Data Cleaning:

Once the two files are downloaded we will read the data using the read.delim( ) function. It should be noted that i have used only a sample of the data since it is easier to display 9 cancer types compared to 55 different types of cancer in one single visualization.

I have retained all 55 categories of cancer types for the death.txt file to show how dplyr can be used to clean the data. The new tidyverse packages loads all the functions of dplyr and hence we do not need to load dplyr package anymore.

incidence= read.delim("incidence.txt" , stringsAsFactors = FALSE, na.strings = "n/a")
death=read.delim("death.txt",stringsAsFactors = FALSE, na.strings = "n/a")
incidence=incidence[,-2]
death= death[,-2]
incidence[is.na(incidence)]=0
death[is.na(death)]=0
colnames(incidence)= c("type","female","male")
colnames(death)= c("type","female","male")
To clean the data we need to go a step further. We know that we only need data from death data for just 9 types of cancer. We do this by using the inner_join( ) function. To learn more about this function type ?inner_join() in R console window.

Once we clean the data we create additional data fields using mutate( ) function from dplyr package. This is required as we like to plot the data inside the plotting margin window.

data= inner_join(incidence, death, by=c("type"))
colnames(data)= c("type", "in.female","in.male","de.female","de.male")
data= mutate(data, in.f= in.female/1000,
                    in.m = in.male/1000,
                    d.f=de.female/1000,
                    d.m= de.male/1000)

Plot

Finally, We generate the plot using the pyramid.plot( ) function of plotrix package. Note that there are other packages that would assist in generating the pyramid plot.

The types variable is created to label the plot.The first two arguments of the pyramid.plot() function are the data to be plotted on the left side and the right side of the plot. The laxlab and raxlab arguments allow to label the data on the left and right side. The gap argument allows to create gap between the left and right plots. We can play around with this argument to fit the labels well within the left and right plot.

We need to plot 2 sets of data for males and females. This can be achieved by using the add=TRUE argument to overlay the incidence data with the number of deaths. However, we have also passed a space argument to make the plot look similar to the one in New York Times article.

types= c("Breast","Esophagus","Kidney","Leukemia","Liver","Lung","Lymphoma","Ovary","Pancreas","Prostate")
pyramid.plot(data$in.f,data$in.m,
             laxlab= c(0,50,100,150,200,250),
             raxlab=c(0,50,100,150,200),
             top.labels=c("Female","Types of Cancer","Male"),labels=types,
             gap  =25, labelcex = .8, unit="$ in 000's",lxcol="#edf8e9", rxcol="#f2f0f7")
## [1] 5.1 4.1 4.1 2.1
pyramid.plot(data$d.f,data$d.m,
             laxlab= c(0,50,100,150,200,250),
             raxlab=c(0,50,100,150,200),
             top.labels=c("Female","","Male"),labels=types,space= 0.4,
             gap  =25, labelcex = 1, unit="",lxcol="#74c476", rxcol="#9e9ac8", add=TRUE)

## [1] 4 2 4 2
I have only provided explanation for the most essential arguments used in pyramid.plot() function. In order to learn more about the function type ?pyramid.plot() function in R console window. The plot is missing legends which is essential. The legends can be added using the legend() function.

In order to make your plot look like the one in the NYtimes export the plot as a jpeg image and use your favorite editor to add text or labels. It is much easier to do this outside R.

The following is the entire code used to generate the plot.

#install.packages(c("plotrix","tidyverse") #should be ran only once
library(plotrix)
library(tidyverse)
incidence= read.delim("incidence.txt" , stringsAsFactors = FALSE, na.strings = "n/a")
death=read.delim("death.txt",stringsAsFactors = FALSE, na.strings = "n/a")
incidence=incidence[,-2]
death= death[,-2]
incidence[is.na(incidence)]=0
death[is.na(death)]=0
colnames(incidence)= c("type","female","male")
colnames(death)= c("type","female","male")
data= inner_join(incidence, death, by=c("type"))
colnames(data)= c("type", "in.female","in.male","de.female","de.male")
data= mutate(data, in.f=in.female/1000,
                    in.m = in.male/1000,
                    d.f=de.female/1000,
                    d.m= de.male/1000)

types= c("Breast","Esophagus","Kidney","Leukemia","Liver","Lung","Lymphoma","Ovary","Pancreas","Prostate")

pyramid.plot(data$in.f,data$in.m,
             laxlab= c(0,50,100,150,200,250),
             raxlab=c(0,50,100,150,200),
             top.labels=c("Female","Types of Cancer","Male"),labels=types,
             gap  =25, labelcex = .8, unit="$ in 000's",lxcol="#edf8e9", rxcol="#f2f0f7")

pyramid.plot(data$d.f,data$d.m,
             laxlab= c(0,50,100,150,200,250),
             raxlab=c(0,50,100,150,200),
             top.labels=c("Female","","Male"),labels=types,space= 0.4,
             gap  =25, labelcex = 1, unit="",lxcol="#74c476", rxcol="#9e9ac8", add=TRUE)

3 comments:

  1. can you also make a chart showing the value of a marker by its size (like, year 1988, 1998, 2008, yield 5,15,20) I want the size of the second marker to be 3 times as the first and the third 4 times bigger than the first). You can use a png/jpg image for the markers. This will help me to talk to farmers about benefits, I tried but couldn't succeed.

    ReplyDelete
  2. Hi Duleep,

    I am bit unclear as to how you want to generate the plot. When you say " value of a marker by its size (like, year 1988, 1998, 2008, yield 5,15,20)" do you mean the if we were to display the actual values in the plot. You like those values to be size based on some data.

    I will try to get it working for you.

    ReplyDelete
  3. can you also try making the bar separate with each other instead of overlap?
    thank you so much

    ReplyDelete