Saturday, April 13, 2013

Analyzing data using Boxplots

Box Plots <!-- Styles for R syntax highlighter

Box Plots

A lot of economic data, in my opinion, can be better understood using the Boxplots rather than bar charts. Box plots have a number of uses when the research is in its intial stages. A researcher after collecting the data can plot this data using boxplots to get an overview. An analyst can plot a number of different types of boxplots and infer a great deal of information. Boxplots are a quick and easy way to compare multiple data series. Usually , economist use scatter plots, qqplots and histogram to prove the Skewness or normality of data and also further to detect outliers in their data. Boxplot instead of using the mean and standard deviation, to show the center of the distribution  and its spread,uses median and interquantile range.
My main purpose in this post is to introduce users of R to boxplots and take a step further to analyse data using various boxplots commands.


The data for the boxplot is borrowed from Jeffrey Wooldridge book. The data comprises of 526 observations and 24 variables.


The most insightful way to understand a boxplot is to plot one. Following are some of the commands that reads the data and then plots the graph.
wg = read.csv("dataw.csv")  ## read in the raw data file
summary(wg$wage)  ## summary statistics of the wage column
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.53    3.33    4.65    5.90    6.88   25.00
boxplot(wg$wage)  ## plot the wage data
plot of chunk unnamed-chunk-2
As discussed above the median and not the mean is the center of the distribution for boxplots, hencethe thick black line in the box shows the median(4.650). Further the box itself is approximately an interquantile range.Lastly, all the points above and below the fence are outliers.This shows us that the data is skewed and not symmetric.
Multiple boxplots can be plotted using the follwing r command.
boxplot(boxplot(wg$wage ~ as.factor(wg$female)))  ## Multiple boxplots
## Error: non-numeric argument to binary operator
plot of chunk unnamed-chunk-3
In the data set suppose we want to study the distribution of wage between males vs females. The variable female in the data set is a dummy variable which assumes the value 1 if female and 0 if male.One can infer that on an average males are paid a higher wage than females. We can beautify the graph by adding colors and labels to the plot.
boxplot(wg$wage ~ as.factor(wg$female), col = ("blue"), xlab = "sex", ylab = "wage (average hourly earnings in $)")
plot of chunk unnamed-chunk-4
now we can add more information to the graph by differentiating between males and female by color.
boxplot(wg$wage ~ as.factor(wg$female), col = c("blue", "red"), names = c("Male", 
    "Femle"), xlab = "sex", ylab = "wage (average hourly earnings in $)")
plot of chunk unnamed-chunk-5
In the above image you would observe that i have defined “col” as a vector of two colors and have also included “names” command also as a vector. In my next post i will try to include more information to this boxplot and also introduce users to plot multiple plots and combine plots using basic R package.


We have studied data using boxplots and come to a conclusion that on an average females are paid lower wages. Also, we can observe the outlier in the data sets.

No comments:

Post a Comment