Saturday, November 17, 2012

Data Visualization through QQ plots and Normal Distribution:

Abstract


This article is an introduction to techniques and usefulness of QQ plots in Excel, Matlab and R. The main objective is to highlight the importance of using qq plots to convey meaningful analysis.

Introduction:

 QQ plots in statistics refer to quantile plots. They have been widely used in data analysis for comparing probability distributions. Statisticians use qq plots for testing if the error terms in Ordinary least square estimation are normally distributed. Equity analysts have used qq plots for determining if the distribution of returns is normal and further to detect fat tails. Application of QQ plots is not just limited to the fields of finance, economics and statistics but it has also been widely applied in the fields of science and engineering. 

The following read is divided into 4 sections. The first section introduces the users to plotting a normal curve in excel as well as the qq plots. The second section introduces the users to code qq plot in R. The third section applies the data and performs the plotting function using Matlab. The primary objective is to learn on various methods to visualize data.

QQ plots in Excel: 

Step 1: Open a new excel sheet and enter 6 different headings – Rank, Probability, Z score(theory), Data, standardized , zscore exp. I will discuss these headings as we start filling up each column with either data or formula.

Step 2: Under the data column (D2) enter the raw data that you would like to test for normality. If you do not have a standard data you can always use the =rand() command in excel to generate a data set for testing purpose. Please note that once you generate random data set copy and paste special values in the data column, or else every time you hit ENTER the data will change.

 Step 3: under column standardized(cell E2) we standardize the raw data. In Excel use the following formula : =STANDARDIZE(D2,AVERAGE($D$2:$D$43),STDEV($D$2:$D$43))

Step 4: once you standardize the data , copy the data -> column titled zscore exp -> paste special the data with values. Then sort the data in ascending order.

Step 5: Go to the Rank column and use the following formula =RANK(F2,$F$2:$F$43,1) This would rank your data.

Step 6: Go to the column probability and enter the following formula:

 =(A2-0.5)/COUNT($F$2:$F$43)

Since we have standardized the data , area under the curve is 1.

Step 7: Finally , go to column titled zscore and enter the following: =NORMSINV (B2)

Now plot a scatter diagram using the zscores under step 6 as X axis values and zscore under step 4 on y axis. You should see a plot that looks like a S curve. This is nothing but what statisticians call a QQ plot. Q meaning Quantile plot. Individuals can also plot a regression line through the plot and this would visually tell you how good is the fit. An r^2 of 1 would mean data is normally distributed. I have plotted different data sets here to show the users varying degrees of convergence and divergence from normality.

In Excel the easiest way to plot a regression line is by clicking on the image and then going to the Design ribbon on the top and clicking on 9th chart design. Your plot should look like the image bellow:



I have used two data sets alongwith excel function to show how it is done in excel.Please use the link below:
https://docs.google.com/open?id=0Bw3VZ9z5dDlPQlJlT0FwNHRZdVk

In the sheet titles QQplot i have plotted a random data set using the rand function, since excel generates this data set with mean 0.5, the qqplot will have a high rsquare value and regression line has an amazingly great fit. This confirms that the data set has a normal distribution. Also, i have plotted another qq plot for a dataset generated by me, and this dataset is plotted in red. The qqplot of this data deviates from normal and has a low rsquare value.

users can also use the NORMDIST function in excel to generate a probability mass function of a normal distribution. I have made use of this function under the sheet titled Normal. If you observe the scatter plot, the random data has a bell shape (red) but the data  generated by me has a positive skew and deviates from normal(blue)


21 comments: