Saturday, March 23, 2013

Writing a simple function in R

Writing a simple Function in R

Writing a simple Function in R

Introduction: This is post aims at introducing new users of R on how to write a function in R and execute the same to get reasonable output. We would write a very small function that calculates a t statistics to test equality of means. Hence, I would first introduce the concept under section 1 and implement the R code in section 2 and finally, execute the .R function file in section 3.

Section 1: Equality of means: If users have taken basic statistics, it would be hard for me to believe that the instructor diod not teach hypothesis testing for equality of means. But in order to understand how the function in R can be coded it is very essential for the user to know the concept.

We are interested in testing a simple hypothesis of equality of means i.e. mean of population X equals mean of population Y. Often economists are interested in knowing the impact of certain elements on a population against the section of the population that was exposed to certain different element.Some of the examples bellow discuss this

1)Average test score of GMAT of students who graduated with a math degree against ones who graduated with a degree in literature. 2)Average yield of a crop due to use of a certain fertilizers against the yield of the crop of farms that used a different fertilizer.

For each population we would like to test if the mean of population 1(MU1) equals mean of the populaton2(MU2). Since it is hard to estimate the true population mean and standard deviation we would use the sample mean and standard deviation to estimate the same.

Step 1: Null hypothesis : H0 : Mu1 = Mu2 Alternate Hypothesis : H1 : Mu1 ??? Mu2

Step2 : calculate the T stat using the following set of formulae:

Step 3: Compare the calculated T with the t from the table using the above mentioned equation and reject the null hypothesis if the following equality holds true.

Section 2: In order to create a function , users need to open a new R script available under the File drop down, located under the menubar. The teqmu1 function is written in R. Teqmu1 is the name of the function. This function takes vector x and vector y as inputs and returns the T statistic.

The first line of the function should define the name of the function followed by the input variables. The R command length is used in the function to calculate the length of the x and y vector which will then be used in the formula for the T statistic. We have also made use of R commands SD , which calculates the standard deviation and Mean which calculates the mean of the vectors X and Y.

teqmu1 <- function(x, y) {
    m = length(x)
    n = length(y)
    sp = sqrt(((m - 1) * sd(x)^2 + (n - 1) * sd(y)^2)/(m + n - 2))
    t = (mean(x) - mean(y))/(sp * sqrt(1/m + 1/n))
    return(t)

}

Section 3:

In order to execute the command we have to save the file as an teqmu1.r file. Note that the name of the file should match the name of the function( in this case the filename would be teqmu.r)

in order to run this function users have to souce it first. In order to source the function click on the code drop down available on R menu bar -> source the file and select the function file.

Now to execute the function create two random vectors in R:

data1 = c(1, 4, 3, 6, 5)  # vector for X
data2 = c(5, 4, 7, 6, 10)  # vector for Y
teqmu1(data1, data2)  # executing the function
## [1] -1.938

The last step is to call the function, once you get the T statistics you can look up the t value from the table using the equation under Step 3. The researcher now can decide to reject the null or not based on the step 3 formula.

Saturday, March 16, 2013

Scatterplot in R

Scatter plots in R <!-- Styles for R syntax highlighter

Scatter plots in R

Abstract: The main purpose of this page is to learn to plot in R. This document also explains plotting scatter plot in R and how it can be used to visualize and interpret the data.
Introduction: scatter plots are widely used in economics and finance to get a basic idea of the underlying datset.
Data:
The dataset used is house price dataset available alongwith an undergarduate textbook “Introductiory Econometrics” by Jeffrey M. Wooldridge. The data consists of 506 observations.The dataset consist of the following variables:
  1. price- median housing price, $
  2. crime- crimes committed per capita
  3. nox - nitrous oxide, parts per 100 mill.
  4. rooms - avg number of rooms per house
  5. dist - weighted dist. to 5 employ centers
  6. radial - accessibiliy index to radial hghwys
  7. proptax - property tax per $1000
  8. stratio - average student-teacher ratio
  9. lowstat - % of people 'lower status'
    1. lprice- log(price)
    2. lnox - log(nox)
    3. lproptax - log(proptax)
    for the purpose of this analysis we would only utilize price and crime.
    R code: Since the data was available in the Raw text format i copy pasted the data in Excel and saved it as a CSV file under my R directory. Then i use the read.csv command to read in the csv file and save the dataset in hprice. I have also made use of the colnames command to get additional information of the colnames and the summary command to breifly look at the center and distribution of X and Y variables.
hprice <- read.csv("hprice.csv", header = TRUE, sep = ",")
colnames(hprice)
##  [1] "price"    "crime"    "nox"      "rooms"    "dist"     "radial"  
##  [7] "proptax"  "stratio"  "lowstat"  "lprice"   "lnox"     "lproptax"
summary(hprice$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5000   16800   21200   22500   25000   50000
summary(hprice$crime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01    0.08    0.26    3.61    3.68   89.00
The following line of commands will plot a scatterplot in R.The scatter plot generated will have circles which are not filled up. However, to fill up the circles the user needs to use the “pch” command in R.
plot(hprice$price, hprice$crime, col = "BLUE")
plot of chunk unnamed-chunk-2
plot - basic plot command in R hprice$price - X variable in Plot hprice$crime - Y variable in plot command col - color
plot(hprice$price, hprice$crime, pch = 19, col = "BLUE")
plot of chunk unnamed-chunk-3
You would observe that the points are too close to each other and so an additional command “cex” can be added. cex command will help in reducing the size of the circles.
plot(hprice$price, hprice$crime, pch = 19, cex = 0.5, col = "BLUE")
plot of chunk unnamed-chunk-4
The image above provides much more information about the relationship between x and Y variables. We can observe that as the home prices rise the crime rate significantly drop. It is not hard to reason with this, as people would not like to live in areas where the crime rates are high and hence the supply of houses are greater than the demand resulting in lower prices. On the other hand demand for houses will be much higher when the crime rates are low as people are willing to spend extra for their security and safety.
-->