Wednesday, April 17, 2013

Linear Algebra in R (Part 1)


1 Generating Vectors and Matrices in R:
In this section we would discuss the most basic operation of creating a one dimension array (Vector) and creating two dimension array (Matrix). Further we will dive into discussion on how to manipulate the elements of a matrix and conduct matrix operations.
How to define a Vector in R ?
To generate a row vector
x = ( 1 2 3 4 5 6 7 8 9)
in R use the following command.

x = c(1,2,3,4,5,6,7,8,9) 

In the above command c( ) is used to define a row vector and each  n ,m element of a vector is     separated by a comma. Users can also define a column vector in a similar way by just taking a transpose of the row vector. To transpose a row vector, user can use the following command.
y = t(x)
 
The t() will transpose a row vector into a column vector. Whenever users use the above mentioned command R creates a numeric vector and users can display all the elements of the vector by simply typing the following in the R console.
 x

Word of Caution: Careful with the Comma !!!
If the user is not very familiar with R then he might run into an error wherein a comma is placed after the last element of the vector. This would result into the following error message:
> x = c(1,2,3,4,5,6,7,8,9,)  
Error in c(1, 2, 3, 4, 5, 6, 7, 8, 9, ) : argument 10 is empty
 
In order to define a vector with some special elements such as a square root or a pi you can define it using the following command:
> f = c(1,2,4*pi,sqrt(2), pi)
> f
[1] 1.000000  2.000000 12.566371  1.414214  3.141593

In order to define a Pi or take a square root of an element users can simply use the word pi and sqrt().


There are more than one way in which users can define a matrix in R. Matrix is an array with more than one dimension. One of the ways in which a user can generate a matrix  
  
is by first generating three row vectors using the commands specified in the prebious page and then combining the three vectors to form a 3x3 matrix by using the rbind command.
 l = c(1,4,7)
 m= c(2,5,8)
 n = c(3,6,9)
 rbind(l,m,n)
 
The above mentioned R commands will generate the following matrix.
  [,1] [,2] [,3]
l    1    4    7
m    2    5    8
n    3    6    9

Alternatively, users can also use the cbind command in R to create a matrix in R using the columns of the vector. This does not seem like a very practical way to define a matrix in R. A bit simpler way would be to make use of the matrix command in R.
d = matrix((1:10),2,5)

This command will generate a matrix of 2 rows and 5 columns(2x5). The following matrix will appear in R.
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   1

Note that the elements are filled using a columnwise. In order to change this user can add one more argument to the matrix command.
 d = matrix((1:10),2,5, byrow = TRUE)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10



When users specify byrow = TRUE, R fills in the element using the row. R users can also use trigonometric  functions such as the following:
ab = abs(3);## absolute value
s = sin(3);## sin
p = cos(6);## cos
d = tan(1);## tan
r = exp(1) ## exponential
kk = log(2) ## log
round(3.564) ## will round it, in this case to 4
round(2.2)## will result in just 2


In order to generate a matrix with all zeros
> s = matrix(rep(0,6),2,3)
> s
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

rep() command is very useful in R. In the above R command, rep will generate the 0 ,6 times.
The length() command can be used to calculate the length of a vector in R.
f = (1:3);## generate a vector
length(f)## calculates length of a vector f
[1] 3

dim() command can be used to calculate the dimension of a matrix.
> gt = matrix((1:8),4,2); ## generate a matrix
> dim(gt) ## calculate the dimension of the above matrix
[1] 4 2
> length(gt) ## Length of the matrix
[1] 8

Both the command can be used on a matrix. The output of dim() command is 4 2 a the matrix had 4 rows and 2 columns. length() is 4*2 =8.Further, these commands can be used to manipulate matrices or calculate additional matrices as shown bellow:
> t=matrix(rep(0,length(gt)),dim(gt))
> t
     [,1] [,2]
[1,]    0    0
[2,]    0    0
[3,]    0    0
[4,]    0    0

R lets users to combine commands in a single statement. We are generating a matrix all zeros of the same dimension as gt . Note that rep does not recognize dim() command and hence we have used length command followed by the dim command.
cbind(gt,t)
     [,1] [,2] [,3] [,4]
[1,]    1    5    0    0
[2,]    2    6    0    0
[3,]    3    7    0    0
[4,]    4    8    0    0

Here, we are combining the two matrices to create a new matrix with 4 rows and 4 columns.
1.2 Creating Matrices using random numbers from Distributions:
In order to generate a matrix of normally distributed random numbers following command can be used in R.
g = matrix(rnorm(2, 1,2), 2,2)
g
         [,1]     [,2]
[1,] 1.645968 1.645968
[2,] 3.992582 3.992582

The rnorm(number of observation, mean, standard deviation) function is used alongwith the matrix function. Simillarly, following commands can be used to generate other known distributions.
runif() for Uniform Distribution
rpois() for poisson distributions
rlnorm() for log normal distribution
rbinom() for binomial distribution
usually users can use these random matrices to test their models or just play around with some data. The seq(begin, end, length of the vector) command can be inserted in the rnorm() to generate a sequence of values between the two given endpoints.
g = matrix(rnorm(seq(-4,4, length = 4), 1,2), 2,2)
g
         [,1]     [,2]
[1,] 3.750171 1.216113
[2,] 1.300675 2.796268

While generating random numbers set.seed(id) command should be used in order to generate a random matrix and store it, else every time user runs the command he/she will get a different set of matrices.
set.seed(123)
g = matrix(rnorm(seq(-4,4, length = 4), 1,2), 2,2)
g
           [,1]     [,2]
[1,] -0.1209513 4.117417
[2,]  0.5396450 1.141017

1.3 Diagonals, identities and matrix manipulations:
In linear algebra the identity matrix plays the same role as 1 in normal arithmetics. Any matrix multiplied by the identity matrix gives back the identity matrix.  To generate an identity matrix –
 diag(5)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

All the diagonal elements of a matrix can be extracted using the diag() command mentioned above. This command is very useful in statistics to extract all the diagonals of a variance covariance matrix. In a variance covariance matrix the diagonal elements are variances and the off diagonal elements are covariances.

A = matrix(c(2,4,5,6,8,9,7,3,2), 3,3)
A
     [,1] [,2] [,3]
[1,]    2    6    7
[2,]    4    8    3
[3,]    5    9    2
dg = diag(A)
dg
[1] 2 8 2

The upper.tri() command and the lower.tri() command can be used to create an upper and lower  triangle matrix respectively. However, the matrix created is a logical matrix with TRUE and FALSE as shown below. We need to add one more command to add zeros wherever TRUE appears.
up = upper.tri(A)
up
      [,1]  [,2]  [,3]
[1,] FALSE  TRUE  TRUE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE FALSE

A[upper.tri(A)]= 0
A
     [,1] [,2] [,3]
[1,]    2    0    0
[2,]    4    8    0
[3,]    5    9    2

For the ease of understanding data extraction from a matrix X, we will pre specify a matrix with 5 rows and 6 columns.
x = matrix(rep(51:80), 5,6)
x



     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   51   56   61   66   71   76
[2,]   52   57   62   67   72   77
[3,]   53   58   63   68   73   78
[4,]   54   59   64   69   74   79
[5,]   55   60   65   70   75   80

Note that the same matrix can be generated by a nesting of seq() and matrix(). However, seq() command will generate values with decimals.
x = matrix(seq(50,80, length = 30), 5,6)
x
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
[1,] 50.00000 55.17241 60.34483 65.51724 70.68966 75.86207
[2,] 51.03448 56.20690 61.37931 66.55172 71.72414 76.89655
[3,] 52.06897 57.24138 62.41379 67.58621 72.75862 77.93103
[4,] 53.10345 58.27586 63.44828 68.62069 73.79310 78.96552
[5,] 54.13793 59.31034 64.48276 69.65517 74.82759 80.00000

For the learning purpose we will only use the X matrix generated from using the rep() command. To extract just the first column of the matrix x
u = x[,1]
u
[1] 51 52 53 54 55

In the above command [row number, column number] is specified. The blank space before the comma is interpreted by R as all the rows but only the first column. Similarly, to extract first row from the matrix x
v = x[1,]
v
[1] 51 56 61 66 71 76

The key to understanding matrix manipulation is to understand when to use the square brackets and when to use the circular brackets. Most of the matrix manipulation is performed using the square brackets.
Suppose user wants selected elements from a matrix , data extraction can be performed using the following set of R Commands.
g = x[1:2,5:6]
g
     [,1] [,2]
[1,]   71   76
[2,]   72   77

Now suppose user wants to multiply all the elements of a sub matrix by a scalar, following commands can be used.
g = 2*x[1:2,5:6]
g
     [,1] [,2]
[1,]  142  152
[2,]  144  154
following command can be used  to make the first two rows of a matrix the same as the last two rows. Note that anything on the right side of the equal sign will be the manipulated form of the matrix. since we want to make the first 2 rows same as the last two we are referring the last to rows on the right side.
x[1:2,] = x[4:5,]
x
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   54   59   64   69   74   79
[2,]   55   60   65   70   75   80
[3,]   53   58   63   68   73   78
[4,]   54   59   64   69   74   79
[5,]   55   60   65   70   75   80

If the user wants to exchange the first row with the second row , we can achieve this by using the same command as mentioned above but now we change the order of the rows on the right hand side of the equal sign.
x[1:2,]=x[2:1,]
> x
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   52   57   62   67   72   77
[2,]   51   56   61   66   71   76
[3,]   53   58   63   68   73   78
[4,]   54   59   64   69   74   79
[5,]   55   60   65   70   75   80

To extract all the diagonal elements of a
> dg = diag(x)
> dg
[1] 51 57 63 69 75

To remove a column or row from a matrix in R, users need to add the negative sign as shown bellow
x[,-1]
     [,1] [,2] [,3] [,4] [,5]
[1,]   56   61   66   71   76
[2,]   57   62   67   72   77
[3,]   58   63   68   73   78
[4,]   59   64   69   74   79
[5,]   60   65   70   75   80

R will preserve all the rows in a matrix but will omit first column. The same can be applied to rows.
x = matrix(51:80,5,6)
x[-5,]
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   51   56   61   66   71   76
[2,]   52   57   62   67   72   77
[3,]   53   58   63   68   73   78
[4,]   54   59   64   69   74   79

Saturday, April 13, 2013

Analyzing data using Boxplots

Box Plots <!-- Styles for R syntax highlighter

Box Plots

A lot of economic data, in my opinion, can be better understood using the Boxplots rather than bar charts. Box plots have a number of uses when the research is in its intial stages. A researcher after collecting the data can plot this data using boxplots to get an overview. An analyst can plot a number of different types of boxplots and infer a great deal of information. Boxplots are a quick and easy way to compare multiple data series. Usually , economist use scatter plots, qqplots and histogram to prove the Skewness or normality of data and also further to detect outliers in their data. Boxplot instead of using the mean and standard deviation, to show the center of the distribution  and its spread,uses median and interquantile range.
My main purpose in this post is to introduce users of R to boxplots and take a step further to analyse data using various boxplots commands.

Data

The data for the boxplot is borrowed from Jeffrey Wooldridge book. The data comprises of 526 observations and 24 variables.

Interpretation:

The most insightful way to understand a boxplot is to plot one. Following are some of the commands that reads the data and then plots the graph.
wg = read.csv("dataw.csv")  ## read in the raw data file
summary(wg$wage)  ## summary statistics of the wage column
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.53    3.33    4.65    5.90    6.88   25.00
boxplot(wg$wage)  ## plot the wage data
plot of chunk unnamed-chunk-2
As discussed above the median and not the mean is the center of the distribution for boxplots, hencethe thick black line in the box shows the median(4.650). Further the box itself is approximately an interquantile range.Lastly, all the points above and below the fence are outliers.This shows us that the data is skewed and not symmetric.
Multiple boxplots can be plotted using the follwing r command.
boxplot(boxplot(wg$wage ~ as.factor(wg$female)))  ## Multiple boxplots
## Error: non-numeric argument to binary operator
plot of chunk unnamed-chunk-3
In the data set suppose we want to study the distribution of wage between males vs females. The variable female in the data set is a dummy variable which assumes the value 1 if female and 0 if male.One can infer that on an average males are paid a higher wage than females. We can beautify the graph by adding colors and labels to the plot.
boxplot(wg$wage ~ as.factor(wg$female), col = ("blue"), xlab = "sex", ylab = "wage (average hourly earnings in $)")
plot of chunk unnamed-chunk-4
now we can add more information to the graph by differentiating between males and female by color.
boxplot(wg$wage ~ as.factor(wg$female), col = c("blue", "red"), names = c("Male", 
    "Femle"), xlab = "sex", ylab = "wage (average hourly earnings in $)")
plot of chunk unnamed-chunk-5
In the above image you would observe that i have defined “col” as a vector of two colors and have also included “names” command also as a vector. In my next post i will try to include more information to this boxplot and also introduce users to plot multiple plots and combine plots using basic R package.

Conclusion

We have studied data using boxplots and come to a conclusion that on an average females are paid lower wages. Also, we can observe the outlier in the data sets.
-->