A catalogue of Models in R
Table of Contents
1 Simple Linear Model
df <- mtcars my.lmodel <- lm(mpg ~ wt, data=df) summary(my.lmodel)
Call: lm(formula = mpg ~ wt, data = df) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
2 Multi-Linear Model
df <- mtcars my.lmodel <- lm(mpg ~ wt + hp, data=df) summary(my.lmodel)
Call: lm(formula = mpg ~ wt + hp, data = df) Residuals: Min 1Q Median 3Q Max -3.941 -1.600 -0.182 1.050 5.854 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.22727 1.59879 23.285 < 2e-16 *** wt -3.87783 0.63273 -6.129 1.12e-06 *** hp -0.03177 0.00903 -3.519 0.00145 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.593 on 29 degrees of freedom Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148 F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
3 Polynomial Model
df <- data.frame( "XX" = c( 9.692748, 8.694011, 2.711132, 2.363472, 5.529044, 7.009672, 6.399579, 3.828119, 7.728721, 9.178202, 8.685242, 8.782446, 1.168813, 8.417367), "YY" = c( -24.5773396, -9.9371045, 66.9430825, 62.8107794, 36.1920856, 19.0341337, 27.7069809, 45.1998159, 11.2022725, -15.5522627, -6.5048885, -6.7400534, 68.0594333, -3.1477693) ) my.qmodel <- lm(YY ~ XX + I(XX^2), data=df) summary(my.qmodel)
Call: lm(formula = YY ~ XX + I(XX^2), data = df) Residuals: Min 1Q Median 3Q Max -7.6322 -0.5881 -0.2455 1.3227 6.0064 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 71.8863 4.2283 17.001 3.03e-09 *** XX -1.7604 1.7851 -0.986 0.345255 I(XX^2) -0.8404 0.1577 -5.329 0.000241 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.305 on 11 degrees of freedom Multiple R-squared: 0.9911, Adjusted R-squared: 0.9895 F-statistic: 611.4 on 2 and 11 DF, p-value: 5.318e-12
4 Exponential Model
In order to fit the model
\begin{equation} Y = \beta_0 e^{\beta_1 X}, \end{equation}we take the logarithm and fit the linear model
\begin{equation} \ln(Y) = \ln(\beta_0) + \beta_1 X. \end{equation}df <- data.frame( "XX" = c( 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 ), "YY" = c( 125.0 ,100.7, 70.5 , 83.4 , 100 , 65.9 , 66.5, 53.5 , 61.3 , 43.9 , 42.2 , 45.7 , 32.5 , 36.6, 40.1 , 23.0 , 39.8 , 20.8 , 35.0 , 17.9 , 21.1, 27.9 , 21.8 , 14.2 , 13.6 , 11.8 , 25.1 , 8.19, 17.1 , 24.2 , 17.7 ) ) my.expmodel <- lm(log(YY) ~ XX, data=df) # plot(time, amount, pch=19, col="darkblue") # lines(time,exp(my.expfit$fitted.values),col="orange") summary(my.expmodel)
Call: lm(formula = log(YY) ~ XX, data = df) Residuals: Min 1Q Median 3Q Max -0.58725 -0.20024 0.02913 0.22138 0.63404 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.551145 0.101212 44.97 < 2e-16 *** XX -0.068925 0.005795 -11.89 1.13e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2886 on 29 degrees of freedom Multiple R-squared: 0.8299, Adjusted R-squared: 0.824 F-statistic: 141.4 on 1 and 29 DF, p-value: 1.126e-12
5 Logistic Regression Model
We would like a model that predicts the relationship between \(Pr(Y=1|X)\) and \(X\).
The logistic function \(p:\mathbb{R}\to [0,1]\) is given by
\begin{equation} p(X) = \frac{ e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}. \end{equation}The odds are given by
\begin{equation} \frac{p(X)}{1-p(X)}, \end{equation}its logarithm is called log-odds or simply logit:
\begin{equation} \ln\left(\frac{p(X)}{1-p(X)}\right). \end{equation}Then, the logistic regression model has a linear logit in the variable \(X\).
library(ISLR) # Default data set: # A simulated data set containing information on ten thousand # customers. The aim here is to predict which customers will default # on their credit card debt. df <- Default df$Y <- ifelse(df$default%in%"Yes",1,0) ## Encoding Yes as 1 and No as 0. my.logisticmodel <-glm(Y ~ balance, data=df,family="binomial") summary(my.logisticmodel)
Call: glm(formula = Y ~ balance, family = "binomial", data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.2697 -0.1465 -0.0589 -0.0221 3.7589 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 *** balance 5.499e-03 2.204e-04 24.95 <2e-16 *** --- codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2920.6 on 9999 degrees of freedom Residual deviance: 1596.5 on 9998 degrees of freedom AIC: 1600.5 Number of Fisher Scoring iterations: 8
To add the estimated probabilities \(p(X)\) and predicting a class with threshold \(p(X) > 0.5\).
p1 <- predict(my.logisticmodel,df, type="response") df <- cbind(df,p1) df$class <- ifelse(df$p1>0.5,"Yes","No")
The confusion matrix
CC <- table(df$default,df$class) # Confusion matrix: CC
No | Yes |
---|---|
9625 | 42 |
233 | 100 |
and the accuracy of the classification
#accuracy sum(diag(CC)) / sum(CC)
[1] 0.9725
6 K-nearest neighbourhood
The Normalisation function Since Different variables have different scaling units, like weight in kg and height in cm. We normalise each one of the variables and therefore each variable will be represented by a value between \(0\) and \(1\).
\begin{equation} N(X_1) = \frac{ X - \min(\{X_1, \ldots,X_n\}) } {\max(\{X_1, \ldots,X_n\}) - \min(\{X_1, \ldots,X_n\})}. \end{equation}## We will be working with the iris data set. df <- iris nor <-function(x) { (x -min(x))/(max(x)-min(x)) } ## Run nomalization on first 4 coulumns which ## are the predictors. ## Note the use of the lapply function df_norm <- as.data.frame(lapply(df[,c(1:4)], nor)) ##extract testing set consisting of a ## subset of 90% of the rows ran <- sample(1:nrow(df), 0.9 * nrow(df)) df_train <- df_norm[ran,] df_train_category <- df[ran,5] df_test <- df_norm[-ran,] df_test_category <- df[-ran,5] library(class) ##run knn function pr <- knn(df_train,df_test,cl=df_train_category,k=5) ##create confusion matrix CC <- table(pr,df_test_category) ## accuracy is the ratio of correct predictions ## by total number of predictions sum(diag(CC))/sum(CC) * 100
93.3333333333333
Remark.Note that result may be different each time we run the code, because we are using a random selection.
7 Bootstrap
In this example in particular we will use the Bootstrap method in order to estimate the variance (standard deviation) of the following function depending on two variables \(X\) and \(Y\) given by:
\begin{equation} \label{eq:01} \alpha = \frac{\sigma^2_{Y} - \sigma_{XY}}{\sigma_{X}^2 + \sigma_{Y}^2 - 2\sigma_{XY}}. \end{equation}library(boot) library(ISLR) df <- Portfolio my.function <- function(data, index){ X <- data[index,"X"] Y <- data[index,"Y"] return( ( var(Y) - cov(X,Y) ) / ( var(X) + var(Y) - 2*cov(X,Y)) ) } boot.result <- boot(data=df,my.function,R=1000) boot.result
ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = df, statistic = my.function, R = 1000) Bootstrap Statistics : original bias std. error t1* 0.5758321 0.004756099 0.09076746
Of course, the function can be anything else but the implementation is similar.