A catalogue of Models in R

1. Simple Linear Model
2. Multi-Linear Model
3. Polynomial Model
4. Exponential Model
5. Logistic Regression Model
6. K-nearest neighbourhood
7. Bootstrap

1 Simple Linear Model

\begin{equation} Y = \beta_0 + \beta_1 X. \end{equation}

df <- mtcars

my.lmodel <- lm(mpg ~ wt, data=df)
summary(my.lmodel)


Call:
lm(formula = mpg ~ wt, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

2 Multi-Linear Model

\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_n X_n. \end{equation}

df <- mtcars

my.lmodel <- lm(mpg ~ wt + hp, data=df)
summary(my.lmodel)


Call:
lm(formula = mpg ~ wt + hp, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,	Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

3 Polynomial Model

\begin{equation} Y = \beta_0 + \beta_1 X + \beta_2 X^2. \end{equation}

df <-
data.frame(
"XX" = c(
9.692748, 8.694011, 2.711132, 2.363472, 5.529044, 7.009672, 6.399579, 3.828119,
7.728721, 9.178202, 8.685242, 8.782446, 1.168813, 8.417367),
"YY" = c(
-24.5773396,  -9.9371045,  66.9430825,  62.8107794,  36.1920856,  19.0341337,
 27.7069809,  45.1998159,  11.2022725, -15.5522627,  -6.5048885,  -6.7400534,
 68.0594333,  -3.1477693)
)

my.qmodel <- lm(YY ~ XX + I(XX^2), data=df)
summary(my.qmodel)


Call:
lm(formula = YY ~ XX + I(XX^2), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6322 -0.5881 -0.2455  1.3227  6.0064 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  71.8863     4.2283  17.001 3.03e-09 ***
XX           -1.7604     1.7851  -0.986 0.345255    
I(XX^2)      -0.8404     0.1577  -5.329 0.000241 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.305 on 11 degrees of freedom
Multiple R-squared:  0.9911,	Adjusted R-squared:  0.9895 
F-statistic: 611.4 on 2 and 11 DF,  p-value: 5.318e-12

4 Exponential Model

In order to fit the model

\begin{equation} Y = \beta_0 e^{\beta_1 X}, \end{equation}

we take the logarithm and fit the linear model

\begin{equation} \ln(Y) = \ln(\beta_0) + \beta_1 X. \end{equation}

df <- data.frame(
"XX" = c(
    0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,
   10 , 11 , 12 , 13 , 14 , 15 , 16 ,
   17 , 18 , 19 , 20 , 21 , 22 , 23 ,
   24 , 25 , 26 , 27 , 28 , 29 , 30 ),
"YY" =  c(
125.0 ,100.7, 70.5 ,
83.4 , 100  , 65.9 , 66.5, 53.5 , 61.3 , 43.9 ,
42.2 , 45.7 , 32.5 , 36.6, 40.1 , 23.0 , 39.8 ,
20.8 , 35.0 , 17.9 , 21.1, 27.9 , 21.8 , 14.2 ,
13.6 , 11.8 , 25.1 , 8.19, 17.1 , 24.2 , 17.7 )
)

my.expmodel  <-  lm(log(YY) ~ XX, data=df)
# plot(time, amount, pch=19, col="darkblue")
# lines(time,exp(my.expfit$fitted.values),col="orange")
summary(my.expmodel)


Call:
lm(formula = log(YY) ~ XX, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.58725 -0.20024  0.02913  0.22138  0.63404 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.551145   0.101212   44.97  < 2e-16 ***
XX          -0.068925   0.005795  -11.89 1.13e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2886 on 29 degrees of freedom
Multiple R-squared:  0.8299,	Adjusted R-squared:  0.824 
F-statistic: 141.4 on 1 and 29 DF,  p-value: 1.126e-12

5 Logistic Regression Model

We would like a model that predicts the relationship between \(Pr(Y=1|X)\) and \(X\).

The logistic function \(p:\mathbb{R}\to [0,1]\) is given by

\begin{equation} p(X) = \frac{ e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}. \end{equation}

The odds are given by

\begin{equation} \frac{p(X)}{1-p(X)}, \end{equation}

its logarithm is called log-odds or simply logit:

\begin{equation} \ln\left(\frac{p(X)}{1-p(X)}\right). \end{equation}

Then, the logistic regression model has a linear logit in the variable \(X\).

library(ISLR)
# Default data set:
     # A simulated data set containing information on ten thousand
     # customers. The aim here is to predict which customers will default
     # on their credit card debt.
df <- Default

df$Y  <-  ifelse(df$default%in%"Yes",1,0) ## Encoding Yes as 1 and No as 0.

my.logisticmodel <-glm(Y ~ balance, data=df,family="binomial")

summary(my.logisticmodel)


Call:
glm(formula = Y ~ balance, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2697  -0.1465  -0.0589  -0.0221   3.7589  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
---
codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1596.5  on 9998  degrees of freedom
AIC: 1600.5

Number of Fisher Scoring iterations: 8

To add the estimated probabilities \(p(X)\) and predicting a class with threshold \(p(X) > 0.5\).

p1 <- predict(my.logisticmodel,df, type="response")
df <- cbind(df,p1)
df$class <- ifelse(df$p1>0.5,"Yes","No")

The confusion matrix

CC <- table(df$default,df$class)
# Confusion matrix:
CC

No	Yes
9625	42
233	100

and the accuracy of the classification

#accuracy
sum(diag(CC)) / sum(CC)


[1] 0.9725

6 K-nearest neighbourhood

The Normalisation function Since Different variables have different scaling units, like weight in kg and height in cm. We normalise each one of the variables and therefore each variable will be represented by a value between \(0\) and \(1\).

\begin{equation} N(X_1) = \frac{ X - \min(\{X_1, \ldots,X_n\}) } {\max(\{X_1, \ldots,X_n\}) - \min(\{X_1, \ldots,X_n\})}. \end{equation}

## We will be working with the iris data set.
df <- iris

nor <-function(x) { (x -min(x))/(max(x)-min(x))   }

## Run nomalization on first 4 coulumns which
## are the predictors.
## Note the use of the lapply function

df_norm <- as.data.frame(lapply(df[,c(1:4)], nor))

##extract testing set consisting of a 
## subset of 90% of the rows

ran <- sample(1:nrow(df), 0.9 * nrow(df)) 
df_train <- df_norm[ran,] 
df_train_category <- df[ran,5] 


df_test <- df_norm[-ran,]  
df_test_category <- df[-ran,5]

 library(class) 
##run knn function
 pr <- knn(df_train,df_test,cl=df_train_category,k=5)

 ##create confusion matrix
CC <- table(pr,df_test_category)

## accuracy is the ratio of  correct predictions 
## by total number of predictions 

sum(diag(CC))/sum(CC) * 100

93.3333333333333

Remark.Note that result may be different each time we run the code, because we are using a random selection.

7 Bootstrap

In this example in particular we will use the Bootstrap method in order to estimate the variance (standard deviation) of the following function depending on two variables \(X\) and \(Y\) given by:

\begin{equation} \label{eq:01} \alpha = \frac{\sigma^2_{Y} - \sigma_{XY}}{\sigma_{X}^2 + \sigma_{Y}^2 - 2\sigma_{XY}}. \end{equation}

library(boot)
library(ISLR)

df <- Portfolio

my.function <- function(data, index){
                      X <- data[index,"X"]
                      Y <- data[index,"Y"]
                      return(
                              ( var(Y) - cov(X,Y) ) /
                              ( var(X) + var(Y) - 2*cov(X,Y))
                     )
}
boot.result <- boot(data=df,my.function,R=1000)
boot.result


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = df, statistic = my.function, R = 1000)


Bootstrap Statistics :
     original      bias    std. error
t1* 0.5758321 0.004756099  0.09076746

Of course, the function can be anything else but the implementation is similar.