.

Introduction to simple linear regression

What is regression?

A way of predicting the value of one variable from another.

- it is a hypothetical model of the relationship between two variables.
- the model used is linear.
- we describe the relationship using the equation of a straight line.

Simple linear regression

Simple linear regression (a quick review)

Regression equation

\[ Y_i = \beta_0 + \beta_iX_i + \epsilon_i \]

- $\beta_i$
  - regression coefficient for the predictor
  - gradient or slope of the regression line
  - direction/strength of the relationship
- $\beta_0$
  - intercept
  - point at which the regression line crosses the Y-axis

Simple linear regression (a quick review)

Simple linear regression

Example in R

A distributor of frozen desert pies wants to evaluate the effect of price to the demand of pie. Data are collected for 15 weeks.

head(df)
# A tibble: 6 × 4
   Week Pie_Sales Price Advertising
  <int>     <dbl> <dbl>       <dbl>
1     1       350   5.5         3.3
2     2       460   7.5         3.3
3     3       350   8           3  
4     4       430   8           4.5
5     5       350   6.8         3  
6     6       380   7.5         4

Simple linear regression

Step 1: check data using a scatter plot

df |> ggplot(aes(Price, Pie_Sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(limits = c(4, 10)) +
  scale_y_continuous(limits = c(200, 500)) +
  theme_minimal(base_size = 14)

Simple linear regression

Step 2: Estimate the model

## estimating linear regression
model <- lm(Pie_Sales ~ Price, data = df)

## printing model summary
summary(model)

Call:
lm(formula = Pie_Sales ~ Price, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-90.040 -45.040   1.977  55.926  81.977 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   558.28      90.44   6.173 3.36e-05 ***
Price         -24.03      13.48  -1.783   0.0979 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.09 on 13 degrees of freedom
Multiple R-squared:  0.1965,    Adjusted R-squared:  0.1347 
F-statistic: 3.179 on 1 and 13 DF,  p-value: 0.09794

Step 3: Interpret results

y-intercept (a): 558.28
- estimated average value of $y$ when all $x_i = 0$
slope (b): -24.03
- estimates the average value of y changes by $b_i$ units for each 1 unit increase in $x_i$ holding other variables constant.
- example: a 1 unit increase in price decreases the pie sales by 24.03 pies per week.

Multiple linear regression

- a linear regression with one or more independent variable (explanatory variable), and one dependent variable (response variable).
- multiple regression model

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + b_k x_k \]

Multiple linear regression

Example in R

A distributor of frozen desert pies wants to evaluate the factors affecting the demand of pie. Data are collected for 15 weeks.

Multiple linear regression

Step 1: import dataset

## import synthetic data
df <- tibble(
  Week = 1:15,
  Pie_Sales = c(350, 460, 350, 430, 350, 380, 430, 470, 450, 490, 340, 300, 440, 450, 300),
  Price = c(5.50, 7.50, 8.00, 8.00, 6.80, 7.50, 4.50, 6.40, 7.00, 5.00, 7.20, 7.90, 5.90, 5.00, 7.00),
  Advertising = c(3.3, 3.3, 3.0, 4.5, 3.0, 4.0, 3.0, 3.7, 3.5, 4.0, 3.5, 3.2, 4.0, 3.5, 2.7)
)

## print dataset
head(df)
# A tibble: 6 × 4
   Week Pie_Sales Price Advertising
  <int>     <dbl> <dbl>       <dbl>
1     1       350   5.5         3.3
2     2       460   7.5         3.3
3     3       350   8           3  
4     4       430   8           4.5
5     5       350   6.8         3  
6     6       380   7.5         4

Step 2: Create a scatter plot

Multiple linear regression

Step 3: Estimate the model

## estimating linear regression
model <- lm(Pie_Sales ~ Price + Advertising, data = df)

## printing model summary
summary(model)

Call:
lm(formula = Pie_Sales ~ Price + Advertising, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.795 -33.796  -9.088  17.175  96.155 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   306.53     114.25   2.683   0.0199 *
Price         -24.98      10.83  -2.306   0.0398 *
Advertising    74.13      25.97   2.855   0.0145 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.46 on 12 degrees of freedom
Multiple R-squared:  0.5215,    Adjusted R-squared:  0.4417 
F-statistic: 6.539 on 2 and 12 DF,  p-value: 0.01201

Step 4: Interpret results

$\beta_1$ (Price): -24.98
- pie sales will decrease, on average, by 24.98 pies per week for each 1 unit increase in selling price, net of the effects of changes due to advertising.
$\beta_2$ (Advertising): 74.13
- sales will increase, on average, by 74.13 pies per week for each $1 increase in advertising, net of the effects of changes due to price.

Multiple linear regression

# table summary
library(jtools)
summ(model)

Observations	15
Dependent variable	Pie_Sales
Type	OLS linear regression

F(2,12)	6.54
R²	0.52
Adj. R²	0.44

	Est.	S.E.	t val.	p
(Intercept)	306.53	114.25	2.68	0.02
Price	-24.98	10.83	-2.31	0.04
Advertising	74.13	25.97	2.85	0.01
Standard errors: OLS

Multiple linear regression

# plot
library(jtools)
effect_plot(model, 
            pred = Price, 
            plot.points = TRUE, 
            jitter = 0.5, 
            interval = TRUE)

Multiple linear regression

# plot
library(jtools)
effect_plot(model, 
            pred = Advertising, 
            plot.points = TRUE, 
            jitter = 0.5, 
            interval = TRUE)

Check-up quiz

What is the primary goal of linear regression analysis?

To determine the strength of the relationship between two categorical variables.
To predict the value of a dependent variable based on one or more independent variables
To compare the means of two or more groups.
To analyze the frequency of occurrences within categories.

In simple linear regression, what does the slope of the regression line represent?

The average value of the dependent variable.
The change in the dependent variable for a one-unit increase in the independent variable
The correlation between the two variables.
The predicted value of the dependent variable when the independent variable is zero.

How does multiple linear regression differ from simple linear regression?

Multiple linear regression uses only one independent variable.
Multiple linear regression uses two or more independent variables
Multiple linear regression analyzes categorical variables.
Multiple linear regression does not involve a dependent variable.

What is the coefficient of determination (R-squared) in the context of regression analysis?

A measure of the strength of the linear relationship between the variables
The probability of making a correct prediction.
The difference between the observed and predicted values.
he slope of the regression line.

How good is the model?

Take note!

- The regression line is only a model based on a data.
- This model might not reflect the reality.
  - We need some way of testing how well the model fits the observed data.
  - how?

Sum of Squares

Total sum of squares

- $SS_T = \Sigma(y_i - \bar{y})^2$
  - uses the differences between the observed data and the mean value of Y

Sum of Squares

Residuals sum of squares

- $SS_R = \Sigma(y_i - \hat{y_i})^2$
  - uses the differences between the observed data and regression line

Multiple coefficient of determination (R-squared)

R-squared ($R^2$)

- reports the proportion of total variation of in $y$ explained by all $x$ variables taken together

\[ R^2 = \frac{\text{SSR}}{SST} = \frac{\text{Sum of squares regression}}{\text{Total sum of squares}} \]

Multiple coefficient of determination (R-squared)

Adjusted R-squared ($R^2$)
- $R^2$ never decreases when new $x$ variable is added to the model.
- This can be disadvantage when comparing models

What is the effect of adding a new variable?

- We lose a degree of freedom when a new $x$ variable is added.
- Did the new $x$ variable add enough explanatory power to offset the loss of one degree of freedom?

Multiple coefficient of determination (R-squared)

Adjusted R-squared ($R^2$)

\[ R^2_{adj} = 1 - (1 - R^2)(\frac{n-1}{n-k-1}) \]

where
- $n$ = sample size
- $k$ = number of independent variables

- shows the proportion of variation in $y$ explained by all $x$ variables adjusted for the number of $x$ variables used.
- where excessive use of unimportant independent variables
- smaller then $R^2$
- useful in comparing among models

Example in R

Calculate the $R^2$ and adjusted $R^2$

\[ R^2 = \frac{\text{SSR}}{SST} = \frac{\Sigma(y_i - \hat{y_i})^2}{\Sigma(y_i - \bar{y})^2} \]

## Estimating linear regression
model <- lm(Pie_Sales ~ Price + Advertising, data = df)

## Extract model predictions
y_pred <- predict(model)
y_actual <- df$Pie_Sales

## Compute SS_tot (Total Sum of Squares)
y_mean <- mean(y_actual)
SS_tot <- sum((y_actual - y_mean)^2)

## Compute SS_res (Residual Sum of Squares)
SS_res <- sum((y_actual - y_pred)^2)

## Compute R-squared manually
r_squared_manual <- 1 - (SS_res / SS_tot)

Manual R-squared: 0.5214779

\[ R^2_{adj} = 1 - (1 - R^2)(\frac{n-1}{n-k-1}) \]

## Compute Adjusted R-squared manually
n <- nrow(df)  # Number of observations
k <- length(coef(model)) - 1  # Number of predictors
adj_r_squared_manual <- 1 - ((1 - r_squared_manual) * (n - 1) / (n - k - 1))

Manual Adjusted R-squared: 0.4417243

Diagnostic tests

Linearity assumption

- linear regression model relates the outcome to the predictors via a linear fashion.
- departure from linearity can be checked by using a scatter plot and a line overlaid to the plots
- if linearity is not satisfied, transformation may be needed

Linearity assumption

insert plot

Normality of errors

- confidence intervals for regression and related hypothesis tests are based on the assumption that the coeffcient estimates have normal distribution.

.

Introduction to simple linear regression

What is regression?

Simple linear regression

Simple linear regression (a quick review)

Simple linear regression (a quick review)

Simple linear regression

Simple linear regression

Simple linear regression

Multiple linear regression

Multiple linear regression

Multiple linear regression

Multiple linear regression

Multiple linear regression

Multiple linear regression

Multiple linear regression

Check-up quiz

What is the primary goal of linear regression analysis?

In simple linear regression, what does the slope of the regression line represent?

How does multiple linear regression differ from simple linear regression?

What is the coefficient of determination (R-squared) in the context of regression analysis?

How good is the model?

Sum of Squares

Sum of Squares

Multiple coefficient of determination (R-squared)

Multiple coefficient of determination (R-squared)

Multiple coefficient of determination (R-squared)

Example in R

Diagnostic tests

Linearity assumption

Linearity assumption

Normality of errors

Homoscedasticity of error variance

Independence of errors

Non-multicollinearity