How Can You Tell if You Have a 10 Spline or 26 Spline T5

Linear Regression Assumptions and Diagnostics in R: Essentials

Linear regression (Affiliate @ref(linear-regression)) makes several assumptions most the data at hand. This chapter describes regression assumptions and provides built-in plots for regression diagnostics in R programming language.

Afterwards performing a regression analysis, you should always cheque if the model works well for the data at manus.

A first footstep of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in the Chapters @ref(linear-regression) and @ref(cross-validation).

In this electric current chapter, yous volition acquire additional steps to evaluate how well the model fits the data.

For example, the linear regression model makes the assumption that the relationship betwixt the predictors (x) and the result variable is linear. This might non be true. The relationship could be polynomial or logarithmic.

Additionally, the data might incorporate some influential observations, such equally outliers (or extreme values), that can affect the upshot of the regression.

Therefore, you should closely diagnostic the regression model that you congenital in order to discover potential problems and to check whether the assumptions made past the linear regression model are met or not.

To practice so, nosotros generally examine the distribution of residuals errors, that can tell you more than about your data.

In this chapter,

we kickoff past explaining residuals errors and fitted values.
next, we present linear regresion assumptions, too as, potential issues you can confront when performing regression analysis.
finally, we describe some built-in diagnostic plots in R for testing the assumptions underlying linear regression model.

Contents:

Loading Required R packages
Example of data
Edifice a regression model
Fitted values and residuals
Regression assumptions
Regression diagnostics {reg-diag}
- Diagnostic plots
Linearity of the data
Homogeneity of variance
Normality of residuals
Outliers and high levarage points
Influential values
Discussion
References

Loading Required R packages

tidyverse for easy data manipulation and visualization
broom: creates a tidy data frame from statistical test results

                  library(tidyverse) library(broom) theme_set(theme_classic())

Instance of data

Nosotros'll use the data set marketing [datarium parcel], introduced in Chapter @ref(regression-analysis).

                  # Load the data information("marketing", package = "datarium") # Inspect the data sample_n(marketing, 3)

                  ##     youtube facebook newspaper sales ## 58    163.iv     23.0      19.ix  fifteen.viii ## 157   112.seven     52.2      60.half-dozen  18.iv ## 81     91.vii     32.0      26.viii  14.ii

Edifice a regression model

We build a model to predict sales on the ground of advertising budget spent in youtube medias.

                  model <- lm(sales ~ youtube, information = marketing) model

                  ##  ## Call: ## lm(formula = sales ~ youtube, data = marketing) ##  ## Coefficients: ## (Intercept)      youtube   ##      8.4391       0.0475

Our regression equation is: y = 8.43 + 0.07*x, that is sales = eight.43 + 0.047*youtube.

Before, describing regression assumptions and regression diagnostics, nosotros outset past explaining two key concepts in regression analysis: Fitted values and residuals errors. These are important for understanding the diagnostic plots presented hereafter.

Fitted values and residuals

The fitted (or predicted) values are the y-values that you would await for the given x-values co-ordinate to the built regression model (or visually, the all-time-fitting straight regression line).

In our example, for a given youtube advertising budget, the fitted (predicted) sales value would be, sales = eight.44 + 0.0048*youtube.

From the scatter plot beneath, it can be seen that not all the data points fall exactly on the estimated regression line. This means that, for a given youtube advertising budget, the observed (or measured) auction values can be different from the predicted sale values. The difference is called the residual errors, represented by a vertical red lines.

Linear regression

In R, you can easily augment your data to add fitted values and residuals past using the part broaden() [broom package]. Permit's call the output model.diag.metrics considering it contains several metrics useful for regression diagnostics. We'll describe theme later.

                  model.diag.metrics <- augment(model) caput(model.diag.metrics)

                  ##   sales youtube .fitted .se.fit .resid    .hat .sigma  .cooksd .std.resid ## one 26.52   276.i   21.56   0.385  4.955 0.00970   3.90 7.94e-03     ane.2733 ## 2 12.48    53.4   10.98   0.431  1.502 0.01217   three.92 nine.20e-04     0.3866 ## 3 11.sixteen    20.6    9.42   0.502  i.740 0.01649   3.92 one.69e-03     0.4486 ## four 22.20   181.eight   17.08   0.277  five.119 0.00501   3.90 4.34e-03     one.3123 ## 5 15.48   217.0   xviii.75   0.297 -3.273 0.00578   3.91 ii.05e-03    -0.8393 ## 6  8.64    10.iv    8.94   0.525 -0.295 0.01805   3.92 5.34e-05    -0.0762

Amid the table columns, in that location are:

youtube: the invested youtube advertizement upkeep
sales: the observed auction values
.fitted: the fitted sale values
.resid: the residual errors
…

The post-obit R lawmaking plots the residuals error (in cerise color) betwixt observed values and the fitted regression line. Each vertical red segments represents the rest mistake between an observed sale value and the corresponding predicted (i.e. fitted) value.

                  ggplot(model.diag.metrics, aes(youtube, sales)) +   geom_point() +   stat_smooth(method = lm, se = FALSE) +   geom_segment(aes(xend = youtube, yend = .fitted), color = "cerise", size = 0.iii)

In club to check regression assumptions, we'll examine the distribution of residuals.

Regression assumptions

Linear regression makes several assumptions well-nigh the data, such as :

Linearity of the data. The relationship between the predictor (x) and the effect (y) is assumed to exist linear.
Normality of residuals. The residual errors are assumed to be normally distributed.
Homogeneity of residuals variance. The residuals are causeless to have a constant variance (homoscedasticity)
Independence of residuals mistake terms.

You should bank check whether or not these assumptions agree truthful. Potential issues include:

Non-linearity of the outcome - predictor relationships
Heteroscedasticity: Non-constant variance of error terms.
Presence of influential values in the data that tin can be:
- Outliers: extreme values in the event (y) variable
- High-leverage points: extreme values in the predictors (x) variable

All these assumptions and potential problems can exist checked by producing some diagnostic plots visualizing the residual errors.

Regression diagnostics {reg-diag}

Diagnostic plots

Regression diagnostics plots tin be created using the R base function plot() or the autoplot() function [ggfortify bundle], which creates a ggplot2-based graphics.

Create the diagnostic plots with the R base function:

                    par(mfrow = c(two, 2)) plot(model)

Create the diagnostic plots using ggfortify:

                    library(ggfortify) autoplot(model)

The diagnostic plots show residuals in 4 different means:

Residuals vs Fitted. Used to cheque the linear relationship assumptions. A horizontal line, without singled-out patterns is an indication for a linear relationship, what is good.
Normal Q-Q. Used to examine whether the residuals are unremarkably distributed. It'south good if residuals points follow the direct dashed line.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the example in our example, where we have a heteroscedasticity trouble.
Residuals vs Leverage. Used to identify influential cases, that is farthermost values that might influence the regression results when included or excluded from the analysis. This plot will exist described further in the next sections.

The four plots evidence the top 3 near extreme data points labeled with with the row numbers of the data in the data set. They might be potentially problematic. You lot might want to take a close look at them individually to check if there is anything special for the subject or if it could be simply data entry errors. We'll talk over about this in the following sections.

The metrics used to create the above plots are available in the model.diag.metrics data, described in the previous section.

                    # Add observations indices and # drop some columns (.se.fit, .sigma) for simplification model.diag.metrics <- model.diag.metrics %>%   mutate(index = 1:nrow(model.diag.metrics)) %>%   select(index, everything(), -.se.fit, -.sigma) # Inspect the data caput(model.diag.metrics, 4)

                    ##   index sales youtube .fitted .resid    .hat .cooksd .std.resid ## ane     1  26.v   276.1   21.56   4.96 0.00970 0.00794      1.273 ## 2     2  12.5    53.4   x.98   one.fifty 0.01217 0.00092      0.387 ## 3     3  11.2    20.6    9.42   1.74 0.01649 0.00169      0.449 ## iv     4  22.2   181.8   17.08   5.12 0.00501 0.00434      one.312

We'll utilise mainly the following columns:

.fitted: fitted values
.resid: residual errors
.lid: lid values, used to find high-leverage points (or farthermost values in the predictors x variables)
.std.resid: standardized residuals, which is the residuals divided by their standard errors. Used to find outliers (or extreme values in the outcome y variable)
.cooksd: Cook's altitude, used to detect influential values, which tin be an outlier or a high leverage signal

In the following section, nosotros'll describe, in details, how to use these graphs and metrics to bank check the regression assumptions and to diagnostic potential problems in the model.

Linearity of the data

The linearity assumption tin can be checked past inspecting the Residuals vs Fitted plot (1st plot):

                  plot(model, i)

Ideally, the rest plot will show no fitted design. That is, the red line should exist approximately horizontal at zero. The presence of a pattern may point a trouble with some attribute of the linear model.

In our case, there is no pattern in the residual plot. This suggests that nosotros can presume linear relationship between the predictors and the result variables.

Annotation that, if the residual plot indicates a non-linear human relationship in the data, and then a unproblematic approach is to use non-linear transformations of the predictors, such as log(x), sqrt(x) and 10^2, in the regression model.

Homogeneity of variance

This supposition can be checked by examining the scale-location plot, also known as the spread-location plot.

                  plot(model, 3)

This plot shows if residuals are spread equally along the ranges of predictors. It's good if you meet a horizontal line with every bit spread points. In our example, this is not the instance.

It tin be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting not-abiding variances in the residuals errors (or heteroscedasticity).

A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).

                  model2 <- lm(log(sales) ~ youtube, data = marketing) plot(model2, iii)

Normality of residuals

The QQ plot of residuals can exist used to visually cheque the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

In our case, all the points fall approximately along this reference line, so we tin assume normality.

                  plot(model, 2)

Outliers and high levarage points

Outliers:

An outlier is a point that has an farthermost outcome variable value. The presence of outliers may affect the estimation of the model, because information technology increases the RSE.

Outliers can be identified by examining the standardized residual (or studentized residual), which is the residual divided by its estimated standard mistake. Standardized residuals can be interpreted equally the number of standard errors away from the regression line.

Observations whose standardized residuals are greater than iii in absolute value are possible outliers (James et al. 2014).

High leverage points:

A information point has high leverage, if it has extreme predictor x values. This can exist detected by examining the leverage statistic or the hat-value. A value of this statistic to a higher place ii(p + 1)/northward indicates an observation with high leverage (P. Bruce and Bruce 2017); where, p is the number of predictors and n is the number of observations.

Outliers and high leverage points can be identified past inspecting the Residuals vs Leverage plot:

                  plot(model, 5)

The plot to a higher place highlights the top 3 most farthermost points (#26, #36 and #179), with a standardized residuals beneath -2. However, there is no outliers that exceed 3 standard deviations, what is good.

Additionally, there is no high leverage bespeak in the data. That is, all information points, have a leverage statistic below 2(p + one)/n = 4/200 = 0.02.

Influential values

An influential value is a value, which inclusion or exclusion tin can alter the results of the regression analysis. Such a value is associated with a large balance.

Non all outliers (or extreme data points) are influential in linear regression assay.

Statisticians take developed a metric chosen Cook's distance to determine the influence of a value. This metric defines influence as a combination of leverage and residual size.

A dominion of pollex is that an observation has high influence if Cook'southward distance exceeds four/(n - p - 1) (P. Bruce and Bruce 2017), where n is the number of observations and p the number of predictor variables.

The Residuals vs Leverage plot can help u.s. to find influential observations if whatsoever. On this plot, outlying values are generally located at the upper right corner or at the lower right corner. Those spots are the places where information points can be influential against a regression line.

The following plots illustrate the Cook's distance and the leverage of our model:

                  # Cook's distance plot(model, iv) # Residuals vs Leverage plot(model, five)

By default, the pinnacle 3 almost extreme values are labelled on the Cook'southward distance plot. If you want to label the summit v extreme values, specify the option id.due north as follow:

                  plot(model, iv, id.n = 5)

If you want to look at these meridian 3 observations with the highest Melt'south distance in case you want to appraise them farther, blazon this R code:

                  model.diag.metrics %>%   top_n(3, wt = .cooksd)

                  ##   alphabetize sales youtube .fitted .resid   .hat .cooksd .std.resid ## one    26  14.iv     315    23.four  -9.04 0.0142  0.0389      -two.33 ## ii    36  15.4     349    25.0  -9.66 0.0191  0.0605      -two.49 ## 3   179  14.two     332    24.ii -10.06 0.0165  0.0563      -2.59

When information points have high Cook'due south distance scores and are to the upper or lower right of the leverage plot, they take leverage meaning they are influential to the regression results. The regression results will be contradistinct if we exclude those cases.

In our example, the data don't present any influential points. Cook's altitude lines (a red dashed line) are not shown on the Residuals vs Leverage plot because all points are well inside of the Cook'southward distance lines.

Permit's show now another example, where the information contain ii extremes values with potential influence on the regression results:

                  df2 <- data.frame(   x = c(marketing$youtube, 500, 600),   y = c(marketing$sales, eighty, 100) ) model2 <- lm(y ~ x, df2)

Create the Residuals vs Leverage plot of the two models:

                  # Cook's distance plot(model2, iv) # Residuals vs Leverage plot(model2, five)

On the Residuals vs Leverage plot, look for a data point outside of a dashed line, Cook'due south distance. When the points are outside of the Cook's altitude, this ways that they take high Cook's distance scores. In this case, the values are influential to the regression results. The regression results volition be altered if we exclude those cases.

In the higher up instance ii, two information points are far beyond the Cook's altitude lines. The other residuals appear clustered on the left. The plot identified the influential observation as #201 and #202. If y'all exclude these points from the analysis, the slope coefficient changes from 0.06 to 0.04 and R2 from 0.5 to 0.6. Pretty big bear upon!

Word

This chapter describes linear regression assumptions and shows how to diagnostic potential problems in the model.

The diagnostic is substantially performed past visualizing the residuals. Having patterns in residuals is not a stop point. Your electric current regression model might not exist the best way to empathize your data.

Potential problems might exist:

A non-linear relationships betwixt the issue and the predictor variables. When facing to this problem, one solution is to include a quadratic term, such every bit polynomial terms or log transformation. Run into Chapter @ref(polynomial-and-spline-regression).
Being of important variables that you left out from your model. Other variables you didn't include (e.k., historic period or gender) may play an of import part in your model and data. See Chapter @ref(misreckoning-variables).
Presence of outliers. If you believe that an outlier has occurred due to an error in data collection and entry, then 1 solution is to simply remove the concerned observation.

References

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O'Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.

wilfongfroldn.blogspot.com

Source: http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/