This article was first published by IBM Developer at, but authored by Casper Hansen. Here is the Direct link.

Linear Regression is famously known for being a simple algorithm and a good baseline to compare more complex models to. In this article, we explore the algorithm and turn the math into code, and then we run the code on a dataset, to get predictions on new data.

Table Of Contents (Click To Scroll)

  1. What Is Linear Regression?
  2. Multiple Linear Regression
  3. Special Case 1: Simple Linear Regression
  4. Special Case 2: Polynomial Regression

What Is Linear Regression?

The Linear Regression model consists of one equation of linearly increasing variables (also called parameters or features), along with a coefficient estimation algorithm called least squares, which attempts to figure out the best possible coefficient given a variable.

Linear regression models are known to be simple and easy to implement, because there is no advanced mathematical knowledge needed, except for a bit of linear algebra. For this reason, many people choose to use a linear regression model as a baseline model, to compare if another model can outperform such a simple model.

Multiple Linear Regression

Multiple linear regression is a model that can capture the a linear relationship between multiple variables/features – assuming that there is one. The general formula for multiple linear regression looks like the following:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_i x_i + \varepsilon $$
  • $\beta_0$ is known as the intercept
  • $\beta_1$ to $\beta_i$ are known as coefficients
  • $x_1$ to $x_i$ are the features of our dataset
  • $\varepsilon$ are the residual terms

We can also represent the formula for linear regression in vector notation. When representing the formula in vector notation, we have the advantage of using some operations from linear algebra, which in turn makes it easier to code.

$$ \mathbf {y} = {\begin{bmatrix} y_{1}\\y_{2}\\\vdots \\y_{n} \end{bmatrix}},\quad {\displaystyle X={ \begin{bmatrix} \mathbf {x} _{0}^{\mathsf {T}}\\ \mathbf {x} _{1}^{\mathsf {T}}\\ \vdots \\ \mathbf {x} _{n}^{\mathsf {T}} \end{bmatrix}} ,\quad } {\displaystyle {\boldsymbol {\beta }}={ \begin{bmatrix} \beta _{0}\\ \beta _{1}\\ \beta _{2}\\ \vdots \\ \beta _{n} \end{bmatrix}},\quad {\boldsymbol {\varepsilon }}={ \begin{bmatrix} \varepsilon _{1}\\ \varepsilon _{2}\\ \vdots \\ \varepsilon _{n} \end{bmatrix}}} $$

Note that the $x_0^T$ vector contains just a series of 1's: [1, 1, ..., 1]. This turns our equation into something much more compact, where all our terms are now represented as matrices. The following equation shows that we can compute the output value for all y, given that we have an estimation of the coefficients $\boldsymbol{\beta }$.

$$ \mathbf{y} = \mathbf{X} \boldsymbol{\beta } + \boldsymbol{\varepsilon} $$

But how exactly can we estimate the coefficient values?

Ordinary Least Squares

Linear least squares (LLS) is the main algorithm for estimating coefficients of the one formula just presented. In Machine Learning language, this is known as fitting your model to the dataset. We will focus on the most popular variant called Ordinary Least Squares (OLS).

The OLS algorithm minimizes the sum of squares of residuals.

One residual is the distance from one data point to the line in 2D or plane in 3D, so when we minimize the total sum of the squared residuals, we are minimizing the average distance from a data point to the line we are fitting to.

The formula for the algorithm can be quite intimidating, if one is not familiar with linear algebra: permuted matrices, dimensionality, dot product and the inverse of a matrix. This is not a topic for this article, but read Inverting A Matrix and the full derivation of the formula by Arthur S. Goldberger, also called the normal equation, if you want to be more knowledgeable on those topics.

The following formula ensures that the resulting coefficients defines a minimum for the normal equation, which means that the result is the minimized total sum of squared residuals:

$$ {\displaystyle {\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y}} $$

The formula can be coded in one line of code, because it's just a few operations. We will see that later on in the coding section.

The outcome of the algorithm, beta hat $\boldsymbol{\hat{\beta}}$, is a vector containing all the coefficients, that can be used to make predictions using the formula presented in the beginning for multiple linear regression. To give an example in 3D, we could have this set of coefficients [2.1, 5.3, 9.2], which can be plugged into the equation for multiple linear regression:

$$ y = 2.1 + 5.3 \times x_1 + 9.2 \times x_2 $$

This means the intercept is at 2.1, while there are two variables that have their estimated coefficients as 5.3 and 9.2. To estimate an output variable y, one would need to input two variables x1 and x2 into the equation, and then we have made a prediction.

We will see examples for this in the coding section, though there is one important thing missing; assuming we have a number of predictions for some observed data, how can we measure how well the model predicted the ground truth?

R Squared: Accuracy Of The Multiple Linear Regression Model

We can measure the accuracy of how well the multiple linear regression model performs by a metric called r squared.

$$ R^2 = 1 - \frac { \sum_{i=0}^{n} (y_i - \hat{y})^2 } { \sum_{i=0}^{n} (y_i - \bar{y})^2 } $$
  • $y_i$ is one value of y at index i.
  • $\hat{y}$ is pronounced as y hat, and it is the predicted values of y
  • $\bar{y}$ is pronounced as y bar, and it is the average of y

The metric measures the relationship between the residual sum of squares (RSS) and the total sum of squares (TSS). The RSS is computed as the ground truth minus the predicted ground truth, while the TSS is computed as the ground truth minus the average of the ground truth.

Code For Multiple Linear Regression

We have made a single class called MultipleLinearRegression in a file, that represents all of what we have talked about until this point.

import numpy as np
import copy

class MultipleLinearRegression():
    def __init__(self):
        self.coefficients = None
        self.intercept = None

    def fit(self, x, y):
        # prepare x and y values for coefficient estimates
        x = self._transform_x(x)
        y = self._transform_y(y)

        betas = self._estimate_coefficients(x, y)
        # intercept becomes a vector of ones
        self.intercept = betas[0]

        # coefficients becomes the rest of the betas
        self.coefficients = betas[1:]

    def predict(self, x):
            y = b_0 + b_1*x + ... + b_i*x_i
        predictions = []
        for index, row in x.iterrows():
            values = row.values

            pred = np.multiply(values, self.coefficients)
            pred = sum(pred)
            pred += self.intercept


        return predictions

    def r2_score(self, y_true, y_pred):
            r2 = 1 - (rss/tss)
            rss = sum_{i=0}^{n} (y_i - y_hat)^2
            tss = sum_{i=0}^{n} (y_i - y_bar)^2
        y_values = y_true.values
        y_average = np.average(y_values)

        residual_sum_of_squares = 0
        total_sum_of_squares = 0

        for i in range(len(y_values)):
            residual_sum_of_squares += (y_values[i] - y_pred[i])**2
            total_sum_of_squares += (y_values[i] - y_average)**2

        return 1 - (residual_sum_of_squares/total_sum_of_squares)

    def _transform_x(self, x):
        x = copy.deepcopy(x)
        x.insert(0, 'ones', np.ones( (x.shape[0], 1) ))
        return x.values

    def _transform_y(self, y):
        y = copy.deepcopy(y)
        return y.values

    def _estimate_coefficients(self, x, y):
            β = (X^T X)^-1 X^T y
            Estimates both the intercept and all coefficients.
        xT = x.transpose()

        inversed = np.linalg.inv( )
        coefficients = xT ).dot(y)

        return coefficients

We needed a dataset to put our new multiple linear regression algorithm to use, so we loaded in a dataset called Boston Housing Prices from Scikit-Learn. We split the dataset into inputs x and outputs y, and we further split x and y into data we train on and data we test on. This is in a file.

from sklearn.datasets import load_boston
import pandas as pd
from sklearn.model_selection import train_test_split

def sklearn_to_df(data_loader):
    X_data =
    X_columns = data_loader.feature_names
    X = pd.DataFrame(X_data, columns=X_columns)

    y_data =
    y = pd.Series(y_data, name='target')

    return x, y

x, y = sklearn_to_df(load_boston())

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42)

The following is the start of the file for running our multiple linear regression.

  1. Load all dependencies and data
  2. Initialize the class MultipleLinearRegression() as an object
  3. Fit the object to our data by, y_train)
  4. Make predictions on a test set that is unseen to the algorithm by mlr.predict(x_test).
  5. Compute the r squared score by mlr.r2_score(y_test, pred).
from load_dataset import x_train, x_test, y_train, y_test
from multiple_linear_regression import MultipleLinearRegression
from sklearn.linear_model import LinearRegression

# # # # # # # # # # # # # # # # #
# Our Multiple Linear Regression #
# # # # # # # # # # # # # # # # #
mlr = MultipleLinearRegression()

# fit our LR to our data, y_train)

# make predictions and score
pred = mlr.predict(x_test)

# calculate r2_score
score = mlr.r2_score(y_test, pred)
print(f'Our Final R^2 score: {score}')

The output score turns out to be:

Our Final R^2 score: 0.6687594935355938

Although we did just implement multiple linear regression, let's compare how well our implementation works in comparison to Scikit-Learn's linear regression:

# # # # # # # # # # # # # # # # # #
# Scikit-Learn's Linear Regression #
# # # # # # # # # # # # # # # # # #
sk_mlr = LinearRegression()

# fit scikit-learn's LR to our data, y_train)

# predicts and scores
sk_score = sk_mlr.score(x_test, y_test)
print(f'Scikit-Learn\'s Final R^2 score: {sk_score}')

The output from their model to ours is almost identical, except for after the 13th decimal:

Scikit-Learn's Final R^2 score: 0.6687594935356329

Special Case 1: Simple Linear Regression

Simple Linear Regression can be expressed in one simple equation

$$ y = \text{intercept} + \text{coefficient} \times x_{value} $$

The intercept is often known as beta zero $\beta_0$ and the coefficient as beta 1 $\beta_1$. The equation is equal to the equation for a straight line

$$ y = \beta_0 + \beta_1 x_1 = mx + b $$

That is all there is to a simple linear regression equation, though, how do we determine the intercept and coefficient? This is where we introduce the least squares algorithm.

Ordinary Least Squares

Ordinary Least Squares is known to minimize the sum of squared residuals (SSR). There are two central parts to ordinary least squares in this special case: estimating the coefficients and estimating the intercept.

Estimating The Coefficient

We can estimate the coefficient (the slope) by finding the covariance between x and y, and dividing it by the variance of x. It should be noted that this is a special case of linear regression, derived using calculus from the formula as seen in multiple linear regression earlier on in this article.

$$ \text{coefficient} = \beta_1 = \frac {cov(x,y)} {var(x)} $$

This turns into something a little bit more complex. Any time we have a summation symbol $\sum$, the equivalent in programming is summing the elements of an array. We start by adding the first element to the sum where index i=0, and we go all the way to the length of the array n. The covariance and variance can be expanded to the following formula:

$$ \text{coefficient} = \beta_1 = \frac {\sum_{i=0}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sum_{i=0}^{n}(x_i - \bar{x})^2 } $$
  • $x_i$ is one value of x at index i.
  • $y_i$ is one value of y at index i.
  • $\bar{x}$ is pronounced as x bar, and it is the average of x
  • $\bar{y}$ is pronounced as y bar, and it is the average of y

Estimating The Intercept

The estimate of the intercept $\beta_0$ should be easier to understand than the estimate of the coefficient $\beta_1$.

$$ \beta_0 = \bar{y} - \beta_1 \times \bar{x} $$
  • $\bar{x}$ is pronounced as x bar, and it is the average of x
  • $\bar{y}$ is pronounced as y bar, and it is the average of y
  • $\beta_1$ is the coefficient we estimated from earlier

Code For Simple Linear Regression

We have made a single class called SimpleLinearRegression in a file, that implements the special case of simple linear regression.

import numpy as np

class SimpleLinearRegression():
    def __init__(self):
        self.coefficient = None
        self.intercept = None

    def fit(self, x, y):
            Given a dataset with 1 input feature x and output feature y,
            estimates the coefficient and compute the intercept.
        self.coefficient = self._coefficient_estimate(x, y)
        self.intercept = self._compute_intercept(x, y)

    def predict(self, x):
            y = b_0 + b_1*x
        x_times_coeff = np.multiply(x, self.coefficient)
        return np.add(x_times_coeff, self.intercept)

    def r2_score(self, y_true, y_pred):
            r2 = 1 - (rss/tss)
            rss = sum_{i=0}^{n} (y_i - y_hat)^2
            tss = sum_{i=0}^{n} (y_i - y_bar)^2
        y_average = np.average(y_true)

        residual_sum_of_squares = 0
        total_sum_of_squares = 0

        for i in range(len(y_true)):
            residual_sum_of_squares += (y_true[i] - y_pred[i])**2
            total_sum_of_squares += (y_true[i] - y_average)**2

        return 1 - (residual_sum_of_squares/total_sum_of_squares)

    def _compute_intercept(self, x, y):
            intercept = y_bar - coefficient*x_bar
            WHERE:  y_bar = average(y),
                    x_bar = average(x),
                    coefficient = coefficient already estimated
        # find the average of the array x
        x_average = np.average(x)

        # multiply the coefficient and the average of the x values
        mul = self.coefficient*x_average

        return np.average(y) - mul

    def _coefficient_estimate(self, x, y):
            coefficient_{x,y} = ∑_{i=0}^{n} (x_i - x_bar) * (y_i - y_bar)
                                ∑_{i=0}^{n} (x_i - x_bar)^2
        numerator = 0
        denominator = 0

        for i in range(len(x)):
            x_value = x[i]
            y_value = y[i]
            x_average = np.average(x)
            y_average = np.average(y)

            numerator += (x_value - x_average) * (y_value - y_average)
            denominator += (x_value - x_average)**2

        return numerator / denominator

As shown in the Code For Multiple Linear Regression section, we load in a dataset from Scikit-Learn.

The following is our file that loads the dataset, picks one feature and runs the simple linear regression model. After experimentation, we found that the feature called LSTAT performed the best in terms of the r2 score.

from load_dataset import x_train, x_test, y_train, y_test
from simple_linear_regression import SimpleLinearRegression

# pick a single feature to estimate y
x_train = x_train['LSTAT'].values
x_test = x_test['LSTAT'].values
y_train = y_train.values
y_test = y_test.values

# fit to data
slr = SimpleLinearRegression(), y_train)

# make predictions and score
pred = slr.predict(x_test)
score = slr.r2_score(y_test, pred)
print(f'Final R^2 score: {score}')

We did conveniently pick the feature that gave us the highest r squared score, but comparatively to the multiple linear regression, our model is not that far off:

Final R^2 score: 0.5429180422970383

Special Case 2: Polynomial Regression

Another case of multiple linear regression is polynomial regression, which might look like the following formula:

$$ {\displaystyle y_{i}\,=\,\beta _{0}+\beta _{1}x_{i}+\beta _{2}x_{i}^{2}+\cdots +\beta _{m}x_{i}^{m}+\varepsilon _{i}\ (i=1,2,\dots ,n)} $$

The formula is flexible in the exponents, so that it can be changed to model specific problems better. The code for polynomial regression is the same as for multiple linear regression, except for the predict function. We increase the degree of the exponent by one for each feature and use those as our values instead.

def predict_polynomial(self, x):
        y = β0 + β1*x + β2*x^2 + ... + βm*x_i^m
    predictions = []
    for index, row in x.iterrows():
        # treating each feature as a variable that needs to be raised to the power of m
        polynomial_values = [feature**i+1 for i, feature in enumerate(row.values)]

        pred = np.multiply(polynomial_values, self.coefficients)
        pred = sum(pred)
        pred += self.intercept


    return predictions

Though, one of the drawbacks of polynomial regression is that you have to find out which formula might work for you, and that gets quite hard as the number of variables grow, since we don't really have much intuition for what a 4-, 7- or 100-dimensional space looks like.

Perhaps a naive solution to the problem, is finding the best polynomial by brute force; trying all different permutations that you can think of, e.g. with a varying degree from 1 to 5.


  • One very in-depth explanation of the mathematics by Martina Bremer from Cornell University – recommended if you are somewhat fluent in Linear Algebra.
  • Thumbnail/GIF credits to assessingpsyche.