Perhaps where everyone starts, with machine learning models, is linear regression. Here you will be introduced to both linear and logistic regression.

Table of Contents

1.Linear Regression
2.Tips for Linear Regression
3.Logistic Regression
4.Maximum Likelihood for Logistic Regression
5.Code for Linear Regression
6.Code for Logistic Regression

Linear Regression (Least Squares Regression)

First of all, what is regression? When doing regression, we map some object to another object, namely an input and output. Actually what we are doing is estimating the relationships between variables in our dataset. Maybe you have seen a plot like this before, where we plot some data points.

When doing linear regression, we simply find a line that fits the best through our data points. What does it mean that it fits the best? Well, it means that when you draw a line through the data, the average distance from each point to that line is the lowest of all the lines we could draw – so we just find the line that fits right in the middle of our data. It looks like this:

Now that we covered the intuition, let's explain it in math. To plot the above we have an equation to do it. I include two versions that are equal, just in different terms:

$$ y=mx+b \Leftrightarrow y=w_0+w_1 x $$

I'm going to continue using the left one, as that is what I learned, but just know that they are the same.

$w_0$ (or b) is the intercept term, meaning where the line through the data points intercept the y-axis. $w_1$ is the coefficient or slope, our input. Now I'm now diving into how you actually calculate each term, but I will refer you to Khan Academy if you want to learn that.

Let's step into deeper waters, where we generalize this formula to take $w_M$ ($M$ features or coefficients). We could first imagine the 2-dimensional input where $M=2$, which looks like $y=w_0+w_1 x_1 +w_2 x_2$, but let's also generalize it:

$$ y=w_0+w_1 x_1 +w_2 x_2+...+w_M x_M $$

As mentioned in the intuition part, we kind of want the average from all the actual plotted points to the predicted value on the line we plot, to be the lowest that it can possibly be. As stated before, we try to find such a line, where the average distance of all the points to the line is minimal.

This is referred to as calculating the residual error.

For each blue data point, there is a distance indicated by the red line. The average of this distance for all points to the line, is what we are trying to minimize. We then get an $R^2$ value, which is best when it approaches the value 1. This is also referred to as least squares regression, and for more on the calculations, watch this Khan Academy video.

Though when doing least squares regression, we make an initial guess on a line that might fit good, calculate the $R^2$ value, then make another that has a better fit, and we keep doing that until we get the most optimal fit.

As a final remark, there really is no random parameters in this algorithm, and this means that least squares linear regression produces the same $R^2$ value every time you run it. This means, for the same data points, you will always get the same line, because it will be the best fitting one.

Final tips for linear regression

As a final notice on linear regression, I want to include some information about what to be careful with, when using the algorithm for data.

Could you imagine if you had a dataset that has a few outliers? Least Squares Regression is susceptible to outliers; one could imagine that that would impact the calculation of the average, if you have some extreme outliers. Suppose your dataset isn't that big either, then it is even more susceptible to outliers, making the linear regression algorithm perform poorly.

When performing linear regression, we have to be wary of overfitting. It is too easy to use too complicated of a model, by the formula $y=w_0+w_1 x_1 +w_2 x_2+...+w_M x_M$. It might be a clear indication of overfitting, if you use too many input variables in linear regression, such that less input variables will give a better model.

Finally, always remember to evaluate your model using new data, that you did not train you model on. If you evaluate on the same data as you trained your model, you could get a different picture of which model is the best for new data.

Logistic Regression

The way logistic regression works, will remind you of linear regression, except for we don't use least squares to find the optimal $R^2$ value. Instead we use maximum likelihood to find the maximum likelihood.

In other words, if you plot any data point in a logistic plot, it will have some measure along the x-axis, and the probability of that measure being either true or false on the y-axis. True or false, meaning 1 for true and 0 for false.

Logistic regression is almost always used for classification, and that is the typical use-case. However, it can also be used to do regression, though that is not common at all, and there are probably other regression methods that will perform better than this one.

Example of logistic regression

In linear regression, we fit a straight line through the data, but in logistic regression, we fit a curve that looks sort of like an s. It will probably remind you of the sigmoid function, if you have ever heard of that. So we have this s-curve that goes from 0 to 1 or from 1 to 0, dependent on the variable on the x-axis.

As you can read from this graph, the probability of passing an exam increases with the more hours spent studying. Usually though, you are trying to plot a new point, to then classify that point. A common basic rule is, if your new point on the x-axis can be 'trailed' up to the curve, meaning you draw a line from the point to the curve, and it reaches above 50%, then we say that the probability of passing the exam is more likely, therefore we say that you are going to pass. And in the other case, below 50%, we say that you are not going to pass.

What I just told you is more of a deterministic look at the curve, than a probabalistic approach. Instead of making such a basic rule, we could simply say that a student who studied for 4 hours has the probability 0.875 (87.5%) of passing, simply by eyeballing where such a new point would be trailed to the curve.

Logistic Regression — Maximum Likelihood revisited

You might say, well how did the curve get there in the first place? This is what I was talking about at the beginning, it's a concept called maximum likelihood. You pick a value $\theta$, then pick something you want to predict, e.g. the likelihood of passing an exam. You would then calculate the likelihood of all points and multiply them together. At last, you could form a distribution to show which $\theta$ value resulted in the biggest likelihood for all points.

Similarly in logistic regression, we also calculate the maximum likelihood, but in a different way.

  1. Transform coordinate system to the y-axis being the log of probabilities, and the x-axis being 0.5 probability (e.g. where the x-axis intercepts the y-axis at zero, the probability is 0.5).
  2. Move all data points over to the new coordinate system, for each point: $log\left(\frac{p}{1-p}\right) = log(predictor)$. Note that all points are either at negative or positive infinity in the new coordinate system.
  3. Plot a random line, like in linear regression. Remember the formula $y=b+mx$.
  4. Project all data points onto the line. Since the points are at infinity, we don't have a specific y-value for each point. But we could imagine a step for each point, where the y-value is set to 0, then projected onto the line.
  5. Transform coordinate system to old coordinate system with probabilities on the y-axis, forming the s-shaped curve. For each point $p=\frac{e^{log(predictor)}}{1+e^{log(predictor)}}$.
  6. Each point is now on the s-curve, and the probabilities can be found by looking at where each point is traced to on the y-axis.
  7. Trace each point to it's probability on the y-axis, and use addition to calculate the log of the probability of all points together. This is the likelihood estimate.

Steps 3 to 7 are repeated, and note that the algorithm naturally tries to plot a better line than the last one in step 3, by figuring out if the likelihood estimate increases.

When this is done, the s-curve that had the best likelihood estimate from step 7 is the s-curve with the best estimate, therefore we call it the maximum likelihood estimate.

GIF'ed from this video.

The GIF gives a clear image of logistic regression. You could simply imagine that we try a new candidate line (right), then transform it (arrow) into the probability plot. We would then get likelihood estimate for each of the lines, and we could say that the one with the maximum likelihood estimate is the best fitting line.

Code – Linear Regression

First we import everything we need.

# Import all necessary packages
from matplotlib.pyplot import figure, plot, xlabel, ylabel, legend, show
from sklearn.linear_model import LinearRegression
import numpy as np

Then we create a random dataset. You would want to create import your own dataset here and define the training and testing data for your Linear Regression. Here is an example from Scikit-Learn.

# Random Dataset of points trending upwards
N = 100
X = np.array(range(N)).reshape(-1,1)
mean, std = -0.5, 0.1
eps = np.array(std*np.random.randn(N) + mean).reshape(-1,1)
w0 = -0.5
w1 = 0.02
y = w0 + w1*X + eps

Next we define the model, which is going to be Linear Regression. This example just shows you how to use the function from the sklearn import. Note that we train and test upon the same data here, to give an example, which is not something you want to do, when solving problems. Always train and test on different data.

# Define model
model = LinearRegression(fit_intercept=True)

# Fit model to data
model =,y)

# Make predictions
# Note, don't make predictions on the data you trained on (X)
y_est = model.predict(X)

Lastly, we visualize the data along with the model we just trained.

# Plot original data and the model output
f = figure()

# Plot datapoints

# Plot linear regression

# Define labels
xlabel('X'); ylabel('y');

# Define 'legend', respectively what the dots and line means
legend(['Training data', 'Linear Regression'])

# Show the plot

Code – Logistic Regression

This example is a copy-paste from sklearn's example. It's a great example on one of the most popular datasets, when learning machine learning, the iris dataset.

As with many algorithms in machine learning, the groundwork has been done for you by scikit-learn. They implement the algorithm, you implement their package which includes the algorithm, by the line of code from sklearn.linear_model import LogisticRegression. The output is the following:

Here is the code. You would need to adapt it to your dataset:

# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

# import some data to play with
iris = datasets.load_iris()
X =[:, :2]  # we only take the first two features.
Y =

logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')

# Create an instance of Logistic Regression Classifier and fit the data., Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z,

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k',
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())