We want to be able to say something about our data using some statistical measures, because it is important information. There are many ways to use statistics in machine learning, and the mean, variance and standard deviation is just some of them. I did shortly explain some of this in my last post, but I wanted to put it more formally here, to gain a better understanding myself and to share what I learned.

Also, don't miss out on the code to fix missing values in your data further down in this post.

What is the mean, variance and standard deviation?

Let's first start with the formulas and explanation of them, in short.

The mean of some numbers is something that probably everyone knows about — it is simply adding all numbers together and dividing by the number of numbers. It really just gives the average of some data, and it can be explained this simple formula, usually denoted by $\mu$ or more formally $\bar{x}$

$$ \mu \approx \frac{1}{n}\sum\limits^{n}_{i=1}x_i $$

This is just a fancy way of saying that we take every observation of $x_i$, sum them and multiply into the fraction to divide by the $n$, the number of observations.


The variance of data is just a way to measure how spread each data point is from the average (read: the mean of all data points). You sum over all data points, where you subtract the mean from each data point and square it, and then you divide by the number of observations subtracted one.

I want you to ignore the next unintuitive way of showing you how to calculate the variance. This is the standard definition, but you want to intuitive part, right? Okay, so ignore this, but this is how you might have been shown the formula for variance:

$$ Var_{n}(X) \approx E_n{[X-E_n(X)]^2} $$

To translate this to something more simple, and perhaps something you might notice from the last post. So here is the formula that you need to understand:

$$ Var(n) \approx \frac{1}{n-1}\sum\limits^{n}_{i=1}(x_i-\mu)^2 $$

As I explained before, we really just sum over all data points subtracted the mean and squared, then we multiply it into the fraction and divide by $n-1$. Don't get caught up in what it means, but basically subtracting one is considered to make the result unbiased whereas it would otherwise be biased.


The standard deviation is literally taking the square root of the variance, nothing more. We don't really need a formula for that, but let me just give it. Notice that we denote the standard deviation by sigma ($\sigma$)

$$ \sigma \approx \sqrt{Var(n)} $$

Why should we care about variance and standard deviation? Well for all of your data, you will inevitably have variance in machine learning. So we want to account for the variance in the dataset, and the standard deviation is the industry standard of doing that. It really is just as simple as that.

What is the median and percentile?

Percentiles has a relation to something called the median, which refers to the middle most observation of a sorted dataset. If the number of observations is odd, then we can just count the observations, add one and divide by two. If the number of observations are even, we do two calculations, where the first is dividing by two and the second is dividing by two and adding one. Then we would take the average of those two values to get the median.

The percentile can be explained in percent with a sample $x$ from $n$ observations, where there is some percent $p$. The percentile then gives a value — e.g. $x_{p=60\%}$ means that 60% of the observations has a lower value than the calculated percentile. If you were selling a house, the price of that house could be categorized under some percentile. If you have $n=1000$, 1000 houses sold for some sale price. Then if we calculate the percentile of how many houses sold for above 1 million dollars, we would get something like $x_{p=99\%}=1,000,000$. That is, 99 percent of all houses sell for under a million dollars.

What is the mode?

This is a short answer, but an important one nevertheless. The mode is the most frequently occuring observation.

This can be useful along with the mean to fill out missing data. You can use the mean to fill out observations with features that has a data type integer or float, while you can use the mode for categorical variables, which is really just something like string where you have x categories.

Here is some code for doing it. Note that you have to feed the function a dataframe which is the dataset. Mine is called "train", so the second last line refers to replacing the actual missing value with the one found from the mean or mode.

# Data Manipulation
import numpy as np
import pandas as pd

# df is used as stand-in name for train
def fill_missing_or_nan_values(df):
    fill_with = 0
    
    # Get the most common element by using size(),
    # which returns the element and how common it is
    for column in df:
        
        # Check if the column is an object, float64 or int64
        is_it_float = (df[column].dtype == np.float64)
        is_it_int = (df[column].dtype == np.int64)
        
        # If it is an object,
        # find the most common element and fill missing and NaN values
        if(not is_it_float and not is_it_int):
            fill_with = df[column].mode().item()
                    
        # If it is either a float64 or int64,
        # then calculate the mean and fill missing and NaN values
        else:
            if is_it_float:
                fill_with = np.nanmean(df[column], dtype=np.float64)
            if is_it_int:
                fill_with = np.nanmean(df[column], dtype=np.int64)
        
        # Fill the values in our dataset
        df[column] = df[column].fillna(fill_with)
        fill_with = 0

fill_missing_or_nan_values(train)

What is covariance and correlation?

What is covariance, and how is it different from correlation? Well, covariance is when you have a variable x that is expected to change as much as another variable y. That also implies y changes as much as x features. So these two variables are what we call attributes or features in the machine learning world.

Let's denote the mean as you saw earlier, $\mu$, and for the variable x as $\mu_x$ and y as $\mu_y$. Now we have some observations n with features x and y. The subscripted i in $x_i$ and $y_i$ still just means the i'th observation, counting from 1 to n observations. The formula for covariance then looks like this:

$$ Var(n) = \frac{1}{n-1}\sum\limits^{n}_{i=1}(x_i-\mu_x)(y_i-\mu_y) $$

In short and simple terms, we could expand the equation after the sigma to

$$ cov(x,y) = \frac{1}{n-1}\sum\limits^{n}_{i=1}x_iy_i-x_i\mu_y-\mu_xy_i+\mu_x\mu_y $$
Again, we sum over something from $i=1$ to n observations. In this case we sum over the i'th observation $x_i$ multiplied by the i'th observation $y_i$, minus the i'th observation $x_i$ multiplied mean of y, minus the mean of x multiplied i'th observation $y_i$, plus the mean of x and y.

Clearly a lot to take in, but try to make sense of the above paragraph and the equation. Try to seperate that x has a mean for the whole feature and the individual observation $x_i$ (and the same for y). Afterwards, you will see that we just take the sum of the above and multiply it into the a fraction, then divide it by $n-1$.

The covariance really just measures the variance between two features, i.e. if you measure the covariance of the same feature $cov(x,x)$, it is the same thing as finding the variance for the feature x, which implies $cov(x,x)=Var(x)$.


Correlation is somehow related to covariance, in a way. The value of covariance can be from negative infinity to positive infinity, whereas correlation goes from -1 to 1. Correlation tells us how linearly related attributes are, that is, a two features x and y as before with covariance.

If the correlation is 0, then x tells nothing about y. Above 0 and towards 1, then it means if x is large, then y is likely to be about as large as x. Below 0 and towards -1 tells us, that x and y are opposite in some way, that is, if x i large, y is likely to be small. Let's get to the formula, as it does require some amount of explanation, more than the other statistics learned here. Here it is:

$$ cor(x,y)=\frac{cov(x,y)}{\sigma_x\sigma_y} $$

So for both features x and y, we use

  1. The calculation of covariance for those features, using what I showed you above, and
  2. The standard deviation $\sigma$ for both features, which uses the square root of the variance.

Clearly there goes much into calculating the correlation, but the nice part of being programmers is that it has already been invented long ago, as a function that you can just call on your data.