Anytime we do machine learning, measures of distance can be important. An example comes to mind, when a machine has to distinguish between two object, say a pig or cow. Well then it is useful to know some definition for measuring distance, as that will help us distinguish between whether it is a pig or cow.

One such way to describe a measure of distance is as a function with two observations and let's call those two observations $n_1$ and $n_2$. Then we could define a function that takes those two observations as inputs, to calculate a distance d. Now in particular, we have a set of rules that we want some observations to follow, making them more life-like in the sense that they are probabilistically determined to be true in real-life.

Ok, so you might think, what does this mean? All it means is that we want some rules that are true, e.g. the absolute value of a distance d must be greater than 0. Let me give you the mumbo-jumbo language right here, and define what it means afterwards:

  1. Non-negativity: $||x||> 0$ if $x \ne 0$
  2. Scaling: $||ax|| = |a|||x||$
  3. Triangle inequality: $||x+y|| \leq ||x|| + ||y||$

Let's clarify what each of these rules means, i.e. what does $||x||$, absolute value, mean. First of all, x contains the two observations $n_1$ and $n_2$, and x is a vector — x could be described by $x=n_1-n_2$, since that is measuring the distance.

The absolute value can be seen in the scope of the distance. When measuring the distance, you go from observation $n_1$ to observation $n_2$, and in similar fashion, we can describe what $||x||$ really means. Firstly, it reads out as the absolute value of the absolute value of x. You might think, well what does that mean? Let's tread on the paths of what an absolute value is. An absolute value is the distance measured from where you started to where you ended, in other words, if $x=-12$, the distance you would have travelled ought to be 12, and the same goes for if $x=5$, then the distance would be 5. What is the absolute value of the absolute value? Exactly the same answer, it means that when you take the absolute value of $x=5$, then the distance is 5. If you take the absolute value of $x=5$ again, then the distance would still be the same.

Now to describe what the rules means. The first rule, non-negativity, describes that the distance of x must be greater than zero. The second rule, scaling, describes that the absolute value of the absolute value of a multiplied by x, must be the same as the absolute value of a and the absolute value of the absolute value of x. In this instance, a is the scalar, that is, a real number (also seen in my Linear Algebra series).The third rule, triangle inequality, is along the lines of the same logic as in the second rule. The absolute value of the absolute value must be less than or equal to the absolute value of the absolute value of $x+y$.

The previous part for measures of distances are the very theoretical part, just as to give you an understanding what makes up these formulas that we are going to talk about now.

The Minkowski distance, which is described as a p-norm distance where $p \geq 1$, is a way to measure distance by points. These points can be plotted on a coordinate if they are not too high dimensional. That is, our points $n_1$ and $n_2$ from before has m dimensions. And now we call those two points $X$ and $Y$. The way to calculate the Minkowski distance is then

$$ d_p(X,Y)=\left( \sum\limits^{m}_{j=1}|x_j-y_j|^p \right)^{\frac{1}{p}} $$

Then $x_j$ and $y_j$ will be how many dimensions of x and y exists. So here we sum over our two points X and Y in m dimensions, and raise each point to the power of p, and then once summed, to the power of 1 over p.

When $p=1$, we call it the Manhattan distance (or one-norm), which stems from the fact that you can either walk vertical or horizontal (along the x or y axis, but not both at the same time) in the middle of New York, since there are these huge building blocks. Then there is $p=2$, also called the Euclidean distance (or two-norm), which means we can travel to our destination in a straight line, now have to walk either vertical or horizontal.

Let me show you by a picture:

We want to get from point X to Y. The Euclidean distance can walk along the x and y axis, while the Manhattan distance can only walk along either the x and y axis. Particularly, here there are no obstacles, like huge building blocks, that prevent us from using the Euclidean distance. But if there were, you couldn't use it and would be forced to go the Manhattan distance if you wanted to get from X to Y.

Filling in the formulas for the Manhattan and Euclidean distance, we get:

$$ d_1(X,Y)=\sum\limits^{m}_{j=1}|x_j-y_j|\\ d_2(X,Y)=\sqrt{\sum\limits^{m}_{j=1}(x_j-y_j)^2} $$

Measures of similarity

What if I told you, that I have just been describing dissimilarity in the last part that you just read? That d stands for dissimilarity and not distance.

One way to use similarity/dissimilarity is in classification, where you classify a document X as having the same topic Y as the document $X{_something}$ it is the most similar/dissimilar to. Another way is Outlier detection, where the observation most dissimilar to all other observations is an outlier.

How we can define similarity is by dissimilarity: $s(X,Y)=-d(X,Y)$, where s is for similarity and d for dissimilarity (or distance as we saw before). Let's consider when X and Y are both binary, i.e. when they are both 0 or 1. Then we can define 4 situations denoted $f_{xy}$:

$$ f_{11}=\,\,where\,\, x_i=1\,\, and\,\, y_i=1\\ f_{10}=\,\, where\,\, x_i=1\,\, and\,\, y_i=0\\ f_{01}=\,\, where\,\, x_i=0\,\, and\,\, y_i=1\\ f_{00}=\,\, where\,\, x_i=0\,\, and\,\, y_i=0 $$

Then each of f will be the number of entries of i, where each entry meet the conditions of one of the 4 functions.

We can define how many attributes that amounts to by M, and in this case we have $M=4$, that is $M=f_{11}+f_{10}+f_{01}+f_{00}$. From now on I'm going to introduce you to three different measures:

  1. Simple Matching Coefficient
  2. Jaccard Similarity
  3. Cosine Similarity
  4. Extended Jaccard Similarity (where we consider general vectors)

Let me give you a formula for each, then explain it more algorithmically, since that is what you really need to understand and not the formula.

For each of these, let's remember we are considering a binary case, with 4 features called M.

[1] Simple Matching Coefficient

The formula is

$$ SMC(X,Y)=\frac{f_{11}+f_{00}}{M} $$

We are taking the input of two points, called X and Y (also referred to as vectors), where we find how many entries there is for $X_i and Y_i = 1$ and the opposite case where $X_i and Y_i = 0$. If we are to consider a three-dimensional dataset, then each point has an x, y and z coordinate, where i of those vectors refers to the coordinates. In this case, all of the coordinates are either 0 or 1, which means we end up with the number of coordinates that has the value 1 in $f_11$, and the one's with value 0 in $f_00$.

That is, for a M-dimensional point X or Y (referred to as $X_i$ or $Y_i$), we have some dimension. If $X_1$ (the x-coordinate of X) is 1, then we add one entry (e.g. add 1 to the current number of entries) to $f_{11}$, and similarly if $X_2$ (the y-coordinate of X) is 1, then we add one entry to $f_{11}$. The same thing goes for Y, that is, if any of the coordinates of Y is equal to 1, we add an entry to $f_{11}$. Then we do the opposite for $f_{00}$, that is, where any coordinate of X or Y is 0.

[2] Jaccard Similarity

The formula is

$$ J(X,Y)=\frac{f_{11}}{f_{11}+f_{10}+f_{01}} $$

This formula almost needs no explaining, if you read the last formula. We really just look at $f_{11}\,,\,f_{10}\,\, and\,\, f_{01}$, that is, how many coordinates of X and Y that respectively is 1 and 1, 1 and 0, and 0 and 1.

[3] Cosine Similarity

The formula is

$$ cos(X,Y)=\frac{f_{11}}{f_{11}+f_{10}+f_{01}}=\frac{X \boldsymbol{\cdot} Y}{\sqrt{X \boldsymbol{\cdot} X}\sqrt{Y \boldsymbol{\cdot} Y}} $$

This formula extends to the use of the dot product (more here), and so you really just end up using the dot product to figure out this formula. If $X=(1,1,1)$ and $Y=(0,1,0)$, then the cosine similarity would be

$$ X \boldsymbol{\cdot} Y= \begin{bmatrix} 1*0\\ 1*1\\ 1*0 \end{bmatrix} = \begin{bmatrix} 0\\ 1\\ 0 \end{bmatrix} =0+1+0 \\ \sqrt{X \boldsymbol{\cdot} X}\sqrt{Y \boldsymbol{\cdot} Y} = \sqrt{\begin{bmatrix} 1*1\\ 1*1\\ 1*1 \end{bmatrix}} \sqrt{\begin{bmatrix} 0*0\\ 1*1\\ 0*0 \end{bmatrix}} =\sqrt{3}\sqrt{1} $$

After these calculations, we see that

$$ cos(X,Y)=\frac{0+1+0}{\sqrt{3}\sqrt{1}}=\frac{1}{1.73205}=0.57735 $$

You could also just do it with the $f_{xy}$, but I prefer it this way, since it is easier and less convoluted to calculate.

To convert this to code, it would look like this:

import numpy as np

# Pass function two vectors (points)
def cosine_similarity(X, Y):
    return (, Y)) / (np.sqrt(, X)) * np.sqrt(, Y)))
[4] Extended Jaccard Similarity

The formula is

$$ EJ(X,Y)=\frac{X^TY}{||X||^2+||Y||^2-X^TY}=\frac{X \boldsymbol{\cdot} Y}{\sqrt{X \boldsymbol{\cdot} X^2}+\sqrt{Y \boldsymbol{\cdot} Y^2}-X \boldsymbol{\cdot} Y} $$

This does look quite a lot like cosine similarity, so I think you should re-read the above if you didn't understand it. Again we just use the dot product from Linear Algebra.

To get a more intuitive understanding of what similarity and dissimilarity is, as well as use cases for the formulas above, I suggest you watch the video below or explore further: