This post is where you need to listen and really learn the fundamentals. All modern approaches to Machine Learning uses probability theory. AlphaStar is an example, where DeepMind made many different AIs using neural network models for the popular game StarCraft 2. As an example, these AIs used probability to figure out if it would win the next fight or where the next attack from the enemy would be, and then used that information as part of its decision making.

You need no other prior advanced knowledge, before reading this. This is introduced from the bottom up.

Table of Contents

1.Introduction to logic
2.Overview of Formulas
3.Sum Rule
4.Product Rule
5.Bayes' Theorem
6.Example 1
7.Example 2
9.Example 3

Introduction to logic

Where can we trace probability back to, you might start out asking. Probabiltiy traces back to logical reasoning. It is statements like 'my bike has been stolen' and 'my bike is not where I left it' that comes to mind, since they can either be true or false. In logical reasoning, you also use the opposite of the truth, e.g. 'my bike has NOT been stolen'. These translate into two statements, A and B.

The way we might be able to logically reason about these statements, is by stating which is true and which is false.

If my bike has been stolen, then surely it is not where I left it.
Or, if A is true then B is true.

What would happen if we changed the order of the statements A and B?

If my bike is not where I left it, then it almost certainly has been stolen?
Or, given B is true, A is almost certainly true?

The next thing we could begin reasoning about is the opposite of a statement, e.g. the opposite of 'my bike is not where I left it' would be 'my bike is where I left it'.

Another statement with a 'negation', meaning NOT or the opposite as I explained it.

If my bike is where I left it, my bike has not been stolen
Or, if $\overline{B}$ is true then $\overline{A}$ is true.

In other words, we would get a degree of belief. And in other words, we would begin to reason about what the probability of some event is true — what might the probability be, that my bike has been stolen, if my bike is not where I left it?

In the language of degree of belief, we have probabilities, which has the answer to the aforementioned question. We write it like this:

What is the probability of my bike has been stolen, if it is not where I left it?
Or, if B is true, what is the probability of A being true?
Or, $P(A|B)$

That last statement reads, 'what is the probability of A being true, given that B is true?'

We might also say that the probability of A is true given that B is true, is greater than the probability of A is true given B is not true. In plain english, the probability of my bike has been stolen given my bike is not where I left it, is greater than the probability of my bike has been stolen given my bike is where I left it.

$$ P(A|B) > P(A|\overline{B}) $$

In these simple terms, we can reason about uncertainties in this world, like I have just shown you.

Now we need to transfer these simple terms to probability theory, where the sum rule, product and bayes' therorem is all you need.

A, B and C can be any three propositions. We could select C as the logical constant true, which means $C=1$. Notice that the probability of something is measured in terms of true or false, which in binary language translated to 1 or 0. When the probability of something approaches 1, then it means it is very likely, and when the probability of something approaches 0, then it means that it is very unlikely.

A great way to visualize this form of negation is by area of some rectangle, where we can have a degree of belief in S. Say $S=0.8$ and then the opposite to get to the sum of 1 would be $\overline{S}=0.2$. This might not make sense, but always think of we need to get to 1, then it will be more clear that the 'opposite' of 0.8 is 0.2.

Here is a table explaining shortly what each term means. Take a look at it, it is some of the basic language probability theory.

Term Notation Explained
Negation $\overline{A}$ not - true if A is false
Conjunction $AB$ and - true if A and B are both true
Disjunction $A+B$ or - true if A or B is true

I have only covered negations, but I think conjunctions and disjunctions makes sense too. For A and B to be true, we denote that $AB$. For A or B to be true, we denote that $A+B$.


Here is all the formulas, that you need to know.

$$ Sum\,rule:\\ P(A|C)+P(\overline{A}|C)=1 $$
$$ Product\,rule:\\ P(AB|C)=P(B|AC)P(A|C) $$
$$ Bayes'\,theorem:\\ P(A|BC)=\frac{P(B|AC)P(A|C)}{P(B|C)} $$

Let me explain each one.

Sum rule

The sum rule just tells us, that the probability of an event A and the negated (opposite) of that event $\overline{A}$ is equal to one. You see, C can be a constant that we set equal to 1. That means C disappears from the equation, since it is true (1 is the same as being true). Remember that we are asking 'what is the probability of A given C is true?', so if C is true, we can remove it.

$$ P(A|C)+P(\overline{A}|C)=1 \Leftrightarrow P(A)+P(\overline{A})=1 $$

The event A can be any number between 0 and 1; that means if A is 0 (false), $\overline{A}$ is 1 (true), because that is how negation works. It just means the opposite of a binary statement.

Product rule

The same that applies to the sum rule, applies to the product rule. That means the C disappears again, since it is selected to be true, meaning we just factor it out.

$$ P(AB|C)=P(B|AC)P(A|C) \Leftrightarrow P(AB)=P(B|A)P(A) $$
Bayes' Theorem

I will just give you the equation first, then explain.

$$ P(A|BC)=\frac{P(B|AC)P(A|C)}{P(B|C)} \\ = \frac{P(B|AC)P(A|C)}{P(B|AC)P(A|C)+P(B|\overline{A}C)P(\overline{A}|C)} \\ \Leftrightarrow P(A|B)=\frac{P(B|A)P(A)}{P(B)} $$

Remember that C can be set equal to 1, that is, we remove C from the equation. This one needs many examples to be complete, so stay with me here, while I explain how the formula could be used. First let me give the simple example, then move on to the more advanced example. It can be quite hard to follow, so hang tight.

Example 1:

What is the probability of there being a fire when you see smoke? You know these facts:

  • Fires happens 1% of the time when you see smoke
  • Smoke is likely to occur 10% of the time
  • Smoke is likely to occur 90% of the times when there is a fire

The first thing we could do is write the probability down in our language, the probability theory language. We can use the F for fire and S for smoke, so that we get the proposition $P(F|S)$, that is given there is smoke, how likely is it that it is a fire?

This is where Bayes' theorem comes in handy.

$$ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \Leftrightarrow P(F|S)=\frac{P(S|F)P(F)}{P(S)} $$

Since we know all the facts, we would get that the probability of smoke given we can see fire is $P(S|F)=0.90$. And the probability of there being a fire is $P(F)=0.01$, while the probability of there being a smoke is $P(S)=0.10$. To translate this into the formula

$$ P(F|S)=\frac{P(S|F)P(F)}{P(S)}=\frac{0.90 \times 0.01}{0.10}=0.09=9\% $$

This means that when we see smoke, it is likely to be a fire 9% of the time.

Example 2:

What is the probability of having the disease given a true positive? You know these facts:

  • A test for the disease correctly identifies it 99% of the time (true positive)
  • A test for the disease incorrectly identifies it 2% of the time (false positive)
  • 1% of the population suffers from the disease

We firstly have to find out what is what here. Let's start by each bullet point. Here we use T for true positive, F for false positive and D for disease. The probability of a true positive given that you have the disease is $P(T|D)=0.99$. The probability of a false positive given you have the disease is $P(F|D)=0.02$. The probability of having the disease is $P(D)=0.01$. Then the probability of not having the disease would be $P(\overline{D})=1-P(D)=0.99$.

Note that $P(T|D)=0.99$ is the same as $P(F|\overline{D})=0.99$, and also that $P(F|D)=0.02$ is the same as $P(T|\overline{D})=0.02$. This is because it is a negation, meaning the opposite of the statement. $P(T|\overline{D})$ reads 'the probability of true positive given not having the disease'. This is convoluted logic, but let it fully sink in. Try to reason on a piece of paper how this might be.

To plug this into the Bayes' theorem formula, we need nothing more than what I just laid out for you, but first let me summarize some important points:

  • $P(T|D)=0.99$, or $P(F|\overline{D})=0.99$
  • $P(F|D)=0.02$, or $P(T|\overline{D})=0.02$

You might think that we could reduce the convolutedness, and we could do that. We could reduce the problem down to only looking at true positives.. that is, removing the F entirely. From now on, we don't care about F since there is a better way to denote it. Also, let me summarize the rest of what we figured out:

  • $P(T|D)=0.99$
  • $P(T|\overline{D})=0.02$
  • $P(D)=0.01$
  • $P(\overline{D})=0.99$

Now that we have reduced the convoluted logic, we can plug this into the Bayes' theorem formula. Let us remind ourselfes what was asked in the question.
What is the probability of having the disease given a true positive? Or in Bayes' theorem terms: $P(D|T)$. Let's plug in the formula and play:

$$ P(A|B)=\frac{P(B|A)P(A)}{P(B)} \Leftrightarrow P(D|T)=\frac{P(T|D)P(D)}{P(T)} = \frac{0.99 \times 0.01}{P(T)} $$

Everything looks good right? Wrong! You see, we need to expand on the term $P(T)$, since we don't really know what it is. We cannot assign it a value like we could with the other terms. Let us remind ourselfes of both the sum rule and product rule, because we need both to solve this problem.

Actually it lies in the definition of Bayes' theorem, which I didn't fully give to you. The derivation of Bayes' theorem used the product and sum rule to get there, which is why you might have felt lied to, if you have read about the theorem elsewhere.

$$ P(A|BC)=\frac{P(B|AC)P(A|C)}{P(B|C)} \Leftrightarrow P(A|B)= \frac{P(B|A)P(A)}{P(B)} \\ \Leftrightarrow \frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\overline{A})P(\overline{A})} $$

To put this into context, what we need in our denominator in the fraction is exactly what was derived from the Bayes' theorem. Let's apply it to our case, shall we!!

$$ P(D|T)=\frac{0.99 \times 0.01}{P(T|D)P(D)+P(T|\overline{D})P(\overline{D})} \\ = \frac{0.99 \times 0.01}{0.99 \times 0.01+0.02 \times 0.99} = 0.33 = 33\% $$

Bottom line of what I have been teaching you here is that you could use Bayes' theorem in many different ways.

Marginalization and Bayes' theorem

Next we have to take the fact that there also are non-binary events (or propositions) that could happen, and we need some way to calculate the probability for that. An example is when rolling a die multiple times.

We need an extension of A in the denominator in Bayes' theorem, for when A is not binary (meaning that it can not only be A or $\overline{A}$), is also important in the case where we consider something not being binary.

When we want to know A, but A has 3 or more cases, we have to use marginalization. It is a pretty technical derivation of the formula, but it can be simplified and explained simply. So listen up, this one is important! Here is the margnialization with Bayes' theorem:

$$ P(A_i|B)=\frac{P(B|A_i)P(A_i)}{P(B)}=\frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{n}P(B|A_j)P(A_j)} $$

It is explained in the picture below. But let me type it out here.
So we now have a situation where A has $A_1,...,A_n$ cases (read: $A_1$ all the way to $A_n$ cases). But it only makes sense to use this when there is 3 or more cases, so remember that.

We choose any case from the n cases available, that we want to find the probability of, then we replace $A_i$ with that case you chose in the formula. Then when we get to the part in the green rectangle, in the denominator, we have a big greek letter sigma. That symbol $\sum$ means 'sum', so we sum from $j=1$ to $n$.

Translated to english, we sum from case 1 ($j=1$) to case n, sum meaning that we fill in the value for $P(B|A_j)P(A_j)$ for each case $j$ and add each case together iteratively. So that means we start with case 1, fill in the values, then case 2, fill in the values all the way to case n. Afterwards we add all of them up to get the final result.

Example 3:

What is the probability of a custumer coming from Copenhagen spends above the median of the rest of the customers, on some item? For the sake of the example, let's say the item is meat. This is collected from a dataset of all custumers. We know these facts:

  • People from Copenhagen spent 19.5%, people from Hongkong spent 7.8% and the rest (of the world) spent 72,7%.
  • 48.4% of people from Copenhagen spent above the median
  • 35.2% of people from Hongkong spent above the median
  • 56.7% of people from the rest of the world spent above the median

So how would we go about this problem? Let's start off by defining the different propositions:

  • The probability of a customer from Copenhagen buying more than the median is $P(M|C)=0.484$, where M is median and C is Copenhagen
  • The probability of a customer from Hongkong buying more than the median is $P(M|H)=0.352$, where M is median and H is Hongkong
  • The probability of a customer from the rest of the world buying more than the median is $P(M|R)=0.567$, where M is median and R is rest of world
  • $P(C)=0.195$
  • $P(H)=0.078$
  • $P(R)=0.727$

In this problem we have 3 cases – Copenhagen, Hongkong and rest of world. That means $n=3$. The probability we want to find, is the probability of a customer being from Copenhagen, and for that customer to spend more than the median; $P(C|M)$.

To expand on that, we take the formula from before and fill it in.

$$ P(C|M)=\frac{P(M|C)P(C)}{\sum_{j=1}^{3}P(M|A_j)P(A_j)} \\ = \frac{P(M|C)P(C)}{P(M|C)P(C)+P(M|H)P(H)+P(M|R)P(R)} $$

And as you might have guessed now, all we need to do is plug in the probabilities for each probability in the formula

$$ P(C|M)=\frac{0.484 \times 0.195}{0.484 \times 0.195+0.352 \times 0.078+0.567 \times 0.727} = 0.176 = 17.6\% $$

There we have it, the probability of a customer being from Copenhagen and spending above the median is 17.6%.