Probability distributions is all about how we can represent the distributions of probabilities of data.

There is fundamentally a formula for each distribution, but I like to visualize the sort of distribution of probabilities, that we might get from this formula. The formulas is just a way to summarize the visualization in a one-liner in the form of a math equation.

As a note to the reader, this post is very much scratching the surface of Maximum Likelihood Estimation (MLE). You might find that you want more thorough explanations, rather than a shallow approach. More material included at the bottom of this page.

Bearnoulli Distribution

If we were to toss a coin twice, and the toss is fair, then the result could either be success or failure (translated to last post, true or false, or 1 and 0). So we have that $P(Success)=\theta$ and $P(Failure)=1-\theta$, where $\theta$ (read: theta) is the probability between 0 and 1.

So then we make the assumption that some variable $b$ is either 1 or 0, for success and failure or true and false. And this $b$ is a probability, that depends on a number called $\theta$. We say that such a variable $b$ has a Bernoulli distribution:

$$ P(b|\theta)=\theta^b(1-\theta)^{1-b} $$

What we ask is, what values can $\theta$ be? And remember that it can only be 0 or 1 for success or failure, when flipping this coin. We ask $for \theta=0,1$, what might the probability be for each of those cases? We fill in the formula with 0 and 1

$$ P(1|\theta)=\theta^1(1-\theta)^{1-1}=\theta $$
$$ P(0|\theta)=\theta^0(1-\theta)^{1-0}=1-\theta $$

The formula is just a way of expressing that these two things can happen, but in one line. What is usually also introduced by the Bernoulli distribution is how we can find the mean and variance. Remember we denote the mean as $\mu$ (read: mu) and variance as $Var(n)$, where $n$ is the number of observations. So the mean and variance of a Bernoulli distribution is

$$ \mu = \theta $$
$$ Var(n)=\theta(1-\theta) $$

The key thing to remember is that we always assume that $b$ is either 1 or 0.


What can we use the Bernoulli Distribution for? We use it to model binary data, a yes-no question. We can represent binary data in such a Bernoulli distribution, where probability might be along the y-axis and the outcomes (1 or 0) along the x-axis.

Outcomes along the x-axis, 1 or 0. Probabilities on the y-axis, ranging from 0 to 1 and summing to 1.
Categorical Distribution (Multinoulli Distribution)

This formula is often written with a capital pi $\prod$, which is similar to the $\sum$, but instead of summing, we multiply iteratively from $k=1$ to K, for an example

$$ \prod_{k=1}^{K}f(k)=f(1)\times f(2) \times ... \times f(k) $$

Here is the formula

$$ Categorical(y|\theta)=\prod_{k=1}^{K} \theta_k^{z_k} $$

$y$ is a list of outcomes from 1 to $M$. $\theta$ is a list with probabilities, and this list has K length.

What we do here is multiply the k'th outcome of $\theta_k$ from 1 and all the way to the K number of outcomes. We have one-hot encoded each element in the list of $\theta_k$, which is a list from 1 to K probabilities. For each $z_k$ for our $\theta$, we get that only one of them can be equal to 1 (true) and the rest must be 0, since it is one-hot encoded.

The analogy here is that $z$ is the number of features or attributes from $k=1$ features to $K$ features. That means only one of those $k$ features can be 1. Similarly, $\theta$ is the number of observation from $k=1$ features to $K$ observations.


Why do we need to consider categorical distributions? In the Bernoulli distribution we had $b$ which could be 1 or 0, that is, binary data. But we need to consider the cases where our data is not binary, where we might have 1 to M outcomes, denoted as $y$. This is why we might also refer to it as the Multinoulli distribution. It is literally representing a list of probabilities along the y-axis for each of the outcomes on the x-axis, expanding upon the Bernoulli distribution.

For 6 die rolls with somehow a very unfair dice favoring the outcome 6 heavily, we can expect some sort of different outcomes. Note that if the dice were fair, we would call it a uniform distribution.

A very unfair dice. Outcomes on the x-axis, from 1 to M, and in this case $M=6$. The probability is along the y-axis, ranging from 0 to 1 and summing to 1.
Likelihood — Maximum Likelihood

To gauge the next probability distributions, we need an introduction to likelihood, so this will be a segway that helps us explain the next distributions.

What is likelihood and how do we use it to our advantage? If we think about the word likelihood, it means the same as probability in english. But if we were to describe them in probability terms then

  • The percentage of how often a success occurs for some variable given other variables is referred to as a probability. The probability of a coinflip landing heads is 0.5.
  • The percentage of how often a success occurs for a new event, using the previously calculated probability of a single event. The likelihood of 8 out of 20 coinflips landing heads, using the fact that we know the probability of a coinflip landing heads is 0.5.

The formula for the likelihood function looks like this, derived from the Bernoulli distribution:

$$ p(\theta|b)=\frac{(N+1)!}{m!(N-M)!}\theta^m(1-\theta)^{N-m} $$
  • $N$ is the total number of observations
  • $m$ is the total number of observations that gives us the answer we want, e.g. in the question to a 1000 people ($n=1000$), are you vegetarian? Then the number of people who answers yes is $m$.
  • $b$ list of probabilities of some events, e.g. a coin toss where the outcome is 1 or 0.
  • $\theta$ is the value we want to find the optimal value for maximizing the rest of the equation. Explanation follows below.

Note that this is a probability density function, but for the purpose of a light introduction, it is just shown as is.

Perhaps we could vary $\theta$ with many different values, which we could plot as a distribution. Then we could eyeball the $\theta$ value where the maximum likelihood estimation is the largest. Or we could use the formula for it, without having to plot anything. The formula is

$$ \theta^* = \frac{n}{M} $$

Where $\theta^*$ denotes the maximum likelihood, e.g. the peak of the curve of a distribution.


As a final remark, I think this likelihood function needs a way longer and more thorough post, seeing as there goes much more into it. It was shortly explained here, but I will link to some resources.

Further reading/watching