Classification and Regression Trees (CART) is one of the most used algorithms in Machine Learning, as it appears in Gradient Boosting. This means that the most popular packages like XGBoost and LightGBM are using CART to build trees.

Decision Tree is a generic term, and they can be implemented in many ways – don't get the terms mixed, we mean the same thing when we say classification trees, as when we say decision trees. But a decision tree is not necessarily a classification tree, it could also be a regression tree.

We will be exploring Gini Impurity, which helps us measure **the quality of a split**. Gini impurity is a classification metric that measures how we should create internal nodes and leaf nodes. This metric is different from but similar to Gini Index and Information Gain/Impurity Improvement.

- Terminology Of Trees
- Gini Impurity: Leaf Nodes
- Gini Impurity: Internal Nodes
- Example: Building A Classfication Tree
- Regularization For CARTs
- Further Reading

Let's get the definitions rolling right away. We have root nodes, internal node and leaf nodes – each illustrated in the picture below:

- Root node: The very first node in a tree.
- Internal node: Nodes which are connected to deeper nodes.
- Leaf nodes: Nodes that are not connected to deeper nodes, but have an internal node connected to it.

Throughout this article, we will color code the root and internal nodes as blue, while the leaf nodes are mint green.

Some general terminology for trees is the concept of parents and children. The nodes above a certain node are called parent nodes, while the nodes below are called child nodes.

When calculating the gini impurity for a single leaf node, we can have multiple classes $C$ – e.g. the classes "YES" and "NO". The following are the steps:

- Calculate the probabilities of all classes, given the samples in a Leaf $k$
- Square the calculated probabilities, i.e. raised to the power of 2
- Sum all the squared probabilities into a single integer
- Subtract the single integer from $1$

The probabilities are easy to find. If the class is "YES", then we simply count the number of observations labelled "YES" in $Leaf_k$ and divide it by the total number of observations. We will see exactly how this is done later on.

$$

Gini(Leaf_k) = 1 \sum_{class=1}^{C \, classes} -(probability(class|Leaf_k))^2

$$

$$

\Updownarrow

$$

$$

Gini(L_k) =

1\sum_{c=1}^{C}

-\left(

p(c|L_k) \right

)^2

$$

We have expanded the terms a bit in the top equation, for better readability, while the bottom equation is the one we will continue to use.

The $p(c|L_k)$ reads:

What is the probability that a class $c$ is true, given the observations in the k'th leaf (called $Leaf_k$)?

If you are not comfortable with the thought of probabilities yet, I have written a free post explaining some of the basic probability theory. Visit the post and come back afterwards, if you feel like you need a refresher on probability theory.

The next thing we need to explain is how gini impurity is calculated for an internal node (and the root node, more on this later). The following are the steps:

- Loop over each leaf node, from $k=1$ all the way to $K$ leaf nodes.
- Find the gini impurity of the current leaf $k$'th leaf (with the first equation that was just explained.
- Count the number of observations in the $k$'th leaf.
- Divide by the total number of observations in all the leaf nodes.
- After looping over all leafs, the result of each leaf is summed for a final gini impurity.

This equation should be nothing hard to handle, but there is many things to account for; which leaf holds which observations?

$$

Gini(Internal_j) = \sum_{k'th \, leaf=1}^{K \, leaf \, nodes}

\left(\frac{count(L_k)}{count(L_1,...,L_k)}\right) Gini(L_k)

$$

We have again expanded the definitions in the upper equation and simplified the terms in the lower equation. The definitions are mathematically the same, but the sum/count might be confusing. All you have to think about when calculating that is, how many observations can we currently consider in the internal node?

First and foremost, we need a dataset to even start thinking about building a classification tree. Let's define this dataset as below with 5 examples.

We want to predict whether a person has The Flu. We are given a dataset, where we have four features: does the person have a fever, cough, headache and/or the flu? We use $1 = \text{YES}$ and $0 = \text{NO}$, with each observation indexed in the left-most column.

FEVER | COUGHING | HEADACHE | FLU | |
---|---|---|---|---|

0 | 1 | 0 | 0 | 1 |

1 | 0 | 1 | 1 | 1 |

2 | 0 | 1 | 1 | 0 |

3 | 0 | 1 | 0 | 0 |

4 | 1 | 1 | 0 | 1 |

... | ... | ... | ... | ... |

267 | 0 | 1 | 0 | 1 |

As you can see from the table, we have $N=268$ observations and $M=4$ features. Note that N and M is the standard notation; N means observations, while M means features. We also assume that this dataset has no missing values.

To build a decision tree, we have to start off by finding the best feature for the root node – the feature which best separates the observations, measured by impurity.

We iteratively go feature by feature, to find the impurity for each feature. We quite literally test which feature separates the data the best.

We start with fever – and we ask, how well does the feature fever separate our 268 observations into NO flu and YES flu? After having separated by YES/NO fever, we look at how many observations in each leaf has YES/NO for flu.

This split looks good, but let's actually measure just how good it is, by using our equations from earlier. We start off with the first equation.

**Step 1:** Calculating the gini impurities for each leaf node.

We can treat the root node just like an internal node when calculating the impurity. To expand the equation, we know our two classes YES (1) and NO (0). The big greek sigma $\Sigma$ works like a foreach loop, where we just loop over each class, from $c=1$ until $C$ classes.

$$

Gini(L_1) = 1 \sum_{c=1}^{C}

-( p(c|L_1) )^2

= 1-p(1|L_1)^2 - p(0|L_1)^2

$$

But wait, how do we measure the probabilities? We could ask, given the observations in Leaf 1, how likely is a person to have the flu? Now, how we actually measure this is by putting the class in the numerator and total number of observations in the leaf in the denominator.

$$

p(1|L_1) = \frac{138}{154},p(0|L_1) = \frac{16}{154}

$$

In these instances, we can think about $p(c|L_k)$ as counting the number of observations with class c, divided by the count of observations in the leaf k $\frac{count(c)}{count(L_k)}$.

$$

Gini(L_1)

= 1 - \left(\frac{138}{154} \right)^2 - \left(\frac{16}{154} \right)^2

= 0.186

$$

The same approach is used for Leaf 2:

$$

Gini(L_2)

= 1 - \left(\frac{18}{114} \right)^2 - \left(\frac{96}{114} \right)^2

= 0.266

$$

Now that we have the gini impurity of each leaf, we can calculate the gini impurity for the fever feature by the second equation.

**Step 2:** Calculating the gini impurities for the fever feature.

We know the gini impurity of $L_1$ and $L_2$, we just calculated them above in step 1. The second step is expanding the summation. We realize that we only have two leaf nodes, so we just have to expand it to the two leafs.

Then we count the number of observations in $N$ in each leaf node and replace the $count(L_k)$ by the actual count in the leaf. And then we replace the $Gini(L_k)$ by the gini impurity that we just calculated above.

$$
\begin{align}
Gini(I_1)
&=
\sum_{k=1}^{K}
\left(\frac
{
count(L_k)
}
{
count(L_1,...,L_K)
}
\right)Gini(L_k)
\\
& =
\left(\frac
{
count(L_1)
}
{
count(L_1 + L_2)
}
\right)Gini(L_1)\,+
\left(\frac
{
count(L_2)
}
{
count(L_1 + L_2)
}
\right)Gini(L_2)
\\
& =
\left(\frac{154}{268} \right)0.186
+ \left(\frac{114}{268} \right)0.266
\\
& =
0.22
\end{align}
$$

We end up with a gini impurity of $0.22$. Next, we will calculate the gini impurity for the other features – at the end, we choose the root to be the feature with the lowest gini impurity, since we could consider that feature the most pure.

Now that I have already showed you how to calculate the gini impurity for the first feature, we will quickly go through the gini impurity for the coughing and headache features, since the type of data is the same.

Splitting by the coughing feature, the leaf nodes ended up like the in the following image.

Given the information from this decision tree, we can calculate the gini impurity of the two leaf nodes.

$$

Gini(L_1)

= 1 - \left(\frac{60}{132} \right)^2 - \left(\frac{72}{132} \right)^2

= 0.496

$$

$$

Gini(L_2)

= 1 - \left(\frac{35}{136} \right)^2 - \left(\frac{101}{136} \right)^2

= 0.382

$$

It turns out that these gini impurity scores are way higher than for the fever feature, which means the gini impurity is going to be higher.

$$
\begin{align}
Gini(I_1)
& =
\sum_{k=1}^{K}
\left(\frac
{
count(L_k)
}
{
count(L_1,...,L_K)
}
\right)Gini(L_k)
\\
& =
\left(\frac
{
count(L_1)
}
{
count(L_1 + L_2)
}
\right)Gini(L_1)\,+
\left(\frac
{
count(L_2)
}
{
count(L_1 + L_2)
}
\right)Gini(L_2)
\\
& =
\left(\frac{132}{268} \right)0.496
+ \left(\frac{136}{268} \right)0.3825
\\
& =
0.438
\end{align}
$$

Using these gini impurities from the leaf nodes to calculate the gini impurity for the coughing feature, we get a gini impurity feature of $0.438$, which is comparatively much worse than the one for the fever feature.

Given the information from this decision tree, we can calculate the gini impurity of the two leaf nodes.

$$

Gini(L_1)

= 1 - \left(\frac{84}{148} \right)^2 - \left(\frac{64}{148} \right)^2

= 0.491

$$

$$

Gini(L_2)

= 1 - \left(\frac{26}{120} \right)^2 - \left(\frac{94}{120} \right)^2

= 0.339

$$

And yet again, it turns out that these gini impurity scores are way higher than for the fever feature, which means the gini impurity is going to be higher.

$$
\begin{align}
Gini(I_1)
& =
\sum_{k=1^{K}}
\left(\frac
{
count(L_k)
}
{
count(L_1,...,L_K)
}
\right)Gini(L_k)
\\
& =
\left(\frac
{
count(L_1)
}
{
count(L_1 + L_2)
}
\right)Gini(L_1)\,+
\left(\frac
{
count(L_2)
}
{
count(L_1 + L_2)
}
\right)Gini(L_2)
\\
& =
\left(\frac{148}{268} \right)0.491
+ \left(\frac{120}{268} \right)0.339
\\
& =
0.423
\end{align}
$$

The final gini impurity is $0.423$ for the headache feature. Since the gini impurity for the fever feature is the lowest, the fever feature now becomes the root.

What if our data instead had discrete values, like the age of a person – 10, 15 etc. How would we calculate the gini impurity? We can't exactly find the probability because we are not working with classes now.

FEVER | COUGHING | HEADACHE | AGE | FLU | |
---|---|---|---|---|---|

0 | 1 | 0 | 0 | 10 | 1 |

1 | 0 | 1 | 1 | 15 | 1 |

2 | 0 | 1 | 1 | 55 | 0 |

3 | 0 | 1 | 0 | 31 | 0 |

4 | 1 | 1 | 0 | 22 | 1 |

... | ... | ... | ... | ... | ... |

267 | 0 | 1 | 0 | 21 | 1 |

Well actually, we can find the probability and calculate the gini impurity – we just have to make some thresholds for a split. Let me show you how.

The following is the way to calculate the gini impurity for a feature:

- Sort the feature by ascending or descending.
- For as many data points as possible, find the average of two adjacent rows. We call each of these averages a threshold.
- Use each average as a threshold and calculate the gini impurity.
- Use the threshold with the lowest gini impurity to compare to the other features.

If we have the first 5 rows of the age feature as an array `[10, 15, 55, 31, 22]`

, we can sort it such that the array becomes `[10, 15, 22, 31, 55]`

. Next, we can find the averages between the adjacent data points.

So the technique is to iterate over each row, then find the average of the value of the current row plus the value of the next row. This is easily achieved by storing an index of the current row and looking for the next value in the array by looking up index+1.

- First threshold: $(10+15)/2 = 12.5$
- Second threshold: $(15+22)/2 = 18.5$
- Third threshold: $(22+31)/2 = 26.5$
- Fourth threshold: $(31+55)/2 = 43$

Now we have four thresholds `[12.5, 18.5, 26.5, 43]`

. We find a gini impurity by asking: what is the probability that the a given weight is LESS than a given threshold?

$$

p(\text{all$_\text{weights}$} < \text{threshold}| \text{threshold})

$$

And the reverse, what is the probability that the a given weight is GREATER than a given threshold?

$$

p(\text{all$_\text{weights}$} > \text{threshold} | \text{threshold})

$$

Great, this fits perfectly into our gini impurity calculations, since we just split the observations into two leaf nodes for each threshold. We construct 4 potential splits in a decision tree, and then we evaluate how well the split went. Note that we went from using 5 rows of data to make thresholds, to using the whole dataset to calculate the gini impurity. In practice, you would like to make thresholds for all the adjacent rows possible, but for this example, we will stick with a low number of thresholds for clarity.

Given these four different splits, all we have to do is calculate the gini impurity, just like we did with the other features. For the sake of clarity, I'm leaving the final gini impurities for each of these thresholds below, since we have already done this exercise for the three other features.

Gini Impurity of Threshold at 12.5:

$$

Gini(L_1)=1-\left(\frac{118}{223}\right)^2 -\left(\frac{105}{223}\right)^2=0.4983

$$

$$

Gini(L_2)=1-\left(\frac{21}{45}\right)^2 -\left(\frac{24}{45}\right)^2=0.49777

$$

$$

Gini(I_1) = \left(\frac{223}{268}\right)0.4983 + \left(\frac{45}{268}\right)0.49777 = 0.4982

$$

Gini Impurity of Threshold at 18.5:

$$

Gini(L_1)=1-\left(\frac{91}{172}\right)^2 -\left(\frac{81}{172}\right)^2=0.4983

$$

$$

Gini(L_2)=1-\left(\frac{49}{96}\right)^2 -\left(\frac{47}{96}\right)^2=0.4997

$$

$$

Gini(I_1) = \left(\frac{172}{268}\right)0.4983 + \left(\frac{96}{268}\right)0.4997 = 0.4988

$$

Gini Impurity of Threshold at 26.5:

$$

Gini(L_1)=1-\left(\frac{51}{145}\right)^2 -\left(\frac{94}{145}\right)^2=0.456

$$

$$

Gini(L_2)=1-\left(\frac{59}{123}\right)^2 -\left(\frac{64}{123}\right)^2=0.4992

$$

$$

Gini(I_1) = \left(\frac{145}{268}\right)0.456 + \left(\frac{123}{268}\right)0.4992 = 0.4758

$$

Gini Impurity of Threshold at 43:

$$

Gini(L_1)=1-\left(\frac{32}{83}\right)^2 -\left(\frac{51}{83}\right)^2=0.4738

$$

$$

Gini(L_2)=1-\left(\frac{105}{185}\right)^2 -\left(\frac{80}{185}\right)^2=0.4909

$$

$$

Gini(I_1) = \left(\frac{83}{268}\right)0.4738 + \left(\frac{185}{268}\right)0.4909 = 0.4856

$$

Since the gini impurity for the threshold at 26.5 is the lowest, that means we split at this threshold, since it's the optimal split of compared to the rest.

Suppose we add another feature to the table called eye color. This feature contains a string of letters that describe the color of a persons eye (obviously). But the important part is the string, since we cannot exactly calculate the gini impurity based on a string value.

FEVER | COUGHING | HEADACHE | EYECOLOR | FLU | |
---|---|---|---|---|---|

0 | 1 | 0 | 0 | Blue | 1 |

1 | 0 | 1 | 1 | Blue | 1 |

2 | 0 | 1 | 1 | Brown | 0 |

3 | 0 | 1 | 0 | Brown | 0 |

4 | 1 | 1 | 0 | Green | 1 |

... | ... | ... | ... | ... | ... |

267 | 0 | 1 | 0 | Brown | 1 |

What is the solution? The way modern solutions like scikit-learn or XGBoost builds their decision trees is by only allowing 3 data types: *booleans, floats and integers*. This means they also reject datetime and anything else that is not one of those three data types. You can also encode datetime observations, but that is a topic for another day.

This is exactly the approach we should follow, and we should reject any data that has another data type than booleans, floats and integers. This means we need to transform our string into one of the three data types, such that the algorithm will accept it – a popular choice is one-hot encoding.

Let's say we find that there exists three categorical values for the eye color feature: *blue, green and brown*. The main idea is that we turn each categorical value into its own feature with a boolean value.

FEVER | COUGHING | HEADACHE | EYECOLOR_Blue | EYECOLOR_Green | EYECOLOR_Brown | FLU | |
---|---|---|---|---|---|---|---|

0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |

1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |

2 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |

3 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |

4 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |

... | ... | ... | ... | ... | ... | ... | ... |

267 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |

You see how we get three new features: EYECOLOR_BLUE, EYECOLOR_GREEN and EYECOLOR_BROWN. We give them a boolean value (1 or 0), and at this point, we can treat them like all the other features and calculate the gini impurity, just like I have shown you earlier.

Now that we have inspected how we calculate the root node, we will go through calculating the child nodes in the tree until we are done. To recap, we found that fever was the best feature at the root node, and it splits the observations such that 154 observations goes to the left, and 114 observations goes to the right in the decision tree.

We choose to expand the left side first, then the right side second – the next step is calculating the gini impurity for the remaining features: coughing and headache. We always explore the left side of things first, then go back and explore the right side. It's really the same process as before, as you will see in a moment.

We try to split the 154 observations by the coughing feature first. The procedure will look exactly the same as before – the main point you need to get, is that we keep trying to split by a new feature the deeper we get into the tree. This is what it looks like, when we explore the best features to split by on the left side of the tree. Notice how the left green leaf, in the image above, becomes a new feature in blue.

Note that the fever feature cannot be used to split the observations anymore, as we consider that feature used. The feature can only be reused when the parent nodes have not used the feature yet, but since fever is at the root of the tree, there are no parent nodes.

It looks like the coughing feature ends up splitting the observations quite nicely, when looking at the ratio between the persons who had the flu and persons who did not have the flu. This looks to be a promising feature.

As you have seen a couple times now, we calculate the gini impurity for each leaf, and then the gini impurity for the internal node coughing.

$$

Gini(L_1)=1-\left(\frac{102}{124}\right)^2 -\left(\frac{22}{124}\right)^2=0.2919

$$

$$

Gini(L_2)=1-\left(\frac{8}{30}\right)^2 -\left(\frac{22}{30}\right)^2=0.3911

$$

$$

Gini(I_1) = \left(\frac{124}{154}\right)0.2919 + \left(\frac{30}{154}\right)0.3911 = 0.3112

$$

There we have it, the gini impurity for the coughing feature is $0.3112$.

The next feature is headache. The split looks like the following, when splitting the observations based on if a person had a headache.

At first glance, the headache feature does not split the observations as well. If you look at the ratio between $FLU=1$ and $FLU=0$, we see that the ratio from coughing is better.

It looks like headache won't make it, but we won't know until we have calculated the gini impurity. Same method as before:

$$

Gini(L_1)=1-\left(\frac{55}{80}\right)^2 -\left(\frac{25}{80}\right)^2=0.4297

$$

$$

Gini(L_2)=1-\left(\frac{32}{74}\right)^2 -\left(\frac{42}{74}\right)^2=0.4909

$$

$$

Gini(I_1) = \left(\frac{80}{154}\right)0.4297 + \left(\frac{74}{154}\right)0.4909 = 0.4591

$$

And it turns out the gini impurity for the headache feature is $0.4591$, which is much higher than the gini impurity for the coughing feature, which was $0.3112$. Now we split by the headache feature.

Now we just need to see if headache splits the two leaf nodes better than what's already there. We start by the left side again.

Now we are matching a gini impurity of a leaf against a potential new internal node. Recall the gini impurity for the left leaf, where $coughing = 1$, was $0.2919$, and the gini impurity for the right leaf, where $coughing = 0$, was $0.3911$. These are the scores we need to beat, if we want to split by our last feature headache.

Let's calculate the gini impurity for headache on the left side, given that the above image is how headache turns out to split the observations.

$$

Gini(L_1)=1-\left(\frac{97}{101}\right)^2 -\left(\frac{4}{101}\right)^2=0.076

$$

$$

Gini(L_2)=1-\left(\frac{5}{23}\right)^2 -\left(\frac{18}{23}\right)^2=0.3402

$$

$$

Gini(I_1) = \left(\frac{101}{124}\right)0.076 + \left(\frac{23}{124}\right)0.3402 = 0.125

$$

We split by the headache feature, since the gini impurity $0.125$ for the headache feature is lower than the previous gini impurity $0.2919$ for the leaf where $coughing = 1$.

The last thing we need to do is go back and check the second leaf node. We need to know if the gini impurity is lower than $0.3911$.

Now that we have seen the split, it still looks promising. Though, let's see what the gini impurity turns out to be.

$$

Gini(L_1)=1-\left(\frac{4}{16}\right)^2 -\left(\frac{12}{16}\right)^2=0.375

$$

$$

Gini(L_2)=1-\left(\frac{4}{14}\right)^2 -\left(\frac{10}{14}\right)^2=0.4082

$$

$$

Gini(I_1) = \left(\frac{16}{30}\right)0.375 + \left(\frac{14}{30}\right)0.4082 = 0.3905

$$

Since $0.3905$ is less than $0.3911$, we will have to split here again. This concludes building the left side of the tree.

Let's put tree construction into steps, for the sake of reference and making it as easy as possible to understand.

- Calculate the gini impurity for each feature, as a candidate to become the next node in the classification tree.
- a. For each feature in unused_features
- b. For each class in feature
- c. Calculate the gini impurity for class
- d. Calculate the gini impurity for feature

- Compare all gini impurities for all unused features, and split the observations based on the feature with the lowest gini impurity.
- a. Compare gini impurities for all unused features
- b. If one of the gini impurities is less than the gini impurity for the leaf node, split the observations with the feature that has the lowest gini impurity.

Notice how the strategy for the root node is the same as the internal nodes when finding the best split, except for the fact that there are no previous leaf nodes, which means we skip part of step 2b.

Regularization is when we try to control the complexity of a machine learning model, such that we are not overfitting/underfitting to our dataset. The way regularization usually works is by adding a penalty to a score based on the parameters of a model – the penalty is large when the parameters have large values, and the penalty is small when the parameters have small values.

What are examples of regularization for classification trees?

- Post-training pruning. This goes back in the classification tree and removes internal nodes and leaf nodes, based on calculations of a tree score. The algorithm is called
*minimal cost complexity pruning*. - Max tree depth. This can limit the number of splits we can use in a tree.
- Minimum samples required for a split. This makes it such that a leaf node needs at least
*n*observations in a leaf node, if that leaf node is to be split with by feature.

There are other parameters that we can adjust, but these are the main one's.

Did you like this article? Then you will enjoy following the implementation of classification trees from scratch in Python, which is coming up next.

I recommend watching StatQuest for a visual explanation of Classification Trees. He really goes through the whole concept and explains things nicely.

https://www.youtube.com/watch?v=7VeUPuFGJHk

- Classification and Regression Trees by Leo Breiman et al. 1984. This is the original book on CARTs, so if you have a relatively good mathematical background and want a stronger intuition, this is a recommended read.
- Introduction To Statistical Learning by Hastie et al. 2013. This book is a general machine learning and statistics book. It has some great sections on decision trees and ensemble methods in chapter 8, page 303-332.

*Disclaimer*: This post is not sponsored or supported in any way by Digital Ocean.

Run your Machine Learning model without paying a dime. Do this by receiving **$*** 100 for free* by using our referral link. If you choose to spend \$25 dollars besides the free credits, the referral program gives ML From Scratch an extra \$25, which helps the website run smoothly.

In four simple steps, you will see exactly how you can cheaply and simply deploy your model to the cloud within.

**The Setup:**

a. Installing Docker

b. Creating a \$5 droplet (also known as virtual machine (VM))

c. Accessing your droplet**Training Our Model****Preparing Our Model For Deployment:**

a. Making A Flask Application

b. Running The Flask Application In A Container**Deploying Our Model:**

a. Accessing The Online Application**Final Things To Consider**

The setup is crucial to make this tutorial work. We need to install docker, setup your droplet and make sure you can access it through SSH.

Note that you cannot use Docker with Windows Home – consider using a local Ubuntu VM by using Hyper-V on Windows instead.

**Windows** Pro/Enterprise/Education: Sign up and download Docker Desktop on Windows. Once logged in, you can download Docker. Make sure the program is running and that you are logged in locally.

**MacOS**: Sign up and download Docker Desktop on Mac. Once logged in, you can download Docker. Make sure the program is running and that you are logged in locally.

**Linux**: Use the three following command to download and start Docker:

1) `apt install docker.io`

, 2) `systemctl start docker`

, and 3) `systemctl enable docker`

. You can login with `docker login`

if you have a registry you want to login in to.

The next step is creating your Digital Ocean account and setting billing up. Remember you get $100 for free if you use our referral link.

After logging in, you want to look in the top right corner, click *Create* and then *Droplets*.

For this guide, you only have to modify the parts showed in the screenshots – meaning you will be using Ubuntu 18.04.3 since that is the standard. Start by expanding the plans by clicking *Show all plans*.

Then choose the $5 plan.

Next, choose a region, preferably the one closest to you or your customers.

We turned on monitoring, so you can see how much RAM, CPU, Disk, Network etc. is being used throughout time. We also chose to use a password for accessing the server, because it was the easiest way, but we recommend spending time setting up a SSH keys since it is much more secure.

As a last option, you can enable snapshot backups that you can revert to if something every goes wrong.

After creating your droplet, click on your project name in the *Projects* tab in the left side menu, then click on your droplet. Here, you can see the ip-address (ipv4) in the upper left corner. Note that the ip-address you see in this article is not online anymore.

Using your ip-address, open Terminal on Mac/Linux and use the following command. You can also use similar commands on Windows, look here for more information. Another good option for Windows is setting up PuTTy.

After having logged in, you will be prompted to enter your current password and then a new password for the root user.

When you are done, you want to run the two following commands to update all the necessary packages from *apt* and install *git* for later.

`apt update`

`apt install git`

To deploy a model, we first need to train a model. In this example, we will use the Iris Dataset with a Random Forest model. We start by importing all the packages we need for training.

```
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
```

Then we load the iris dataset from scikit-learn into a Pandas dataframe and Pandas series.

```
data_loader = load_iris()
X_data = data_loader.data
X_columns = data_loader.feature_names
x = pd.DataFrame(X_data, columns=X_columns)
y_data = data_loader.target
y = pd.Series(y_data, name='target')
```

We always split our dataset into training and testing, so here we use a function from scikit-learn that splits our input *x* and output *y* into training and testing.

```
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=42)
```

Next up is training our model. We used the default parameters for the random forest classifier, and then we input the training dataset into the fit function.

```
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
```

After having trained the model, we can make predictions and check how well we perform on the test set. After that, we save the random forest model as a pickle file.

```
pred = rf.predict(x_test)
score = accuracy_score(pred, y_test)
print(score)
joblib.dump(rf, 'trained_models/iris.pkl')
```

Note that this is a scikit-learn model, but you could use XGBoost or LightGBM at almost the same speed. Usually, one of those models will perform much better on a more complex and larger dataset.

The only limitation here, is that we are only using a CPU, which means the setup is not suited for running deep learning, since those applications require GPUs and are much more expensive.

We need to make a flask application to serve a webpage, that can make requests to our machine learning model on the server, so we can return predictions based on the input from the user on the webpage.

We create a Flask application that will run inside a container. There are two routes specified by the `@app.route()`

decorator; one for serving a web page at the route `/`

, and another that acts as an API that both the webpage `/predict`

, or some third party can interact with on a request-reply basis.

The following code snippet is all there is to our `main.py`

, which we in a moment's time will show you how to containerize. The code makes a service on localhost using port 5000, and it can be run locally by `python main.py`

.

```
from flask import Flask, request
import pandas as pd
import joblib
app = Flask(__name__, static_url_path='/static')
model = joblib.load('trained_models/iris.pkl')
@app.route('/')
def root():
return app.send_static_file('index.html')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)
df = pd.DataFrame(data, index=[0])
prediction = model.predict(df)
return str(prediction[0])
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
```

We won't go into the HTML, CSS or JavaScript files, but we did some basic work using Bootstrap and jQuery to build the webpage and send requests using Ajax.

Check the static folder on GitHub to find out more.

In order for us to run our service in a container, we must make a Dockerfile to make an image. In a Dockerfile, you specify the base image (Python 3.7), copy or add all the necessary files, and then run the container.

```
FROM python:3.7
COPY ./requirements.txt /app/req.txt
WORKDIR /app
RUN pip install -r req.txt
ADD ./trained_models ./trained_models
ADD ./static ./static
ADD ./main.py main.py
EXPOSE 5000
CMD [ "gunicorn", "--bind", "0.0.0.0:5000", "main:app" ]
```

We use Gunicorn as a WSGI, since Flask is only for development and needs a stable server, hence the use of Gunicorn.

To build this image using Docker, you would run the following command:

```
docker build -t iris:1.0 .
```

Where the `-t`

is for tagging the image with an image name `iris`

and a version `1.0`

, and we want to build the image in the current working directory, specified by the dot at the end.

The part where it says *Successfully built 735e1e847ba6* is where your attention should be, because that is the ID of the image, which you can also get through running a Docker command `docker images`

.

For running the container locally, you can now use the following command. The flag `-d`

means we are running it in detached mode, while the flag `-p`

indicates the port numbers in the format *to:from*.

```
docker run -d -p 5000:5000 735e1e847ba6
```

Go to http://localhost:5000 and check out the application.

Note that we always need to check if we can access our web application when running it through a Docker container, before we get to the deployment. This is important, because then we can always eliminate this if any error occurs later.

You installed git earlier – now, we need to use it on the VM. Start by cloning my repository using the following git command.

```
git clone https://github.com/casperbh96/iris-deployment.git
```

After this, you need to change to the directory of the repository you just cloned.

```
cd iris-deployment
```

The next thing is modifying the `setup.sh`

file. Please note that you don't need to modify all the variables, only the one's where you deviated from this tutorial.

- Modify the
*LISTEN*variable on line 6 by inserting the IP address of your own server`LISTEN=<ip_address>:80`

- Modify the
*LOCATIONS*variable on line 10. This variable refers to how to access your web application through the URL, so you could use something like`/webapp`

instead of the default`/`

. - Modify the
*PROXIES*variable on line 11 if your container runs on another port than port 5000. - Modify the
*ports*variable if your container runs on another port than port 5000. The mapping is in the format*to:from*, which means the first port number is mapped to local host, while the second port number is mapped to the port number used in the container. - Modify the
*image_name*variable to be the name of your image. This is referred to as*repository*if you run the command`docker images`

. - Modify the
*image_version*variable if needed.

```
#!/bin/bash
reinstall_docker_nginx=true
configure_nginx=true
kill_docker_images=false
LISTEN=64.227.72.137:80
SERVER_NAME=localhost
# Arrays
declare -a LOCATIONS=("/")
declare -a PROXIES=("http://localhost:5000/")
declare -a image_name=("iris")
declare -a image_version=("1.0")
declare -a ports=("5000:5000")
#-------------------------------------------------
####
#### Install all necessary packages
####
if [ "$reinstall_docker_nginx" = true ]
then
# Remove previous docker
apt update
apt-get -y purge docker docker-engine docker.io
# Install and enable docker
apt-get -y install docker.io
systemctl start docker
systemctl enable docker
# Remove previous nginx
apt-get -y purge nginx nginx-common nginx-full
# Install and enable nginx
apt-get -y install nginx
systemctl enable nginx
systemctl start nginx
ufw enable
fi
####
####Setup nginx config
####
# get length of an array
n_locations=${#LOCATIONS[@]}
# insert at
LINE_NUMBER=63
if [ "$configure_nginx" = true ]
then
sed -i ''"${LINE_NUMBER}"'i\
server {\
listen '"${LISTEN}"'; \
server_name '"${SERVER_NAME}"'; \
\
location '"${LOCATIONS[0]}"' { \
proxy_pass '"${PROXIES[0]}"'; \
} \
}' /etc/nginx/nginx.conf
# insert at
LINE_NUMBER=70
# use for loop to read all values and indexes
for (( i=1; i<${n_locations}; i++ ));
do
sed -i ''"${LINE_NUMBER}"'i\
location '"${LOCATIONS[i]}"' { \
proxy_pass '"${PROXIES[i]}"'; \
}' /etc/nginx/nginx.conf
LINE_NUMBER=$(($LINE_NUMBER + 4))
done
fi
####
#### Run all images
####
if [ "$kill_docker_images" = true ]
then
docker kill $(docker ps -q)
fi
# get length of an array
n_images=${#image_name[@]}
# use for loop to read all values and indexes
for (( i=0; i<${n_images}; i++ ));
do
docker build -t "${image_name[$i]}:${image_version[$i]}" .
ID="$(docker images | grep ${image_name[$i]} | head -n 1 | awk '{print $3}')"
docker run -d -p ${ports[$i]} ${ID}
done
service nginx restart
```

After you have went through and modified the variables correctly, you are ready to run the script. This is a one-and-done script, that sets up the whole server fast, and you can run it using the command `bash setup.sh`

.

Executing the script will do the following in order:

- Uninstall any existing Docker installation (if any), and install Docker. Also makes sure to always start Docker when the system is restarted.
- Uninstall any existing nginx installation (if any), and install nginx. Also makes sure to always start nginx when the system is restarted.
- Correctly modifies the
`nginx.conf`

file that you can find in the path`/etc/nginx/nginx.conf`

. This step is crucial, as we insert some lines that points the external ip address of your server to the Docker container. - Builds the Docker image from the Dockerfile and runs a container with the image on the ports specified earlier.
- Restarts nginx for all changes to apply.

I encourage you to go and check out the code in the `nginx.conf`

file, because it is important that you understand the nuances of how to get it to work. Small things like a missed `/`

at the end of a location can make nginx not work.

Now you should be ready to access your application through the IP address of the server. If you modified the *LOCATIONS* variable, you should remember to access the IP address and then `/webapp`

or whatever you used.

Because we have this as part web application and part API, we can make a POST request to the public IP address by Postman.

Note that the output is encoded as a 0 for Iris Setosa, 1 for Iris Versicolour, and 2 for Iris Virginica.

We can also do the same thing by a Python script.

```
import json
import requests
data = {
"sepal length (cm)": 6.1,
"sepal width (cm)": 3.2,
"petal length (cm)": 3.95,
"petal width (cm)": 1.3
}
ip_address = 'http://64.227.72.137/predict'
r = requests.post(ip_address, json=data)
print(r.text)
print(r.status_code)
```

Note that `r.text`

returns the output, which is a 0 for Iris Setosa, 1 for Iris Versicolour, and 2 for Iris Virginica.

- VM security: Creating non-root users and using SSH-keys to login to the server. Also think about security of the Flask app, e.g. XSS, X-XSS, CSRF, clickjacking (iframe) attacks.
- Version control: Create a container registry for version control of your container.
- Traffic: You can scale nginx and add more droplets to the upstream, so that the requests are distributed if you need to handle more traffic.
- Repository size: Don't put big files in your GitHub repository, because it will take a long time to download from the VM.

*This article was first published by IBM Developer at developer.ibm.com, but authored by Casper Hansen. Here is the Direct link.*

Linear Regression is famously known for being a simple algorithm and a good baseline to compare more complex models to. In this article, we explore the algorithm and turn the math into code, and then we run the code on a dataset, to get predictions on new data.

- What Is Linear Regression?
- Multiple Linear Regression
- Special Case 1: Simple Linear Regression
- Special Case 2: Polynomial Regression

The Linear Regression model consists of one equation of linearly increasing *variables* (also called *parameters* or *features*), along with a coefficient estimation algorithm called least squares, which attempts to figure out the best possible coefficient given a variable.

Linear regression models are known to be simple and easy to implement, because there is no advanced mathematical knowledge needed, except for a bit of linear algebra. For this reason, many people choose to use a linear regression model as a baseline model, to compare if another model can outperform such a simple model.

Multiple linear regression is a model that can capture the a linear relationship between multiple variables/features – assuming that there is one. The general formula for multiple linear regression looks like the following:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_i x_i + \varepsilon
$$

- $\beta_0$ is known as the intercept
- $\beta_1$ to $\beta_i$ are known as coefficients
- $x_1$ to $x_i$ are the features of our dataset
- $\varepsilon$ are the residual terms

We can also represent the formula for linear regression in vector notation. When representing the formula in vector notation, we have the advantage of using some operations from linear algebra, which in turn makes it easier to code.

$$
\mathbf {y} =
{\begin{bmatrix}
y_{1}\\y_{2}\\\vdots \\y_{n}
\end{bmatrix}},\quad
{\displaystyle X={
\begin{bmatrix}
\mathbf {x} _{0}^{\mathsf {T}}\\
\mathbf {x} _{1}^{\mathsf {T}}\\
\vdots \\
\mathbf {x} _{n}^{\mathsf {T}}
\end{bmatrix}}
,\quad }
{\displaystyle {\boldsymbol {\beta }}={
\begin{bmatrix}
\beta _{0}\\
\beta _{1}\\
\beta _{2}\\
\vdots \\
\beta _{n}
\end{bmatrix}},\quad
{\boldsymbol {\varepsilon }}={
\begin{bmatrix}
\varepsilon _{1}\\
\varepsilon _{2}\\
\vdots \\
\varepsilon _{n}
\end{bmatrix}}}
$$

Note that the $x_0^T$ vector contains just a series of 1's: `[1, 1, ..., 1]`

. This turns our equation into something much more compact, where all our terms are now represented as matrices. The following equation shows that we can compute the output value for all *y*, given that we have an estimation of the coefficients $\boldsymbol{\beta }$.

$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta } + \boldsymbol{\varepsilon}
$$

But how exactly can we estimate the coefficient values?

Linear least squares (LLS) is the main algorithm for estimating coefficients of the one formula just presented. In Machine Learning language, this is known as fitting your model to the dataset. We will focus on the most popular variant called Ordinary Least Squares (OLS).

The OLS algorithm minimizes *the sum of squares of residuals*.

One residual is the distance from one data point to the line in 2D or plane in 3D, so when we minimize the total sum of the squared residuals, we are minimizing the average distance from a data point to the line we are fitting to.

The formula for the algorithm can be quite intimidating, if one is not familiar with linear algebra: permuted matrices, dimensionality, dot product and the inverse of a matrix. This is not a topic for this article, but read Inverting A Matrix and the full derivation of the formula by Arthur S. Goldberger, also called the normal equation, if you want to be more knowledgeable on those topics.

The following formula ensures that the resulting coefficients defines a minimum for the normal equation, which means that the result is the minimized total sum of squared residuals:

$$
{\displaystyle {\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y}}
$$

The formula can be coded in one line of code, because it's just a few operations. We will see that later on in the coding section.

The outcome of the algorithm, *beta hat* $\boldsymbol{\hat{\beta}}$, is a vector containing all the coefficients, that can be used to make predictions using the formula presented in the beginning for multiple linear regression. To give an example in 3D, we could have this set of coefficients `[2.1, 5.3, 9.2]`

, which can be plugged into the equation for multiple linear regression:

$$
y = 2.1 + 5.3 \times x_1 + 9.2 \times x_2
$$

This means the intercept is at 2.1, while there are two variables that have their estimated coefficients as 5.3 and 9.2. To estimate an output variable *y*, one would need to input two variables *x1* and *x2* into the equation, and then we have made a prediction.

We will see examples for this in the coding section, though there is one important thing missing; assuming we have a number of predictions for some observed data, how can we measure how well the model predicted the ground truth?

We can measure the accuracy of how well the multiple linear regression model performs by a metric called *r squared*.

$$
R^2 = 1 - \frac
{ \sum_{i=0}^{n} (y_i - \hat{y})^2 }
{ \sum_{i=0}^{n} (y_i - \bar{y})^2 }
$$

- $y_i$ is one value of y at index i.
- $\hat{y}$ is pronounced as y hat, and it is the predicted values of y
- $\bar{y}$ is pronounced as y bar, and it is the average of y

The metric measures the relationship between the residual sum of squares (RSS) and the total sum of squares (TSS). The RSS is computed as the ground truth minus the predicted ground truth, while the TSS is computed as the ground truth minus the average of the ground truth.

We have made a single class called `MultipleLinearRegression`

in a `multiple_linear_regression.py`

file, that represents all of what we have talked about until this point.

```
import numpy as np
import copy
class MultipleLinearRegression():
def __init__(self):
self.coefficients = None
self.intercept = None
def fit(self, x, y):
# prepare x and y values for coefficient estimates
x = self._transform_x(x)
y = self._transform_y(y)
betas = self._estimate_coefficients(x, y)
# intercept becomes a vector of ones
self.intercept = betas[0]
# coefficients becomes the rest of the betas
self.coefficients = betas[1:]
def predict(self, x):
'''
y = b_0 + b_1*x + ... + b_i*x_i
'''
predictions = []
for index, row in x.iterrows():
values = row.values
pred = np.multiply(values, self.coefficients)
pred = sum(pred)
pred += self.intercept
predictions.append(pred)
return predictions
def r2_score(self, y_true, y_pred):
'''
r2 = 1 - (rss/tss)
rss = sum_{i=0}^{n} (y_i - y_hat)^2
tss = sum_{i=0}^{n} (y_i - y_bar)^2
'''
y_values = y_true.values
y_average = np.average(y_values)
residual_sum_of_squares = 0
total_sum_of_squares = 0
for i in range(len(y_values)):
residual_sum_of_squares += (y_values[i] - y_pred[i])**2
total_sum_of_squares += (y_values[i] - y_average)**2
return 1 - (residual_sum_of_squares/total_sum_of_squares)
def _transform_x(self, x):
x = copy.deepcopy(x)
x.insert(0, 'ones', np.ones( (x.shape[0], 1) ))
return x.values
def _transform_y(self, y):
y = copy.deepcopy(y)
return y.values
def _estimate_coefficients(self, x, y):
'''
β = (X^T X)^-1 X^T y
Estimates both the intercept and all coefficients.
'''
xT = x.transpose()
inversed = np.linalg.inv( xT.dot(x) )
coefficients = inversed.dot( xT ).dot(y)
return coefficients
```

We needed a dataset to put our new multiple linear regression algorithm to use, so we loaded in a dataset called Boston Housing Prices from Scikit-Learn. We split the dataset into inputs *x* and outputs *y*, and we further split *x* and *y* into data we train on and data we test on. This is in a `load_dataset.py`

file.

```
from sklearn.datasets import load_boston
import pandas as pd
from sklearn.model_selection import train_test_split
def sklearn_to_df(data_loader):
X_data = data_loader.data
X_columns = data_loader.feature_names
X = pd.DataFrame(X_data, columns=X_columns)
y_data = data_loader.target
y = pd.Series(y_data, name='target')
return x, y
x, y = sklearn_to_df(load_boston())
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=42)
```

The following is the start of the `main.py`

file for running our multiple linear regression.

- Load all dependencies and data
- Initialize the class
`MultipleLinearRegression()`

as an object - Fit the object to our data by
`mlr.fit(x_train, y_train)`

- Make predictions on a test set that is unseen to the algorithm by
`mlr.predict(x_test)`

. - Compute the r squared score by
`mlr.r2_score(y_test, pred)`

.

```
from load_dataset import x_train, x_test, y_train, y_test
from multiple_linear_regression import MultipleLinearRegression
from sklearn.linear_model import LinearRegression
# # # # # # # # # # # # # # # # #
# Our Multiple Linear Regression #
# # # # # # # # # # # # # # # # #
mlr = MultipleLinearRegression()
# fit our LR to our data
mlr.fit(x_train, y_train)
# make predictions and score
pred = mlr.predict(x_test)
# calculate r2_score
score = mlr.r2_score(y_test, pred)
print(f'Our Final R^2 score: {score}')
```

The output score turns out to be:

```
Our Final R^2 score: 0.6687594935355938
```

Although we did just implement multiple linear regression, let's compare how well our implementation works in comparison to Scikit-Learn's linear regression:

```
# # # # # # # # # # # # # # # # # #
# Scikit-Learn's Linear Regression #
# # # # # # # # # # # # # # # # # #
sk_mlr = LinearRegression()
# fit scikit-learn's LR to our data
sk_mlr.fit(x_train, y_train)
# predicts and scores
sk_score = sk_mlr.score(x_test, y_test)
print(f'Scikit-Learn\'s Final R^2 score: {sk_score}')
```

The output from their model to ours is almost identical, except for after the 13th decimal:

```
Scikit-Learn's Final R^2 score: 0.6687594935356329
```

Simple Linear Regression can be expressed in one simple equation

$$
y = \text{intercept} + \text{coefficient} \times x_{value}
$$

The intercept is often known as beta zero $\beta_0$ and the coefficient as beta 1 $\beta_1$. The equation is equal to the equation for a straight line

$$
y = \beta_0 + \beta_1 x_1 = mx + b
$$

That is all there is to a simple linear regression equation, though, how do we determine the intercept and coefficient? This is where we introduce the least squares algorithm.

Ordinary Least Squares is known to minimize the *sum of squared residuals (SSR)*. There are two central parts to ordinary least squares in this special case: estimating the coefficients and estimating the intercept.

We can estimate the coefficient (the slope) by finding the covariance between x and y, and dividing it by the variance of x. It should be noted that this is a special case of linear regression, derived using calculus from the formula as seen in multiple linear regression earlier on in this article.

$$
\text{coefficient} = \beta_1 =
\frac
{cov(x,y)}
{var(x)}
$$

This turns into something a little bit more complex. Any time we have a summation symbol $\sum$, the equivalent in programming is summing the elements of an array. We start by adding the first element to the sum where index *i=0*, and we go all the way to the length of the array *n*. The covariance and variance can be expanded to the following formula:

$$
\text{coefficient} = \beta_1 =
\frac
{\sum_{i=0}^{n} (x_i - \bar{x})(y_i - \bar{y})}
{\sum_{i=0}^{n}(x_i - \bar{x})^2 }
$$

- $x_i$ is one value of
*x*at index*i*. - $y_i$ is one value of
*y*at index*i*. - $\bar{x}$ is pronounced as
*x bar*, and it is the average of*x* - $\bar{y}$ is pronounced as
*y bar*, and it is the average of y

The estimate of the intercept $\beta_0$ should be easier to understand than the estimate of the coefficient $\beta_1$.

$$
\beta_0 = \bar{y} - \beta_1 \times \bar{x}
$$

- $\bar{x}$ is pronounced as
*x bar*, and it is the average of*x* - $\bar{y}$ is pronounced as
*y bar*, and it is the average of y - $\beta_1$ is the coefficient we estimated from earlier

We have made a single class called `SimpleLinearRegression`

in a `simple_linear_regression.py`

file, that implements the special case of simple linear regression.

```
import numpy as np
class SimpleLinearRegression():
def __init__(self):
self.coefficient = None
self.intercept = None
def fit(self, x, y):
'''
Given a dataset with 1 input feature x and output feature y,
estimates the coefficient and compute the intercept.
'''
self.coefficient = self._coefficient_estimate(x, y)
self.intercept = self._compute_intercept(x, y)
def predict(self, x):
'''
y = b_0 + b_1*x
'''
x_times_coeff = np.multiply(x, self.coefficient)
return np.add(x_times_coeff, self.intercept)
def r2_score(self, y_true, y_pred):
'''
r2 = 1 - (rss/tss)
rss = sum_{i=0}^{n} (y_i - y_hat)^2
tss = sum_{i=0}^{n} (y_i - y_bar)^2
'''
y_average = np.average(y_true)
residual_sum_of_squares = 0
total_sum_of_squares = 0
for i in range(len(y_true)):
residual_sum_of_squares += (y_true[i] - y_pred[i])**2
total_sum_of_squares += (y_true[i] - y_average)**2
return 1 - (residual_sum_of_squares/total_sum_of_squares)
def _compute_intercept(self, x, y):
'''
intercept = y_bar - coefficient*x_bar
WHERE: y_bar = average(y),
x_bar = average(x),
coefficient = coefficient already estimated
'''
# find the average of the array x
x_average = np.average(x)
# multiply the coefficient and the average of the x values
mul = self.coefficient*x_average
return np.average(y) - mul
def _coefficient_estimate(self, x, y):
'''
coefficient_{x,y} = ∑_{i=0}^{n} (x_i - x_bar) * (y_i - y_bar)
_________________________________________
∑_{i=0}^{n} (x_i - x_bar)^2
'''
numerator = 0
denominator = 0
for i in range(len(x)):
x_value = x[i]
y_value = y[i]
x_average = np.average(x)
y_average = np.average(y)
numerator += (x_value - x_average) * (y_value - y_average)
denominator += (x_value - x_average)**2
return numerator / denominator
```

As shown in the Code For Multiple Linear Regression section, we load in a dataset from Scikit-Learn.

The following is our `main.py`

file that loads the dataset, picks one feature and runs the simple linear regression model. After experimentation, we found that the feature called *LSTAT* performed the best in terms of the r2 score.

```
from load_dataset import x_train, x_test, y_train, y_test
from simple_linear_regression import SimpleLinearRegression
# pick a single feature to estimate y
x_train = x_train['LSTAT'].values
x_test = x_test['LSTAT'].values
y_train = y_train.values
y_test = y_test.values
# fit to data
slr = SimpleLinearRegression()
slr.fit(x_train, y_train)
# make predictions and score
pred = slr.predict(x_test)
score = slr.r2_score(y_test, pred)
print(f'Final R^2 score: {score}')
```

We did conveniently pick the feature that gave us the highest r squared score, but comparatively to the multiple linear regression, our model is not that far off:

```
Final R^2 score: 0.5429180422970383
```

Another case of multiple linear regression is polynomial regression, which might look like the following formula:

$$
{\displaystyle y_{i}\,=\,\beta _{0}+\beta _{1}x_{i}+\beta _{2}x_{i}^{2}+\cdots +\beta _{m}x_{i}^{m}+\varepsilon _{i}\ (i=1,2,\dots ,n)}
$$

The formula is flexible in the exponents, so that it can be changed to model specific problems better. The code for polynomial regression is the same as for multiple linear regression, except for the predict function. We increase the degree of the exponent by one for each feature and use those as our values instead.

```
def predict_polynomial(self, x):
'''
y = β0 + β1*x + β2*x^2 + ... + βm*x_i^m
'''
predictions = []
for index, row in x.iterrows():
# treating each feature as a variable that needs to be raised to the power of m
polynomial_values = [feature**i+1 for i, feature in enumerate(row.values)]
pred = np.multiply(polynomial_values, self.coefficients)
pred = sum(pred)
pred += self.intercept
predictions.append(pred)
return predictions
```

Though, one of the drawbacks of polynomial regression is that you have to find out which formula might work for you, and that gets quite hard as the number of variables grow, since we don't really have much intuition for what a 4-, 7- or 100-dimensional space looks like.

Perhaps a naive solution to the problem, is finding the best polynomial by brute force; trying all different permutations that you can think of, e.g. with a varying degree from 1 to 5.

- One very in-depth explanation of the mathematics by Martina Bremer from Cornell University – recommended if you are somewhat fluent in Linear Algebra.
- Thumbnail/GIF credits to assessingpsyche.

The inverse of a matrix is not exactly an easy task if you have not yet been introduced to Gaussian Elimination. The inverse of a matrix is used in a large number of algorithms, one of the simplest being Linear Regression.

There are two steps to inverting a matrix:

- Checking if the matrix is invertible by finding the Determinant
- Inverting the matrix by a Gaussian Elimination variant Gauss-Jordan

The concept of rank in linear algebra relates to dimensionality. All matrices are of some size $m \times n$, where $m$ is the number of rows and $n$ the number of columns.

- Rank $1$: When a matrix is a line, it has rank $1$.
- Rank $2$: When a matrix is a plane, it has rank $2$.
- A matrix is said to have
**Full Rank**if and only if the matrix does not contain a column as a linear combination of two columns because that means

Any given matrix can only have as high a rank as the number of columns $n$ for that matrix. But a matrix with $n=3$ might only be of rank $2$, because some transformation might squish the column vectors onto a plane – and the rank could even be decreased to rank $1$, if a transformation squishes the vectors onto a straight line.

Let's examine what this means.

We define three vectors $\vec{v}$, $\vec{u}$, and $\vec{w}$, and we combine them as column vectors in a matrix $A$.

$$
\vec{v} =
\begin{bmatrix}
2\\
0\\
1
\end{bmatrix}
,\quad
\vec{u} =
\begin{bmatrix}
1\\
2\\
1
\end{bmatrix}
,\quad
\vec{w} =
\begin{bmatrix}
1\\
-2\\
0
\end{bmatrix}
,\quad
A =
\begin{bmatrix}
2 &1 &1 \\
0 &2 &-2 \\
1 &1 &0
\end{bmatrix}
$$

If we perform vector addition on $\vec{u}$ and $\vec{w}$, then the result is the vector $\vec{v}$. This means that $\vec{v}$ is a linear combination of $\vec{u}$ and $\vec{w}$, i.e. the matrix $A$ does not have full rank.

$$
\vec{u} + \vec{w} =
\begin{bmatrix}
1+1\\
2+(-2)\\
1+0
\end{bmatrix}
=
\begin{bmatrix}
2\\
0\\
1
\end{bmatrix}
$$

Similarly, if you subtract $\vec{v}$ and $\vec{u}$, then you find that they are a linear combination of $\vec{w}$.

The geometrical interpretation of linear combinations is that we *lose information* if we were to use that specific transformation associated with the matrix $A$, because the resulting matrix of the transformation no longer spans the full three dimensions of space.

The above matrix $A$ is of rank $2$, and as we will see later on, we can compute the rank of a matrix by trying to reduce the matrix to *reduced row echelon form*. But for now, we can determine if a matrix can be inverted by figuring out the determinant.

When the determinant of a matrix is zero, the rank of the matrix is not full rank, meaning that we cannot invert the matrix. Similarly, there are 23 other properties that you equivalently can use to check if a matrix is invertible.

For a 3x3 matrix, the following is the formula:

$$
\begin{align}
det(A)
&=
det\left(\begin{bmatrix}
a&b&c\\
d&e&f\\
g&h&i
\end{bmatrix}\right)\\
&=aei + bfg + cdh - ceg - bdi - afh
\end{align}
$$

So let's plug in the number and find the solution.

$$
\begin{align}
det(A)
&=
det\left(\begin{bmatrix}
2 &1 &1 \\
0 &2 &-2 \\
1 &1 &0
\end{bmatrix}\right)\\
&=2*2*0 + 1*(-2)*1 + 1*0*1 - 1*2*1 - 1*0*0 - 2*(-2)*1\\
&=0
\end{align}
$$

There we have it, the matrix $A$ is non-invertible, since the determinant is zero, though if we instead had an invertible matrix, the determinant *is* non-zero. In other words, we cannot find the inverse for the matrix $A$, but if we instead had another matrix $B$ where $det(B) \ne 0$, we would be able to find the inverse.

Note that computing the determinant is only possible for square matrices, i.e. the matrices where the number of rows $m$ is equal to the number of columns $n$.

Once we have checked if a matrix is invertible, and it turns out that it is invertible, then we can go ahead and use Gaussian Elimination to invert the matrix. Specifically, the method is called Gauss-Jordan elimination, which is a variant of the Gaussian Elimination algorithm.

The following matrix $B$ is invertible since the determinant of the matrix is non-zero. We will use the matrix as an example.

$$
B =
\begin{bmatrix}
1 &5 &-1 \\
2 &2 &-2 \\
-1 &4 &3
\end{bmatrix}
$$

The way we initialize the algorithm is by gathering the matrix $B$ on the left side, separated by a vertical line, and then the *identity matrix* of $B$, denoted $I$. The identity matrix is a matrix with one's through the diagonal from top left towards the bottom right (main diagonal) and zero's in the rest of the entries.

$$
[B|I] =\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
2 &2 &-2 &0 &1 &0\\
-1 &4 &3 &0 &0 &1
\end{array}
\right]
$$

When performing operations, we perform them on both the matrix $B$ and $I$. After many operations, matrix B will be converted to row echelon form, while the identity matrix $I$ will be the inverse of matrix $B$.

The goal of Gauss-Jordan elimination is to convert a matrix to *reduced row echelon form*.

Let's explore what this means for a minute.

For a matrix to be in reduced row echelon form, it must satisfy the following conditions:

- All entries in a row must be $0$'s up until the first occurrence of the number $1$.
- The first $1$ in a row is always to the right of the first $1$ in the row above.
- All rows that only consist of $0$'s are placed below rows that do not.
- The first $1$ in a row only contains $0$'s in the same entry in all the rows above the $1$. In other words, the first $1$ in a row only contains $0$'s in that same column for all the rows above the $1$.

These conditions imply that the row below the first $1$ in any given row contains $0$'s below and to the left of that first $1$.

Here are examples of matrices that ARE in reduced row echelon form:

$$
A =
\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}
,\quad
B =
\begin{bmatrix}
1 & -2 & 0 & 0\\
0 & 0 & 1 & 0\\
0 & 0 & 0 & 1
\end{bmatrix}
,\quad
C =
\begin{bmatrix}
1 & 0 & -2 & 0 & 3 \\
0 & 1 & 3 & 0 & 5\\
0 & 0 & 0 & 1 & 6
\end{bmatrix}
$$

Here are examples of matrices that ARE NOT in reduced row echelon form:

$$
A =
\begin{bmatrix}
2 & 1 & 0 \\
0 & 0 & 1 \\
0 & 0 & 0
\end{bmatrix}
,\quad
B =
\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 1 & 0
\end{bmatrix}
,\quad
C =
\begin{bmatrix}
1 & 0 & 0 \\
0 & 0 & 0 \\
0 & 1 & 0
\end{bmatrix}
,\quad
D =
\begin{bmatrix}
1 & 7 & 0 \\
0 & 1 & -2 \\
0 & 0 & 1
\end{bmatrix}
$$

Note that a matrix in reduced row echelon form is also in row echelon form.

The following are the operations that we can perform during this elimination process, to convert the matrix to reduced row echelon form.

Generally, the accepted notation for performing row operations looks like the following:

$$R_{\text{some row number x}} \rightarrow R_{\text{some other row number y}} $$

Which means that $R_x$ becomes equal to the row $R_y$. But it's not that simple, because we can use three basic row operations that can quickly become confusing.

- Row Addition: We can add one row to another row,

e.g. $R_3 \rightarrow R_2 + R_3$.

Note that this also means we can do subtraction,

e.g. $R_3 \rightarrow -R_2 + R_3$. - Row Multiplication: We can add or subtract the same row by some amount,

e.g. $R_3 \rightarrow 2R_2 - 1/4R_1 + R_3$.

Note that we also can perform division when doing multiplication. - Row Interchanging: We can swap rows in a matrix, so they take each other's place, e.g. $R_3 \rightarrow R_2$ and $R_2 \rightarrow R_3$.

You can even think of very simple row multiplications or division where you divide a row by $2$, or you multiply a row by $3$.

When performing these matrix operations by hand, it is important to keep it clean and very organized, else you will quickly lose sight of what you were doing. All the arithmetics can be hard to keep track of.

We finally have it all together, we understand what reduced row echelon form is, and we are familiar with the operations we can perform to convert a matrix to reduced row echelon form.

Now, it is time to look at an example and apply what we just learned. That is, apply the Gauss-Jordan algorithm to invert a matrix. Remember that the goal is to convert the matrix to reduced row echelon form, which means the matrix must satisfy certain conditions listed in the Reduced Row Echelon Form section.

$$
[B|I] =\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
2 &2 &-2 &0 &1 &0\\
-1 &4 &3 &0 &0 &1
\end{array}
\right]
$$

The goal of the first few operations are to convert the first column so that there are only $0$'s below the first $1$.

- Firstly, we add $-2R_1$ to $R_2$
- Secondly, we add $R_1$ to $R_3$

$$
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
2 &2 &-2 &0 &1 &0\\
-1 &4 &3 &0 &0 &1
\end{array}
\right]
R_2 \rightarrow -2R_1 + R_2
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &-8 &0 &-2 &1 &0\\
-1 &4 &3 &0 &0 &1
\end{array}
\right]
$$

$$
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &-8 &0 &-2 &1 &0\\
-1 &4 &3 &0 &0 &1
\end{array}
\right]
R_3 \rightarrow R_1 + R_3
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &-8 &0 &-2 &1 &0\\
0 &9 &2 &1 &0 &1
\end{array}
\right]
$$

For the second row, we can see that there is no $1$ in the row, so we have to change that. And secondly, we also need there to be only $0$'s below and to the left of the first $1$ in the second row.

- Firstly, we divide $R2$ by $-8$
- Secondly, we add $-9R_2$ to $R_3$

$$
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &-8 &0 &-2 &1 &0\\
0 &9 &2 &1 &0 &1
\end{array}
\right]
R_2 \rightarrow \frac{1}{-8}R_2
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &1 &0 &0.25 &-0.125 &0\\
0 &9 &2 &1 &0 &1
\end{array}
\right]
$$

$$
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &1 &0 &0.25 &-0.125 &0\\
0 &9 &2 &1 &0 &1
\end{array}
\right]
R_3 \rightarrow -9R_2 + R_3
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &1 &0 &0.25 &-0.125 &0\\
0 &0 &2 &-1.25 &1.125 &1
\end{array}
\right]
$$

Next, we take care of the last column that also has no $1$. We do this by dividing $R_3$ by $2$.

$$
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &1 &0 &0.25 &-0.125 &0\\
0 &0 &2 &-1.25 &1.125 &1
\end{array}
\right]
R_3 \rightarrow \frac{1}{2}R_3
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &1 &0 &0.25 &-0.125 &0\\
0 &0 &1 &-0.625 & 0.5625 &0.5
\end{array}
\right]
$$

Let me remind you of rule 4 from the Reduced Row Echelon Form section.

The first $1$ in a row only contains $0$'s in the same entry in all the rows above the $1$. In other words, the first $1$ in a row only contains $0$'s in that same column for all the rows above the $1$.

This means we need to get rid of the $5$ and $-1$ in the first row. In the following steps, we also convert some of the entries to fractions instead of decimals to better fit them on the screen.

- Firstly, we add $R_3$ to $R_1$
- Secondly, we add $-5R_2$ to $R_1$

$$
\left[
\begin{array}{ccc|ccc}
1 &5 &-1 &1 &0 &0\\
0 &1 &0 &0.25 &-0.125 &0\\
0 &0 &1 &-0.625 & 0.5625 &0.5
\end{array}
\right]
R_1 \rightarrow R_3 + R_1
\left[
\begin{array}{ccc|ccc}
1 &5 &0 &\frac{3}{8} &\frac{9}{16} &\frac{1}{2}\\
0 &1 &0 &\frac{1}{4} &-\frac{1}{8} &0\\
0 &0 &1 &-\frac{5}{8} & \frac{9}{16} &\frac{1}{2}
\end{array}
\right]
$$

$$
\left[
\begin{array}{ccc|ccc}
1 &5 &0 &\frac{3}{8} &\frac{9}{16} &\frac{1}{2}\\
0 &1 &0 &\frac{1}{4} &-\frac{1}{8} &0\\
0 &0 &1 &-\frac{5}{8} & \frac{9}{16} &\frac{1}{2}
\end{array}
\right]
R_1 \rightarrow -5R_2 + R_1
\left[
\begin{array}{ccc|ccc}
1 &0 &0 &-\frac{7}{8} &1\frac{3}{16} &\frac{1}{2}\\
0 &1 &0 &\frac{1}{4} &-\frac{1}{8} &0\\
0 &0 &1 &-\frac{5}{8} & \frac{9}{16} &\frac{1}{2}
\end{array}
\right]
$$

And that's it, we have converted matrix $B$ to reduced row echelon form, which means the identity matrix $I$ has been converted to the inverse of $B$.

$$
inverse(B) =
\begin{bmatrix}
-\frac{7}{8} &1\frac{3}{16} &\frac{1}{2}\\
\frac{1}{4} &-\frac{1}{8} &0\\
-\frac{5}{8} & \frac{9}{16} &\frac{1}{2}
\end{bmatrix}
=
\begin{bmatrix}
-0.875 & 1.1875 &0.5\\
0.25 &-0.125 &0\\
-0.625 & 0.5625 &0.5
\end{bmatrix}
$$

With very basic programming skills, you can use Python with the package NumPy to perform the inverse operation quite easily.

```
import numpy as np
B = np.array([[1,5,-1],[2,2,-2],[-1,4,3]])
inv = np.linalg.inv(B)
print(inv)
```

This yields the same result we got from our row operations, and it can be a good tool to check your results when doing elimination by hand.

```
array([[-0.875 , 1.1875, 0.5 ],
[ 0.25 , -0.125 , 0. ],
[-0.625 , 0.5625, 0.5 ]])
```

A good video to understand the general concept of ranks, determinants, and inverse matrices:

Another example of how to find the inverse of a matrix by Khan Academy:

]]>*This article was first published by IBM Developer at developer.ibm.com, but authored by Casper Hansen. Here is the Direct link.*

**Disclaimer:** I am not a medical, radiology or epidemiology professional. This article was an experiment, from an engineering and data scientist perspective, and should be regarded as such.

To apply Deep Learning for COVID-19, we need a good dataset – one with lots of samples, edge cases, metadata, and different looking images. We want our model to generalize to the data, such that it can make accurate predictions on new, unseen data. All the work for this article is available on GitHub.

Unfortunately, not much data is available, but there are already posts on LinkedIn/Medium claiming >90% or in some cases 100% accuracy on detecting COVID-19 cases. Though, these posts usually contains mistakes, which are not acknowledged.

We have actively tried to mitigate some of the usual mistakes, by thinking about and coming up with solutions to the following problems:

- You might not believe the following is an issue, but it is a typical fallacy for new people in machine learning and data science. That is, the first rule of machine learning: never, ever test your model's performance with the same data you used to train it with. Not using a testing dataset, and instead testing and measuring the accuracy of their model on the training dataset, does not give an accurate representation of how well the model generalizes to new, unseen data.
- Not using computer vision techniques to achieve better generalization – augmentation being an absolute necessity, especially in our case where there are very few samples for our model to learn from.
- Not thinking about what the model learns, i.e. will our model actually learn the pattern of what COVID-19 looks like on an X-Ray image, or is it likely that there is some other noisy pattern in our dataset, that it will learn instead?

One amusing story of early machine learning is called Detecting Tanks:*"photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days."* - Not using the correct metrics. If you use the accuracy metric on a heavily imbalanced dataset, then your model might look like it's performing well in the general case, even though it's performing poorly on COVID-19 cases. 95% accuracy is not that impressive, if your accuracy on COVID-19 cases are 30%.

Here is a list of different data sources that I have accumulated over the period of the coronavirus spreading. Currently, this is the best we are going to get from the internet, because the data being collected in many countries and hospitals are classified information. Even if it was not classified information, you still need consent from each patient.

- GitHub: COVID-19 Chest X-Ray and CT images. This dataset is continually updated, as more cases appear.
- Kaggle 1: Chest X-Ray Images, positive and negative cases of Pneumonia.
- Kaggle 2: Chest X-Ray Images, positive and negative cases of Pneumonia.

All the data available on the internet have not been subject to the same preprocessing, and all the images are, as explained below, clearly different in the amount of black bars. Most of our positive COVID-19 data has the whole x-ray take up most of the screen, with little to no black bars on the sides (except for a few). Though, the dataset with negative COVID-19 cases have mostly black bars on the side of each image.

This becomes a problem, because our model later on might learn that it just needs to look at the black bars on the side, to know whether or not a new sample is a positive or negative case of COVID-19.

After manual inspection of the dataset, it has become apparent that almost all the negative cases have these black bars, while approximately only 10-20% of the positive cases have black bars. Wouldn't it be quite a coincidence, if our accuracy of predicting positive/negative later on turns out to be around the 80-90% mark?

The data problem from earlier is clearly one we *need* to solve. Can we somehow detect when there are black top, black bottom, or black side bars? And can we somehow automatically correct this?

Here is the key to solving this:

- A simple metric: are there a number
*x*or more black pixels in a given row/column? - Yes? Remove that row/column. Repeat over all rows and columns.
- Done checking for black pixels? Then we crop the image.
- Next image and repeat.

Ok, not that fast. New problem: we need to keep the same aspect ratio. And we can't upscale or downscale, if the aspect ratio is not the same. This is due to the existing architectures needing an input size of 224x224, which indeed is an aspect ratio of 1x1.

In other words – can we check if we are removing as many columns as we are removing rows? Can we overcompensate, by removing specific rows, if there are too many columns with black pixels, but little to no rows of black pixels?

The image below demonstrates the input and output of removing the black bars, with one example where it was particularly successful. Quite powerful, but we are still not adhering to the aspect ratio of 1x1.

The most effective and easy solution I found to fixing the aspect ratio, was calculating the difference between the number of rows removed versus the number of columns removed. Divide the difference by two and remove rows from top and bottom, or remove columns from left and right.

The last image is saved as a 224x224 image, because we can scale it down now that the aspect ratio is 1x1. This saves space and some time when loading the image again in another script.

Here is the thing – it takes about 14 seconds per image, to find the rows and columns for removal. We use a naive approach, which could probably be done more efficiently, but nevertheless, it works for this experiment. We had 8851 images in total in the Kaggle 1 dataset, which turned out to be 39 hours of preprocessing. All data is released and available in the GitHub repository for this article.

We ran the code on both the Kaggle 1 and GitHub datasets, and the result is that we have all these images saved in a 224x224 image format. The below code is only for running it on the Kaggle 1 dataset, but you can substitute the first part of the code to load any dataset of images.

```
import pandas as pd
import matplotlib.pyplot as plt
import cv2, time, os
import numpy as np
df = pd.read_csv('kaggle-pneumonia-jpg/stage_2_detailed_class_info.csv')
image_folder = 'kaggle-pneumonia-jpg/stage_2_train_images_jpg/'
df = df.loc[df['class'] == 'Normal']
```

Here is how I loaded the COVID-19 dataset. Note that the rest of the code is the same, this is just loading a different dataset.

```
columns = ['finding', 'modality', 'filename', 'view']
# only use COVID-19 samples done with X-ray and PA view
df = pd.read_csv('metadata.csv', usecols=columns)
df = df.loc[df['finding'] == 'COVID-19']
df = df.loc[df['modality'] == 'X-ray']
df = df.loc[df['view'] == 'PA']
```

After loading the dataframe, we load the images, labels and filenames into arrays.

```
images = []
labels = []
filenames = []
for index, row in df.iterrows():
image = cv2.imread(image_folder + row.patientId + '.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
images.append(image)
label = row['class']
labels.append(label)
filename = row.patientId
filenames.append(filename)
images = np.array(images)
```

Next, we have the a function `get_black_pixel_indices()`

that loops through rows and columns of pixels – you could imagine an image as a table, that has rows and columns of pixels, that has pixel values.

We count the number of black pixels, defined by the number of pixel values that are less than 40. A pixel value consists of one value for each color channel, e.g. Red, Green, Blue (RGB). We could have an array `[39, 67, 45]`

, and the average pixel value would be the average of those numbers, which is 50.33.

When we are done counting the number of black pixels for a given row and column, we check if there are more black pixels than non black pixels, and if that is the case, we choose to remove that row or column.

```
def get_black_pixel_indices(img):
rows_remove = []
cols_remove = []
for j, counter in enumerate(range(img.shape[0])):
row = img[counter]
# shift column dim to first dim, row dim to second dim
# acts like a 90 degree counterclock-wise rotation of image
col = img.transpose(1, 0, 2)[counter]
row_black_pixel_count = 0
col_black_pixel_count = 0
for selector in range(img.shape[1]):
row_pixel = row[selector]
col_pixel = col[selector]
if np.average(row_pixel) < 40:
row_black_pixel_count += 1
if np.average(col_pixel) < 40:
col_black_pixel_count += 1
if row_black_pixel_count > len(row)/2:
rows_remove.append(j)
if col_black_pixel_count > len(col)/2:
cols_remove.append(j)
return rows_remove, cols_remove
```

Before we remove the rows and columns, with a higher black pixel count than non black pixel count, we have to check for rows and columns that, if removed, would obstruct the image. The simplest way we found was that no rows and columns should be removed between column and row 200 and 800. After filtering these obstructing indices, we remove the rows and columns.

```
def remove_obstructing_indices(rows_remove, cols_remove):
for i, value in enumerate(rows_remove):
if 200 <= value <= 800:
del rows_remove[i]
for i, value in enumerate(cols_remove):
if 200 <= value <= 800:
del cols_remove[i]
return rows_remove, cols_remove
def remove_black_pixels(img, rows_remove, cols_remove):
img = np.delete(img, rows_remove, axis=0)
img = np.delete(img, cols_remove, axis=1)
return img
```

The last thing for preprocessing, is adjusting the aspect ratio. Quite simply, we know that, in most cases, a human torso is more tall than wide, so we choose to arbitrarily cut the image from the top and bottom. There are also cases where the torso is wider, but those were usually rare when inspecting the dataset.

We start by calculating the column to row difference and row to column difference, and based on that, we find how many rows or columns we need to remove, in order to fix the aspect ratio. We found that we can divide the difference by two, then remove half of the difference from either the top and bottom, or left and right. After that, we might have one leftover row or column if the difference is an odd number, so we compensate by either removing a row from the bottom, or a column from the right.

```
def adjust_aspect_ratio(img, rows_remove, cols_remove):
col_row_diff = len(cols_remove) - len(rows_remove)
row_col_diff = len(rows_remove) - len(cols_remove)
if col_row_diff > 0:
slice_size = int(col_row_diff/2)
img = img[:-slice_size]
img = img[slice_size:]
if img.shape[0] != img.shape[1]:
img = img[:-1]
elif row_col_diff > 0:
slice_size = int(row_col_diff/2)
img = img[:,:-slice_size,:]
img = img[:,slice_size:,:]
if img.shape[0] != img.shape[1]:
img = img[:,:-1,:]
if img.shape[0] == img.shape[1]:
return img, True
else:
return img, False
```

Finally, we run the loop that calls the four functions previously described. This results in images that have removed black bars from the left, right, top and bottom. We downscale that image to a 224x224 image, then we save the image, along with the label and filename in a .csv file that cumulates all the paths and filenames of all our preprocessed images.

```
save_folder = 'normal_dataset/'
start = time.time()
for i, img in enumerate(images):
rows_remove, cols_remove = get_black_pixel_indices(img)
rows_remove, cols_remove = remove_obstructing_indices(rows_remove, cols_remove)
new_img = remove_black_pixels(img, rows_remove, cols_remove)
adj_img, shape_match = adjust_aspect_ratio(new_img, rows_remove, cols_remove)
if shape_match:
resized_image = cv2.resize(adj_img, (224, 224))
label = labels[i]
name = filenames[i] + '.jpg'
cv2.imwrite(save_folder + name, resized_image)
new_df = pd.DataFrame({'filename': [name], 'finding': [label]})
if i == 0:
new_df.to_csv('normal_xray_dataset.csv')
else:
new_df.to_csv('normal_xray_dataset.csv', mode='a', header=False)
print('image number {0}, time spent {1:2f}s'.format(i+1, time.time() - start))
```

For a final overview of another 40 images of negative and positive cases, we can clearly see that our preprocessing approach helped.

To make the most of what we have (99 images), we are going to have to use augmentation of our training data. Augmentation is a manipulation of the image, that will look new to CNN – common examples of augmentation are rotations, shears and flips of images.

Keep in mind that we use augmentation in order to generalize our model better and prevent overfitting. Essentially, each time we use one form of augmentation, we add another similar, but slightly different, image to the training dataset. Note that this is also a demonstration of how you could augment the images, and not the augmentation we will use later.

We use TensorFlow's `image.adjust_contrast()`

to adjust the contrast of all our images.

```
X_train_contrast = []
for x in X_train:
contrast = tf.image.adjust_contrast( x, 2 )
X_train_contrast.append(contrast.numpy())
plot_images(X_train_contrast, 'Adjusted Contrast')
```

Which results in the following updated images:

We use TensorFlow's `image.adjust_saturation()`

to adjust the saturation of all our images.

```
X_train_saturation = []
for x in X_train:
saturation = tf.image.adjust_saturation( x, 3 )
X_train_saturation.append(saturation.numpy())
plot_images(X_train_saturation, 'Adjusted Saturation')
```

Which results in the following updated images:

We use TensorFlow's `image.flip_left_right()`

to flip all of our images from left to right.

```
X_train_flipped = []
for x in X_train:
flipped = tf.image.flip_left_right(x)
X_train_flipped.append(flipped.numpy())
plot_images(X_train_flipped, 'Flipped Left Right')
```

Which results in the following updated images:

We use TensorFlow's `image.flip_up_down`

to flip all of our images from top to the bottom.

```
X_train_flipped_up_down = []
for x in X_train:
flipped = tf.image.flip_up_down(x)
X_train_flipped_up_down.append(flipped.numpy())
plot_images(X_train_flipped_up_down, 'Flipped Up Down')
```

Which results in the following updated images:

We use TensorFlow's `image.flip_left_right()`

to flip all of our images that are already flipped from top to bottom, which results in the images being flipped from top to bottom, then from left to right.

```
X_train_flipped_up_down_left_right = []
for x in X_train_flipped_up_down:
flipped = tf.image.flip_left_right(x)
X_train_flipped_up_down_left_right.append(flipped.numpy())
plot_images(X_train_flipped_up_down_left_right, 'Flipped Up Down Left Right')
```

Which results in the following updated images:

TensorFlow has an extension called TensorFlow Addons, which can be installed by the pip command `pip install tensorflow-addons`

. The package has a function `image.rotate()`

that allows us to rotate any image by an arbitrary number of degrees. Though, the function takes radians as input, so we use the `radians()`

function from Python's math import.

```
import tensorflow_addons as tfa
from math import radians
X_train_rot_45_deg = []
X_train_rot_135_deg = []
X_train_rot_225_deg = []
X_train_rot_315_deg = []
for x in X_train:
deg_45 = tfa.image.rotate(image, radians(45))
deg_135 = tfa.image.rotate(image, radians(135))
deg_225 = tfa.image.rotate(image, radians(225))
deg_315 = tfa.image.rotate(image, radians(315))
X_train_rot_45_deg.append(deg_45)
X_train_rot_135_deg.append(deg_135)
X_train_rot_225_deg.append(deg_225)
X_train_rot_315_deg.append(deg_315)
plot_images(X_train_rot_45_deg, 'Rotated 45 Degrees')
plot_images(X_train_rot_135_deg, 'Rotated 135 Degrees')
plot_images(X_train_rot_225_deg, 'Rotated 225 Degrees')
plot_images(X_train_rot_315_deg, 'Rotated 315 Degrees')
```

Which results in the following updated images:

VGG and ResNet models are particularly popular at the moment, since they are performing the best, according to some of the papers released on COVID-19:

- Hemdan, E., Shouman M., & Karar M. (2020). COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images. arXiv:2003.11055.
- Wang, L., & Wong, A. (2020). COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. arXiv:2003.09871.

We have made an overview of two architectures, VGG19 and ResNet50, using an adapted style guide from this article. The yellow R stands for ReLU, while the yellow S stands for Softmax. Brackets means repeated blocks of layers. Please note that we abstracted batch normalization away.

We are going to be using transfer learning using the VGG19 model, where we initialize the weights to be of those from ImageNet, although it is a very different dataset to ours. ImageNet consists of images of random objects, while our dataset is just X-Ray images. We are transferring the hidden layers that find colorful patterns from those previously seen objects, or their different geometrical forms in another dataset.

I ran a short experiment where I tried not using transfer learning, and it resulted in an accuracy and F1 around 0.5-0.6. I chose to go ahead with the approach that performed better, as we will see in a moment. If you want to experiment, you need to define a VGG19 model with `weights=None`

.

For all the imports for loading data and modeling, we use various packages; pandas, pyplot, numpy, cv2, tensorflow and scikit-learn.

```
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import cv2, time
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG19
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
```

We start by loading both the datasets (note that `dataset/`

in the path variables is the name of the folder of the GitHub repository). We put a constraint on the number of normal cases we use for modeling, since using all the images, or even 20% of the normal images lead to overfitting to those samples.

```
covid_path = 'dataset/covid_dataset.csv'
covid_image_path = 'dataset/covid_adjusted/'
normal_path = 'dataset/normal_xray_dataset.csv'
normal_image_path = 'dataset/normal_dataset/'
covid_df = pd.read_csv(covid_path, usecols=['filename', 'finding'])
normal_df = pd.read_csv(normal_path, usecols=['filename', 'finding'])
normal_df = normal_df.head(99)
covid_df.head()
```

The following code is used to load the images into Python arrays, and lastly, we load the images into NumPy arrays and normalize the images, so that all the pixel values are between zero and one.

```
covid_images = []
covid_labels = []
for index, row in covid_df.iterrows():
filename = row['filename']
label = row['finding']
path = covid_image_path + filename
image = cv2.imread(path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
covid_images.append(image)
covid_labels.append(label)
normal_images = []
normal_labels = []
for index, row in normal_df.iterrows():
filename = row['filename']
label = row['finding']
path = normal_image_path + filename
image = cv2.imread(path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
normal_images.append(image)
normal_labels.append(label)
# normalize to interval of [0,1]
covid_images = np.array(covid_images) / 255
# normalize to interval of [0,1]
normal_images = np.array(normal_images) / 255
```

We want to make sure that both the normal and infected cases are split correctly, and since we loaded them separately, we will also split them separately. If we later choose to load more normal cases than infected cases, thus loading an imbalanced dataset, we will still get a good split in terms of the number of infected cases in the training and testing datasets.

After splitting the dataset, we concatenate them into input *X* and output *y*, along with a label for training and testing data. We also convert the labels *COVID-19* and *Normal* into numbers, by using `LabelBinarizer()`

and `to_categorical()`

.

```
# split into training and testing
covid_x_train, covid_x_test, covid_y_train, covid_y_test = train_test_split(
covid_images, covid_labels, test_size=0.2)
normal_x_train, normal_x_test, normal_y_train, normal_y_test = train_test_split(
normal_images, normal_labels, test_size=0.2)
X_train = np.concatenate((normal_x_train, covid_x_train), axis=0)
X_test = np.concatenate((normal_x_test, covid_x_test), axis=0)
y_train = np.concatenate((normal_y_train, covid_y_train), axis=0)
y_test = np.concatenate((normal_y_test, covid_y_test), axis=0)
# make labels into categories - either 0 or 1
y_train = LabelBinarizer().fit_transform(y_train)
y_train = to_categorical(y_train)
y_test = LabelBinarizer().fit_transform(y_test)
y_test = to_categorical(y_test)
```

We went with a pre-trained VGG19 model. We have set `include_top=False`

, which means there are no fully connected layers at the end of the VGG19 architecture. This enables us to take the outputs from the convolutions, flatten them and parse them into our own output layer.

We also added regularization by using a Dropout layer, because we were strongly overfitting to our training data. We are still overfitting to the training data, but not as much – we went from roughly a 0.97 accuracy to a 0.88 accuracy.

```
vggModel = VGG19(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
outputs = vggModel.output
outputs = Flatten(name="flatten")(outputs)
outputs = Dropout(0.5)(outputs)
outputs = Dense(2, activation="softmax")(outputs)
model = Model(inputs=vggModel.input, outputs=outputs)
for layer in vggModel.layers:
layer.trainable = False
model.compile(
loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
```

The way we chose to augment our data, was by generating batches of augmentations on each sample. For each sample, we get a new batch of samples from the ImageDataGenerator we defined below, and all these new augmented samples will be used for training, instead of the original sample.

```
train_aug = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True
)
```

We chose to set the `rotation_range=20`

, which is the degree range for random rotations of the original image. This means Keras will randomly choose a degree from 1-20 and rotate the image by that degree either clockwise or counterclockwise.

Meanwhile, we also have both the `width_shift_range=0.2`

and `height_shift_range=0.2`

, which randomly shifts the image by the width (i.e. to the left and right) and height (i.e. up and down) as specified.

At last, we also have `horizontal_flip=True`

, which at random flips the image horizontally, i.e. flipping the image from left to right, as we also saw earlier.

A good visualization of the ImageDataGenerator can be found here.

We have defined our model, augmentation and data. Now all we need to do, is to start training our VGG19 model. We can train our model with the augmented data, by feeding the batches from the augmentation by the `train_aug.flow()`

function.

```
history = model.fit(train_aug.flow(X_train, y_train, batch_size=32),
steps_per_epoch=len(X_train) / 32,
epochs=500)
```

Since we want to see how well our model performs on the different classes that exists, we use the correct metrics to measure how well our model performs; namely, we F1, and not accuracy. Here is why we could use both AUC and F1 (same explanation), explained by David Ernst:

The ROC AUC [or F1] is sensitive to class imbalance in the sense that when there is a minority class, you typically define this as the positive class and it will have a strong impact on the AUC [or F1] value. This is very much desirable behaviour. Accuracy is for example not sensitive in that way. It can be very high even if the minority class is not well predicted at all.

After training our model, we can use it to make predictions on new and unseen data. We use argmax for finding the highest predicted probability, and then we can generate a classification report using scikit-learn.

```
from sklearn.metrics import classification_report
y_pred = model.predict(X_test, batch_size=32)
print(classification_report(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)))
```

Our LabelBinarizer from earlier made it such that '0' is the COVID-19 class and '1' is the normal class. We can clearly see that the accuracy is 82%, while the f1-score for COVID-19 is 0.84. This is pretty good considering the limited data. Perhaps this even suggests that there are more noise in the images, that we did not filter out or correct for.

```
precision recall f1-score support
0 0.75 0.95 0.84 19
1 0.93 0.70 0.80 20
accuracy 0.82 39
macro avg 0.84 0.82 0.82 39
weighted avg 0.84 0.82 0.82 39
```

At last, we can plot the training accuracy and the training loss using pyplot.

```
plt.figure(figsize=(10,10))
plt.style.use('dark_background')
plt.plot(history.history['accuracy'])
plt.plot(history.history['loss'])
plt.title('Model Accuracy & Loss')
plt.ylabel('Value')
plt.xlabel('Epoch')
plt.legend(['Accuracy', 'Loss'])
plt.show()
```

This generates a simple plot that shows how the model's accuracy increased over time on the training dataset, as well as the loss on the training dataset.

Although our approach attempted to mitigate some of the mistakes made in other approaches, we should still believe that we can improve by a lot. As we saw from the f1-score, we most likely still have some issues with our data, that we can't immediately see ourselves.

Here are some of the approaches we could use, to get a better model, and more reliable results.

- Using x-ray images of SARS, MERS, ARDS and other types of infections, instead of just the normal category. This would make us able to distinguish between different infections.
- More quality data of COVID-19 cases from the posteroanterior (PA) view.
- Data in different stages of COVID-19: infected, no symptoms; infected, mild symptoms; infected, heavy symptoms.
- More advanced types of imaging data, e.g. CT images, which includes more information.
- Cross-validation or nested cross-validation for better estimation of the true generalization error of the model.

The results achieved in this article are *NOT* to be interpreted as usable in real world scenarios. We should only consider a Deep Learning + COVID-19 approach, after rigorous testing by engineers, doctors and other professionals, followed by a clinical study.

*This article was first published by IBM Developer at developer.ibm.com, but authored by Casper Hansen. Here is the Direct link.*

Creating complex neural networks with different architectures in Python should be a standard practice for any Machine Learning Engineer and Data Scientist. But a genuine understanding of how a neural network works is equally as valuable. This is what we aim to expand on in this article, the very fundamentals on how we can build neural networks, without the help of the frameworks that make it easy for us.

Please open the notebook from GitHub and run the code alongside reading the explanations in this article.

In this specific article, we explore how to make a basic deep neural network, by implementing the forward and backward pass (backpropagation). This requires some specific knowledge on the functionality of neural networks – which I went over in this complete introduction to neural networks.

It's also important to know the fundamentals of linear algebra, to be able to understand why we do certain operations in this article. I have a series of articles here, where you can learn some of the fundamentals. Though, my best recommendation would be watching 3Blue1Brown's brilliant series *Essence of linear algebra*.

We are building a *basic* deep neural network with 4 layers in total: 1 input layer, 2 hidden layers and 1 output layer. All layers will be fully connected.

We are making this neural network, because we are trying to classify digits from 0 to 9, using a dataset called MNIST, that consists of 70000 images that are 28 by 28 pixels. The dataset contains one label for each image, specifying the digit we are seeing in each image. We say that there are 10 classes, since we have 10 labels.

For training the neural network, we will use stochastic gradient descent; which means we put one image through the neural network at a time.

Let's try to define the layers in an exact way. To be able to classify digits, we must end up with the probabilities of an image belonging to a certain class, after running the neural network, because then we can quantify how well our neural network performed.

**Input layer:**In this layer, we input our dataset, consisting of 28x28 images. We*flatten*these images into one array with $28 \times 28 = 784$ elements. This means our input layer will have**784 nodes**.**Hidden layer 1:**In this layer, we have decided to reduce the number of nodes from 784 in the input layer to**128 nodes**. This brings a challenge when we are going forward in the neural network (explained later).**Hidden layer 2:**In this layer, we have decided to go with**64 nodes**, from the 128 nodes in the first hidden layer. This is no new challenge, since we already reduced the number in the first layer.**Output layer:**In this layer, we are reducing the 64 nodes to a total of**10 nodes**, so that we can evaluate the nodes against the label. This label is received in the form of an array with 10 elements, where one of the elements is 1, while the rest is 0.

You might realize that the number of nodes in each layer decreases from 784 nodes, to 128 nodes, to 64 nodes and then to 10 nodes. This is based on empirical observations that this yields better results, since we are not overfitting nor underfitting, but trying to get just the right number of nodes. Though, the specific number of nodes chosen for this article were just chosen at random, although decreasing to avoid overfitting. In most real-life scenarios, you would want to optimize these parameters by brute force or good guesses – usually by Grid Search or Random Search, but this is outside the scope of this article.

For the whole NumPy part, I specifically wanted to share the imports used. Note that we use other libraries than NumPy to more easily load the dataset, but they are not used for any of the actual neural network.

```
from sklearn.datasets import fetch_openml
from keras.utils.np_utils import to_categorical
import numpy as np
from sklearn.model_selection import train_test_split
import time
```

Now we have to load the dataset and preprocess it, so that we can use it in NumPy. We do normalization by dividing all images by 255, and make it such that all images have values between 0 and 1, since this removes some of the numerical stability issues with activation functions later on. We choose to go with one-hot encoded labels, since we can more easily subtract these labels from the output of the neural network. We also choose to load our inputs as flattened arrays of 28 * 28 = 784 elements, since that is what the input layer requires.

```
x, y = fetch_openml('mnist_784', version=1, return_X_y=True)
x = (x/255).astype('float32')
y = to_categorical(y)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.15, random_state=42)
```

The initialization of weights in the neural network is kind of hard to think about. To really understand how and why the following approach works, you need a grasp of linear algebra, specifically dimensionality when using the dot product operation.

The specific problem that arises, when trying to implement the feedforward neural network, is that we are trying to transform from 784 nodes all the way down to 10 nodes. When instantiating the DeepNeuralNetwork class, we pass in an array of sizes that defines the number of activations for each layer.

```
dnn = DeepNeuralNetwork(sizes=[784, 128, 64, 10])
```

This initializes the DeepNeuralNetwork class by the init function.

```
def __init__(self, sizes, epochs=10, l_rate=0.001):
self.sizes = sizes
self.epochs = epochs
self.l_rate = l_rate
# we save all parameters in the neural network in this dictionary
self.params = self.initialization()
```

Let's look at how the sizes affect the parameters of the neural network, when calling the `initialization()`

function. We are preparing *m x n* matrices that are *"dot-able"*, so that we can do a forward pass, while shrinking the number of activations as the layers increase. We can only use the dot product operation for two matrices M1 and M2, where *m* in M1 is equal to *n* in M2, or where *n* in M1 is equal to *m* in M2.

With this explanation, you can see that we initialize the first set of weights W1 with $m=128$ and $n=784$, while the next weights W2 are $m=64$ and $n=128$. The number of activations in the input layer A0 is equal to 784, as explained earlier, and when we dot W1 by the activations A0, the operation is successful.

```
def initialization(self):
# number of nodes in each layer
input_layer=self.sizes[0]
hidden_1=self.sizes[1]
hidden_2=self.sizes[2]
output_layer=self.sizes[3]
params = {
'W1':np.random.randn(hidden_1, input_layer) * np.sqrt(1. / hidden_1),
'W2':np.random.randn(hidden_2, hidden_1) * np.sqrt(1. / hidden_2),
'W3':np.random.randn(output_layer, hidden_2) * np.sqrt(1. / output_layer)
}
return params
```

The forward pass consists of the dot operation in NumPy, which turns out to be just matrix multiplication. As described in the introduction to neural networks article, we have to multiply the weights by the activations of the previous layer. Then we have to apply the activation function to the outcome.

To get through each layer, we sequentially apply the dot operation, followed by the sigmoid activation function. In the last layer we use the softmax activation function, since we wish to have probabilities of each class, so that we can measure how well our current forward pass performs.

Note: A numerical stable version of the softmax function was chosen, you can read more from the course at Stanford called CS231n.

```
def forward_pass(self, x_train):
params = self.params
# input layer activations becomes sample
params['A0'] = x_train
# input layer to hidden layer 1
params['Z1'] = np.dot(params["W1"], params['A0'])
params['A1'] = self.sigmoid(params['Z1'])
# hidden layer 1 to hidden layer 2
params['Z2'] = np.dot(params["W2"], params['A1'])
params['A2'] = self.sigmoid(params['Z2'])
# hidden layer 2 to output layer
params['Z3'] = np.dot(params["W3"], params['A2'])
params['A3'] = self.softmax(params['Z3'])
return params['A3']
```

The following are the activation functions used for this article. As can be observed, we provide a derivative version of the sigmoid, since we will need that later on when backpropagating through the neural network.

```
def sigmoid(self, x, derivative=False):
if derivative:
return (np.exp(-x))/((np.exp(-x)+1)**2)
return 1/(1 + np.exp(-x))
def softmax(self, x):
# Numerically stable with large exponentials
exps = np.exp(x - x.max())
return exps / np.sum(exps, axis=0)
```

The backward pass is hard to get right, because there are so many sizes and operations that have to align, for all the operations to be successful. Here is the full function for the backward pass; we will go through each weight update below.

```
def backward_pass(self, y_train, output):
'''
This is the backpropagation algorithm, for calculating the updates
of the neural network's parameters.
Note: There is a stability issue that causes warnings. This is
caused by the dot and multiply operations on the huge arrays.
RuntimeWarning: invalid value encountered in true_divide
RuntimeWarning: overflow encountered in exp
RuntimeWarning: overflow encountered in square
'''
params = self.params
change_w = {}
# Calculate W3 update
error = 2 * (output - y_train) / output.shape[0] * self.softmax(params['Z3'], derivative=True)
change_w['W3'] = np.outer(error, params['A2'])
# Calculate W2 update
error = np.dot(params['W3'].T, error) * self.sigmoid(params['Z2'], derivative=True)
change_w['W2'] = np.outer(error, params['A1'])
# Calculate W1 update
error = np.dot(params['W2'].T, error) * self.sigmoid(params['Z1'], derivative=True)
change_w['W1'] = np.outer(error, params['A0'])
return change_w
```

The update for `W3`

can be calculated by subtracting the ground truth array with labels called `y_train`

from the output of the forward pass called `output`

. This operation is successful, because `len(y_train)`

is 10 and `len(output)`

is also 10.

An example of `y_train`

might be the following, where the 1 is corresponding to the label of the `output`

:

```
y_train = np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
```

While an example of `output`

might be the following, where the numbers are probabilities corresponding to the classes of `y_train`

:

```
output = np.array([0.2, 0.2, 0.5, 0.3, 0.6, 0.4, 0.2, 0.1, 0.3, 0.7])
```

If we subtract them, we get the following:

```
>>> output - y_train
array([ 0.2, 0.2, -0.5, 0.3, 0.6, 0.4, 0.2, 0.1, 0.3, 0.7])
```

We use that operation when calculating the initial error, along with the length of our output vector, and the softmax derivative.

```
error = 2 * (output - y_train) / output.shape[0] * self.softmax(params['Z3'], derivative=True)
change_w['W3'] = np.outer(error, params['A2'])
```

The next is updating the weights `W2`

. More operations are involved for success. Firstly, there is a slight mismatch in shapes, because `W3`

has the shape `(10, 64)`

, and `error`

has `(10, 64)`

, i.e. the exact same dimensions. Thus, we can use a transpose operation on the `W3`

parameter by the `.T`

, such that the array has its dimensions permuted and the shapes now align up for the dot operation.

`W3`

now has shape `(64, 10)`

and `error`

has shape `(10, 64)`

, which are compatible with the dot operation. The result is multiplied element-wise (also called Hadamard product) with the outcome of the derivative of the sigmoid function of `Z2`

. At last, we use the outer product of two vectors to multiply the error with the activations `A1`

.

```
error = np.dot(params['W3'].T, error) * self.sigmoid(params['Z2'], derivative=True)
change_w['W2'] = np.outer(error, params['A1'])
```

Likewise, the code for updating `W1`

is using the parameters of the neural network one step earlier. Except for other parameters, the code is equivalent to the W2 update.

```
error = np.dot(params['W2'].T, error) * self.sigmoid(params['Z1'], derivative=True)
change_w['W1'] = np.outer(error, params['A0'])
```

We have defined a forward and backward pass, but how can we start using them? We have to make a training loop and choose to use Stochastic Gradient Descent (SGD) as the optimizer to update the parameters of the neural network.

There are two main loops in the training function. One loop for the number of epochs, which is the number of times we run through the whole dataset, and a second loop for running through each observation one by one.

For each observation, we do a forward pass with `x`

, which is one image in an array with the length 784, as explained earlier. The `output`

of the forward pass is used along with `y`

, which are the one-hot encoded labels (the ground truth), in the backward pass. This gives us a dictionary of updates to the weights in the neural network.

```
def train(self, x_train, y_train, x_val, y_val):
start_time = time.time()
for iteration in range(self.epochs):
for x,y in zip(x_train, y_train):
output = self.forward_pass(x)
changes_to_w = self.backward_pass(y, output)
self.update_network_parameters(changes_to_w)
accuracy = self.compute_accuracy(x_val, y_val)
print('Epoch: {0}, Time Spent: {1:.2f}s, Accuracy: {2}'.format(
iteration+1, time.time() - start_time, accuracy
))
```

The `update_network_parameters()`

function has the code for the SGD update rule, which just needs the gradients for the weights as input. And to be clear, SGD involves calculating the gradient using backpropagation from the backward pass, not just updating the parameters. They seem separate and they should be thought of separately, since the two algorithms are different.

```
def update_network_parameters(self, changes_to_w):
'''
Update network parameters according to update rule from
Stochastic Gradient Descent.
θ = θ - η * ∇J(x, y),
theta θ: a network parameter (e.g. a weight w)
eta η: the learning rate
gradient ∇J(x, y): the gradient of the objective function,
i.e. the change for a specific theta θ
'''
for key, value in changes_to_w.items():
for w_arr in self.params[key]:
w_arr -= self.l_rate * value
```

After having updated the parameters of the neural network, we can measure the accuracy on a validation set that we conveniently prepared earlier, to validate how well our network performs after each iteration over the whole dataset.

This code uses some of the same pieces as the training function; to begin with, it does a forward pass, then it finds the prediction of the network and checks for equality with the label. We return the average of the accuracy.

```
def compute_accuracy(self, x_val, y_val):
'''
This function does a forward pass of x, then checks if the indices
of the maximum value in the output equals the indices in the label
y. Then it sums over each prediction and calculates the accuracy.
'''
predictions = []
for x, y in zip(x_val, y_val):
output = self.forward_pass(x)
pred = np.argmax(output)
predictions.append(pred == np.argmax(y))
return np.mean(predictions)
```

Finally, we can call the training function, after knowing what will happen. We use the training and validation data as input to the training function, and then we wait.

```
dnn.train(x_train, y_train, x_val, y_val)
```

Note that the results may vary a lot, depending on how the weights are initialized.

Here is the full code, for an easy copy-paste and overview of what's happening.

```
from sklearn.datasets import fetch_openml
from keras.utils.np_utils import to_categorical
import numpy as np
from sklearn.model_selection import train_test_split
import time
x, y = fetch_openml('mnist_784', version=1, return_X_y=True)
x = (x/255).astype('float32')
y = to_categorical(y)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.15, random_state=42)
class DeepNeuralNetwork():
def __init__(self, sizes, epochs=10, l_rate=0.001):
self.sizes = sizes
self.epochs = epochs
self.l_rate = l_rate
# we save all parameters in the neural network in this dictionary
self.params = self.initialization()
def sigmoid(self, x, derivative=False):
if derivative:
return (np.exp(-x))/((np.exp(-x)+1)**2)
return 1/(1 + np.exp(-x))
def softmax(self, x, derivative=False):
# Numerically stable with large exponentials
exps = np.exp(x - x.max())
if derivative:
return exps / np.sum(exps, axis=0) * (1 - exps / np.sum(exps, axis=0))
return exps / np.sum(exps, axis=0)
def initialization(self):
# number of nodes in each layer
input_layer=self.sizes[0]
hidden_1=self.sizes[1]
hidden_2=self.sizes[2]
output_layer=self.sizes[3]
params = {
'W1':np.random.randn(hidden_1, input_layer) * np.sqrt(1. / hidden_1),
'W2':np.random.randn(hidden_2, hidden_1) * np.sqrt(1. / hidden_2),
'W3':np.random.randn(output_layer, hidden_2) * np.sqrt(1. / output_layer)
}
return params
def forward_pass(self, x_train):
params = self.params
# input layer activations becomes sample
params['A0'] = x_train
# input layer to hidden layer 1
params['Z1'] = np.dot(params["W1"], params['A0'])
params['A1'] = self.sigmoid(params['Z1'])
# hidden layer 1 to hidden layer 2
params['Z2'] = np.dot(params["W2"], params['A1'])
params['A2'] = self.sigmoid(params['Z2'])
# hidden layer 2 to output layer
params['Z3'] = np.dot(params["W3"], params['A2'])
params['A3'] = self.softmax(params['Z3'])
return params['A3']
def backward_pass(self, y_train, output):
'''
This is the backpropagation algorithm, for calculating the updates
of the neural network's parameters.
Note: There is a stability issue that causes warnings. This is
caused by the dot and multiply operations on the huge arrays.
RuntimeWarning: invalid value encountered in true_divide
RuntimeWarning: overflow encountered in exp
RuntimeWarning: overflow encountered in square
'''
params = self.params
change_w = {}
# Calculate W3 update
error = 2 * (output - y_train) / output.shape[0] * self.softmax(params['Z3'], derivative=True)
change_w['W3'] = np.outer(error, params['A2'])
# Calculate W2 update
error = np.dot(params['W3'].T, error) * self.sigmoid(params['Z2'], derivative=True)
change_w['W2'] = np.outer(error, params['A1'])
# Calculate W1 update
error = np.dot(params['W2'].T, error) * self.sigmoid(params['Z1'], derivative=True)
change_w['W1'] = np.outer(error, params['A0'])
return change_w
def update_network_parameters(self, changes_to_w):
'''
Update network parameters according to update rule from
Stochastic Gradient Descent.
θ = θ - η * ∇J(x, y),
theta θ: a network parameter (e.g. a weight w)
eta η: the learning rate
gradient ∇J(x, y): the gradient of the objective function,
i.e. the change for a specific theta θ
'''
for key, value in changes_to_w.items():
self.params[key] -= self.l_rate * value
def compute_accuracy(self, x_val, y_val):
'''
This function does a forward pass of x, then checks if the indices
of the maximum value in the output equals the indices in the label
y. Then it sums over each prediction and calculates the accuracy.
'''
predictions = []
for x, y in zip(x_val, y_val):
output = self.forward_pass(x)
pred = np.argmax(output)
predictions.append(pred == np.argmax(y))
return np.mean(predictions)
def train(self, x_train, y_train, x_val, y_val):
start_time = time.time()
for iteration in range(self.epochs):
for x,y in zip(x_train, y_train):
output = self.forward_pass(x)
changes_to_w = self.backward_pass(y, output)
self.update_network_parameters(changes_to_w)
accuracy = self.compute_accuracy(x_val, y_val)
print('Epoch: {0}, Time Spent: {1:.2f}s, Accuracy: {2:.2f}%'.format(
iteration+1, time.time() - start_time, accuracy * 100
))
dnn = DeepNeuralNetwork(sizes=[784, 128, 64, 10])
dnn.train(x_train, y_train, x_val, y_val)
```

You might have noticed that the code is very readable, but takes up a lot of space and could be optimized to run in loops. Here is a chance to optimize and improve the code. For newcomers, the difficulty of the following exercises are easy-hard, where the last exercise is the hardest.

**Easy:**Implement the ReLU activation function, or any other activation function from this overview of activation functions. Check how the sigmoid functions are implemented for reference, and remember to implement the derivative as well. Use the ReLU activation function in place of the sigmoid function.**Easy:**Initialize biases and add them to Z before the activation function in the forward pass, and update them in the backward pass. Be careful of the dimensions of the arrays when you try to add biases.**Medium:**Optimize the forward and backward pass, such that they run in a for loop in each function. This makes the code easier to modify and possibly easier to maintain.

a. Optimize the initialization function that makes weights for the neural network, such that you can modify the`sizes=[]`

argument without the neural network failing.**Medium:**Implement mini-batch gradient descent, replacing stochastic gradient descent. Instead of making an update to a parameter for each sample, make an update based on the average value of the sum of the gradients accumulated from each sample in the mini-batch. The size of the mini-batch is usually below 64.**Hard:**Implement the Adam optimizer, described in this overview of optimizers. This should be implemented in the training function.

a. Implement Momentum by adding the extra term.

b. Implement an adaptive learning rate, based on the AdaGrad optimizer.

c. Combine step a and b to finally implement Adam.

My belief is that if you complete these exercises, you will have learnt a lot. The next step would be implementing convolutions, filters and more, but that is left for a future article. *As a disclaimer, there are no solutions to these exercises, but feel free to share GitHub/Colab links to your solution in the comment section.*

Now that we have shown how to implement these calculations for the feedforward neural network with backpropagation, let's show just how easy and how much time PyTorch saves us, in comparison to NumPy.

One of the things that seems more complicated, or harder to understand than it should be, is loading datasets with PyTorch.

You start by defining the transformation of the data, specifying that it should be a tensor and that it should be normalized. Then you use the DataLoader in combination with the datasets import to load a dataset. This is all we need, and we will see how to unpack the values from these loaders later.

```
import torch
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=True, download=True, transform=transform))
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('data', train=False, transform=transform))
```

I have defined a class called Net, that is similar to the DeepNeuralNetwork class written in NumPy earlier. This class has some of the same methods, but you can clearly see that we don't need to think about initializing the network parameters nor the backward pass in PyTorch, since those functions are gone along with the function for computing accuracy.

```
import time
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self, epochs=10):
super(Net, self).__init__()
self.linear1 = nn.Linear(784, 128)
self.linear2 = nn.Linear(128, 64)
self.linear3 = nn.Linear(64, 10)
self.epochs = epochs
def forward_pass(self, x):
x = self.linear1(x)
x = torch.sigmoid(x)
x = self.linear2(x)
x = torch.sigmoid(x)
x = self.linear3(x)
x = torch.softmax(x, dim=0)
return x
def one_hot_encode(self, y):
encoded = torch.zeros([10], dtype=torch.float64)
encoded[y[0]] = 1.
return encoded
def train(self, train_loader, optimizer, criterion):
start_time = time.time()
loss = None
for iteration in range(self.epochs):
for x,y in train_loader:
y = self.one_hot_encode(y)
optimizer.zero_grad()
output = self.forward_pass(torch.flatten(x))
loss = criterion(output, y)
loss.backward()
optimizer.step()
print('Epoch: {0}, Time Spent: {1:.2f}s, Loss: {2}'.format(
iteration+1, time.time() - start_time, loss
))
```

When reading this class, we observe that PyTorch has implemented all the relevant activation functions for us, along with different types of layers. We don't even have to think about it, we can just define some layers like `nn.Linear()`

for a fully connected layer.

We have imported optimizers earlier, and here we specify which optimizer we want to use, along with the criterion for the loss. We pass both the optimizer and criterion into the training function, and PyTorch starts running through our examples, just like in NumPy. We could even include a metric for measuring accuracy, but that is left out in favor of measuring the loss instead.

```
model = Net()
optimizer = optim.SGD(model.parameters(), lr=0.001)
criterion = nn.BCEWithLogitsLoss()
model.train(train_loader, optimizer, criterion)
```

For the TensorFlow/Keras version of our neural network, I chose to use a simple approach, minimizing the number of lines of code. That means we are not defining any class, but instead using the high level API of Keras to make a neural network with just a few lines of code. If you are just getting into learning neural networks, you will find that the bar to entry is the lowest when using Keras, therefore I recommend it.

We start off by importing all the functions we need for later.

```
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.losses import BinaryCrossentropy
```

We can load the dataset and preprocess it with just these few lines of code. Note that we only preprocess the training data, because we are not planning on using the validation data for this approach. I will explain how we can use the validation data later on.

```
(x_train, y_train), (x_val, y_val) = mnist.load_data()
x_train = x_train.astype('float32') / 255
y_train = to_categorical(y_train)
```

The next step is defining our model. In Keras, this is extremely simple once you know which layers you want to apply to your data. In this case, we are going for the fully connected layers, as in our NumPy example; in Keras, this is done by the `Dense()`

function.

Once we have defined the layers of our model, we compile the model and define the optimizer, loss function and metric. At last, we can tell Keras to fit to our training data for 10 epochs, just like in our other examples.

```
model = tf.keras.Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='sigmoid'),
Dense(64, activation='sigmoid'),
Dense(10)
])
model.compile(optimizer='SGD',
loss=BinaryCrossentropy(),
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10)
```

If you want to use the validation data, you could pass it in using the `validation_data`

parameter of the fit function:

```
model.fit(x_train, y_train, epochs=10, validation_data=(x_val, y_val))
```

- Optimizing Gradient Descent by Sebastian Ruder. A very good read on the majority of optimizers.
- Neural Networks by 3Blue1Brown. Probably one of the best introductions to neural networks in general. Extremely visual and clear intuitive explanations are provided.
- Neural Networks and Deep Learning by Michael Nielsen. A formal book that intuitively explains the mathematical perspective behind neural network.

Deployment is perhaps one of the most overlooked topics in the Machine Learning world. But it most certainly is important, if you want to get into the industry as a Machine Learning Engineer (MLE). In this article, we will take a sober look at how painless this process can be, if you just know the small ins and outs of the technologies involved in deployment.

All the files for this project are available on GitHub, and you can perhaps use this project as a Hello World application, such that you have something running and later on replace it with something more complex.

- Setup & Installation
- What Is Docker And Kubernetes?
- A Sample Project
- The Deployment With Docker And Kubernetes

These are all the steps to set up your environment. We are going to be using Google Cloud Platform (GCP), as they are largely the leader in Kubernetes, and they are also the one's who developed and open-sourced the internal project. The main benefits of using GCP is that it has a great UI/UX and is easy to set up. The same cannot be said for their competitors.

- You probably already have a Google account. Sign up for Google Cloud Platform and get
. You have to enter your credit card, but it won't be charged unless you give them permission. Remember to enable billing.*$300 free credits* - Download the Google Cloud SDK.
- Export the SDK to PATH by finding the path to the SDK bin folder.

a.**Windows:**Add an environmental variable. 1) Open search and search for "Edit the environment variables", 2) Click on the "Environment Variables" button at the bottom, 3) For your user, double click Path in "User variables for <user>" OR click new if it does not exist (Variable Name is "Path"). Click new and enter the path to the SDK bin folder and save.

b.**MacOS**: In Terminal, type in`nano ~/.bash_profile`

and add the path to your bin folder as a new line in the file.`export PATH="/Applications/google-cloud-sdk/bin"`

Save by doing CTRL+O, press Enter/Return and press CTRL+X.

c.**Linux:**Depending on the distribution and what you have installed, there could be different profiles for Terminal. Check`nano ~/.bash_profile`

,`nano ~/.bash_login`

,`nano ~/.profile`

or`nano ~/.bashrc`

. Add the path to the bin folder of the SDK:`export PATH="/Applications/google-cloud-sdk/bin"`

Save by doing CTRL+O, press Enter/Return and press CTRL+X. - Download Docker. Note that you cannot use Docker with Windows Home – consider using a local Ubuntu VM by using Hyper-V on Windows instead.

a.**Windows**Pro/Enterprise/Education: Sign up and download Docker Desktop on Windows. Once logged in, you can download Docker. Make sure the program is running and that you are logged in locally.

b.**MacOS**: Sign up and download Docker Desktop on Mac. Once logged in, you can download Docker. Make sure the program is running and that you are logged in locally.

c.**Linux**: Use the three following command to download and start Docker:

1)`apt install docker.io`

, 2)`systemctl start docker`

, and 3)`systemctl enable docker`

. You can login with`docker login`

if you have a registry you want to login in to. - After this is done, you should be able to type
`gcloud init`

and configure the SDK for the setup.

a. Type Y, press enter and log into your account when gcloud displays:*"You must log in to continue. Would you like to log in (Y/n)?"*

b. Type the number of your project, mine was 1, and press enter when gcloud displays:*"Please enter numeric choice or text value (must exactly match list item)"*

c. Type n and press enter when gcloud displays:*"Do you want to configure a default Compute Region and Zone? (Y/n)?"* - You need to use
`gcloud auth configure-docker`

in the Terminal to be able to push your containers later on. If this does not run, try restarting Terminal.

After you have prepared the tools, we need to create a cluster, such that we can go through using these tools to deploy a Machine Learning application.

Creating your cluster is very individual and down to what your needs are. For this tutorial, I simply need 1 node, since we are just deploying a toy dataset that just needs to work remotely. If you want to replicate my cluster, adjust the number of nodes to 1, make the machine type g1-small and create your cluster – these are the only steps I took for this article.

We will go through the options, but before please go ahead and create your cluster in the Kubernetes Engine.

Consider the resources your application needs. Are you doing Deep Learning? Then you will need GPUs, so you need to use the *"GPU Accelerated Computing"* cluster template in the left side.

If you want high availability, you should make use of the cluster template *"Highly Available"* in the left sidebar. If your application has a need for CPU power or lots of RAM, then choose those cluster templates.

Remember that you can actually combine these templates, so you can have highly available GPU nodes by using the location type of regional.

While creating the cluster, you should pay attention to the menu under *"Machine Configuration"* > *"More Options"*.

What if your application hits the front page of Hacker News or some big blog or magazine? Are you sure that your application can handle that amount of traffic? You should enable autoscaling, such that Kubernetes automatically scales your application up and down when needed. Just enter the maximum number of nodes you are prepared to pay for if your application experiences a huge load.

Another great option is auto-upgrading and auto-repairing. The nodes can actually be automatically upgraded with no downtime, just by enabling auto-upgrade – this ensures there is fewer security flaws and the latest features of the stable version.

If a node is somehow failing or not ready to be used, Kubernetes can take care of automatically restoring and making the node work again. Auto-repairing does this, by monitoring all nodes with health checks, multiple times an hour.

After you have created your cluster, you want to connect to your cluster in your local Terminal. You can connect to the cluster by clicking on *"Connect"*, and copy the line of code that pops up under *"Command-line access"*.

Normally, I don't recommend YouTube videos. But the following YouTube video is a very comprehensive and great explanation of Docker and Kubernetes. It even answers *why* we want to use Docker and Kubernetes, and *why* we shifted away from VMs to containers.

The basic idea of Kubernetes consists of Ingress', Services, Deployments and Pods. We can think about the Ingress as a Load Balancer between multiple services, and we can think about a Service as a Load Balancer between the Pods of a specific Deployment. For this article, we will not be using an Ingress, since we have just a single container, but you should look into it, since you will likely need it for larger applications or when doing a microservice architecture.

We are deploying a machine learning model on the Auto MPG dataset, which is a toy dataset. Let's walk through the steps of training a Random Forest model.

To train a model, we don't have to do much work with the chosen dataset. We start by importing the packages we need, as well as the classes from the core folder.

```
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import joblib
from core.clean_data import CleanData
from core.predict_data import PredictData
from core.load_data import LoadData
```

Then we instantiate the classes, such that we can call all the functions from the classes later on.

```
loader = LoadData()
cleaner = CleanData()
predicter = PredictData()
```

The next step is loading the dataset into a dataframe, which we provided a function for in the LoadData class. When we call the function, we use Pandas to read a data table, and then we specify the column names from the class. Remember to check out the GitHub repository to see the full code.

```
df = loader.load_dataset_as_df()
```

Now that we have the dataset loaded, we want to think about what preprocessing steps we need to take, to be able to make a model. I found that there are some question marks in the *horsepower* feature of the dataset, so we need to get rid of those rows.

The CleanData class has a function for this, where we specify the dataframe to only contain the rows where horsepower does not contain a question mark – in other words, it removes the rows where horsepower has a question mark instead of a number. In a similar fashion as the loader, we called the class cleaner by the following line of code.

```
df = cleaner.clear_question_marks(df)
```

The next and last step before training our model is specifying which feature we want to predict. Naturally, for this dataset at least, we want to predict the *mpg* feature, which is miles per gallon for a car.

What happens here is that we split the dataset into training and testing, where *y* is the feature we are trying to predict, and *X* is the features we are using to predict *y*.

```
y = df['mpg']
X = cleaner.drop_unused_columns(df)
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
```

Now we are finally ready to do fit a random forest model to the dataset, since it has been cleaned and prepared for the algorithm.

We start off by instantiating the Random Forest with default parameters, and then we tell scikit-learn to train a random forest model with the training data. After that is done, we can predict on the testing data, and we can also score how well the predictions went.

```
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
pred = predicter.predict(X_test, rf)
score = predicter.score_r2(y_test)
```

The absolute last step is "dumping" the model, which means exporting it to a file, that we can load into Python at another point in time. For this, we use joblib.

```
joblib.dump(rf, "models/rf_model.pkl")
```

Now that we have a model, we need a way to expose it as a service. Flask is the most common way to do this, and it will scale easily. The first step is to create the Flask app by `app = Flask(__name__)`

, and then a function `def`

with an annotation `@`

, a route `/`

and a methods parameters `methods=[]`

.

Upon running the script, the following code imports the random forest model we exported earlier. Then once someone asks us at the specified route, we make predictions and return them.

```
from core.clean_data import CleanData
from core.predict_data import PredictData
from core.load_data import LoadData
from flask import Flask, jsonify, request
app = Flask(__name__)
loader = LoadData()
cleaner = CleanData()
model = loader.load_model_from_path('./models/rf_model.pkl')
predicter = PredictData(model)
@app.route("/", methods=['POST'])
def do_prediction():
json = request.get_json()
df = loader.json_to_df(json)
df = cleaner.clear_question_marks(df)
X = cleaner.drop_car_name(df)
prediction = predicter.predict(X)
return jsonify(prediction[0])
if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)
```

When we run this code, we have told flask to run the method `do_prediction()`

, if someone queries the any computers IP on port 5000 (e.g. locally `localhost:5000`

). This does not expose it to the world yet, though, since we need a cloud provider for that. This is simply the entrypoint to our application, which we will package and ship off to a cloud provider.

This section combines the powerful combination of Docker, Kubernetes and Machine Learning to expose your application to the world. The following information was not easy to learn, although it seems easy now.

The very first thing we always do is: *creating a Dockerfile*. By making a Dockerfile, we take only the necessary files from our project and package them into an image. We want the image to be as small as possible, so we don't want to clutter it with tons of unnecessary files.

The syntax is not hard to learn. Let me give you a brief overview of the commands used in this Dockerfile.

Command | What it does |
---|---|

FROM | This specifies the base image, which is usually a Python image for Machine Learning. You can browse base images on Docker Hub. |

WORKDIR | This command changes (and creates) the directory within the image to the specified path. |

RUN | This runs a command in Terminal inside the image. It could be anything, but don't clutter it up. |

ADD | This specifies the files to add from your directory to a directory in the image (also creates directory if it does not exist). |

EXPOSE | This command opens up a specific port number, like port 5000. |

CMD | Takes the argument for running the application. |

Now, the following is the Dockerfile, which I created for this project. We run the installation of packages directly by specifying the names of the packages, instead of through a requirements.txt file, like one would usually do.

After having installed all the packages and added the necessary files, we tell Docker to run the command `gunicorn --bind 0.0.0.0:5000 main:app`

, which is the syntax for using Gunicorn. We *always* want to use Gunicorn on top of Flask, because it is suited as a production service, where as Flask is suited for development purposes. The application will work exactly the same in development and production.

The *main* is the filename without the extension (main.py), and the *app* is what we called the Flask app in the file. So if you change the name, you also have to change it here.

```
FROM python:3.7
WORKDIR /app
RUN pip install pandas scikit-learn flask gunicorn
ADD ./core ./core
ADD ./models ./models
ADD main.py main.py
EXPOSE 5000
CMD [ "gunicorn", "--bind", "0.0.0.0:5000", "main:app" ]
```

The image does not create itself though. The Dockerfile just lists the specifics of how the image should look like – so we need a way to build the image, before we can use it.

To be able to push the image to Google Cloud, we have to build it, tag it and push it. Luckily for you, I have mostly automated this process for Google Cloud, to the point where you just have to enter your details once and change the version number when you want to update the images.

First, you need to find your project id. You can find your project in the top left and you can find your ID by clicking on the arrow on the right.

All you have to enter is the address, project id, repository and version, and then this script will build and push the image. Use that ID for your project id.

```
#!/bin/bash
ADDRESS=gcr.io
PROJECT_ID=macro-authority-266522
REPOSITORY=auto
VERSION=0.17
docker build -t ${PROJECT_ID}:${VERSION} .
ID="$(docker images | grep ${REPOSITORY} | head -n 1 | awk '{print $3}')"
docker tag ${ID} $ADDRESS/${PROJECT_ID}/${REPOSITORY}:${VERSION}
docker push $ADDRESS/${PROJECT_ID}/${REPOSITORY}:${VERSION}
```

You can run the image with git bash or terminal `sh build_push_image.sh`

. You should run this script instead of manually running these commands each time.

Alternatively, you can experiment with using `docker build`

, `docker tag`

and `docker push`

yourself.

The image is now built and pushed to the cloud. To be able to run it, we need to create a deployment and service file, which is defined in the syntax of YAML.

I'm going to show you how to deploy the product now. The general procedure is the following steps.

- Create/Update the actual containers.
- Update the details in the bash script and run it. If you want to update the container, you just have to update the version in the script.
- Update the details of the image in the deployment YAML file. If you want to update the container, you just have to update the version in the image.
- Run the
`kubectl apply -f <filename>.yml`

command on a deployment or service.

Below, you can see the `deployment.yaml`

file that I used – it works by naming your project (rename *mpg*) and specifying the url for the image. When we apply this deployment, we have specified for Kubernetes to create 3 replicas, which are called Pods – a pod will run one instance of the container.

```
apiVersion: apps/v1
kind: Deployment
metadata:
name: mpg
labels:
app: mpg
spec:
replicas: 3
selector:
matchLabels:
app: mpg
template:
metadata:
labels:
app: mpg
spec:
containers:
- name: auto
image: gcr.io/macro-authority-266522/auto:0.17
imagePullPolicy: Always
ports:
- containerPort: 5000
```

The containerPort is set 5000, because that is the port we set in our Flask application. The image is set to `gcr.io/<project_id>/<repository>:<version>`

, which is the same variable names as in the bash file from earlier.

When you have such a deployment file, you very simply run the below line of code.

```
kubectl apply -f deployment.yml
```

It should say `deployment.apps/mpg created`

when you create it for the first time, and it will also give you another message if you update the image.

Remember we can always go back and reapply this deployment file, if we have made changes to the container. It really is as simple as running the apply command once again, after having changed the version number in your bash script to push to the cloud and in the deployment file.

Though, we are not quite done yet. The application you made is running in the cloud, but it's not exposed yet. Let me introduce services.

A service is also defined in a YAML file, just like the deployment. You use the same command to make and update a service. This is why some DevOps engineers are called YAML engineers, because most of the configuration for deployments are done using YAML files just like these one's presented here.

Below is the `service.yaml`

file used for this tutorial. You want to specify the type to be a Load Balancer, since that will distribute traffic equally amongst the available pods from your deployments. This enables for immense scaling capacity, since you can just create more nodes in your cluster and create more pods.

We called our deployment *mpg*, so we specify the app in the selector to be the same app name, such that this service is linked to our deployment. The *port* is set to 80, because that defaults to just the external-ip address, while the *targetPort *is set to 5000, since that is what we specified in the deployment file and Flask application.

```
apiVersion: v1
kind: Service
metadata:
name: mlfromscratch
spec:
type: LoadBalancer
selector:
app: mpg
ports:
- protocol: TCP
port: 80
targetPort: 5000
```

Quite simply as before, we apply the service in the same way as the deployment.

```
kubectl apply -f service.yml
```

And we get back a response `service/mlfromscratch created`

.

Ok, now we know that everything is created. Let me give you a rundown of how you can access your deployment.

First things first, let's take a look at what we just created, shall we? The very first command to run is (note that "service" can be shortened to "svc").

```
kubectl get service
```

We get back a response. Your `EXTERNAL-IP`

might still say `<pending>`

, but it will come through rather quickly with the actual public ip. Mine took 2 minutes to come online and another 2 minutes to give me the response I was expecting, instead of endlessly loading. Note that we don't care about the service named *kubernetes*, but instead the name specified in your service, which in this case was *mlfromscratch*.

NAME | TYPE | CLUSTER-IP | EXTERNAL-IP | PORT(S) | AGE |
---|---|---|---|---|---|

kubernetes | ClusterIP | 10.12.0.1 | `<none>` |
443/TCP | 20m |

mlfromscratch | LoadBalancer | 10.12.15.238 | 35.225.0.74 | 80:30408/TCP | 5m14s |

All we have to do to access the application is access the external-ip from the service, and that's it! We just made a Machine Learning model ready to serve predictions at an endpoint. Ideally, we would query this IP-address from another service/product, such that we continually could use these predictions. You could even make your own SaaS by turning this into a microservice and deploying it using this Flask template I integrated with Stripe.

Sometimes you want to check what the console prints or just the status of the application to see if it's running, and if there is any errors.

Just like we used the `kubectl get service`

command to get information about our services earlier, we can use it to get information on pieces of our Kubernetes or just access it all through this command. Optionally we can add `-o wide`

at the end for even more information.

```
kubectl get svc,deployment,pods -o wide
```

The above command will give us a look at the pods. They should say running under status, else, your application is not working like expected and you need to go back and make sure that it runs locally.

After getting a specific pod name, we could see the logs for that specific pod and see what happened. Tail can also be `--tail=1h`

to see the last hour of logs instead of just 20 lines.

```
kubectl logs -f pod/mpg-768578c99c-lt6j2 --tail=20
```

Similarly to these get command, we have other a whole bunch of commands that you can list by typing `kubectl`

in your Terminal. Most interestingly you can delete your deployments and services, e.g. `kubectl delete -f service.yaml`

, if you just want to start over.

```
Basic Commands (Beginner):
create Create a resource from a file or from stdin.
expose Take a replication controller, service, deployment or pod and expose it as a new Kubernetes Service
run Run a particular image on the cluster
set Set specific features on objects
Basic Commands (Intermediate):
explain Documentation of resources
get Display one or many resources
edit Edit a resource on the server
delete Delete resources by filenames, stdin, resources and names, or by resources and label selector
Deploy Commands:
rollout Manage the rollout of a resource
scale Set a new size for a Deployment, ReplicaSet, Replication Controller, or Job
autoscale Auto-scale a Deployment, ReplicaSet, or ReplicationController
Cluster Management Commands:
certificate Modify certificate resources.
cluster-info Display cluster info
top Display Resource (CPU/Memory/Storage) usage.
cordon Mark node as unschedulable
uncordon Mark node as schedulable
drain Drain node in preparation for maintenance
taint Update the taints on one or more nodes
Troubleshooting and Debugging Commands:
describe Show details of a specific resource or group of resources
logs Print the logs for a container in a pod
attach Attach to a running container
exec Execute a command in a container
port-forward Forward one or more local ports to a pod
proxy Run a proxy to the Kubernetes API server
cp Copy files and directories to and from containers.
auth Inspect authorization
Advanced Commands:
diff Diff live version against would-be applied version
apply Apply a configuration to a resource by filename or stdin
patch Update field(s) of a resource using strategic merge patch
replace Replace a resource by filename or stdin
wait Experimental: Wait for a specific condition on one or many resources.
convert Convert config files between different API versions
kustomize Build a kustomization target from a directory or a remote url.
Settings Commands:
label Update the labels on a resource
annotate Update the annotations on a resource
completion Output shell completion code for the specified shell (bash or zsh)
Other Commands:
api-resources Print the supported API resources on the server
api-versions Print the supported API versions on the server, in the form of "group/version"
config Modify kubeconfig files
plugin Provides utilities for interacting with plugins.
version Print the client and server version information
```

We have the external-ip from earlier, which we are going to reuse here. The following JSON request can now be sent as with a HTTP POST method, and we will receive the expected response.

```
{
"cylinders": 8,
"displacement": 307.0,
"horsepower": 130.0,
"weight": 3504,
"acceleration": 12.0,
"model_year": 70,
"origin": 1,
"car_name": "chevrolet chevelle malibu"
}
```

In just 0.29 seconds, we received the prediction of this car, and our machine learning model predicted `16.383`

.

Or from a Python script.

```
import json
import requests
data = {
"cylinders": 8,
"displacement": 307.0,
"horsepower": 130.0,
"weight": 3504,
"acceleration": 12.0,
"model_year": 70,
"origin": 1,
"car_name": "chevrolet chevelle malibu"
}
r = requests.post('http://35.225.0.74', json=data)
```

Printing `r.text`

gives us the prediction $16.383$, and printing `r.status_code`

gives us a HTTP status code of $200$.

Sometimes you discover small tips and tricks to improve your code and make life easier for yourself, e.g. better maintainability, efficiency etc. — well this is one of those improvements to your machine learning, except it’s essential and takes an extra thought to implement.

The goal is to introduce you, the developer, to stacking in machine learning. Using your own models, you will learn how to apply stacking to your own datasets. Follow this article and get better results — it’s that simple.

You should at least be familiar with overfitting, underfitting and cross-validation to read this article.

Stacking is the process of using different Machine Learning models one after another, where you add the predictions from each model to make a new feature.

There are generally two different variants for stacking, variant A and B. For this article, we are focusing on variant A, since this *seems *to get the better results of the two variants, because models more easily overfit to training data in variant B. This is likely also the reason why practitioners use variant A, although it does not eliminate overfitting.

Please note that there is no *correct* way of implementing model stacking (David H. Wolpert), because model stacking only describes the process of combining many models with a final generalized model. There exists ways to implement model stacking, some of which has been proven to work well in practice. This is why we explore variant A.

Model stacking should always be accompanied by cross validation, to reduce overfitting models to training data. This is common practice – read more here.

Model stacking will seem like a simple technique to improving your results, when you understand what happens inside the algorithm. Though, there is many components interacting, and keeping track of all of them can be quite difficult, especially when first learning this concept. For you to fully understand the algorithm, I created a step-by-step image and description, such that it is easier to understand.

For starters, when doing model stacking with cross-validation, we require three parameters: a `Training`

dataset, `Holdout`

dataset (validation dataset) and a list of models called `models_to_train`

.

The most essential part here, is that each model's predictions becomes a slice of a new feature, such that each model gets to predict a slice of the training data for this new feature.

Now, let's put *figure 1* into text, to actually explain what goes on! Later on, you will encounter a real example in Python.

- Gather models with optimized hyperparameters into a
`models_to_train`

array. - Split the original dataset into a
`Training`

and`Holdout`

dataset. - Let
`Training`

go onwards into the upcoming loop, and save`Holdout`

until the last part in the upcoming loop. - Make a for loop with KFold Cross-Validation where k=5
- In each iteration, split the
`Training`

dataset into another training and testing dataset. Call them`X_train`

,`y_train`

,`X_test`

and`y_test`

. The white parts in*figure 1*represents`X_test`

and`y_test`

, while the green represents`X_train`

and`y_train`

. - Set the current model equal to
`models_to_train[k-1]`

. - Train the current model on
`X_train`

and`y_train`

. - Make predictions on the test dataset
`X_test`

and call them`y_test_pred`

. Extend an array`full_y_pred`

with the predictions`y_test_pred`

. - Make predictions on the Holdout dataset
`Holdout`

and call them`holdout_pred`

.

- In each iteration, split the
- Average the
`holdout_pred`

arrays into a`full_holdout_pred`

array. - Add
`full_y_pred`

as a new feature in`Training`

and add`full_holdout_pred`

as a new feature in`Holdout`

- Return the datasets
`Training`

and`Holdout`

with the new features for use in the next layer.

Notice that we use one model to predict each part of the total training dataset; we predict one fold at a time, with a different model each time.

Great, we have some stacked machine learning models, but how can we use them for anything? The next step is taking the new training and holdout dataset and use them as input in the next layer.

Stacking machine learning models is done in layers, and there can be arbitrarily many layers, dependent on exactly how many models you have trained, along with the best combination of these models. For an instance, the first layer could boil down to learning some feature that has great predictive power, while the next layer might be doing something else, like reducing noise.

We put models stacks in layers, and usually with a different purpose. At the end, we get a final dataset, that we feed into a last model. The last model is called a meta-learner (e.g. meta-regressor or meta-classifier), and its purpose is to generalize all the features from each layer into the final predictions.

Below is an example, where each level represents a new layer in our total pipeline, when applying model stacking (there is a switch between *layer* and *level*, but their meaning is the same). The example given is only with two levels, where level 2 is the final model that gives the final predictions, which is the output.

If you wanted to add more layers, we could just add another level after level 1 and make the meta-learner level 3. If we were to increase the numbers of layers, level 2 would be another round of m*odel stacking of M models*, just as in level 1. The meta-learner is always present in the last layer, hence why we insert a new level before the meta-learner.

To summarize, we produce new features by different models stack, which is combined in a new dataset to make final predictions at the end. People sometimes combine other methods in these model stacks, such as feature selection or normalization, but we will not explore these other methods here.

Quite literally, you define a stack of models for each layer, and you run the first model from the first layer, all the way down to the last model in the last layer, while making predictions between levels. These stacks of models could be any type of model, e.g. we could have a stack with XGBoost, Neural Networks and Linear Regression.

The image below helps realizing how models are stacked in layers, even though it is not as specific as the one I provided earlier.

Throughout this article, the promise of *better results* have been mentioned a couple of times. Let's justify *how* and *why* we get better results when using model stacking.

1) Model stacking is used amongst competition winners and practitioners – and the reason for why is simple. Your meta-learner generalizes better than a single model, i.e. it makes better predictions on unseen data, than just a single model. It's that simple.

2) A good and older paper on this topic is Stacked Generalization by David H. Wolpert (1992). Wolpert argues that model stacking *deduces* the bias in a model on a particular dataset, such that we later on can correct for such bias in a meta-learner.

3) Before model stacking became popular and more widely used, a technique called Bayes Model Averaging (BMA) existed. In such a model, weights were the posterior probabilities of models – but instead, empirical evidence by Bertrand Clarke has shown that model stacking with cross validation out-performs BMA, while being robust, because model stacking is less sensitive to changes in a dataset.

I think the argument you should pay most attention to, is the first point – everyone uses model stacking to get better results. But if you are looking for an explanation of why it works, I would read the paper by Wolpert.

Now for what most developers would consider *the fun part*. The part where we apply what we just learned from reading about what model stacking is, and how exactly it improves the predictive power. GitHub repository with full code.

The first thing to do, in a Machine Learning project, is finding a dataset. For this illustrative example, we are going to be using a *toy dataset* (one that requires no data cleaning or transforming).

Before we can import a dataset, we need to import some of the basic packages for handling datasets.

```
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import warnings
```

The next step is to load the dataset into a Pandas dataframe, such that we can index and inspect the dataset easily. We simply provide the `load_boston`

function from sklearn, and we input it into the function as seen below. This will return a dataframe containing all the features and samples in the Boston Housing Dataset.

```
def dataset_to_df(load):
# Load the input data into the dataframe
df = pd.DataFrame(load.data, columns=load.feature_names)
# Add the output data into the dataframe
df['label'] = pd.Series(load.target)
# Return the dataframe
return df
df = dataset_to_df(load_boston())
```

After loading the dataset, we typically want to inspect it and learn what the features means. This task is omitted here, but I will leave 5 samples of data, along with the meaning of each feature in the dataset means here. You can also find a description of all these features here.

Inspecting the first five rows in our dataset, we can see some values. We don't know what they mean yet.

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |

1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |

2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |

3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |

4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |

1-13 is our input and 14 is the output which we are trying to predict as precisely as possible, given an unseen test dataset.

- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town.
- CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per \$10,000
- PTRATIO: pupil-teacher ratio by town
- B: $1000(Bk-0.63)^2$ where
*Bk*is the proportion of blacks by town - LSTAT: % lower status of the population
- MEDV (label): Median value of owner-occupied homes in $1000's

Before we start doing model stacking, we want to optimize the hyperparameters for each algorithm for a baseline comparison. This is commonly done by a searching algorithm, for an example, scikit-learn offers a grid search and random search algorithm.

If we cannot exceed the baseline predictions from the baseline models, then we should simply not bother trying model stacking. Sometimes though, model stacking cannot improve results by that much, either because of a low number of samples, or not complex enough data.

Before we start stacking, we split the dataset `df`

into four variables `X_train`

, `X_test`

, `y_train`

and `y_test`

. Any variable named *X* is typically our input data, while *y* is what we want to predict (the output data).

```
from sklearn.model_selection import train_test_split
# Getting the output variable
y = df['label']
# Getting the input variables
X = df.drop(['label'], axis=1)
# Diving our input and output into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.33,
random_state=42
)
```

I made a function for doing hyperparameter tuning. This function gives us exactly what we want; the best model, the predictions and the score of the best model on the test dataset. It uses imports from the scikit-learn package (sklearn), which makes it easier to run the grid search algorithm.

As parameters to the function, we can just pass in our data, a model and different hyperparameters for tuning.

```
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
def algorithm_pipeline(X_train_data, X_test_data, y_train_data, y_test_data,
model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
scoring_test=r2_score, do_probabilities = False):
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
cv=cv,
n_jobs=-1,
scoring=scoring_fit,
verbose=2
)
fitted_model = gs.fit(X_train_data, y_train_data)
best_model = fitted_model.best_estimator_
if do_probabilities:
pred = fitted_model.predict_proba(X_test_data)
else:
pred = fitted_model.predict(X_test_data)
score = scoring_test(y_test_data, pred)
return [best_model, pred, score]
```

Next, we define the hyperparameters for tuning, along with the models. We use Random Forest, LightGBM and XGBoost in the following code, since they usually perform the best. First, we import and instantiate the classes for the models, then we define some parameters to input into the grid search function.

```
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
# Defining our estimator, the algorithm to optimize
models_to_train = [XGBRegressor(), LGBMRegressor(), RandomForestRegressor()]
# Defining the hyperparameters to optimize
grid_parameters = [
{ # XGBoost
'n_estimators': [400, 700, 1000],
'colsample_bytree': [0.7, 0.8],
'max_depth': [15,20,25],
'reg_alpha': [1.1, 1.2, 1.3],
'reg_lambda': [1.1, 1.2, 1.3],
'subsample': [0.7, 0.8, 0.9]
},
{ # LightGBM
'n_estimators': [400, 700, 1000],
'learning_rate': [0.12],
'colsample_bytree': [0.7, 0.8],
'max_depth': [4],
'num_leaves': [10, 20],
'reg_alpha': [1.1, 1.2],
'reg_lambda': [1.1, 1.2],
'min_split_gain': [0.3, 0.4],
'subsample': [0.8, 0.9],
'subsample_freq': [10, 20]
},
{ # Random Forest
'max_depth':[3, 5, 10, 13],
'n_estimators':[100, 200, 400, 600, 900],
'max_features':[2, 4, 6, 8, 10]
}
]
```

Now that we have defined all the models and hyperparameters for tuning, we can simply input our data into the function defined earlier.

```
models_preds_scores = []
for i, model in enumerate(models_to_train):
params = grid_parameters[i]
result = algorithm_pipeline(X_train, X_test, y_train, y_test,
model, params, cv=5)
models_preds_scores.append(result)
```

Inspecting the results, we finally get to see which models performed the best, by running the following lines of code.

```
for result in models_preds_scores:
print('Model: {0}, Score: {1}'.format(type(result[0]).__name__, result[2]))
```

As we can see from the results, XGBoost clearly outperformed the two other algorithms, therefore we will use XGBoost as the meta-learner.

Model: XGBRegressor, Score: 0.8954

Model: LGBMRegressor, Score: 0.8679

Model: RandomForestRegressor, Score: 0.8697

We have a score to beat; the XGBRegressor score of $0.8954$. The point of stacking, is that we can improve results. Let me show you how to make it happen.

After doing some research on existing packages, I found pystacknet and mlxtend. For this article, I went with mlxtend, because it integrates nicely with scikit-learn, which defines a great API that XGBoost, LightGBM and other packages follow.

We start with importing the classes from the packages that we need to run this code. We assume no hyperparameters for the stacking – this means that we use the predefined hyperparameters for each model.

```
from mlxtend.regressor import StackingCVRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
xgb = XGBRegressor()
lgbm = LGBMRegressor()
rf = RandomForestRegressor()
ridge = Ridge()
lasso = Lasso()
svr = SVR(kernel='linear')
```

Now that we have defined all our models, we can begin improving our results by stacking some models! As we can see here, we defined two levels, where the first level has 6 models, and the second level has the meta-learner.

**Note**: XGBoost/mlxtend has a bug when trying to mix the packages. That is why we edit the names of the columns of the `X_test`

to $f0$, $f1$, etc. – XGBoost internally renames the features of `X_train`

, and if we don't provide the exact same names, it will not run. We also use `shuffle=False`

and `random_state=42`

for the sake of reproducing the same results displayed in this article.

```
stack = StackingCVRegressor(regressors=(ridge, lasso, svr, rf, lgbm, xgb),
meta_regressor=xgb, cv=12,
use_features_in_secondary=True,
store_train_meta_features=True,
shuffle=False,
random_state=42)
stack.fit(X_train, y_train)
X_test.columns = ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12']
pred = stack.predict(X_test)
score = r2_score(y_test, pred)
```

Finally, after solving some small tasks throughout the code, we get to a result.

0.9071

About 0.01 better in the $r^2$ score – which makes sense. The dataset used here is highly linear, which means it's *really* hard to capture any patterns that the baseline algorithms couldn't capture already. But it still demonstrates the effect of stacking, which is better results – even on highly linear datasets.

The possibilities for optimizing results for more complex datasets are much greater, and if you want to compete on Kaggle, I suggest picking up a version of model stacking. This is a rather simple case; on Kaggle, you will see stacking of 100+ models in a single layer. Gilberto and Stanislav on Kaggle optimized 33 models from scratch and combined them in a model stack to win the first place award of \$5000. This was some years back, today there are prices of up to \$1.000.000 for detecting Deepfakes.

]]>I thought, how can we angle "Web Scraping for Machine Learning", and I realized that Web Scraping should be essential to Data Scientists, Data Engineers and Machine Learning Engineers.

The Full Stack AI/ML Engineer toolkit needs to include web scraping, because it can improve predictions with new quality data. Machine Learning inherently requires data, and we would be most comfortable, if we have as much high quality data as possible. But what about when the data you need is not available as a dataset? What then? Do you just go and ask organizations and hope that they kindly will deliver it to you for free?

The answer is: *you collect, label and store it yourself.*

I made a GitHub repository for scraping the data. I encourage you to try it out and scrape some data yourself, and even trying to make some NLP or other Machine Learning project out of the scraped data.

In this article, we are going to web scrape Reddit – specifically, the /r/DataScience (and a little of /r/MachineLearning) subreddit. There will be no usage of the Reddit API, since we usually web scrape when an API is not available. Furthermore, you are going to learn to combine the knowledge of HTML, Python, Databases, SQL and datasets for Machine Learning. We are doing a small NLP sample project at last, but this is only to showcase that you can pickup the dataset and create a model providing predictions.

- Web Scraping in Python - Beautiful Soup and Selenium
- Labelling Scraped Data
- Storing Scraped Data
- Small Machine Learning Project on Exported Dataset
- Further Readings

The first things we need to do is install BeautifulSoup and Selenium for scraping, but for accessing the whole project (i.e. also the Machine Learning part), we need more packages.

Selenium is basically good for content that changes due to Javascript, while BeautifulSoup is great at capturing static HTML from pages. Both the packages can be downloaded to your environment by a simple pip or conda command.

Install all the packages from the Github (linked at the start)

```
pip install -r requirements.txt
```

Alternatively, if you are using Google Colab, you can run the following to install the packages needed:

```
!pip install -r https://github.com/casperbh96/Web-Scraping-Reddit/raw/master/requirements.txt
```

Next, you need to download the *chromedriver* and place it in the *core* folder of the downloaded repository from GitHub.

Scrape responsibly, please! Reddit might update their website and invalidate the current approach of scraping the data from the website. If this is used in production, you would really want to setup an email / sms service, such that you get immediate notice when your web scraper fails.

This is for educational purposes only, please don't misuse or do anything illegal with the code. It's provided as-is, by the MIT license of the GitHub repository.

Scraping takes time. Remember that you have to open each page, letting it load, then scraping the needed data. It can really be a tedious process – even figuring out where to start gathering the data can be hard, or even figuring out exactly what data you want.

There are 5 main steps for scraping reddit data:

- Collecting links from a subreddit
- Finding the <script> tag from each link
- Turning collected data from links into Python dictionaries
- Getting the data from Python dictionaries
- Scraping comment data

Especially step 2 and 5 will take you a long time, because they are the hardest to optimize.

An approximate benchmark for:

- Step 2: about 1 second for each post
- Step 5: $n/3$ seconds for each post, where $n$ is number of comments. Scraping 300 comments took about 100 seconds for me, but it varies with internet speed, CPU speed, etc.

We are going to be using 4 files with one class each, a total of 4 classes, which I placed in the core folder.

Whenever you see SQL, SelScraper or BSS being called, it means we are calling a method from another class, e.g. `BSS.get_title()`

. We are going to jump forward and over some lines of code, because about 1000 lines have been written (which is too much to explain here).

```
from core.selenium_scraper import SeleniumScraper
from core.soup_scraper import SoupScraper
from core.progress_bar import ProgressBar
from core.sql_access import SqlAccess
SQL = SqlAccess()
SelScraper = SeleniumScraper()
BSS = SoupScraper(reddit_home,
slash,
subreddit)
```

We start off with scraping the actual URLs from a given subreddit, defined at the start of `scraper.py`

. What happens is that we open a browser in headless mode (without opening a window, basically running in the background), and we use some Javascript to scroll the page, while collecting links to posts.

The next snippets are just running in a while loop until the variable `scroll_n_times`

becomes zero

```
while scroll_n_times:
# Scrolls browser to the bottom of the page
self.driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);")
time.sleep(sleep_time)
scroll_n_times -= 1
elements = self.driver.find_elements_by_xpath(xpath)
# Get the link from the href attribute
self.links = [tag.get_attribute('href') for tag in elements]
```

The following is the Javascript to scroll a page to the bottom:

```
window.scrollTo(0, document.body.scrollHeight);
```

After that, we use xpath to find all <a> tags in the HTML body, which really just returns all the links to us:

```
xpath = "//a[@data-click-id='body']"
elements = self.driver.find_elements_by_xpath(xpath)
self.links = [tag.get_attribute('href') for tag in elements]
```

At any point in time, if some exception happens, or we are done with scrolling, we basically garbage collect the process running, such that the program won't have 14 different chrome browsers running

```
try:
# while loop code
finally:
self.driver.quit()
```

Great, now we have a collection of links that we can start scraping, but how? Well, we start by...

Upon opening one of the collected links, Reddit provides us with a javascript <script> element that contains all the data for each post. You won't be able to see it when visiting the page, since the data is loaded in, then removed.

But our great software package BeautifulSoup will! And for our convenience, Reddit marked the <script> with an id: `id=data`

. This attribute makes it easy for us to find the element and capture all the data of the page.

First we specify the header, which tells the website we are visiting, which agent we are using (as to avoid being detected as a bot). Next step, we make a request and let BeautifulSoup get all the text from the page, i.e. all the HTML.

```
progress = ProgressBar(len(urls))
for url in urls:
progress.update()
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
pure_html_data.append(r.text)
pure_script_data.append(soup.find(id='data').text)
```

You see that last line, after we have told BeautifulSoup to load the page, we find the text of the script with the `id='data'`

. Since it's an id attribute, I knew that there would only be one element on the whole page with this attribute – hence this is pretty much a bulletproof approach of getting the data. Unless, of course, Reddit changes their site.

You might think that we could parallelize these operations, such that we can run through multiple URLs in the for loop, but it's bad practice and might get you banned – especially on bigger sites, they will protect their servers from overload, which you effectively can do with pinging URLs.

From the last code piece, we get a string, which is in a valid JSON format. We want to convert this string into a Python dict, such that we can easily lookup the data from each link.

The basic approach here is that we find the first left curly bracket, which is where JSON starts, by the index. And we find the last index by an rfind() method with the right curly bracket, but we need to use plus one for the actual index.

```
pure_dicts = []
print('Making Python dicts out of script data')
progress = ProgressBar(len(script_data))
for data in script_data:
progress.update()
first_index = data.index('{')
last_index = data.rfind('}') + 1
json_str = data[first_index:last_index]
pure_dicts.append(json.loads(json_str))
return pure_dicts
```

Effectively, this gives us a Python dict of the whole Reddit data which is loaded into every post. This will be great for scraping all the data and storing it.

I defined one for loop, which iteratively scrapes different data. This makes it easy to maintain and find your mistakes.

As you can see, we get quite a lot of data – basically the whole post and comments, except for comments text, which we get later on.

```
progress = ProgressBar(len(links))
for i, current_data in enumerate(BSS.data):
progress.update()
BSS.get_url_id_and_url_title(BSS.urls[i],
current_data, i)
BSS.get_title()
BSS.get_upvote_ratio()
BSS.get_score()
BSS.get_posted_time()
BSS.get_author()
BSS.get_flairs()
BSS.get_num_gold()
BSS.get_category()
BSS.get_total_num_comments()
BSS.get_links_from_post()
BSS.get_main_link()
BSS.get_text()
BSS.get_comment_ids()
```

After we collected all this data, we need to begin thinking about storing the data. But we couldn't manage to scrape the text of the comments in the big for loop, so we will have to do that before we setup the inserts into the database tables.

This next code piece is quite long, but it's all you need. Here we collect **comment text, score, author, upvote points and depth**. For each comment, we have subcomments for that main comment, which specifies depth, e.g. depth is zero at the each root comment.

```
for i, comment_url in enumerate(comment_id_links_array):
author = None
text = None
score = None
comment_id = array_of_comment_ids[i]
# Open the url, find the div with classes Comment.t1_comment_id
r = requests.get(comment_url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.select('div.Comment.t1_{0}'.format(comment_id))
# If it found the div, let's extract the text from the comment
if div is not None and len(div) > 0 :
author = div[0].find_all('a')[0].get_text()
spans = div[0].find_all("span")
score = [spans[i].get_text() for i in range(len(spans)) if 'point' in spans[i].get_text()]
html_and_text = div[0].find('div', attrs={'data-test-id' : 'comment'})
if html_and_text is not None:
text = html_and_text.get_text()
if len(score) == 0:
score = None
else:
score = score[0]
# Make useable array for insertion
array_of_comment_data.append([None,
None,
str(comment_id),
str(score),
array_of_depth[i],
str(array_of_next[i]),
str(array_of_prev[i]),
str(author),
str(text)])
return array_of_comment_data
```

What I found working was opening each of the collected comment URLs from the earlier for loop, and basically scraping once again. This ensures that we get all the data scraped.

**Note:** this approach for scraping the comments can be reaaally slow! This is probably the number one thing to improve – it's completely dependent on the number of comments on a posts. Scraping >100 comments takes a pretty long time.

For the labelling part, we are mostly going to focus on tasks we can immediately finish with Python code, instead of the tasks that we cannot. For instance, labelling images found on Reddit is probably not feasible by a script, but actually has to be done manually.

For the SQL in this article, we use Snake Case for naming the features. An example of this naming convention is *my_feature_x*, i.e. we split words with underscores, and only lower case.

...But please:

If you are working in a company, look at the naming conventions they use and follow them. The last thing you want is different styles, as it will just be confusing at the end of the day. Common naming conventions include camelCase and Pascal Case.

For storing the collected and labelled data, I have specifically chosen that we should proceed with an *SQLite database* – since it's way easier for smaller projects like web scraping. We don't have to install any drivers like for MySQL or MS SQL servers, and we don't even need to install any packages to use it, because it comes natively with Python.

Some considerations for data types has been made for the columns in the SQLite database, but there is room for improvement in the current state of form. I used some cheap varchars in the comment table, to get around some storing problems. It currently does not give me any problems, but for the future, it should probably be updated.

The 1st normal form was mostly considered in the database design, i.e. we have separate tables for links and comments, for avoiding duplicate rows in the post table. Further improvements include making a table for categories and flairs, which is currently put into a string form from an array.

Without further notice, let me present you the database diagram for this project. It's quite simple.

In short: we have 3 tables. For each row with a post id in the post table, we can have multiple rows with the same post id in the comment and link table. This is how we link the post to all links and comments. The link exists because the post_id in the link and comment table has a foreign key on the post_id in the post table, which is the primary key.

I used an excellent tool for generating this database diagram, which I want to highlight – currently it's free and it's called https://dbdiagram.io/.

Play around with it, if you wish. This is not a paid endorsement of any sorts, just a shoutout to a great, free tool. Here is my code for the above diagram:

```
Table post as p {
id int [pk]
url varchar [not null]
url_id varchar [not null]
url_title varchar [not null]
author varchar
upvote_ratio uint8
score int
time_created datetime
num_gold int
num_comments int
category varchar
text varchar
main_link varchar
flairs int
}
Table link as l {
id int [pk]
post_id int
link varchar [not null]
}
Table comment as c {
id int [pk]
post_id int
comment_id varchar [not null]
score varchar [not null]
depth int [not null]
next varchar
previous varchar
comment_author varchar
text varchar
}
Ref: p.id < l.post_id
Ref: p.id < c.post_id
```

The databases and tables will be automatically generated by some code which I setup to run automatically, so we will not cover that part here, but rather, the part where we insert the actual data.

The first step is creating and/or connecting to the database (which will either automatically generate the database and tables, and/or just connect to the existing database).

After that, we begin inserting data.

```
try:
SQL.create_or_connect_db(erase_first=erase_db_first)
# [0] = post, [1] = comment, [2] = link
for i in range(len(BSS.post_data)):
SQL.insert('post', data = BSS.post_data[i])
SQL.insert('link', data = BSS.link_data[i])
if scrape_comments:
SQL.insert('comment', data = BSS.comment_data[i])
except Exception as ex:
print(ex)
finally:
SQL.save_changes()
```

Inserting the data happens in a for loop, as can be seen from the code snippet above. We specify the column and the data which we want to input.

For the next step, we need to get the number of columns in the table we are inserting into. From the number of columns, we have to create an array of question marks – we have one question mark separated with a comma, for each column. This is how data is inserted, by the SQL syntax.

Some data is input into a function which I called insert(), and the data variable is an array in the form of a row. Basically, we already concatenated all the data into an array and now we are ready to insert.

```
cols = c.execute(('''
PRAGMA table_info({0})
''').format(table))
# Get the number of columns
num_cols = sum([1 for i in cols]) - 1
# Generate question marks for VALUES insertion
question_marks = self._question_mark_creator(num_cols)
if table == 'post':
c.execute(('''INSERT INTO {0}
VALUES ({1})'''
).format(table, question_marks), data)
self.last_post_id = c.lastrowid
elif (table == 'comment' or table == 'link') \
and data != None and data != []:
# setting post_id to the last post id, inserted in the post insert
for table_data in data:
table_data[1] = self.last_post_id
c.execute(('''INSERT INTO {0}
VALUES ({1})'''
).format(table, question_marks), table_data)
```

This wraps up scraping and inserting into a database, but how about...

For this project, I made three datasets, one of which I used for a Machine Learning project in the next section of this article.

- A dataset with posts, comments and links data
- A dataset with the post only
- A dataset with comments only

For these three datasets, I made a Python file called `make_dataset.py`

for creating the datasets and saving them using Pandas and some SQL query.

For the first dataset, we used a left join from the SQL syntax (which I won't go into detail about), and it provides the dataset that we wish for. You do have to filter a lot of nulls if you want to use this dataset for anything, i.e. a lot of data cleaning, but once that's done, you can use the whole dataset.

```
all_data = pd.read_sql_query("""
SELECT *
FROM post p
LEFT JOIN comment c
ON p.id = c.post_id
LEFT JOIN link l
ON p.id = l.post_id;
""", c)
all_data.to_csv('data/post_comment_link_data.csv', columns=all_data.columns, index=False)
```

For the second and third datasets, a simple select all from table SQL query was made to make the dataset. This needs no further explaining.

```
post = pd.read_sql_query("""
SELECT *
FROM post;
""", c)
comment = pd.read_sql_query("""
SELECT *
FROM comment;
""", c)
post.to_csv('data/post_data.csv', columns=post.columns, index=False)
comment.to_csv('data/comment_data.csv', columns=comment.columns, index=False)
```

From the three generated datasets, I wanted to show you how to do a basic machine learning project.

The results are not amazing, but we are trying to classify the comment into four categories; exceptional**, **good**, **average and bad – all based on the upvotes on a comment.

Let's start! Firstly, we import the functions and packages we need, along with the dataset, which is the comments table. But we only import the score (upvotes) and comment text from that dataset.

```
import copy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
pd.options.mode.chained_assignment = None
df = pd.read_csv('data/comment_data.csv', usecols=['score', 'text'])
```

The next thing we have to do is cleaning the dataset. Firstly, we start off with only getting the text by using some regular expression (regex). This removes any weird characters like \ | / & % etc.

The next thing we do is with regards to the score feature. The score feature is formatted as a string, and we just need the number from that string. But we also want the minus in front of the string, if some comment has been downvoted a lot. Another regex was used here to do this.

The next ugly thing from our Python script is a None as a string. We replace this string with an actual None in Python, such that we can run df.dropna().

The last thing we need to do is convert the score to a float, since that is required for later.

```
def prepare_data(df):
# Remove everything except alphanumeric characters
df.text = df.text.str.replace('[^a-zA-Z\s]', '')
# Get only numbers, but allow minus in front
df.score = df.score.str.extract('(^-?[0-9]*\S+)')
# Remove rows with None as string
df.score = df.score.replace('None', np.nan)
# Remove all None
df = df.dropna()
# Convert score feature from string to float
score = df.score.astype(float)
df.score = copy.deepcopy(score)
return df
df = prepare_data(df)
```

The next part is trying to define how our classification is going to work; so we have to adapt the data. An easy way is using percentiles (perhaps not an ideal way).

For this, we find the fiftieth, seventy-fifth and ninety-fifth quantile of the data and mark the data below the fiftieth quantile. We replace the score feature with this new feature.

```
def score_to_percentile(df):
second = df.score.quantile(0.50) # Average
third = df.score.quantile(0.75) # Good
fourth = df.score.quantile(0.95) # exceptional
new_score = []
for i, row in enumerate(df.score):
if row > fourth:
new_score.append('exceptional')
elif row > third:
new_score.append('good')
elif row > second:
new_score.append('average')
else:
new_score.append('bad')
df.score = new_score
return df
df = score_to_percentile(df)
```

We need to split the data and tokenize the text, which we proceed to do in the following code snippet. There is really not much magic happening here, so let's move on.

```
def df_split(df):
y = df[['score']]
X = df.drop(['score'], axis=1)
content = [' ' + comment for comment in X.text.values]
X = CountVectorizer().fit_transform(content).toarray()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = df_split(df)
```

The last part of this machine learning project tutorial is making predictions and scoring the algorithm we choose to go with.

Logistic Regression was used, and perhaps we did not get the best score, but this is merely a boilerplate for future improvement and use.

All we do here is fit the logistic regression model to the training data, make a prediction and then score how well the model predicted.

```
lr = LogisticRegression(C=0.05, solver='lbfgs', multi_class='multinomial')
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
score = accuracy_score(y_test, pred)
print ("Accuracy: {0}".format(score))
```

In our case, the predictions were not that great, as the accuracy turned out to be $0.59$.

Future works includes:

- Fine-tuning different algorithms
- Trying different metrics
- Better data cleaning

For this last section, I want to link to some important documentation for web scraping and SQL in Python.

]]>TensorFlow is inevitably the package to use for Deep Learning, if you are doing any sort of business. Keras is the standard API in TensorFlow and the easiest way to implement neural networks. Deployment is much easier, compared to PyTorch – so unless you are doing research, TensorFlow is most likely the way to go.

And even then, you should go with TensorFlow because your models will be easier for the industry to adopt in production.

**The most important parts of this article is at the end**, so stick around! I will show you how to use TensorFlow functions and also how to make a custom training and testing class.

TensorFlow also seem to be much more popular than PyTorch:

- TensorFlow: ~68k repositories on GitHub
- PyTorch: ~29k repositories on GitHub

It's possible to find all the documentation for TensorFlow on this link.

This article is possible to follow solely based on the Colab notebook provided here. I wish for you to comment on this post if there is any confusion. With that said, let's jump right into TensorFlow version 2.0.

- New Features in TensorFlow 2.0
- Verify Eager Execution and Find GPU Devices
- Common Use Operations
- Linear Algebra Operations
- Calculating Gradients with Gradient Tape
- Functions in TensorFlow with tf.function
- Custom Train and Test Functions for Neural Network

TensorFlow 2.0 is mostly a marketing move and some cleanup in the TensorFlow API. Nevertheless, whenever you consider doing deep learning and want to deploy a model, you will find yourself using TensorFlow.

Let's start off with a simple way to install / upgrade both the CPU and GPU version of TensorFlow in one line of code. This is not default in the popular Google Colab app yet, but it's rumored to arrive soon.

```
!pip install --upgrade tensorflow-gpu
```

All of the upcoming code in this article presumes that you have imported the tensorflow package in your Python program.

```
import tensorflow as tf
```

You should verify that you are running the correct version, TensorFlow 2.0, by the first line of code. All it does is call `__version__`

from TensorFlow.

```
print(('Your TensorFlow version: {0}').format(tf.__version__))
> Your TensorFlow version: 2.0.0
```

Eager execution means that the interpreter executes line by line, making it much better at and faster when debugging. There is also some cleanup in how graphs are made, which makes it fairly simple – in previous TensorFlow versions, you needed to manually make a graph.

This is actually huge, because you reduce the training code from this.

```
with tf.Session() as session:
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
model.fit(X_train, Y_train,
validation_data=(X_val, Y_val),
epochs=50, batch_size=32)
```

To that.

```
model.fit(X_train, Y_train,
validation_data=(X_val, Y_val),
epochs=50, batch_size=32)
```

There is no need for sessions or any of those TensorFlow variables, this is just regular Python code executing. It's nice.

Here is the official word on the new version of TensorFlow with regards to Eager Execution:

TensorFlow 1.X requires users to manually stitch together an abstract syntax tree (the graph) by making`tf.*`

API calls. It then requires users to manually compile the abstract syntax tree by passing a set of output tensors and input tensors to a`session.run()`

call. TensorFlow 2.0 executes eagerly (like Python normally does) and in 2.0, graphs and sessions should feel like implementation details.

One notable byproduct of eager execution is that`tf.control_dependencies()`

is no longer required, as all lines of code execute in order (within a`tf.function`

, code with side effects execute in the order written).

The new eager execution feature is actually a great move for TensorFlow, as it gets confusing when you can't immediately evaluate your code, just like in all your other Python code.

Eager execution is this big new feature, that allows for many things, as explained earlier – but let's just make sure that we are actually running in eager execution mode.

And while we are at it, we should check for which devices we want to run our code on – after all, GPUs are *way* faster than CPUs when it comes to Deep Learning tasks.

To verify whether you are running eager execution or not, I have made a small if-else statement that will tell you if you are.

- Are you running eager execution? And how can you turn it off, if you wish to.
- If you are not running eager execution, then there is a way to manually do it, or you could just try upgrading your TensorFlow version.

```
if(tf.executing_eagerly()):
print('Eager execution is enabled (running operations immediately)\n')
print(('Turn eager execution off by running: \n{0}\n{1}').format('' \
'from tensorflow.python.framework.ops import disable_eager_execution', \
'disable_eager_execution()'))
else:
print('You are not running eager execution. TensorFlow version >= 2.0.0' \
'has eager execution enabled by default.')
print(('Turn on eager execution by running: \n\n{0}\n\nOr upgrade '\
'your tensorflow version by running:\n\n{1}').format(
'tf.compat.v1.enable_eager_execution()',
'!pip install --upgrade tensorflow\n' \
'!pip install --upgrade tensorflow-gpu'))
```

This should print the following, if you are running eager execution and followed this article along. If you have TensorFlow 2.0, then you are running eager execution by default.

```
Eager execution is enabled (running operations immediately)
Turn eager execution off by running:
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
```

Let's say we are interested in knowing if we have a GPU device available – or if we know there is a GPU in our machine, we can test if TensorFlow recognizes that it exists. If not, then perhaps you should try and reinstall CUDA and cuDNN.

```
print(('Is your GPU available for use?\n{0}').format(
'Yes, your GPU is available: True' if tf.test.is_gpu_available() == True else 'No, your GPU is NOT available: False'
))
print(('\nYour devices that are available:\n{0}').format(
[device.name for device in tf.config.experimental.list_physical_devices()]
))
# A second method for getting devices:
#from tensorflow.python.client import device_lib
#print([device.name for device in device_lib.list_local_devices() if device.name != None])
```

My expected output would be that there should at least be a CPU available, and a GPU if you are running it in Google Colab – if no GPU shows up in Google Colab then you need to go to Edit > Notebook Settings > Hardware Accelerator and pick GPU.

As expected, we indeed have a CPU and GPU available in Google Colab:

```
Is your GPU available for use?
Yes, your GPU is available: True
Your devices that are available:
['/physical_device:CPU:0', '/physical_device:XLA_CPU:0', '/physical_device:XLA_GPU:0', '/physical_device:GPU:0']
```

Great, we know we have a GPU available called `GPU:0`

.

But how do we explicitly use it? First, you should know that TensorFlow by default uses your GPU where it can (not every operation can use the GPU).

But if you want to be absolute certain that your code is executed on the GPU, here is a code piece comparing time spent using the CPU versus GPU.

The simple operation here is creating a constant with `tf.constant`

and an identity matrix with `tf.eye`

, which we will discuss later in the Linear Algebra section.

```
import time
cpu_slot = 0
gpu_slot = 0
# Using CPU at slot 0
with tf.device('/CPU:' + str(cpu_slot)):
# Starting a timer
start = time.time()
# Doing operations on CPU
A = tf.constant([[3, 2], [5, 2]])
print(tf.eye(2,2))
# Printing how long it took with CPU
end = time.time() - start
print(end)
# Using the GPU at slot 0
with tf.device('/GPU:' + str(gpu_slot)):
# Starting a timer
start = time.time()
# Doing operations on CPU
A = tf.constant([[3, 2], [5, 2]])
print(tf.eye(2,2))
# Printing how long it took with CPU
end = time.time() - start
print(end)
```

For a small operation like this, we get that the CPU version ran for $0.00235$ seconds, while the GPU version ran for $0.0018$ seconds.

```
tf.Tensor(
[[1. 0.]
[0. 1.]], shape=(2, 2), dtype=float32)
0.0023527145385742188
tf.Tensor(
[[1. 0.]
[0. 1.]], shape=(2, 2), dtype=float32)
0.0018095970153808594
```

Note that how long it takes will vary each time, but the GPU should always outperform in these types of tasks. We could easily imagine how much this would help us with larger computations. In particular, when there is millions/billions of operations executed on a GPU, we do see a significant speed up of neural networks – always use a GPU, if available.

Let me introduce the bread and butter of TensorFlow, the most commonly used operations. We are going to take a look at the following

- Making tensors with
`tf.constant`

and`tf.Variable`

- Concatenation of two tensors with
`tf.concat`

- Making tensors with
`tf.zeros`

or`tf.ones`

- Reshaping data with
`tf.reshape`

- Casting tensors to other data types with
`tf.cast`

Perhaps one of the simplest operations in tensorflow is making a constant or variable. You simply call the tf.constant or tf.Variable function and specify an array of arrays.

```
# Making a constant tensor A, that does not change
A = tf.constant([[3, 2],
[5, 2]])
# Making a Variable tensor VA, which can change. Notice it's .Variable
VA = tf.Variable([[3, 2],
[5, 2]])
# Making another tensor B
B = tf.constant([[9, 5],
[1, 3]])
```

This code piece gives us three tensors; the constant A, the variable VA and the constant B.

Let's say that we have two tensors, perhaps it could be two observations. We want to concat the two tensors A and B into a single variable in Python – how do we do it?

We simply use the tf.concat, and specify the values and axis.

```
# Making a constant tensor A, that does not change
A = tf.constant([[3, 2],
[5, 2]])
# Making a Variable tensor VA, which can change. Notice it's .Variable
VA = tf.Variable([[3, 2],
[5, 2]])
# Making another tensor B
B = tf.constant([[9, 5],
[1, 3]])
# Concatenate columns
AB_concatenated = tf.concat(values=[A, B], axis=1)
print(('Adding B\'s columns to A:\n{0}').format(
AB_concatenated.numpy()
))
# Concatenate rows
AB_concatenated = tf.concat(values=[A, B], axis=0)
print(('\nAdding B\'s rows to A:\n{0}').format(
AB_concatenated.numpy()
))
```

The first output will be concatenating column-wise by `axis=1`

and the second will be concatenating row-wise by `axis=0`

– meaning we add the data either rightwards (columns) or downwards (rows).

```
Adding B's columns to A:
[[3 2 9 5]
[5 2 1 3]]
Adding B's rows to A:
[[3 2]
[5 2]
[9 5]
[1 3]]
```

Creating tensors with just tf.constant and tf.Variable can be tedious if you want to create big tensors. Imagine you want to create random noise – well, you could do that by making a tensor with tf.zeros or tf.ones.

All we need to specify is the shape in the format `shape=[rows, columns]`

and a dtype, if it matters at all. The number of rows and columns are arbitrary, and you could in principle create 4K images (as noise).

```
# Making a tensor filled with zeros. shape=[rows, columns]
tensor = tf.zeros(shape=[3, 4], dtype=tf.int32)
print(('Tensor full of zeros as int32, 3 rows and 4 columns:\n{0}').format(
tensor.numpy()
))
# Making a tensor filled with zeros with data type of float32
tensor = tf.ones(shape=[5, 3], dtype=tf.float32)
print(('\nTensor full of ones as float32, 5 rows and 3 columns:\n{0}').format(
tensor.numpy()
))
```

The output of this code piece will be the following.

```
Tensor full of zeros as int32, 3 rows and 4 columns:
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
Tensor full of ones as float32, 5 rows and 3 columns:
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
```

We might have generated some random noise or have a dataset of images in different sizes, which needs to be one-dimensional in order to fit into some filter or convolution.

We could use tf.reshape to *reshape* the images in whichever way we want. All we do here is define a tensor, and then reshape it into 8 columns with 1 row, instead of 2 columns with 4 rows.

```
# Making a tensor for reshaping
tensor = tf.constant([[3, 2],
[5, 2],
[9, 5],
[1, 3]])
# Reshaping the tensor into a shape of: shape = [rows, columns]
reshaped_tensor = tf.reshape(tensor = tensor,
shape = [1, 8])
print(('Tensor BEFORE reshape:\n{0}').format(
tensor.numpy()
))
print(('\nTensor AFTER reshape:\n{0}').format(
reshaped_tensor.numpy()
))
```

This produces the following result.

```
Tensor BEFORE reshape:
[[3 2]
[5 2]
[9 5]
[1 3]]
Tensor AFTER reshape:
[[3 2 5 2 9 5 1 3]]
```

Some functions in TensorFlow and Keras requires specific data types as inputs, and we can do that with `tf.cast`

. If you mostly have integers, you will probably find yourself casting from integer values to float values.

We can simply make a tensor with the datatype of float32. We can then cast this tensor to int, removing the comma and all decimals, while not rounding up or down.

```
# Making a tensor
tensor = tf.constant([[3.1, 2.8],
[5.2, 2.3],
[9.7, 5.5],
[1.1, 3.4]],
dtype=tf.float32)
tensor_as_int = tf.cast(tensor, tf.int32)
print(('Tensor with floats:\n{0}').format(
tensor.numpy()
))
print(('\nTensor cast from float to int (just remove the decimal, no rounding):\n{0}').format(
tensor_as_int.numpy()
))
```

The output of this code piece will simply be stripping the commas from the original tensor to a new tensor without the commas – a successful conversion from float to int.

```
Tensor with floats:
[[3.1 2.8]
[5.2 2.3]
[9.7 5.5]
[1.1 3.4]]
Tensor cast from float to int (just remove the decimal, no rounding):
[[3 2]
[5 2]
[9 5]
[1 3]]
```

Many algorithms or research needs these operations in order to implement algorithms and trying new things, e.g. making smaller changes in activation functions or optimizers. You will encounter some of these operations in my linear algebra series.

- Transpose tensor with
`tf.transpose`

- Matrix Multiplication with
`tf.matmul`

- Element-wise multiplication with
`tf.multiply`

- Identity Matrix with
`tf.eye`

- Determinant with
`tf.linalg.det`

- Dot Product with
`tf.tensordot`

Suppose we want to do linear algebra operations, then the tf.transpose function comes in handy.

```
# Some Matrix A
A = tf.constant([[3, 7],
[1, 9]])
A = tf.transpose(A)
print(('The transposed matrix A:\n{0}').format(
A
))
```

This produces $A^T$, i.e. it produces the transposed matrix of A.

```
The transposed matrix A:
[[3 1]
[7 9]]
```

Many algorithms requires matrix multiplication, and this is easy in TensorFlow with the tf.matmul function.

All we do here is define two matrices (one is a vector) and use the tf.matmul function to do matrix multiplication.

```
# Some Matrix A
A = tf.constant([[3, 7],
[1, 9]])
# Some vector v
v = tf.constant([[5],
[2]])
# Matrix multiplication of A.v^T
Av = tf.matmul(A, v)
print(('Matrix Multiplication of A and v results in a new Tensor:\n{0}').format(
Av
))
```

If you then use the tf.matmul on A and v, we get the following.

```
Matrix Multiplication of A and v results in a new Tensor:
[[29]
[23]]
```

Element-wise multiplication comes up in many instances, especially in optimizers. Reusing the tf.constants from before, such that we can compare the two, we simply use tf.multiply instead.

```
# Element-wise multiplication
Av = tf.multiply(A, v)
print(('Element-wise multiplication of A and v results in a new Tensor:\n{0}').format(
Av
))
```

And the outcome will be the following.

```
Element-wise multiplication of A and v results in a new Tensor:
[[15 35]
[ 2 18]]
```

In Linear Algebra, the identity matrix is simply a matrix with ones along the diagonal – and if you find the identity matrix of some matrix A, and multiply the identity matrix with A, the result will be the matrix A.

We simply define a tensor A, get the rows and columns and make an identity matrix.

```
# Some Matrix A
A = tf.constant([[3, 7],
[1, 9],
[2, 5]])
# Get number of dimensions
rows, columns = A.shape
print(('Get rows and columns in tensor A:\n{0} rows\n{1} columns').format(
rows, columns
))
# Making identity matrix
A_identity = tf.eye(num_rows = rows,
num_columns = columns,
dtype = tf.int32)
print(('\nThe identity matrix of A:\n{0}').format(
A_identity.numpy()
))
```

The output of the above code is the following.

```
Get rows and columns in tensor A:
3 rows
2 columns
The identity matrix of A:
[[1 0]
[0 1]
[0 0]]
```

The determinant can be used to solve linear equations or capturing how the area of how matrices changes.

We make a matrix A, then cast it to float32, because the tf.linalg.det does not take integers as input. Then we just find the determinant of A.

```
# Reusing Matrix A
A = tf.constant([[3, 7],
[1, 9]])
# Determinant must be: half, float32, float64, complex64, complex128
# Thus, we cast A to the data type float32
A = tf.dtypes.cast(A, tf.float32)
# Finding the determinant of A
det_A = tf.linalg.det(A)
print(('The determinant of A:\n{0}').format(
det_A
))
```

It turns out the output is around 20:

```
The determinant of A:
20.000001907348633
```

Dotting one tensor onto another is perhaps one of the most common linear algebra operations. Hence, we should at least know how to find the dot product of two tenors in TensorFlow.

We just need to instantiate two constants, and then we can dot them together – note that in this instance, tf.tensordot is the same as tf.matmul, but there are differences outside the scope of this article.

```
# Defining a 3x3 matrix
A = tf.constant([[32, 83, 5],
[17, 23, 10],
[75, 39, 52]])
# Defining another 3x3 matrix
B = tf.constant([[28, 57, 20],
[91, 10, 95],
[37, 13, 45]])
# Finding the dot product
dot_AB = tf.tensordot(a=A, b=B, axes=1).numpy()
print(('Dot product of A.B^T results in a new Tensor:\n{0}').format(
dot_AB
))
# Which is the same as matrix multiplication in this instance (axes=1)
# Matrix multiplication of A and B
AB = tf.matmul(A, B)
print(('\nMatrix Multiplication of A.B^T results in a new Tensor:\n{0}').format(
AB
))
```

The result is as follows, quite some big numbers as expected.

```
Dot product of A.B^T results in a new Tensor:
[[8634 2719 8750]
[2939 1329 2975]
[7573 5341 7545]]
Matrix Multiplication of A.B^T results in a new Tensor:
[[8634 2719 8750]
[2939 1329 2975]
[7573 5341 7545]]
```

Let's make an example of the newer **GELU** activation function, used in OpenAI's GPT-2 and Google's BERT.

The GELU function:

$$
\text{GELU}(x) = 0.5x\left(1+\text{tanh}\left(\sqrt{2/\pi}(x+0.044715x^3)\right)\right)
$$

GELU differentiated:

$$
\text{GELU}'(x) = 0.5\text{tanh}(0.0356774x^3 + 0.797885 x) + (0.0535161 x^3 + 0.398942 x) \text{sech}^2(0.0356774x^3+0.797885x)+0.5
$$

If we input $x=0.5$ into the GELU function, we get the following result:

$$
GELU'(0.5) = 0.5tanh(0.0356774*0.5^3 + 0.797885*0.5) + (0.0535161*0.5^3 + 0.398942*0.5)sech^2(0.0356774*0.5^3+0.797885*0.5)+0.5 = 0.867370
$$

When we plot the differentiated GELU function, it looks like this:

Let's just code this into an example in TensorFlow.

First, define the activation function; we chose the GELU activation function `gelu()`

. Then we define a `get_gradient()`

function which uses the Gradient Tape from TensorFlow.

The Gradient Tape is the important part, since it automatically differentiates and records the gradient of any operation indented under `tf.GradientTape() as gt`

. After execution, we use the gradient tape with the gradient function `gt.gradient()`

to retrieve the recorded gradient for the target *y* from the source *x*.

```
import math
def gelu(x):
return 0.5*x*(1+tf.tanh(tf.sqrt(2/math.pi)*(x+0.044715*tf.pow(x, 3))))
def get_gradient(x, activation_function):
with tf.GradientTape() as gt:
y = activation_function(x)
gradient = gt.gradient(y, x).numpy()
return gradient
x = tf.Variable(0.5)
gradient = get_gradient(x, gelu)
print('{0} is the gradient of GELU with x={1}'.format(
gradient, x.numpy()
))
```

The output will be the following – notice that the output is the same as what we calculated at the start, just with more decimals.

```
0.8673698902130127 is the gradient of GELU with x=0.5
```

TensorFlow Functions with `@tf.function`

offers a significant speedup, because TensorFlow uses AutoGraph to convert functions to graphs, which in turn runs faster.

The annotation takes the normal Python syntax and converts it into a graph – and it has minimum side effects, which means we should always use it, especially when training and testing neural network models.

All that is done here is making an image and running it through `conv_layer`

and `conv_fn`

, then finding the difference.

```
import timeit
conv_layer = tf.keras.layers.Conv2D(100, 3)
@tf.function
def conv_fn(image):
return conv_layer(image)
image = tf.zeros([1, 200, 200, 100])
# warm up
conv_layer(image); conv_fn(image)
no_tf_fn = timeit.timeit(lambda: conv_layer(image), number=10)
with_tf_fn = timeit.timeit(lambda: conv_fn(image), number=10)
difference = no_tf_fn - with_tf_fn
print("Without tf.function: ", no_tf_fn)
print("With tf.function: ", with_tf_fn)
print("The difference: ", difference)
print("\nJust imagine when we have to do millions/billions of these calculations," \
" then the difference will be HUGE!")
print("Difference times a billion: ", difference*1000000000)
```

As we can see, the difference is there. Maybe not for such few operations, but one could imagine how it scales – hint: it scales quite well.

```
Without tf.function: 0.005995910000024196
With tf.function: 0.005338444000017262
The difference: 0.0006574660000069343
Just imagine when we have to do millions/billions of these calculations, then the difference will be HUGE!
Difference times a billion: 657466.0000069344
```

For this part, we are going to be following a heavily modified approach of the tutorial from tensorflow's documentation.

Remember that all of the code for this article is also available on GitHub, with a Colab link for you to run it immediately.

For the first part, we just have some imports that we need for later. We also specify that the backend should by default run float64 in layers.

```
from __future__ import absolute_import, division, print_function, unicode_literals
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model
tf.keras.backend.set_floatx('float64')
mnist = tf.keras.datasets.mnist
```

In this next snippet, all we do is load and preprocess the data.

```
# Load Data & Remove color channels
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]
train_ds = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(10000).batch(32)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32)
```

Now we make a class, which starts here and each function will be described in it's separate little code piece.

If you don't know what an `__init__()`

function does, then let me tell you it's called a constructor – a constructor runs this the code in it's function `__init__`

every time you instantiate (explained later) a new object of that class. The first step in TensorFlow is using the `super()`

function, to run the superclass of the current subclass. All other code is a standard approach, we just define some variables and layers, like convolutions and dense layers. When we use the `self.`

, we assign a variable to the instance of the class, such that we can do `self.conv1`

in other methods, and we can do `MyModel.conv1`

outside the class, to access that specific variable.

```
class MyModel(Model):
def __init__(self,
loss_object,
optimizer,
train_loss,
train_metric,
test_loss,
test_metric):
'''
Setting all the variables for our model.
'''
super(MyModel, self).__init__()
self.conv1 = Conv2D(32, 3, activation='relu')
self.flatten = Flatten()
self.d1 = Dense(128, activation='relu')
self.d2 = Dense(10, activation='softmax')
self.loss_object = loss_object
self.optimizer = optimizer
self.train_loss = train_loss
self.train_metric = train_metric
self.test_loss = test_loss
self.test_metric = test_metric
```

The next function is defining the architecture for our neural network, hence why it's called `nn_model()`

. We just run through the model here when it's called with some input *x*. One smaller exercise, if you are just getting started out with Python/TensorFlow would be to remove the function nn_model, and provide it as an input when instantiating the class. Remember to replace references with the new name you give it.

```
def nn_model(self, x):
'''
Defining the architecture of our model. This is where we run
through our whole dataset and return it, when training and
testing.
'''
x = self.conv1(x)
x = self.flatten(x)
x = self.d1(x)
return self.d2(x)
```

Let's watch really close, lots of things are happening in the next function. First of all, we annotated the function with `@tf.function`

for as much of a speedup as possible.

As explained earlier, the `tf.GradientTape()`

records gradients onto a variable `tape`

, which we can access afterwards. The training goes like this:

- Make predictions and call the object holding the loss function with our data and predictions. While this is happening, gradients were automatically recorded.
- Get the gradients from the gradient tape and apply them using the update rule from the optimizer picked (we will look at inputting these functions and variables later).

```
@tf.function
def train_step(self, images, labels):
'''
This is a TensorFlow function, run once for each epoch for the
whole input. We move forward first, then calculate gradients
with Gradient Tape to move backwards.
'''
with tf.GradientTape() as tape:
predictions = self.nn_model(images)
loss = self.loss_object(labels, predictions)
gradients = tape.gradient(loss, self.trainable_variables)
optimizer.apply_gradients(zip(
gradients, self.trainable_variables))
self.train_loss(loss)
self.train_metric(labels, predictions)
```

This next function is just a test step, used to test the last training step. This function is almost identical to the `train_step()`

function, except for there are no gradients and updates.

```
@tf.function
def test_step(self, images, labels):
'''
This is a TensorFlow function, run once for each epoch for the
whole input.
'''
predictions = self.nn_model(images)
t_loss = self.loss_object(labels, predictions)
self.test_loss(t_loss)
self.test_metric(labels, predictions)
```

The next function ties the whole class together into one function, with three for loops. Later on, we define how many epochs (iterations) we want the neural networks to train and test for – and then for each iteration, we run through each observation.

Afterwards, we can see how well we optimized our loss function and metric. We just keep running this from $0$ to $n$ epochs. This concludes the class MyModel. Have a close look at the three for loops, as that is where all the action is happening.

```
def fit(self, train, test, epochs):
'''
This fit function runs training and testing.
'''
for epoch in range(epochs):
for images, labels in train:
self.train_step(images, labels)
for test_images, test_labels in test:
self.test_step(test_images, test_labels)
template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
print(template.format(epoch+1,
self.train_loss.result(),
self.train_metric.result()*100,
self.test_loss.result(),
self.test_metric.result()*100))
# Reset the metrics for the next epoch
self.train_loss.reset_states()
self.train_metric.reset_states()
self.test_loss.reset_states()
self.test_metric.reset_states()
```

For the next snippet of code, we simply define all the variables and functions we need for a neural network to run – a loss function, optimizer and metric.

```
# Make a loss object
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
# Select the optimizer
optimizer = tf.keras.optimizers.Adam()
# Specify metrics for training
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_metric = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
# Specify metrics for testing
test_loss = tf.keras.metrics.Mean(name='test_loss')
test_metric = tf.keras.metrics.SparseCategoricalAccuracy(name='test_accuracy')
```

We take the loss functions, optimizer and metrics, and we input that into MyModel by instantiating the class with these variables. So when we call `MyModel()`

with all these parameters, we actually run the `__init__`

function in the MyModel class.

As mentioned earlier, we can call functions and variables from the instance of a class, so here we quite simply call the fit function with our training and testing dataset.

```
# Create an instance of the model
model = MyModel(loss_object = loss_object,
optimizer = optimizer,
train_loss = train_loss,
train_metric = train_metric,
test_loss = test_loss,
test_metric = test_metric)
EPOCHS = 5
model.fit(train = train_ds,
test = test_ds,
epochs = EPOCHS)
```

This produces the following output in the console (which will change each time you run the training).

```
Epoch 1, Loss: 0.13490843153949827, Accuracy: 95.94166666666666, Test Loss: 0.06402905891434076, Test Accuracy: 97.86
Epoch 2, Loss: 0.043823116325043765, Accuracy: 98.64666666666668, Test Loss: 0.06146741438847755, Test Accuracy: 98.05
Epoch 3, Loss: 0.022285125361487735, Accuracy: 99.29666666666667, Test Loss: 0.056894636656289105, Test Accuracy: 98.3
Epoch 4, Loss: 0.013788788002398602, Accuracy: 99.52666666666666, Test Loss: 0.0621878185059347, Test Accuracy: 98.3
Epoch 5, Loss: 0.010066991776032834, Accuracy: 99.66000000000001, Test Loss: 0.06649907561390188, Test Accuracy: 98.33
```

]]>Picking the right optimizer with the right parameters, can help you squeeze the last bit of accuracy out of your neural network model. In this article, optimizers are explained from the classical to the newer approaches.

This post could be seen as a part three of how neural networks learn; in the previous posts, we have proposed the *update rule* as the one in gradient descent. Now we are exploring better and newer optimizers. If you want to know how we do a forward and backwards pass in a neural network, you would have to read the first part – especially how we calculate the gradient is covered in great detail.

If you are new to neural networks, you probably won't understand this post, if you don't read the first part.

I want to add, before explaining the different optimizers, that you really should read Sebastian Ruder's paper An overview of gradient descent optimization algorithms. It's a great resource that briefly describes many of the optimizers available today.

This is the basic algorithm responsible for having neural networks converge, i.e. we shift towards the optimum of the cost function. Multiple gradient descent algorithms exists, and I have mixed them together in previous posts. Here, I am not talking about batch (vanilla) gradient descent or mini-batch gradient descent.

The basic difference between batch gradient descent (BGD) and stochastic gradient descent (SGD), is that we only calculate the cost of one example for each step in SGD, but in BGD, we have to calculate the cost for all training examples in the dataset. Trivially, this speeds up neural networks greatly. Exactly this is the motivation behind SGD.

The equation for SGD is used to update parameters in a neural network – we use the equation to update parameters in a backwards pass, using backpropagation to calculate the gradient $\nabla$:

$$
\theta = \theta - \eta \cdot
\overbrace{\nabla_\theta J(\theta; \, x, \, y)}^{\text{Backpropagation}}
$$

This is how the equation is presented formally, and here is what each symbol means:

- $\theta$ is a parameter (theta), e.g. your weights, biases and activations. Notice that we only update a single parameter for the neural network here, i.e. we could update a single weight.
- $\eta$ is the learning rate (eta), but also sometimes alpha $\alpha$ or gamma $\gamma$ is used.
- $\nabla$ is the gradient (nabla), which is taken of $J$. Gradient calculations is already explained extremely well in my other post.
- $J$ is formally known as objective function, but most often it's called cost function or loss function.

We take each parameter theta $\theta$ and update it by taking the original parameter $\theta$ and subtract the learning rate $\eta$ times theratioof change $\nabla J(\theta)$.

$J(\theta;\, x, \, y)$ just means that we input our parameter theta $\theta$ along with a training example and label (e.g. a class). The semicolon is used to indicate that the parameter theta $\theta$ is different from the training example and label, which are separated by a comma.

Note that moving forward, the subscript $\theta$ in $\nabla_\theta$ will be left out for simplicity.

We can visualize what happens to a single weight $w$ in a cost function $C(w)$ (same as $J$). Naturally, what happens is that we find the derivative of the parameter $\theta$, which is $w$ in this case, and we update the parameter accordingly to the equation above.

Okay, we got some value theta $\theta$ and eta $\eta$ to work with. But what is that last thing in the equation, what does it mean? Let's expand into the equation from the prior post (which you should have read).

$$
\theta = \theta - \eta \cdot \nabla J(\theta; \, x, \, y)
\Leftrightarrow
\theta = \theta - \eta \cdot \frac{\partial C}{\partial \theta}
$$

Well this is now just a partial derivative, i.e. we find the cost function $C$, and inside that function, we find the derivative of theta $\theta$, but keep the rest of the function constant (we don't touch the rest). The assumption here is that our training example with a label is provided, which is why it was removed on the right side.

We could even replace some of the terms to make it more readable. Say we wanted to update a weight $w$, with the learning rate $0.3$ and a cost function C:

$$
w = w - 0.3 \cdot \frac{\partial C}{\partial w}
$$

Well, we assume that we know $w$, so the only thing stopping us from calculating the equation is the last term. But I won't go into that, since that was part of my last post.

Moving forward, note and remember that

$$
\nabla J(\theta)
=
\frac{\partial C}{\partial \theta}
$$

If you don't know what this means, perhaps you should visit neural networks post, which in detail will explain backpropgation, and what gradients and partial derivatives means.

For each parameter theta $\theta$, from $1$ to $j$, we update according to this equation.

$$
\theta_j = \theta_j - \eta \cdot
\overbrace{\frac{\partial C}{\partial \theta_j}}^{\text{Backprop}}
$$

Usually, this is equation is wrapped in a *repeat until convergence*, i.e. we update each parameter, for each training example, until we reach a local minimum.

Running through the dataset multiple times is usually done, and is called an *epoch*, and for each epoch, we should randomly select a subset of the data – this is the stochasticity of the algorithm.

Say we want to translate this to some pseudo code. This is relatively easy, except for we will leave the function for calculating gradients left out.

```
for i in range(n_epochs):
shuffled_data = np.random.shuffle(data)
for x,y in shuffled_data:
# Using backpropagation to calculate gradients (change)
change = compute_gradient(cost_func, x, y, param)
param = param - l_rate * change
```

Ending the walkthrough of SGD, it is only right to propose some pros and cons of the optimizer. Clearly, it is one of the older algorithms for optimization in neural networks, but nevertheless, it is also comparatively the easiest to learn. The cons are mostly with regards to newer and better optimizers, and is perhaps hard to explain at this point. The reason for the cons will become clear, once I present the next optimizers.

Pros

- Relatively fast compared to the older gradient descent approaches
- SGD is comparatively easy to learn for beginners, since it is not as math heavy as the newer approaches to optimizers

Cons

- Converges slower than newer algorithms
- Has more problems with being stuck in a local minimum than newer approaches
- Newer approaches outperform SGD in terms of optimizing the cost function

Simply put, the momentum algorithm helps us progress faster in the neural network, negatively or positively, to the ball analogy. This helps us get to a local minimum faster.

For each time we roll the ball down the hill (for each epoch), the ball rolls faster towards the local minima in the next iteration. This makes us more likely to reach a better local minima (or perhaps global minima) than we could have with SGD.

The slope of the cost function is not actually such a smooth curve, but it's easier to plot to show the concept of the ball rolling down the hill. The function will often be much more complex, hence we might actually get stuck in a local minimum or significantly slowed down. Obviously, this is not desirable. The terrain is not smooth, it has obstacles and weird shapes in very high-dimensional space – for instance, the concept would look like this in 2D:

In the above case, we are stuck at a local minimum, and the motivation is clear – we need a method to handle these situations, perhaps to never get stuck in the first place.

Now we know why we should use momentum, let's introduce more specifically what it means, by explaining the mathematics behind it.

Momentum is where we add a temporal element into our equation for updating the parameters of a neural network – that is, an element of time.

This time element increases the momentum of the ball by some amount. This amount is called gamma $\gamma$, which is usually initialized to $0.9$. But we also multiply that by the **previous update **$v_t$.

What I want you to realize is that our function for momentum is basically the same as SGD, with an extra term:

$$
\theta = \theta - \eta\nabla J(\theta) + \gamma v_{t}
$$

Let's just make this $100\%$ clear:

- Theta $\theta$ is a parameter, e.g. your weights, biases or activations
- Eta $\eta$ is your learning rate, also sometimes written as alpha $\alpha$ or epsilon $\epsilon$.
- Objective function $J$, i.e. the function which we are trying to optimize. Also called cost function or loss function (although they have different meanings).
- Gamma $\gamma$, a constant term. Also called the momentum, and rho $\rho$ is also used instead of $\gamma$ sometimes.
- Last change (last update) to $\theta$ is called $v_t$.

Although it's very similar to SGD, I have left out some elements for simplicity, because we can easily get confused by the indexing and notational burden that comes with adding temporal elements.

Let's add those elements now. **First** the temporal element, **then** the explanation of $v_t$.

If you want to play with momentum and learning rate, I recommend visiting distill's page for Why Momentum Really Works.

Adding the notion of time; say we want to update the current parameter $\theta$, how would we go about that? Well, we would first have to define which parameter $\theta$ we want to update at a given time. And how do we do that?

One way to track where we are in time, is to assign a variable of time $t$ to $\theta$. The variable $t$ would work like a counter; we increase $t$ by one for each update of a certain parameter.

How might this look in a mathematical sense? Well, we just subscript every variable that are subject to change over time. That is, the values for our parameter $\theta$ will definitely change over time, but the variable for the learning rate $\eta$ remains fixed.

$$
\theta_t = \theta_{t} - \eta\nabla J(\theta_{t}) + \gamma v_{t}
$$

This reads:

Theta $\theta$ at time step $t$ equals $\theta_t$ minus the learning rate, times the gradient of the objective function $J$ with respect to the parameter $\theta_t$, plus a momentum term gamma $\gamma$, times the change to $\theta$ at the last time step $t-1$.

There it is, we added the temporal element. But we are not done, what does $v_t$ mean? I explained it as the previous update, but what does that entail?

I told you about the ball rolling faster and faster down the hill, as it accumulates more speed for each epoch, and this term helps us do exactly that.

What helps us accumulate more speed for each epoch is the momentum term, which consists of a constant $\gamma$ and the previous update $\gamma v_{t}$ to $\theta$. But the previous update to $\theta$ also includes the second last update to $\theta$ and so on.

$$
v_{t} = \eta\nabla J(\theta_{t-1}) + v_{t-1}
$$

Essentially, we store the calculations of the gradients (the updates) for use in all the next updates to a parameter $\theta$. This exact property causes the ball to roll faster down the hill, i.e. we converge faster because now we move forward faster.

Instead of writing $v_{t-1}$, which includes $v_{t-2}$ in it's equation and so on, we could use summation, which might be more clear. We can summarize at tau $\tau$ equal to $1$, all the way up to the current time step $t$.

$$
\theta_t = \theta_{t} - \eta\nabla J(\theta_{t}) + \gamma \sum_{\tau=1}^{t}
\eta\nabla J(\theta_{\tau})
$$

The intuition of why momentum works (besides the theory) can effectively be shown with a contour plot – which is a long and narrow valley in this case.

We can think of optimizing a cost function with SGD as oscillating up and down along the y-axis, and the bigger the oscillation up and down the y-axis, the slower we progress along the x-axis. Intuitively, it then makes sense to add *something* (momentum) to help us oscillate less, thus moving faster along the x-axis towards the local minima.

$$
\updownarrow \text{slower convergence}\\
\leftrightarrow \text{faster convergence}
$$

The next notation for the notion of change might be more explainable and easier to understand. You may skip the next header, but I think it's a good alternative way of thinking about momentum. You will learn a different notation, which can enable you to understand other papers using similar notation.

In that paper that I linked at the start momentum, it's described in a similar way but with different notation, so let's just cover that as well. They defined it for a weight instead of a parameter $\theta$, and they use $E$ for error function, which is the same as $J$ for objective or cost function. They also use the Delta symbol $\Delta$ to indicate change:

$$
\Delta w_t = \epsilon \nabla E(w) + \rho \Delta w_{t-1}
$$

This is pretty straight forward, so let's replace the parameters of the equation with the parameters of what I just explained.

- $w_t$ becomes $\theta_t$
- $E(w)$ becomes $J(\theta_{t-1})$
- Rho $\rho$ becomes $\gamma$

Rewriting the parameters, we get almost the same exact equation as presented in the last notation, except we now have a Delta $\Delta$ term at the start and end of the equation. Intuitively, the delta symbol has always meant *change* when studying physics – and it has the same meaning here, it's just some rate of change for a parameter over a function $J$.

$$
\Delta \theta_t = \eta \nabla J(\theta_t) + \gamma \Delta \theta_{t-1}
$$

All the triangle delta means can be specified as a function $\text{change}(\theta_t)$, which just specifies how much a parameter $\theta$ should change by. So when I tell you that I want you to add $\Delta \theta_{t-1}$ at the end of the equation, it just means to take the last change to $\theta$, i.e. at the last time step $t-1$.

Now $\Delta \theta_t$ becomes our update, and we update our parameter accordingly:

$$
\theta_t = \theta_t - \Delta \theta_t
$$

It's really that simple.

There is not much to say for pros and cons of the algorithm – perhaps there is not too much theory on the subject of the good and bad of momentum.

Pros

- Faster convergence than traditional SGD

Cons

- As the ball accelerates down the hill, how do we know that we don't miss the local minima? If the momentum is too much, we will most likely miss the local minima, rolling past it, but then rolling backwards, missing it again. If the momentum is too much, we could just swing back and forward between the local minima.

Adaptive Moment Estimation (Adam) is the next optimizer, and probably also the optimizer that performs the best on average. Taking a big step forward from the SGD algorithm to explain Adam does require some explanation of some clever techniques from other algorithms adopted in Adam, as well as the unique approaches Adam brings.

Adam uses Momentum and Adaptive Learning Rates to converge faster. We have already explored what Momentum means, now we are going to explore what adaptive learning rates means.

An adaptive learning rate can be observed in AdaGrad, AdaDelta, RMSprop and Adam, but I will only go into AdaGrad and RMSprop, as they seem to be the relevant one's (although AdaDelta has the same update rule as RMSprop). The adaptive learning rate property is also known as Learning Rate Schedules, which I found an insightful Medium post for.

So, what is it? I found that the best way is explaining a property from AdaGrad first, and then adding a property from RMSprop. This will be sufficient to show you what adaptive learning rate means and provides.

Part of the intuition for adaptive learning rates, is that we start off with big steps and finish with small steps – almost like mini-golf. We are then allowed to move faster initially. As the learning rate decays, we take smaller and smaller steps, allowing us to converge faster, since we don't overstep the local minimum with as big steps.

Adaptive Gradients (AdaGrad) provides us with a simple approach, for changing the learning rate over time. This is important for adapting to the differences in datasets, since we can get small or large updates, according to how the learning rate is defined.

Let's go for a top to bottom approach; here is the equation:

$$
\theta_{t+1,i} = \theta_{t,i}
-\frac{\eta}
{
\sqrt
{
\epsilon +
\sum_{\tau=1}^{t}
\left( \nabla J(\theta_{\tau,i}) \right) ^2
}
} \nabla J(\theta_{t,i})
$$

All we added here is division of the learning rate eta $\eta$. Although I told you that $\epsilon$ sometimes is the learning rate, in this algorithm it is not. In fact, it's just a small value that ensures that we don't divide by zero.

What needs explaining here is the term $\sqrt{\sum_{\tau=1}^{t}\left( \nabla J(\theta_{\tau,i}) \right) ^2}$, i.e. the square root of the summation $\sum$ over all gradients squared. We sum over all the gradients, from time step $\tau=1$ all the way to the current time step $t$.

If $t=3$, then we would sum over the gradient at $t=1$, $t=2$ and $t=3$, and this just scales as $t$ becomes larger. Eventually, though, the gradients might be so small, that the momentum becomes *stale*, i.e. it updates with very small values.

Let me just make an example here, denoting the gradient by $g$ under the square root, i.e. $g(\theta_{3,i})^2 = (\nabla J(\theta_{3,i}))^2$

$$
\theta_{4,i} = \theta_{3,i}
\, -\frac{\eta}
{
\sqrt
{
\epsilon +
g(\theta_{1,i})^2 +
g(\theta_{2,i})^2 +
g(\theta_{3,i})^2
}
} \,\nabla J(\theta_{3,i})
$$

What effect does this has on the learning rate $\eta$? Well, division by bigger and bigger numbers means that the learning rate is decreasing over time – hence the term *adaptive* learning rate.

We could in simple terms say, that the sum $\sum$ increases over time, as we add more gradients over time:

Root Mean Squared Propagation (RMSprop) is very close to Adagrad, except for it does not provide the sum of the gradients, but instead an *exponentially decaying average*. This decaying average is realized through combining the Momentum algorithm and Adagrad algorithm, with a new term.

An important property of RMSprop is that we are not restricted to just the sum of the past gradients, but instead we are more restricted to gradients for the recent time steps. This means that RMSprop changes the learning rate slower than Adagrad, but still reaps the benefits of converging relatively fast – as has been shown (and we won't go into those details here).

Doing the top to bottom approach again, let's start out with the equation. By now, you should only be suspect of the expectation of the gradient $E[g^2]$:

$$
\theta_{t+1,i} = \theta_{t,i}
-\frac{\eta}
{
\sqrt
{
\epsilon + E[g^2]_t
}
} \nabla J(\theta_{t,i})
$$

This exact term is what causes the decaying average (also called running average or moving average). Let's examine it, with relation to the momentum algorithm presented earlier.

We still have our momentum term $\gamma=0.9$. We can immediately see that the new term $E$ is similar to $v_t$ from Momentum; the differences is that $E$ has no learning rate in the equation, while it has added a new term $(1-\gamma)$ in front of the gradient $g$. Note that summation $\sum$ is not used here, since it would involve a more complex equation. I tried to convert it, but got stuck because of the new term, hence I found it's not worth it to try and express it with a summation sign.

With the AdaGrad algorithm, the learning rate $\eta$ was monotonously decreasing, while in RMSprop, $\eta$ can adapt up and down in value, as we step further down the hill for each epoch. This concludes adaptive learning rate, where we explored two ways of making the learning rate adapt over time. This property of adaptive learning rate is also in the Adam optimizer, and you will probably find that Adam is easy to understand now, given the prior explanations of other algorithms in this post.

Andrew Ng compares Momentum to RMSprop in a brilliant video on YouTube

Now we have learned all these other algorithms, and for what? Well, to be able able to explain Adam, such that it's easier to understand. By now, you should know what Momentum and Adaptive Learning Rate means.

There are a lot of terms to watch out for in the original paper, and it might seem confusing at first.

But let's just paint it in a simplistic way; here is the update rule for Adam

$$
\theta_{t+1} = \theta_t -
\frac{\eta \cdot \hat{m_t}}
{
\sqrt{\hat{v_t}} + \epsilon
}
\\ \text{where} \\\\
\hat{m_t} = \frac{m_t}{1-\beta_1^t}
\\
\hat{v_t} = \frac{v_t}{1-\beta_2^t}
\\ \text{and where} \\
m_t = (1-\beta_1)g_t + \beta_1 m_{t-1}
\\
v_t = (1-\beta_2)g_t^2 + \beta_2 v_{t-1}
$$

Immediately, we can see that there are a bunch of numbers and things to keep track of. Most of these have already been explained, but for the sake of clarity, let's state each term here:

- Epsilon $\epsilon$, which is just a small term preventing division by zero. This term is usually $10^{-8}$.
- Learning rate $\eta$ (although it's $\alpha$) in the paper. They explain that a good default setting is $\eta=0.001$, which is also the default learning rate in Keras.
- The gradient $g$, which is still the same thing as before: $g = \nabla J(\theta_{t,i})$

We also have two decay terms, also called the exponential decay rates in the paper. The terms are close to $\gamma$ in RMSprop and Momentum, but instead of 1 term, we have two terms called beta 1 and beta 2:

- First momentum term $\beta_1=0.9$
- Second momentum term $\beta_2=0.999$

Although these terms are without the time step $t$, we would just take the value of $t$ and put it in the exponent, i.e. if $t=5$, then $\beta_1^{t=5}=0.9^5=0.59049$.

The likes of RAdam and Lookahead were considered, along with a combination of the two, called Ranger, but ultimately left out. They are acclaimed SOTA optimizers by a bunch of Medium posts, though they stand unproven. A future post could include these "SOTA" optimizers, to explain the difference from Adam, and why that might be useful.

Anyone getting into deep learning will probably get the best and most consistent results using Adam, as that has already been out there and shown that it performs the best.

If you want to visualize optimizers, I found this notebook to be a great resource, using optimizers from TensorFlow.

Here are the further readings for this post. Currently, it's only the papers being referenced here.

Grid Search with Cross-Validation (GridSearchCV) is a brute force on finding the best hyperparameters for a specific dataset and model. Why not automate it to the extend we can? Stay around until the end for a RandomizedSearchCV in addition to the GridSearchCV implementation.

This is perhaps a trivial task to some, but a very important one – hence it is worth showing how you can run a search over hyperparameters for all the popular packages.

In this post, I'm going to go over a code piece for both classification and regression, varying between Keras, XGBoost, LightGBM and Scikit-Learn.

There is a **GitHub** available with a *colab button*, where you instantly can run the same code, which I used in this post.

In one line: cross-validation is the process of splitting the same dataset in K-partitions, and for each split, we search the whole grid of hyperparameters to an algorithm, in a brute force manner of trying every combination.

Note that I'm referring to K-Fold cross-validation (CV), even though there are other methods of doing CV.

In an iterative manner, we switch up the testing and training dataset in different subsets from the full dataset. We usually split the full dataset so that each testing fold has 10% ($K=10$) or 20% ($K=5$) of the full dataset.

**Grid Search: **From this image of cross-validation, what we do for the grid search is the following; for each iteration, test all the possible combinations of hyperparameters, by fitting and scoring each combination separately.

We need a prepared dataset to be able to run a grid search over all the different parameters we want to try. I'm assuming you have already prepared the dataset, else I will show a short version of preparing it and then get right to running grid search.

In this post, I'm going to be running models on three different datasets; MNIST, Boston House Prices and Breast Cancer. The sole purpose is to jump right past preparing the dataset and right into running it with GridSearchCV. But we will have to do just a little preparation, which we will keep to a minimum.

For the MNIST dataset, we normalize the pictures, divide by the RGB code values and one-hot encode our output classes

```
# LOAD DATA
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# PREPROCESSING
def preprocess_mnist(x_train, y_train, x_test, y_test):
# Normalizing all images of 28x28 pixels
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Float values for division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value
x_train /= 255
x_test /= 255
# Categorical y values
y_train = to_categorical(y_train)
y_test= to_categorical(y_test)
return x_train, y_train, x_test, y_test, input_shape
X_train, y_train, X_test, y_test, input_shape = preprocess_mnist(x_train, y_train, x_test, y_test)
```

For the house prices dataset, we do even less preprocessing. We really just remove a few columns with missing values, remove the rest of the rows with missing values and one-hot encode the columns.

```
boston = load_boston()
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

For the last dataset, breast cancer, we don't do any preprocessing except for splitting the training and testing dataset into train and test splits

```
# Load breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

The next step is to actually run grid search with cross-validation. How does it work? Well, I made this function that is pretty easy to pick up and use. You can input your different training and testing split `X_train_data`

, `X_test_data`

, `y_train_data`

, `y_test_data`

. You can also input your `model`

, whichever library it may be from; could be Keras, sklearn, XGBoost or LightGBM. You would have to specify which parameters, by `param_grid`

, you want to 'bruteforce' your way through, to find the best hyperparameters. An important thing is also to specify which scoring you would like to use; there is one for fitting the model `scoring_fit`

. At last, you can set other options, like how many K-partitions you want and which scoring from sklearn.metrics that you want to use. We use `n_jobs=-1`

as a standard, since that means we use all available CPU cores to train our model.

```
def algorithm_pipeline(X_train_data, X_test_data, y_train_data, y_test_data,
model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
do_probabilities = False):
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
cv=cv,
n_jobs=-1,
scoring=scoring_fit,
verbose=2
)
fitted_model = gs.fit(X_train_data, y_train_data)
if do_probabilities:
pred = fitted_model.predict_proba(X_test_data)
else:
pred = fitted_model.predict(X_test_data)
return fitted_model, pred
```

Note that we could switch out `GridSearchCV`

by `RandomSearchCV`

, if you want to use that instead.

Firtly, we define the neural network architecture, and since it's for the MNIST dataset that consists of pictures, we define it as some sort of convolutional neural network (CNN).

```
# Readying neural network model
def build_cnn(activation = 'relu',
dropout_rate = 0.2,
optimizer = 'Adam'):
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation=activation,
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation=activation))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(dropout_rate))
model.add(Flatten())
model.add(Dense(128, activation=activation))
model.add(Dropout(dropout_rate))
model.add(Dense(10, activation='softmax'))
model.compile(
loss='categorical_crossentropy',
optimizer=optimizer,
metrics=['accuracy']
)
return model
```

Next, we just define the parameters and model to input into the `algorithm_pipeline`

; we run classification on this dataset, since we are trying to predict which class a given image can be categorized into. Note that I commented out some of the parameters, because it would take a long time to train, but you can always fiddle around with which parameters you want. You could even add `pool_size`

or `kernel_size`

.

```
param_grid = {
'epochs':[1,2,3],
'batch_size':[128]
#'epochs' : [100,150,200],
#'batch_size' : [32, 128],
#'optimizer' : ['Adam', 'Nadam'],
#'dropout_rate' : [0.2, 0.3],
#'activation' : ['relu', 'elu']
}
model = KerasClassifier(build_fn = build_cnn, verbose=0)
model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model,
param_grid, cv=5, scoring_fit='neg_log_loss')
print(model.best_score_)
print(model.best_params_)
```

From this GridSearchCV, we get the best score and best parameters to be:

```
-0.04399333562212302
{'batch_size': 128, 'epochs': 3}
```

I came across this issue when coding a solution trying to use accuracy for a Keras model in GridSearchCV – you might wonder why `'neg_log_loss'`

was used as the scoring method?

The solution to using something else than negative log loss is to remove some of the preprocessing of the MNIST dataset; that is, *REMOVE* the part where we make the output variables categorical

```
# Categorical y values
y_train = to_categorical(y_train)
y_test= to_categorical(y_test)
```

Surely we would be able to run with other scoring methods, right? Yes, that was actually the case (see the notebook). This was the best score and best parameters:

```
0.9858
{'batch_size': 128, 'epochs': 3}
```

Next we define parameters for the boston house price dataset. Here the task is regression, which I chose to use XGBoost for. I also chose to evaluate by a Root Mean Squared Error (RMSE).

```
model = xgb.XGBRegressor()
param_grid = {
'n_estimators': [400, 700, 1000],
'colsample_bytree': [0.7, 0.8],
'max_depth': [15,20,25],
'reg_alpha': [1.1, 1.2, 1.3],
'reg_lambda': [1.1, 1.2, 1.3],
'subsample': [0.7, 0.8, 0.9]
}
model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model,
param_grid, cv=5)
# Root Mean Squared Error
print(np.sqrt(-model.best_score_))
print(model.best_params_)
```

The best score and parameters for the house prices dataset found from the GridSearchCV was

```
3.4849014783892733
{'colsample_bytree': 0.8, 'max_depth': 20, 'n_estimators': 400, 'reg_alpha': 1.2, 'reg_lambda': 1.3, 'subsample': 0.8}
```

The next task was LightGBM for classifying breast cancer. The metric chosen was accuracy.

```
model = lgb.LGBMClassifier()
param_grid = {
'n_estimators': [400, 700, 1000],
'colsample_bytree': [0.7, 0.8],
'max_depth': [15,20,25],
'num_leaves': [50, 100, 200],
'reg_alpha': [1.1, 1.2, 1.3],
'reg_lambda': [1.1, 1.2, 1.3],
'min_split_gain': [0.3, 0.4],
'subsample': [0.7, 0.8, 0.9],
'subsample_freq': [20]
}
model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model,
param_grid, cv=5, scoring_fit='accuracy')
print(model.best_score_)
print(model.best_params_)
```

The best parameters and best score from the GridSearchCV on the breast cancer dataset with LightGBM was

```
0.9736263736263736
{'colsample_bytree': 0.7, 'max_depth': 15, 'min_split_gain': 0.3, 'n_estimators': 400, 'num_leaves': 50, 'reg_alpha': 1.3, 'reg_lambda': 1.1, 'subsample': 0.7, 'subsample_freq': 20}
```

Just to show that you indeed can run GridSearchCV with one of sklearn's own estimators, I tried the RandomForestClassifier on the same dataset as LightGBM.

```
model = RandomForestClassifier()
param_grid = {
'n_estimators': [400, 700, 1000],
'max_depth': [15,20,25],
'max_leaf_nodes': [50, 100, 200]
}
model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model,
param_grid, cv=5, scoring_fit='accuracy')
print(model.best_score_)
print(model.best_params_)
```

And indeed the score was worse than from LightGBM, as expected:

```
0.9648351648351648
{'max_depth': 25, 'max_leaf_nodes': 50, 'n_estimators': 1000}
```

Interested in running a GridSearchCV that is unbiased? I welcome you to Nested Cross-Validation; where you get the optimal bias-variance trade-off and, by the theory, as unbiased of a score as possible.

I would encourage you to check out this repository over at GitHub.

I made a post about it:

I embedded the examples below, and you can install the package by the a pip command:**pip install nested-cv**

We don't have to restrict ourselves to GridSearchCV – why not implement RandomSearchCV too, if that is preferable to you. This is implemented at the bottom of the notebook available here.

We can specify another parameter for the pipeline `search_mode`

, which let's us specify which search algorithm we want to use in our pipeline. But we also introduce another parameter called `n_iterations`

, since we need to provide such a parameter for both the RandomSearchCV class – but not GridSearchCV.

We can set the default for both those parameters, and indeed that is what I have done. `search_mode = 'GridSearchCV'`

and `n_iterations = 0`

is the defaults, hence we default to GridSearchCV where the number of iterations is not used.

Here the code is, and notice that we just made a simple if-statement for which search class to use:

```
from sklearn.model_selection import RandomizedSearchCV
def search_pipeline(X_train_data, X_test_data, y_train_data, y_test_data,
model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
do_probabilities = False, search_mode = 'GridSearchCV', n_iterations = 0):
fitted_model = None
if(search_mode == 'GridSearchCV'):
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
cv=cv,
n_jobs=-1,
scoring=scoring_fit,
verbose=2
)
fitted_model = gs.fit(X_train_data, y_train_data)
elif (search_mode == 'RandomizedSearchCV'):
rs = RandomizedSearchCV(
estimator=model,
param_distributions=param_grid,
cv=cv,
n_iter=n_iterations,
n_jobs=-1,
scoring=scoring_fit,
verbose=2
)
fitted_model = rs.fit(X_train_data, y_train_data)
if(fitted_model != None):
if do_probabilities:
pred = fitted_model.predict_proba(X_test_data)
else:
pred = fitted_model.predict(X_test_data)
return fitted_model, pred
```

Running this for the breast cancer dataset, it produces the below results, which is almost the same as the GridSearchCV result (which got a score of 0.9648)

```
0.9626373626373627
{'n_estimators': 1000, 'max_leaf_nodes': 100, 'max_depth': 25}
```

Most recommended books (referral to Amazon) are the following, in order. The first one is particularly good for practicing ML in Python, as it covers much of scikit-learn and TensorFlow.

- Hands-On Machine Learning, best practical book!
- Condensed book with all the material needed to get started; a great reference!
- Acedemic and theory-oriented book for deep learning
- Learning and looking at Machine Learning with probability theory. Recommended if you have a mathematics background

I recommend reading the documentation for each model you are going to use with this GridSearchCV pipeline – it will solve complications you will have migrating to other algorithms. In particular, here is the documentation from the algorithms I used in this posts:

]]>During the calculations of the values for activations in each layer, we use an activation function right before deciding what exactly the activation value should be. From the previous activations, weights and biases in each layer, we calculate a value for every activation in the next layer. But before sending that value to the activations of the next layer, we use an activation function to scale the output. Here, we will explore different activation functions.

**The prerequisite** for this post is my last post about feedfordward and backpropagation in neural networks, you would have seen that I briefly talked about activation functions, but never actually expanded on what they do for us. Much of what I talk about here will only be relevant if you have the prior knowledge, or have read my previous post.

`Code > Theory?`

→ Jump straight to the code.

Activation functions can be a make-or-break-it part of a neural network. In this extensive article (>6k words), I'm going to go over 6 different activation functions, each with pros and cons. I will give you the equation, differentiated equation and plots for both of them. The goal is to explain the equation and graphs in simple input-output terms.

I show you the vanishing and exploding gradient problem; for the latter, I follow Nielsens great example of why gradients might explode.

At last, I provide some code that you can run for yourself, in a Jupyter Notebook.

From the small code experiment on the MNIST dataset, we obtain a loss and accuracy graph for each activation function

The sigmoid function is a logistic function, which means that, whatever you input, you get an output ranging between 0 and 1. That is, every neuron, node or activation that you input, will be scaled to a value between 0 and 1.

$$
\text{sigmoid}(x) = \sigma = \frac{1}{1+e^{-x}}
$$

Such a function, as the sigmoid is often called a *nonlinearity*, simply because we cannot describe it in linear terms. Many activation functions are nonlinear, or a combination of linear and nonlinear – and it is possible for some of them to be linear, although that is unusual.

This is not entirely problematic, except for if the value is exactly 0 or 1, which it will at some point. *So why is this problematic?*

This very question relates to backpropagation, covered here (read before continuing). In backpropagation, we calculate gradients for each weight, that is, small updates to each weight. We do this to optimize the output of the activation values throughout the whole network, so that it gives us a better output in the output layer, which in turn will optimize the cost function.

During backpropagation, we have to calculate the ratio of how much each weight impacts the cost function, by finding the partial derivatives of the cost function with respect to each weight. Let's say that we define all weights $w$ in the last layer $L$ by $w^L$, then the derivative of these will be, instead of defining individual weights

$$
\frac{\partial C}{\partial w^{L}}
=
\frac{\partial C}{\partial a^{L}}
\frac{\partial a^{L}}{\partial z^{L}}
\frac{\partial z^{L}}{\partial w^{L}}
$$

Note that when taking the partial derivative, we find the equation for $\partial a^{L}$ and then only differentiate $\partial z^{L}$, while the rest is constant. We denote the derivative of any function by an apostrophe * '*. When calculating the partial derivative for the middle term $\partial a^{L}/\partial z^{L}$, we get this

$$
\frac{\partial a^{L}}{\partial z^{L}}=
\sigma' \left(
w \times a^{L-1}+b
\right)
$$

And the derivative of the sigmoid function turns out to be

$$
\sigma'(x)=\sigma(x) (1-\sigma(x))
= \frac{e^{-x}}{(e^{-x}+1)^2}
$$

When we input a large (positive or negative) $x$-value into the sigmoid function, we get a $y$-value in return that is almost at 0 — that is, when we input $w \times a+b$, we might get a value close to zero in return.

When $x$ is a large value (positive or negative), we essentially multiply a value that is almost zero with the rest of the partial derivatives.

$$
\frac{\partial C}{\partial w^{L}}
=
\frac{\partial C}{\partial a^{L}}
\times \text{almost zero} \times
\frac{\partial z^{L}}{\partial w^{L}}
$$

If enough weights have this behaviour of being a large value, we essentially get a network that does not adjust weights at all, which is a big problem. If we don't adjust the weights, we are left with the tiniest updates, which means that the algorithm does not improve the network much over time. For each of the calculations for partials derivatives with respect to a weight, we throw them into a gradient vector, which we use to update the neural network with. You might imagine, that if all the values of that gradient vector are close to zero, we are not really updating anything at all.

$$
\sigma'(10)=\sigma(10)\times (1-\sigma(10))
= 0.00004539...
$$

What has been described here is known as the **vanishing gradient problem**. Very briefly, this is why the sigmoid function should never be used over other activation functions, introduced later.

Per my last post, if we want to update a specific weight, the update rule is

$$
w^{(L)} = w^{(L)} - \text{learning rate} \times \frac{\partial C}{\partial w^{(L)}}
$$

What if that partial derivative $\partial C/\partial w^{(L)}$ is *vanishingly* small? That is when we encounter the vanishing gradient problem, where a number of weights or biases essentially receive very small updates.

You see, if we have a weight with a value 0.2, it will barely move from that value if we have vanishing gradients. Since this weight is connecting the first neuron in the first layer and the first neuron in the second layer, we will call it $w^{(1)}_{1,1}$ in the notation of $w^{layer}_{to,from}$

$$
w^{(1)}_{1,1}=w^{(1)}_{1,1}-\text{learning rate} \times 0.0000000438
$$

Supposing that the weight has the value 0.2 and some given learning rate (doesn't matter, we will use 0.5), we the new would be

$$
w^{(1)}_{1,1}=0.2-0.5 \times 0.0000000438
= 0.199999978
$$

The weight had the value $0.2$, but was updated to $0.199999978$. Evidently, this is a problem — gradients are *vanishingly* small and the weights in the neural networks will barely be updated. This causes nodes in the network to be far from their optimal value. This is a problem, that holds back a neural network from learning.

It has been observed that this becomes an even bigger problem, if you have different layers learning at different speeds. The layers will learn at different speeds, and the first layers will always be worse in the regard of learning rate.

In this example, hidden layer 4 learns the fastest, because the cost function only depends on the changes of the weights connected to hidden layer 4. Let's consider hidden layer 1; here, the cost function depends on the changes of the weights connected to hidden layer 1 *and *hidden layer 2, 3 and 4's changes. If you followed my first post on backpropagation, maybe you picked up on the fact that earlier layers in the network reuses calculations from the later layers

$$
\frac{\partial C}{\partial w^{(1)}}
=
\underbrace{
\frac{\partial C}{\partial a^{(4)}}
\frac{\partial a^{(4)}}{\partial z^{(4)}}
}_\text{From $w^{(4)}$}
\,
\underbrace{
\frac{\partial z^{(4)}}{\partial a^{(3)}}
\frac{\partial a^{(3)}}{\partial z^{(3)}}
}_\text{From $w^{(3)}$}
\,
\underbrace{
\frac{\partial z^{(3)}}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
}_\text{From $w^{(2)}$}
\,
\frac{\partial z^{(2)}}{\partial a^{(1)}}
\frac{\partial a^{(1)}}{\partial z^{(1)}}
\frac{\partial z^{(1)}}{\partial w^{(1)}}
$$

While, as shown earlier in this post, the last layer only depends on one set of changes when calculating partial derivatives

$$
\frac{\partial C}{\partial w^{4}}
=
\frac{\partial C}{\partial a^{4}}
\frac{\partial a^{4}}{\partial z^{4}}
\frac{\partial z^{4}}{\partial w^{4}}
$$

Ultimately, this is a big problem, because now the layers of weights are learning at a different rate. This means the later layers in the network will almost certainly be more optimized than the early layers in the network.

This is a problem, because the backpropagation algorithm doesn't know in which direction it should transform the weights, to optimize the cost function.

This problem is essentially the opposite of the vanishing gradient problem. It can be shown, as done by Nielsen, that we can get the opposite problem, where weights are essentially 'exploding' i.e. their values are rapidly increasing. We are going to be following his example here. Note that this can also be used to show the vanishing gradient problem, but I chose that problem to be more conceptual, for an easier explanation.

Essentially, we have the chance to run into a vanishing gradient problem when $0 < w < 1$ and an exploding gradient problem when $w > 1$. Though, for a layer to experience this problem, there must be more weights that satisfy the condition for either vanishing/exploding gradients.

We start off with a simple network. One which has few weights, biases and activations, and one which only has one node per layer.

Simple; weights are indicated by $w_j$, biases $b_j$, with a cost function $C$. Nodes, neurons or activations (whatever you prefer) are the circles.

Nielsen uses a common notation from physics, describing change in some value, by the triangle symbol delta $\Delta$ (which is different from the gradient symbol nabla $\nabla$). For an example, $\Delta b_j$ is used to describe a change in the value for a the *j*'th bias.

$$
\Delta b_j = \text{how much $b_j$ changes}
$$

The essence of my last post is that we measure the ratio of change in a weight or bias in relation to the cost function. Instead of considering a layer, let's focus a specific bias, that is, the first bias $b_1$. Then we would measure the ratio by the equation

$$
\frac{\partial C}{\partial b_1}
=
\frac{\partial C}{\partial a}
\frac{\partial a}{\partial z}
\frac{\partial z}{\partial b_1}
$$

The argument for the next equation is the same as for the partial derivative above. That is, how do we measure a ratio of change in the cost function by the ratio of change in the bias?

As just described, Nielsen uses delta $\Delta$ to describe change so we could say that the partial derivatives could, *roughly speaking*, be replaced by delta

$$
\frac{\partial C}{\partial b_1}
\approx
\frac{\Delta C}{\Delta b_1}
$$

The changes to weights and biases can be visualized as follows:

Let's start at the beginning of the network with calculating how a change in the first bias $b_1$ affects the network. Since we know, by my last post, that the first bias $b_1$ feeds into the first activation $a_1$, that is where we would start. Let's just recap to remind ourselves of the equation

$$
a_1 = \sigma(z_1) =
\sigma\left(w_1 a_0 + b_1 \right)
$$

If $b_1$ changes, we call it $\Delta b_1$. Therefore, we notice that when $b_1$ changes, the activation $a_1$ also changes – which we normally express as $\partial a_1/\partial b_1$.

So we have the partial derivative expression on the left, which is the change in $b_1$ with relation to $a_1$. But then on the left side, we begin replacing terms, starting with replace $a_1$ with the sigmoid of $z_1$

$$
\frac{\partial a_1}{\partial b_1}
=
\frac{\partial \sigma(z_1)
}{\partial b_1}
$$

The above equation says that there is some change in the activation value $a_1$, when $b_1$ changes. Recall that we describe the change of $a_1$ as $\Delta a_1$.

We consider the change $\Delta a_1$ as being approximately the same as the change in the activation value $a_1$ *PLUS* the change $\Delta b_1$

$$
\begin{eqnarray}
\Delta a_1 & \approx &
\frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1\\
& = & \sigma'(z_1) \Delta b_1
\end{eqnarray}
$$

We jumped over a step here, but essentially, we just calculated the partial derivative and replaced the fraction by the result of the partial derivative.

The described change $\Delta a_1$ now causes a change in the input $z_2$ to the next layer. If this seems weird or you are not convinced, I encourage you to read through Neural Networks Explained.

$$
z_2 = w_2 a_1 + b_2
$$

Same notation as before, we call the next change $\Delta z_2$. The exact process we went through is used again, by this time to get to the change in $z_2$

$$
\begin{eqnarray}
\Delta z_2 & \approx &
\frac{\partial z_2}{\partial a_1} \Delta a_1\\
& = & w_2 \Delta a_1
\end{eqnarray}
$$

We could replace $\Delta a_1$ with the equation

$$
\sigma'(z_1) \Delta b_1
$$

We just calculated this equation. Hopefully it's clear how we got here – it's the same process we used to calculate $\Delta a_1$.

This process would just keep repeating until we have calculated the whole network. By replacing the $\Delta a_j$ values, we get a final function that that calculates the change in the cost function with relation to the whole network – that is, all the weights, biases and activations

$$
\begin{eqnarray}
\Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4)
\frac{\partial C}{\partial a_4} \Delta b_1
\end{eqnarray}
$$

From this, we simply plug into the $\partial C/\partial b_1$ equation and get the final equation that we need

$$
\begin{eqnarray}
\frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \,
w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}
\end{eqnarray}
$$

From here, what happens is that if all the weights $w_j$ all together is large, i.e. if many of the weights are equal to a value greater than 1, we start multiplying big values. For an example, all weights have some extremely high value, like 100, and we get some random outputs of the derivative of the sigmoid function between 0 and 0.25:

$$
\frac{\partial C}{\partial b_1} = 0.13 \times 100 \times 0.21 \times 100 \times 0.09 \times 100 \times 0.14 \frac{\partial C}{\partial a_4}
$$

The last partial derivative turns out to be $2 \left(a^{(L)} - y \right) \sigma' \left(z^{(L)}\right) w^{(L)}$, which plausibly could turn to the be much greater than 1, but for the sake of the example, we will set it equal to 1.

$$
\begin{eqnarray}
\frac{\partial C}{\partial b_1} & = &
0.13 \times 100 \times 0.21 \times 100 \times
0.09 \times 100 \times 0.14 \times 1\\
& = & 343.98
\end{eqnarray}
$$

Using the update rule, and if we suppose $b_1$ previously was equal to $1.56$ and learning rate being equal to $0.5$

$$
b_1 = 1.56 - 0.5 \times (-343.98) = 173.55
$$

Although an extreme example, you get the point. The values for weights and biases can increasingly explode, causing the whole network to explode.

For a second, just imagine the rest of the network's weights and biases, and in turn activations, *explosively* updating their values. That is exactly what we call the *exploding gradient problem*. Obviously, the network won't learn much here, so this will completely ruin whatever task you are trying to solve.

This idea is basically setting up a rule for avoiding exploding gradients. I won't go into the math on this one, but I will give you the steps on how it's done.

- Pick a threshold value – if a gradient passes this value, gradient clipping or a gradient norm is applied.
- Define if you are going to use gradient clipping or gradient norm. If gradient clipping, you specified a threshold value; e.g. 0.5. If the gradient value exceeds 0.5 or $-0.5$, then it will be either scaled back by the gradient norm or clipped back to the threshold value.

Note, however, that none of these gradient methods avoids the vanishing gradient problem. But we are going to further explore more options for that. Typically, you would need these kinds of methods, if you are using a Recurrent Neural Network architecture (like LSTM or GRU), since that is where you usually will experience exploding gradients.

Rectified Linear Unit. This is how we fix the vanishing gradient problem, but does it cause other problems? Follow along.

This is the equation for ReLU:

$$
\text{ReLU}(x)=\text{max$(0,x)$}
$$

The ReLU equation tells us this:

- If the input $x$ is less than 0, set input equal to 0
- If the input is greater than 0, set input equal to input

Although we can't exactly graph it with most tools, this is how you could explain the ReLU function in a graph. Everything with an x-value less than zero maps to a y-value of zero, but everything greater than zero is mapped to it's own y-value. That is, if we input $x=1$, we get $y=1$ back.

That's great, but how does that relate to the vanishing gradient problem? Firstly, we have to obtain the differentiated equation:

$$
\text{ReLU}'(x) =
\begin{cases}
\mbox{$1$} & \mbox{if } x > 0\\
\mbox{$0$} & \mbox{if } x \leq 0
\end{cases}
$$

This tells us that

- If the input $x$ is greater than 0, then the input becomes 1
- If the input is less than or equal (the $\leq$ symbol) to 0, then the input becomes 0

This is the graph for it

Now we have the answer; we don't get extremely small values, when using a ReLU activation function — like $0.0000000438$ from the sigmoid function. Instead it's either 0, causing some of the gradients to return nothing, or 1.

This spawns another problem, though. The '*dead' *ReLU problem.

What happens if too many values are below 0, when calculating gradients? We get a bunch of weights and biases that is not updated, since the update is equal to zero. To see this in action, let's reverse the exploding gradient example.

Let's denote ReLU as $R$ in this equation, where we just replaced every sigmoid $\sigma$ with $R$

$$
\begin{eqnarray}
\frac{\partial C}{\partial b_1} = R'(z_1) \, w_2
R'(z_2) \, w_3 R'(z_3) \, w_4 R'(z_4)
\, \frac{\partial C}{\partial a_4}
\end{eqnarray}
$$

Now let's say a random $z$ input to the differentiated ReLU is less than 0 — the function will cause this bias to *'die'*. Let's say it's $R'(z_3)=0$

$$
\begin{eqnarray}
\frac{\partial C}{\partial b_1} = R'(z_1) \, w_2
R'(z_2) \, w_3 0 \, w_4 R'(z_4)
\, \frac{\partial C}{\partial a_4}
\end{eqnarray}
= 0
$$

In turn, when we get that $R'(z_3)=0$, the rest of the values are multiplied and we get a whopping zero, which causes this bias to die. We know that the new value for the bias is the bias minus the learning rate minus the gradient, which means we get an update of zero.

When we introduce the ReLU function to a neural network, we also introduce great sparsity. Now what does this term *sparsity *actually mean?

Sparse: *small in numbers or amount, often spread over a large area*. In neural networks this means that the matrices for the activations have many $0$s. What do we obtain by this sparsity? When some percentage (e.g. 50%) of the activations are saturated, we would call the neural network sparse. This leads to an increase in efficiency with regards to time and space complexity – constant values (often) requires less space and are less computationally expensive.

It has been observed by Yoshua Bengio et al., that this component of ReLUs actually make a neural network perform better, with the aforementioned efficiencies of time and space complexity.

Pros:

- Less time and space complexity, because of sparsity, and compared to the sigmoid, it does not evolve the exponential operation, which are more costly.
- Avoids the vanishing gradient problem.

Cons:

- Introduces the
*dead relu*problem, where components of the network are most likely never updated to a new value. This can sometimes also be a pro. - ReLUs does not avoid the exploding gradient problem.

Exponential Linear Unit. This activation function fixes some of the problems with ReLUs and keeps some of the positive things. For this activation function, an alpha $\alpha$ value is picked; a common value is between $0.1$ and $0.3$.

The equation is a little more scary to look at, if you are not as much into math:

$$
\text{ELU}(x) =
\begin{cases}
\mbox{$x$} & \mbox{if } x > 0\\
\mbox{$\alpha (e^x-1)$} & \mbox{if } x < 0
\end{cases}
$$

Let me explain. If you input an x-value that is greater than zero, then it's the same as the ReLU – the result will be a y-value equal to the x-value. But this time, if the input value $x$ is less than 0, we get a value slightly below zero.

The y-value you get depends both on your x-value input, but also on a parameter alpha $\alpha$, which you can adjust as needed. Furthermore, we introduce an exponential operation $e^x$, which means the ELU is more computationally expensive than the ReLU.

The ELU function is plotted below with an $\alpha$ value of 0.2.

It's pretty straight forward, and we should still be good on the vanishing gradient problem, seeing as the input values don't map to extremely small output values.

But what about the derivative of the ELU? This is at least as important to show.

$$
\text{ELU}'(x) =
\begin{cases}
\mbox{$1$} & \mbox{if } x > 0\\
\mbox{ELU$(x) + \alpha$} & \mbox{if } x \leq 0
\end{cases}
$$

Seems simple enough. The y-value output is $1$ if $x$ is greater than 0. The output is the ELU function (not differentiated) plus the alpha value, if the input $x$ is less than zero.

The plot for it looks like this:

As you might have noticed, we avoid the *dead relu* problem here, while still keeping some of the computational speed gained by the ReLU activation function – that is, we will still have some dead components in the network.

Pros

- Avoids the
*dead relu*problem. - Produces negative outputs, which helps the network nudge weights and biases in the right directions.
- Produce activations instead of letting them be zero, when calculating the gradient.

Cons

- Introduces longer computation time, because of the exponential operation included
- Does not avoid the exploding gradient problem
- The neural network does not learn the alpha value

Leaky Rectified Linear Unit. This activation function also has an alpha $\alpha$ value, which is commonly between $0.1$ to $0.3$. The Leaky ReLU activation function is commonly used, but it does have some drawbacks, compared to the ELU, but also some positives compared to ReLU.

The Leaky ReLU takes this mathematical form

$$
\text{LReLU}(x) =
\begin{cases}
\mbox{$x$} & \mbox{if } x > 0\\
\mbox{$\alpha x$} & \mbox{if } x \leq 0
\end{cases}
$$

So, if the input $x$ is greater than $0$, then the output is $x$. If the input is less than $0$, the output will be alpha $\alpha$ times the input.

This means that we solve the *dead relu* problem, because the values of the gradients can no longer be stuck at zero – also, this function avoids the vanishing gradient problem. Though an issue is that we still have to deal with exploding gradients, but more on how to prevent this later in the code section.

The Leaky ReLU is plotted here, with an assumption of alpha $\alpha$ being $0.2$:

As explained from the equation, we see that any x-value maps to the same y-value, if the x-value is greater than $0$. But if the x-value is less than $0$, we have a coefficient of alpha, which is $0.2$. That means, if the input value $x$ is $5$, the output value it maps to is $1$.

Now, the derivative of the Leaky ReLU function is a bit simpler, as it entails two linear cases.

$$
\text{LReLU}'(x) =
\begin{cases}
\mbox{$1$} & \mbox{if } x > 0\\
\mbox{$\alpha$} & \mbox{if } x \leq 0
\end{cases}
$$

The first linear case being is when the input $x$ is greater than $0$, then the output will be $1$. But if the input is less than $0$, then the output will be the value of alpha, which we chose to be $0.2$ in this instance.

Notice how the function just becomes linear when differentiated – either $x$ is greater than zero and we get an output value of $1$, or $x$ is less than zero and we get an output value of alpha $\alpha$, which we chose to be $0.2$

Pros

- Like the ELU, we avoid the
*dead relu*problem, since we allow a small gradient, when computing the derivative. - Faster to compute then ELU, because no exponential operation is included

Cons

- Does not avoid the exploding gradient problem
- The neural network does not learn the alpha value
- Becomes a linear function, when it is differentiated, whereas ELU is partly linear and nonlinear.

Scaled Exponential Linear Unit. This activation functions is one of the newer one's, and it serves us on a particularly long appendix (90 pages) with theorems, proofs etc. When using this activation function in practice, one must use *lecun_normal* for weight initialization, and if dropout wants to be applied, one should use *AlphaDropout*. More on this later in the code section.

The authors have calculated two values; an alpha $\alpha$ and lambda $\lambda$ value for the equation, which I'm going to show first

$$
a \approx 1.6732632423543772848170429916717
$$

$$
\lambda \approx 1.0507009873554804934193349852946
$$

They have that many numbers after the decimal point for absolute precision, and they are predetermined, which means we do not have to worry about picking the right alpha value for this activation function.

To be honest, the equation just looks like the other equations, which it more or less is. All the newer activation functions just looks like a combination of the other existing activation functions.

The equation for it looks like this:

$$
\text{SELU}(x) = \lambda
\begin{cases}
\mbox{$x$} & \mbox{if } x > 0\\
\mbox{$\alpha e^x-\alpha$} & \mbox{if } x \leq 0
\end{cases}
$$

That is, if the input value $x$ is greater than zero, the output value becomes $x$ multiplied by lambda $\lambda$. If the input value $x$ is less than or equal to zero, we have a function that goes up to $0$, which is our output $y$, when $x$ is zero. Essentially, when $x$ is less than zero, we multiply alpha with the exponential of the x-value minus the alpha value, and then we multiply by the lambda value.

The SELU activation is self-normalizing the neural network; and what does that mean?

Well, let's start with, what is normalization? Simply put, we first subtract the mean, then divide by the standard deviation. So the components of the network (weights, biases and activations) will have a mean of zero and a standard deviation of one after normalization. This will be the output value of the SELU activation function.

What do we achieve by mean of zero and standard deviation of one? Under the assumption, that the initialization function *lecun_normal* initializes the parameters of the network as a normal distribution (or Gaussian), then the case for SELU is that the network will be normalized entirely, within the bounds described in the paper. Essentially, when multiplying or adding components of such a network, the network is still considered to be a Gaussian. This is what we call normalization. In turn, this means that the whole network and its output in the last layer is also normalized.

How a normal distribution looks with a mean $\mu$ of zero and a standard deviation $\sigma$ of one.

The output of a SELU is normalized, which could be called *internal normalization*, hence the fact that all the outputs are with a mean of zero and standard deviation of one, as just explained. This is different from *external normalization*, where batch normalization and other methods are used.

Okay, great, the components are normalized. But how does it actually happen?

The simple explanation is that variance decreases when the input is less than zero, and variance increases when the input is greater than zero – and the standard deviation is the square root of variance, so that is how we get to a standard deviation of one.

We get to a mean of zero by the gradients, where we need some positive and some negative, to be able to shift the mean to zero. As you might recall (per my last post), the gradients adjust the weights and biases in a neural network, so we need some negative and positive from the output of those gradients to be able to control the mean.

The main point of the mean *mu* $\mu$ and variance *nu* $\nu$, is that we have some domain omega $\Omega$, where we always map the mean and variance within predefined intervals. These intervals are defined as follows:

$$
\text{mean} = \mu \in \left[ -0.1,\, 0.1 \right]
$$

$$
\text{variance} = \nu \in \left[ 0.8,\, 1.5 \right]
$$

The $\in$ symbol just means that the mean or variance is within those predefined intervals. In turn, this causes the network to avoid vanishing and exploding gradients.

I want to quote from the paper, one which I find important and how they got to this activation function:

SELUs allow to construct a mapping g with properties that lead to SNNs [self-normalizing neural networks]. SNNs cannot be derived with (scaled) rectified linear units (ReLUs), sigmoid units, tanh units, and leaky ReLUs. The activation function is required to have (1) negative and positive values for controlling the mean, (2) saturation regions (derivatives approaching zero) to dampen the variance if it is too large in the lower layer, (3) a slope larger than one to increase the variance if it is too small in the lower layer, (4) a continuous curve. The latter ensures a fixed point, where variance damping is equalized by variance increasing. We met these properties of the activation function by multiplying the exponential linear unit (ELU) [7] with $\lambda > 1$ to ensure a slope larger than one for positive net inputs.

Let's move onto how the differentiated function look like. Here it is:

$$
\text{SELU}'(x) = \lambda
\begin{cases}
\mbox{$1$} & \mbox{if } x > 0\\
\mbox{$\alpha e^x$} & \mbox{if } x \leq 0
\end{cases}
$$

Well.. nothing too fancy, that we can't explain with simple words. If $x$ is greater than zero, then the output value will be $y$. But if $x$ is less than zero, then the alpha value is simply multiplied by the exponential operation on $x$.

The plot is seen as below, and it looks very unique.

Note that the SELU function also requires a weight initialization method called LeCun Normal, and if you want to use dropout, you have to use a special version called Alpha Dropout.

Pros

- Internal normalization is faster than external normalization, which means the network converges faster.
- Vanishing and exploding gradient problem is
*impossible*, shown by their theorems 2 & 3 in the appendix.

Cons

- Relatively new activation function – needs more papers on architectures such as CNNs and RNNs, where it is comparatively explored. One paper talks about CNNs with SELU here.

Gaussian Error Linear Unit. An activation function used in the most recent Transformers – Google's BERT and OpenAI's GPT-2. The paper is from 2016, but is only catching attention up until recently.

This activation function takes the form of this equation:

$$
\text{GELU}(x) = 0.5x\left(1+\text{tanh}\left(\sqrt{2/\pi}(x+0.044715x^3)\right)\right)
$$

So it's just a combination of some functions (e.g. hyperbolic tangent *tanh*) and approximated numbers – there is not much to say about it. What is more interesting, is looking at the graph for the gaussian error linear unit:

It has a negative coefficient, which shifts to a positive coefficient. So when $x$ is greater than zero, the output will be $x$, except from when $x=0 \text{to} x=1$, where it slightly leans to a smaller y-value.

I was unable to find the derivative anywhere, so naturally I turned to WolframAlpha and got it to differentiate the function. The result was this equation:

$$
\text{GELU}'(x) = 0.5\text{tanh}(0.0356774x^3 + 0.797885 x) + (0.0535161 x^3 + 0.398942 x) \text{sech}^2(0.0356774x^3+0.797885x)+0.5
$$

This is just another combination of hyperbolic functions as before. The plot of this function does look interesting though.

Pros

- Seems to be state-of-the-art in NLP, specifically Transformer models – i.e. it performs best
- Avoids vanishing gradients problem

Cons

- Fairly new in practical use, although introduced in 2016.

Let's say you want to try out all these activation functions, to find out which one is the best. How would you go about that? Usually we perform hyperparameter optimization – and this can be done using scikit-learn's GridSearchCV function. But we want to compare, so the idea is that we pick some hyperparameters and keep them constant, while changing the activation function.

Let me sketch for you, what I'm trying to do here:

- Train the same neural network neural model over the activation functions mentioned in this post
- Using the history for each activation function, make a plot of loss and accuracy over epochs.

This code was also published on GitHub with a colab button, so you can instantly run it for yourself; here is the link.

I prefer using the high-level API of Keras, so this is going to be done in Keras.

We start by importing everything we need. Note that there are used 4 libraries here; tensorflow, numpy, matplotlib and keras.

```
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D, Activation, LeakyReLU
from keras.layers.noise import AlphaDropout
from keras.utils.generic_utils import get_custom_objects
from keras import backend as K
from keras.optimizers import Adam
```

From here, we would want to load a dataset to run this experiment on; let's choose the MNIST dataset. We can import it directly from Keras.

```
(x_train, y_train), (x_test, y_test) = mnist.load_data()
```

That's great, but we want to preprocess the data, as to normalize it. We do this by many functions, mainly we `.reshape`

the images and divide by `/= 255`

, the max RGB value. At last, we one-hot encode the data by `to_categorical()`

.

```
def preprocess_mnist(x_train, y_train, x_test, y_test):
# Normalizing all images of 28x28 pixels
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Float values for division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value
x_train /= 255
x_test /= 255
# Categorical y values
y_train = to_categorical(y_train)
y_test= to_categorical(y_test)
return x_train, y_train, x_test, y_test, input_shape
x_train, y_train, x_test, y_test, input_shape = preprocess_mnist(x_train, y_train, x_test, y_test)
```

Now that we have preprocessed the data, we are ready to build models and define some things for Keras to run. Let's start with the convolutional neural network model itself. For the SELU activation function, we have a special case, where we need to use a kernel initializer `'lecun_normal`

and a special form of dropout `AlphaDropout()`

, else everything is normal here.

```
def build_cnn(activation,
dropout_rate,
optimizer):
model = Sequential()
if(activation == 'selu'):
model.add(Conv2D(32, kernel_size=(3, 3),
activation=activation,
input_shape=input_shape,
kernel_initializer='lecun_normal'))
model.add(Conv2D(64, (3, 3), activation=activation,
kernel_initializer='lecun_normal'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(AlphaDropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation=activation,
kernel_initializer='lecun_normal'))
model.add(AlphaDropout(0.5))
model.add(Dense(10, activation='softmax'))
else:
model.add(Conv2D(32, kernel_size=(3, 3),
activation=activation,
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation=activation))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation=activation))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(
loss='binary_crossentropy',
optimizer=optimizer,
metrics=['accuracy']
)
return model
```

We do have a small problem with the GELU function; it does not yet exist in Keras. Luckily for us, it's quite easy to add a new activation function to Keras.

```
# Add the GELU function to Keras
def gelu(x):
return 0.5 * x * (1 + tf.tanh(tf.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))
get_custom_objects().update({'gelu': Activation(gelu)})
# Add leaky-relu so we can use it as a string
get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})
act_func = ['sigmoid', 'relu', 'elu', 'leaky-relu', 'selu', 'gelu']
```

Now we are ready to train the model using different activation functions, defined in the `act_func`

array. We run a simple for-loop over each activation function, and add its results to an array

```
result = []
for activation in act_func:
print('\nTraining with -->{0}<-- activation function\n'.format(activation))
model = build_cnn(activation=activation,
dropout_rate=0.2,
optimizer=Adam(clipvalue=0.5))
history = model.fit(x_train, y_train,
validation_split=0.20,
batch_size=128, # 128 is faster, but less accurate. 16/32 recommended
epochs=100,
verbose=1,
validation_data=(x_test, y_test))
result.append(history)
K.clear_session()
del model
print(result)
```

From this, we can plot the history that we got from our `model.fit()`

for each activation function and have a look at how everything went.

Now we are ready to plot our data, and I made some short code using matplotlib

```
new_act_arr = act_func[1:]
new_results = result[1:]
def plot_act_func_results(results, activation_functions = []):
plt.figure(figsize=(10,10))
plt.style.use('dark_background')
# Plot validation accuracy values
for act_func in results:
plt.plot(act_func.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Test Accuracy')
plt.xlabel('Epoch')
plt.legend(activation_functions)
plt.show()
# Plot validation loss values
plt.figure(figsize=(10,10))
for act_func in results:
plt.plot(act_func.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Test Loss')
plt.xlabel('Epoch')
plt.legend(activation_functions)
plt.show()
plot_act_func_results(new_results, new_act_arr)
```

This produces the graphs below

This is a GO-TO book that is practical and BEST in industry:

Here is four books that are well-written:

Here are the papers, which was discussed in this article:

]]>Towards *really* understanding neural networks — One of the most recognized concepts in Deep Learning (subfield of Machine Learning) is neural networks.

Something fairly important is that **all types of neural networks are different combinations of the same basic principals**. When you know the basics of how neural networks work, new architectures are just small additions to everything you already know about neural networks.

Moving forward, the above will be the primary motivation for every other deep learning post on this website.

The big picture in neural networks is how we go from having some data, throwing it into some algorithm and hoping for the best. But what happens inside that algorithm? This question is important to answer, for many reasons; one being that you otherwise might just regard the inner workings of a neural networks as a black box.

Neural networks consists of neurons, connections between these neurons called weights and some biases connected to each neuron. We distinguish between input, hidden and output layers, where we hope each layer helps us towards solving our problem.

To move forward through the network, called a forward pass, we iteratively use a formula to calculate each neuron in the next layer. Keep a total disregard for the notation here, but we call neurons for activations $a$, weights $w$ and biases $b$ — which is cumulated in vectors.

$a^{(l)}=
\sigma\left(
\boldsymbol{W}\boldsymbol{a}^{l-1}+\boldsymbol{b}
\right)$

This takes us forward, until we get an output. We measure how good this output $\hat{y}$ is by a cost function $C$ and the result we wanted in the output layer $y$, and we do this for every example. This one is commonly called mean squared error (MSE):

$$
C = \frac{1}{n} \sum_{i=1}^n (y_i-\hat{y}_i)^2
$$

Given the first result, we go back and adjust the weights and biases, so that we optimize the cost function — called a backwards pass. We essentially try to adjust the whole neural network, so that the output value is optimized. In a sense, this is how we tell the algorithm that it performed poorly or good. We keep trying to optimize the cost function by running through new observations from our dataset.

To update the network, we calculate so called *gradients*, which is small nudges (updates) to individual weights in each layer.

$$
\frac{\partial C}{\partial w^{(L)}}
=
\frac{\partial C}{\partial a^{(L)}}
\frac{\partial a^{(L)}}{\partial z^{(L)}}
\frac{\partial z^{(L)}}{\partial w^{(L)}}
$$

We simply go through each weight, e.g. in the output layer, and subtract the value of the learning rate, times the cost of a particular weight, from the original value that particular weight had.

$$
w^{(L)} = w^{(L)} - \text{learning rate} \times \frac{\partial C}{\partial w^{(L)}}
$$

Add something called mini-batches, where we average the gradient of some number of defined observation per mini.batch, and then you have the basic neural network setup.

I'm going to explain the each part in great detail if you continue reading further. Refer to the table of contents, if you want to read something specific.

We start off with feedforward neural networks, then into the notation for a bit, then a deep explanation of backpropagation and at last an overview of how optimizers helps us use the backpropagation algorithm, specifically stochastic gradient descent.

There is so much terminology to cover. Let me just take it step by step, and then you will need to sit tight.

Neural networks is an algorithm inspired by the neurons in our brain. It is designed to recognize patterns in complex data, and often performs the best when recognizing patterns in audio, images or video.

A neural network simply consists of neurons (also called nodes). These nodes are connected in some way. Then each neuron holds a number, and **each connection holds a weight.**

These neurons are split between the **input, hidden and output layer**. In practice, there are many layers and there are no general best number of layers.

The idea is that we input data into the input layer, which sends the numbers from our data ping-ponging forward, through the different connections, from one neuron to another in the network. Once we reach the output layer, we hopefully have the number we wished for.

The input data is just your dataset, where each observation is run through sequentially from $x=1,...,x=i$. Each neuron has some **activation **— a value between 0 and 1, where 1 is the maximum activation and 0 is the minimum activation a neuron can have. That is, if we use the activation function called sigmoid, explained below. Thus, it is recommended to scale your data to values between 0 and 1 (e.g. by using MinMaxScaler from Scikit-Learn).

We are kind of given the input layer to us by the dataset that we input, but what about the layers afterwards? What happens is just a lot of ping-ponging of numbers, it is nothing more than basic math operations. We look at all the neurons in the input layer, which are connected to a new neuron in the next layer (which is a hidden layer).

**Remember this:** each neuron has an activation *a *and each neuron that is connected to a new neuron has a weight *w*. Activations are typically a number within the range of 0 to 1, and the weight is a double, e.g. 2.2, -1.2, 0.4 etc.

(see Stochastic Gradient Descent for weight explanation)**Then.. **one could multiply activations by weights and get a single neuron in the next layer, from the first weights and activations $w_1a_1$ all the way to $w_na_n$:

$$
w_1a_1+w_2a_2+...+w_na_n = \text{new neuron}$$

That is, multiply *n* number of weights and activations, to get the value of a new neuron.

$$
1.1 \times 0.3+2.6 \times 1.0 = 2.93$$

The procedure is the same moving forward in the network of neurons, hence the name **feedforward neural network.**

But.. things are not that simple. We also have an **activation function**, most commonly a sigmoid function, which just scales the output to be between 0 and 1 again — so it is a logistic function. In future posts, a comparison or walkthrough of many activation functions will be posted.

$$
\text{sigmoid} = \sigma = \frac{1}{1+e^{-x}}= \text{number between 0 and 1}
$$

We wrap the equation for new neurons with the activation, i.e. multiply summarization of the result of multiplying the weights and activations

$$
\sigma(w_1a_1+w_2a_2+...+w_na_n) = \text{new neuron}
$$

Now we just need to explain adding a **bias** to the equation, and then you have the basic setup of calculating a new neuron's value.

**Bias** is trying to approximate where the value of the new neuron starts to be meaningful. So you would try to add or subtract a bias from the multiplication of activations and weights.

$$
\sigma(w_1a_1+w_2a_2+...+w_na_n + b) = \text{new neuron}
$$

There are many types of activation functions, here is an overview:

This is all there is to a very basic neural network, the feedforward neural network. But we need to introduce other algorithms into the mix, to introduce you to how such a network actually learns.

Before moving into the heart of what makes neural networks learn, we have to talk about the notation. At least for me, I got confused about the notation at first, because not many people take the time to explain it.

Before moving into the more advanced algorithms, I would like to provide some of the notation and general math knowledge for neural networks — or at least resources for it, if you don't know linear algebra or calculus.

When learning neural network theory, one will often find that most of the neurons and layers are formatted in linear algebra. Note that I did a short series of articles, where you can learn linear algebra from the bottom up. I would recommend reading most of them and try to understand them. Leave a comment if you don't and I will do my best to answer in time.

The notation is quite neat, but can also be cumbersome. Let me start from the bottom of the final equation and then explain my way down to the previous equation:

$$
\sigma(w_1a_1+w_2a_2+...+w_na_n\pm b) = \text{new neuron}
$$

So what we start off with is organising activations and weights into a corresponding matrix.

We denote each activation by $a_{neuron}^{(layer)}$, e.g. where $a_{2}^{(1)}$ would correspond to the number three neuron in the second layer (we count from 0). So the number below (subscript) corresponds to which neuron we are talking about, and the number above (superscript) corresponds to which layer we are talking about, counting from zero.

We denote each weight by $w_{to,from}$ where *to* is denoted as $j$ and *from* denoted as $k$, e.g. $w_{2,3}^{2}$ means *to* third neuron in the third layer, *from* neuron four in the previous layer (second layer), since we count from zero. It also makes sense when checking up on the matrix for $w$, but I won't go into the details here.

To calculate each activation in the next layer, we need all the activations from the previous layer:

\begin{bmatrix}
a_0^{0}\\
a_1^{0}\\
\vdots \\
a_n^{0}\\
\end{bmatrix}

And all the weights connected to each neuron in the next layer:

\begin{bmatrix}
w_{0,0} & w_{0,1} & \cdots & w_{0,k}\\
w_{1,0} & w_{1,1} & \cdots & w_{1,k}\\
\vdots & \vdots & \ddots & \vdots \\
w_{j,0} & w_{j,1} & \cdots & w_{j,k}\\
\end{bmatrix}

Combining these two, we can do matrix multiplication (read my post on it), adding a bias matrix and wrapping the whole equation in the sigmoid function, we get:

$
\sigma \left(
\begin{bmatrix}
w_{0,0} & w_{0,1} & \cdots & w_{0,k}\\
w_{1,0} & w_{1,1} & \cdots & w_{1,k}\\
\vdots & \vdots & \ddots & \vdots \\
w_{j,0} & w_{j,1} & \cdots & w_{j,k}\\
\end{bmatrix}
\,
\begin{bmatrix}
a_0^{0}\\
a_1^{0}\\
\vdots \\
a_n^{0}\\
\end{bmatrix}
+
\begin{bmatrix}
b_0\\
b_1\\
\vdots \\
b_n\\
\end{bmatrix}
\right)
$

THIS is the final expression, the one that is neat and perhaps cumbersome, if you did not follow through.:

$a^{(1)}=
\sigma\left(
\boldsymbol{W}\boldsymbol{a}^{0}+\boldsymbol{b}
\right)$

Sometimes we might even reduce the notation even more and replace the weights, activations and biases within the sigmoid function to a mere $z$:

$a^{(1)}=
\sigma\left(
\boldsymbol{z}
\right)$

To read it:

We take all the activations from the first layer $\boldsymbol{a^{0}}$, do matrix multiplication with all the weights connecting each neuron from the first to the second layer $\boldsymbol{W}$, add a bias matrix, and at last use the sigmoid function $\sigma$ on the result. From this, we get a matrix of all the activations in the second layer.

You need to know how to find the slope of a tangent line — finding the derivate of a function. In practice, you don't actually need to know how to do every derivate, but you should at least have a feel for what a derivative means.

There are different rules for differentiation, one of the most important and used rules are the chain rule, but here is a list of multiple rules for differentiation, that is good to know if you want to calculate the gradients in the upcoming algorithms. The partial derivative, where we find the derivate of one variable and let the rest be constant, is also valuable to have some knowledge about.

My own opinion is that you don't need to be able to do the math, you just have to be able to understand the process behind these algorithms. I will pick apart each algorithm, to a more down to earth understanding of the math behind these prominent algorithms.

**To summarize**, you should understand what these terms mean, or be able to do the calculations for:

- Matrices; matrix multiplication and addition, the notation of matrices.
- Derivates; measuring the steepness at a particular point of a slope on a graph.
- Partial Derivative; the derivative of one variable, while the rest is constant.
- The chain rule; finding the composite of two or more functions.

Now that you understand the notation, we should move into the heart of what makes neural networks work. This algorithm is part of every neural network. When I break it down, there is some math, but don't be freightened. What the math does is actually fairly simple, if you get the big picture of backpropagation.

Backpropagation is the heart of every neural network. Firstly, we need to make a distinction between backpropagation and optimizers (which is covered later).

Backpropagation is for calculating the gradients efficiently, while optimizers is for training the neural network, using the gradients computed with backpropagation. In short, all backpropagation does for us is compute the gradients. Nothing more.

SO.. Err, how do we go backwards?

We always start from the output layer and propagate backwards, updating weights and biases for each layer.

**The idea is simple:** adjust the weights and biases throughout the network, so that we get the desired output in the output layer. Say we wanted the output neuron to be 1.0, then we would need to nudge the weights and biases so that we get an output closer to 1.0.

We can only change the weights and biases, but activations are direct calculations of those weights and biases, which means we indirectly can adjust every part of the neural network, to get the desired output — except for the input layer, since that is the dataset that you input.

Now, before the equations, let's define what each variable means. We have already defined some of them, but it's good to summarize. Some of this should be familiar to you, if you read the post.

PLEASE! Pay attention to the notation used between L, L-1 and *l*. I intentionally mix it up, so that you can get an understanding of how both of them work.

Firstly, let's start by defining the relevant equations. Note that any indexing explained earlier is left out here, and we abstract to each layer instead of each weight, bias or activation:

$$
z^{(L)}=w^{(L)} \times a +b
$$

$$
a^{(L)}=
\sigma\left(
\boldsymbol{z}^{(L)}
\right)
$$

$$
C=(a^{(L)}-y)^2
$$

More on the cost function later in the cost function section.

The way we might discover how to calculate gradients in the backpropagation algorithm is by thinking of this question:

How might we measure the change in the cost function in relation to a specific weight, bias or activation?

Mathematically, this is why we need to understand partial derivatives, since they allow us to compute the relationship between components of the neural network and the cost function. And as should be obvious, we want to minimize the cost function. When we know what affects it, we can effectively change the relevant weights and biases to minimize the cost function.

$$
\frac{\partial C}{\partial w^{(L)}}
=
\frac{\partial C}{\partial a^{(L)}}
\frac{\partial a^{(L)}}{\partial z^{(L)}}
\frac{\partial z^{(L)}}{\partial w^{(L)}}
=
2 \left(a^{(L)} - y \right) \sigma' \left(z^{(L)}\right) a^{(L-1)}
$$

If you are not a math student or have not studied calculus, this is not at all clear. So let me try to make it more clear.

The squished 'd' is the partial derivative sign. $\partial C/\partial w^{L}$ means that we look into the cost function $C$ and within it, we only take the derivative of $w^{L}$, i.e. the rest of the variables are left as is. I'm not showing how to differentiate in this article, as there are many great resources for that.

Although $w^{L}$ is not directly found in the cost function, we start by considering the change of w in the z equation, since that z equation holds a w. Next we consider the change of $z^{L}$ in $a^{L}$, and then the change $a^{L}$ in the function $C$. Effectively, this measures the change of a particular weight in relation to a cost function.

We measure a ratio between the weights (and biases) and the cost function. The ones with the largest ratio will have the greatest impact on the cost function and will give us 'the most bang for our buck'.

We need to move backwards in the network and update the weights and biases. Let's introduce how to do that with math. One equation for weights, one for biases and one for activations:

$$
\frac{\partial C}{\partial w^{(L)}}
=
\frac{\partial C}{\partial a^{(L)}}
\frac{\partial a^{(L)}}{\partial z^{(L)}}
\frac{\partial z^{(L)}}{\partial w^{(L)}}
$$

$$
\frac{\partial C}{\partial b^{(L)}}
=
\frac{\partial C}{\partial a^{(L)}}
\frac{\partial a^{(L)}}{\partial z^{(L)}}
\frac{\partial z^{(L)}}{\partial b^{(L)}}
$$

$$
\frac{\partial C}{\partial a^{(L-1)}}
=
\frac{\partial C}{\partial a^{(L)}}
\frac{\partial a^{(L)}}{\partial z^{(L)}}
\frac{\partial z^{(L)}}{\partial a^{(L-1)}}
$$

Remember that these equations just measure the ratio of how a particular weight affects the cost function, which we want to optimize. We optimize by stepping in the direction of the output of these equations. It really is (almost) that simple.

Each partial derivative from the weights and biases is saved in a *gradient vector*, that has as many dimensions as you have weights and biases. The gradient is the triangle symbol $\nabla$, and *n* being number of weights and biases:

$$
-\nabla C(w_1, b_1,..., w_n, b_n)
=
\begin{bmatrix}
\frac{\partial C}{\partial w_1} \\
\frac{\partial C}{\partial b_1} \\
\vdots \\
\frac{\partial C}{\partial w_n} \\
\frac{\partial C}{\partial b_n}
\end{bmatrix}
$$

Activations are also a good idea to keep track of, to see how the network reacts to changes, but we don't save them in the gradient vector. Importantly, they also help us measure which weights matters the most, since weights are multiplied by activations. From an efficiency standpoint, this is important to us.

You compute the gradient according to a mini-batch (often 16 or 32 is best) of your data, i.e. you subsample your observations into batches. For each observation in your mini-batch, you average the output for each weight and bias. Then the average of those weights and biases becomes the output of the gradient, which creates a step in the average best direction over the mini-batch size.

Then you would update the weights and biases after each mini-batch. Each weight and bias is 'nudged' a certain amount for each layer *l*:

$$
w^{(l)} = w^{(l)} - \text{learning rate} \times \frac{\partial C}{\partial w^{(l)}}
$$

$$
b^{(l)} = b^{(l)} - \text{learning rate} \times \frac{\partial C}{\partial b^{(l)}}
$$

The learning rate is usually written as an alpha $\alpha$ or eta $\eta$.

But this is not all there is to it. The three equations I showed are just for the output layer, if we were to move one layer back through the network, there would be more partial derivatives to compute for each weight, bias and activation. We have to move all the way back through the network and adjust each weight and bias.

Taking the rest of the layers into consideration, we have to chain more partial derivatives to find the weight in the first layer, but we do not have to compute anything else.

If we look at the hidden layer in the previous example, we would have to use the previous partial derivates as well as two newly calculated partial derivates. To help you see why, you should look at the dependency graph below, since it helps explain each layer's dependencies on the previous weights and biases.

Updating the weights and biases in layer 2 (or $L$) depends only on the cost function, and the weights and biases connected to layer 2. Similarly, for updating layer 1 (or $L-1$), the dependenies are on the calculations in layer 2 and the weights and biases in layer 1. This would add up, if we had more layers, there would be more dependencies. As you might find, this is why we call it 'back propagation'.

As the graph above shows, to calculate the weights connected to the hidden layer, we will have to reuse the previous calculations for the output layer (L or layer 2). Let me just remind of them:

$$
\frac{\partial C}{\partial w^{(2)}}
=
\frac{\partial C}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
\frac{\partial z^{(2)}}{\partial w^{(2)}}
$$

$$
\frac{\partial C}{\partial b^{(2)}}
=
\frac{\partial C}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
\frac{\partial z^{(2)}}{\partial b^{(2)}}
$$

If we wanted to calculate the updates for the weights and biases connected to the hidden layer (L-1 or layer 1), we would have to reuse some of the previous calculations.

$$
\frac{\partial C}{\partial w^{(1)}}
=
\underbrace{
\frac{\partial C}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
}_\text{Reused from $\frac{\partial C}{\partial w^{(2)}}$}
\,
\frac{\partial z^{(2)}}{\partial a^{(1)}}
\frac{\partial a^{(1)}}{\partial z^{(1)}}
\frac{\partial z^{(1)}}{\partial w^{(1)}}
$$

$$
\frac{\partial C}{\partial b^{(1)}}
=
\underbrace{
\frac{\partial C}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
}_\text{Reused from $\frac{\partial C}{\partial b^{(2)}}$}
\,
\frac{\partial z^{(2)}}{\partial a^{(1)}}
\frac{\partial a^{(1)}}{\partial z^{(1)}}
\frac{\partial z^{(1)}}{\partial b^{(1)}}
$$

We use all previous calculations, except the partial derivatives with respect to either the weights or biases of a layer, e.g. we don't reuse $\partial z^{(1)}/ \partial w^{(1)}$ (we obviously use some of $\partial C/ \partial w^{(1)}$).

If you look at the dependency graph above, you can connect these last two equations to the big curly bracket that says *"Layer 1 Dependencies" *on the left. Try to make sense of the notation used by linking up which layer L-1 is in the graph. This should make things more clear, and if you are in doubt, just leave a comment.

A small detail left out here, is that if you calculate weights first, then you can reuse the 4 first partial derivatives, since they are the same when calculating the updates for the bias. And of course the reverse.

Suppose we had another hidden layer, that is, if we have input-hidden-hidden-output — a total of four layers. Then we would just reuse the previous calculations for updating the previous layer. We essentially do this for every weight and bias for each layer, reusing calculations.

So.. if we suppose we had an extra hidden layer, the equation would look like this:

$$
\frac{\partial C}{\partial w^{(1)}}
=
\underbrace{
\frac{\partial C}{\partial a^{(3)}}
\frac{\partial a^{(3)}}{\partial z^{(3)}}
}_\text{From $w^{(3)}$}
\,
\underbrace{
\frac{\partial z^{(3)}}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
}_\text{From $w^{(2)}$}
\,
\frac{\partial z^{(2)}}{\partial a^{(1)}}
\frac{\partial a^{(1)}}{\partial z^{(1)}}
\frac{\partial z^{(1)}}{\partial w^{(1)}}
$$

If you are looking for a concrete example with explicit numbers, I can recommend watching Lex Fridman from 7:55 to 20:33 or Andrej Karpathy's lecture on Backpropgation.

- Do a forward pass with the help of this equation

$a^{(l)}=
\sigma\left(
\boldsymbol{W}\boldsymbol{a}^{l-1}+\boldsymbol{b}
\right)$

- For each layer weights and biases connecting to a new layer, back propagate using the backpropagation algorithm by these equations (replace $w$ by $b$ when calculating biases)

$$
\frac{\partial C}{\partial w^{(3)}}
=
\frac{\partial C}{\partial a^{(3)}}
\frac{\partial a^{(3)}}{\partial z^{(3)}}
\frac{\partial z^{(3)}}{\partial w^{(3)}}
$$

$$
\frac{\partial C}{\partial w^{(2)}}
=
\underbrace{
\frac{\partial C}{\partial a^{(3)}}
\frac{\partial a^{(3)}}{\partial z^{(3)}}
}_\text{From $w^{(3)}$}
\,
\frac{\partial z^{(3)}}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
\frac{\partial z^{(2)}}{\partial w^{(2)}}
$$

$$
\frac{\partial C}{\partial w^{(1)}}
=
\underbrace{
\frac{\partial C}{\partial a^{(3)}}
\frac{\partial a^{(3)}}{\partial z^{(3)}}
}_\text{From $w^{(3)}$}
\,
\underbrace{
\frac{\partial z^{(3)}}{\partial a^{(2)}}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
}_\text{From $w^{(2)}$}
\,
\frac{\partial z^{(2)}}{\partial a^{(1)}}
\frac{\partial a^{(1)}}{\partial z^{(1)}}
\frac{\partial z^{(1)}}{\partial w^{(1)}}
$$

Continue on adding more partial derivatives for each extra layer in the same manner as done here.

- Repeat for each observation/sample (or mini-batches with size less than 32)

Optimizers is how the neural networks learn, using backpropagation to calculate the gradients.

Many factors contribute to how well a model performs. The way we measure performance, as may be obvious to some, is by a cost function.

The cost function gives us a value, which we want to optimize. There are too many cost functions to mention them all, but one of the more simple and often used cost functions is the sum of the squared differences.

$$
C = \frac{1}{n} \sum_{i=1}^n (y_i-\hat{y}_i)^2
$$

Where $y$ is what we want the output to be and $\hat{y}$ being the actual predicted output from a neural network. Basically, for every sample $n$, we start summing from the first example $i=1$ and over all the squares of the differences between the output we want $y$ and the predicted output $\hat{y}$ for each observation.

There are obviously many factors contributing to how well a particular neural network performs. Complexity of model, hyperparameters (learning rate, activation functions etc.), size of dataset and more.

In Stochastic Gradient Descent, we take a mini-batch of random sample and perform an update to weights and biases based on the average gradient from the mini-batch. The weights for each mini-batch is randomly initialized to a small value, such as 0.1. The biases are initialized in many different ways; the easiest one being initialized to 0.

- Define a cost function, with a vector as input (weight or bias vector)
- Start at a random point along the x-axis and step in any direction.

Ask, which way should we step to decrease the cost function most quickly? - Calculate the gradient using backpropagation, as explained earlier
- Step in the opposite direction of the gradient — we calculate gradient ascent, therefore we just put a minus in front of the equation or move in the opposite direction, to make it gradient descent.

Getting a good grasp of what stochastic gradient descent looks like is pretty easy from the GIF below. Each step you see on the graph is a gradient descent step, meaning we calculated the gradient with backpropagation for some number of samples, to move in a direction.

We say that we want to reach a global minima, the lowest point on the function. Though, this is not always possible. We are very likely to hit a local minima, which is a point between the slope moving upwards on both the left and right side. If we find a minima, we say that our neural network has converged. If we don't, or we see a weird drop in performance, we say that the neural network has diverged.

If we calculate a positive derivative, we move to the left on the slope, and if negative, we move to the right, until we are at a local minima.

Here, I will briefly break down what neural networks are doing into smaller steps.

Repeat for each mini-batch:

- Initialize weights to a small random number and let all biases be 0
- Start forward pass for next sample in mini-batch and do a forward pass with the equation for calculating activations

$a^{(l)}=\sigma\left(\boldsymbol{W}\boldsymbol{a}^{l-1}+\boldsymbol{b}\right)$ - Calculate gradients and update gradient vector (average of updates from mini-batch) by iteratively propagating backwards through the neural network. An example calculation of partial derivative of $w^1$ in an input-hidden-hidden-output neural network (4 layers)

$\frac{\partial C}{\partial w^{(1)}} = \underbrace{ \frac{\partial C}{\partial a^{(3)}} \frac{\partial a^{(3)}}{\partial z^{(3)}} }_\text{From $w^{(3)}$} \, \underbrace{ \frac{\partial z^{(3)}}{\partial a^{(2)}} \frac{\partial a^{(2)}}{\partial z^{(2)}} }_\text{From $w^{(2)}$} \, \frac{\partial z^{(2)}}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial w^{(1)}}$ - Put a minus in front of the gradient vector, and update weights and biases based on the gradient vector calculated from averaging over the nudges of the mini-batch.

Finished reading this article? Go read optimizers explained!

The most recommended book is the first bullet point. It will drag you through the latest and greatest, while explaining concepts in great detail, while keeping it practical. Probably the best book to start learning from, if you are a beginner or semi-beginner.

- THE
*BEST*PRACTICAL BOOK (RECOMMENDED):

1st Edt.: Hands-on Machine Learning By Aurélion Géron (Kindle compatible)

Preorder 2nd Edt.: Hands-on Machine Learning by Aurélien Géron

(with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems) - GREAT book with precise explanations of math and code:

The Hundred-Page Machine Learning Book by Andriy Burkov - Regarded as one of the best books:
*Deep Learning*by Ian Goodfellow, Yoshua Bengio and Aaron Courville. This one might be more academic, a little more cumbersome than the previous one. Nevertheless, it is a good book. - A great book that I have read:
*Neural Networks and Deep Learning*by Michael Nielsen. He explains things clearly and picks those pesky math equations apart, so that you can understand it.

Have any questions? Leave a comment below. I'm here to answer or clarify anything.

]]>Want an unbiased estimation of the true error of an algorithm? This is where you are going to find it. I will explain the what, why, when and how for nested cross-validation. Specifically, the concept will be explained with K-Fold cross-validation.

**Update 1:** The images in this article was updated to the new theme on the site. Multiprocessing was added to the GitHub package, along with other fixes. If you have any issues, please report them on GitHub and I will try to take action!

1.What Is Cross-Validation?

2.What Is Nested Cross-Validation?

3.When To Use Nested Cross-Validation?

4.Other findings for Nested Cross-Validation

5.What To Do After Nested Cross-Validation

6.Code for Nested Cross-Validation

Firstly, a short explanation of cross-validation. K-Fold cross-validation is when you split up your dataset into K-partitions — 5- or 10 partitions being recommended. The way you split the dataset is making *K* random and different sets of indexes of observations, then interchangeably using them. The percentage of the full dataset that becomes the testing dataset is $1/K$, while the training dataset will be $K-1/K$. For each partition, a model is fitted to the current split of training and testing dataset.

Below is an example of K-Fold cross-validation with $K=5$. The full dataset will interchangeably be split up into a testing and training dataset, which a model will be trained upon.

The idea is that you use cross-validation with a search algorithm, where you input a hyperparameter grid — parameters that are selected before training a model. In combination with Random Search or Grid Search, you then fit a model for each pair of different hyperparameter sets in each cross-validation fold (example with random forest model).

```
hyperparameter_grid={
'max_depth': [3, None],
'n_estimators': [10, 30, 50, 100, 200, 400, 600, 800, 1000],
'max_features': [2,4,6]
}
```

First the: *why should you care? *Nested Cross-Validation is an extension of the above, but it fixes one of the problems that we have with normal cross-validation. In normal cross-validation you only have a training and testing set, which you find the best hyperparameters for.

- This may cause information leakage
- You estimate the error of a model on the same data, which you found the best hyperparameters for. This may cause significant bias.

You would not want to estimate the error of your model, on the same set of training and testing data, that you found the best hyperparameters for. Thus, we say that the model is biased, and it has been shown that the bias can be significantly large [1].

**Why?** Along with the fact that bias and variance is linked with model selection, I would suggest that this is possibly one of the best approaches to estimate a true error, that is almost unbiased and with low variance.

As the image below suggests, we have two loops. The inner loop is basically normal cross-validation with a search function, e.g. random search or grid search. Though the outer loop only supplies the inner loop with the training dataset, and the test dataset in the outer loop is held back.

For this reason, we definitely stop information leakage due to cross-validation. And we also get a relatively low-absent bias, as the papers suggest (papers explained further below).

The above image is a nonformal view of it, but I compiled a more abstract view of nested cross-validation, describing the process.

**Require** means that *something* needs to be provided, and the rest of the parameters in this algorithm should be fairly obvious. Keep in mind that the *RandomSample($P_{sets}$)* is a function, that takes a random set from the hyperparameter grid.

Please leave a comment at the bottom of the post if you need more explaining. If it's hard to grasp, try to distinguish between *i* and *j* from the for-loops – they are very important to keep track of when reading this.

It is hard to deny the fact, that nested cross-validation is computationally expensive, in the case of larger datasets. Anything more than a few thousand observations, you might find this computationally expensive.

Nested cross-validation has its purpose. Especially in fields where data is limited, e.g. in biology, there is usually not a lot of data to go with machine learning projects.

To Summarize, when to use nested cross-validation:

- Where the dataset is small (a few thousand observations or less)
- When it is not too computationally expensive. Ask yourself if you find it feasible, given what type of computing power you have access to.

If you fit those two criterions, you should use nested cross-validation for getting an almost unbiased estimate of the true error, and therefore comparing the performance of different algorithms.

Note: If your dataset is considered large, you would produce an almost unbiased true error for your algorithm, hence the use-case for nested cross-validation being when your dataset is relatively small.

Although the case is clear for nested cross-validation, there have been released a paper [3], which details when you should not use nested cross-validation.

In particular, let me quote the two claims:

Nested CV procedures are probably not needed when selecting from among random forest, SVM with Gaussian kernel, and gradient boosting machines...

The words that caught my eyes are *probably not needed*. As it stands, if your dataset is relatively small and you find it computationally feasible, you should still go for testing it with nested cross-validation.

Nested CV procedures are probably not needed when selecting from among any set of classifier algorithms (provided they have only a limited number of hyper-parameters that must be tuned...).

Another *probably not needed* caught my eyes. But this one is a more understandable claim, and one could see why, without the evidence, that this is the case. If you have a limited number (not specified) of hyperparameters, you don't have that much to optimize for — thus, nested cross-validation could just be a waste of computational time for your project.

Though, when thinking about algorithms with a limited number of hyperparameters, I don't exactly find it easy to think of an algorithm that has relatively few hyperparameters to tune. Maybe when testing something like K-Nearest Neighbors, it would be overzealous?

This one is a big question, and surely has caused many misconceptions. So let me just make it clear:

*If your results are stable (i.e. same hyperparameter sets gives roughly the same estimate of the error), then proceed with normal cross-validation.*

To really pinpoint the procedure, let's put it into steps:

- Input all the algorithms, which you intend to estimate the error of, into a nested cross-validation procedure.

a. Continue if results are stable. Redo/rethink each unstable algorithm.

- Select the model with the lowest generalization error, i.e. take the mean of the outer scores for each algorithm.
- Run that algorithm in a normal cross-validation with grid search or random search, without any of the optimized hyperparameters.
- Compare the score from nested cross-validation to the one from normal cross-validation.

At this point, you have tested if there is bias introduced to the procedure of estimating the error of your model. If results are very different, take it as an indication of normal cross-validation introducing bias into the estimate.

You reached the fun part! Let me show you just how to use the GitHub package, that I released specifically for this.

Firstly, you can install it with the command pip command `pip install nested-cv`

. Let's break down the documentation. This is going to be a regression example.

For classification, modifying the `cv_options`

found here is needed, e.g. set `metric`

to a classification metric and `metric_score_indicator_lower`

to False.

In this next part, we simply make an array with different models to run in the nested cross-validation algorithm. An example used here is Random Forest, XGBoost and LightGBM:

```
models_to_run = [RandomForestRegressor(), xgb.XGBRegressor(), lgb.LGBMRegressor()]
```

Next, we would need to include the hyperparameter grid for each of the algorithms. The way we do this is by an array that contains different dictionaries. Note that the index for each algorithm in the array `models_to_run`

should be the same index in the `models_param_grid`

:

```
models_param_grid = [
{ # 1st param grid, corresponding to RandomForestRegressor
'max_depth': [3, None],
'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
'max_features' : [50,100,150,200]
},
{ # 2nd param grid, corresponding to XGBRegressor
'learning_rate': [0.05],
'colsample_bytree': np.linspace(0.3, 0.5),
'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
'reg_alpha' : (1,1.2),
'reg_lambda' : (1,1.2,1.4)
},
{ # 3rd param grid, corresponding to LGBMRegressor
'learning_rate': [0.05],
'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
'reg_alpha' : (1,1.2),
'reg_lambda' : (1,1.2,1.4)
}
]
```

Now we have two arrays, with the algorithms and corresponding hyperparameter grids for them. Firstly, we loop over them and then input each model into nested cross-validation with the corresponding hyperparameter grid. Then there is also some other configurations. So we input this along with other configurations to the nested cross-validation procedure:

```
for i,model in enumerate(models_to_run):
nested_CV_search = NestedCV(model=model, params_grid=models_param_grid[i],
outer_kfolds=5, inner_kfolds=5,
cv_options={'sqrt_of_score':True, 'randomized_search_iter':30})
nested_CV_search.fit(X=X,y=y)
model_param_grid = nested_CV_search.best_params
print(np.mean(nested_CV_search.outer_scores))
print(nested_CV_search.best_inner_params_list)
```

Now we would get the three generalization errors printed for us, and at this point, we could look at the scores and select the best model. Then we continue on in the steps showed earlier in What To Do After Nested Cross-Validation, if the results are stable.

- Probably the most recognized article on bias in cross-validation predictions.

Cawley and Talbot: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation — Link here. - General reading on cross-validation, and nested cross-validation on page 45.

Sebastian Raschka: Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning — Link here. - Indications of when you
*probably*cannot make use of nested cross-validation.

Wainer and Cawley: Nested cross-validation when selecting classifiers is overzealous for most practical applications — Link here.