This article was first published by IBM Developer at developer.ibm.com, but authored by Casper Hansen. Here is the Direct link.

Disclaimer: I am not a medical, radiology or epidemiology professional. This article was an experiment, from an engineering and data scientist perspective, and should be regarded as such.

To apply Deep Learning for COVID-19, we need a good dataset – one with lots of samples, edge cases, metadata, and different looking images. We want our model to generalize to the data, such that it can make accurate predictions on new, unseen data. All the work for this article is available on GitHub.

Unfortunately, not much data is available, but there are already posts on LinkedIn/Medium claiming >90% or in some cases 100% accuracy on detecting COVID-19 cases. Though, these posts usually contains mistakes, which are not acknowledged.

We have actively tried to mitigate some of the usual mistakes, by thinking about and coming up with solutions to the following problems:

  1. You might not believe the following is an issue, but it is a typical fallacy for new people in machine learning and data science. That is, the first rule of machine learning: never, ever test your model's performance with the same data you used to train it with. Not using a testing dataset, and instead testing and measuring the accuracy of their model on the training dataset, does not give an accurate representation of how well the model generalizes to new, unseen data.
  2. Not using computer vision techniques to achieve better generalization – augmentation being an absolute necessity, especially in our case where there are very few samples for our model to learn from.
  3. Not thinking about what the model learns, i.e. will our model actually learn the pattern of what COVID-19 looks like on an X-Ray image, or is it likely that there is some other noisy pattern in our dataset, that it will learn instead?
    One amusing story of early machine learning is called Detecting Tanks: "photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days."
  4. Not using the correct metrics. If you use the accuracy metric on a heavily imbalanced dataset, then your model might look like it's performing well in the general case, even though it's performing poorly on COVID-19 cases. 95% accuracy is not that impressive, if your accuracy on COVID-19 cases are 30%.

Acquiring Data

Here is a list of different data sources that I have accumulated over the period of the coronavirus spreading. Currently, this is the best we are going to get from the internet, because the data being collected in many countries and hospitals are classified information. Even if it was not classified information, you still need consent from each patient.

  1. GitHub: COVID-19 Chest X-Ray and CT images. This dataset is continually updated, as more cases appear.
  2. Kaggle 1: Chest X-Ray Images, positive and negative cases of Pneumonia.
  3. Kaggle 2: Chest X-Ray Images, positive and negative cases of Pneumonia.

Problem With Mixed Data Sources

All the data available on the internet have not been subject to the same preprocessing, and all the images are, as explained below, clearly different in the amount of black bars. Most of our positive COVID-19 data has the whole x-ray take up most of the screen, with little to no black bars on the sides (except for a few). Though, the dataset with negative COVID-19 cases have mostly black bars on the side of each image.

This becomes a problem, because our model later on might learn that it just needs to look at the black bars on the side, to know whether or not a new sample is a positive or negative case of COVID-19.

After manual inspection of the dataset, it has become apparent that almost all the negative cases have these black bars, while approximately only 10-20% of the positive cases have black bars. Wouldn't it be quite a coincidence, if our accuracy of predicting positive/negative later on turns out to be around the 80-90% mark?

Preprocessing — Improving Our Data

The data problem from earlier is clearly one we need to solve. Can we somehow detect when there are black top, black bottom, or black side bars? And can we somehow automatically correct this?

Here is the key to solving this:

  1. A simple metric: are there a number x or more black pixels in a given row/column?
  2. Yes? Remove that row/column. Repeat over all rows and columns.
  3. Done checking for black pixels? Then we crop the image.
  4. Next image and repeat.

Ok, not that fast. New problem: we need to keep the same aspect ratio. And we can't upscale or downscale, if the aspect ratio is not the same. This is due to the existing architectures needing an input size of 224x224, which indeed is an aspect ratio of 1x1.

In other words – can we check if we are removing as many columns as we are removing rows? Can we overcompensate, by removing specific rows, if there are too many columns with black pixels, but little to no rows of black pixels?

Removing The Black Bars And Fixing Aspect Ratio

The image below demonstrates the input and output of removing the black bars, with one example where it was particularly successful. Quite powerful, but we are still not adhering to the aspect ratio of 1x1.

The most effective and easy solution I found to fixing the aspect ratio, was calculating the difference between the number of rows removed versus the number of columns removed. Divide the difference by two and remove rows from top and bottom, or remove columns from left and right.

The last image is saved as a 224x224 image, because we can scale it down now that the aspect ratio is 1x1. This saves space and some time when loading the image again in another script.

Running The Script

Here is the thing – it takes about 14 seconds per image, to find the rows and columns for removal. We use a naive approach, which could probably be done more efficiently, but nevertheless, it works for this experiment. We had 8851 images in total in the Kaggle 1 dataset, which turned out to be 39 hours of preprocessing. All data is released and available in the GitHub repository for this article.

We ran the code on both the Kaggle 1 and GitHub datasets, and the result is that we have all these images saved in a 224x224 image format. The below code is only for running it on the Kaggle 1 dataset, but you can substitute the first part of the code to load any dataset of images.

import pandas as pd
import matplotlib.pyplot as plt
import cv2, time, os
import numpy as np

df = pd.read_csv('kaggle-pneumonia-jpg/stage_2_detailed_class_info.csv')
image_folder = 'kaggle-pneumonia-jpg/stage_2_train_images_jpg/'

df = df.loc[df['class'] == 'Normal']

Here is how I loaded the COVID-19 dataset. Note that the rest of the code is the same, this is just loading a different dataset.

columns = ['finding', 'modality', 'filename', 'view']

# only use COVID-19 samples done with X-ray and PA view
df = pd.read_csv('metadata.csv', usecols=columns)
df = df.loc[df['finding'] == 'COVID-19']
df = df.loc[df['modality'] == 'X-ray']
df = df.loc[df['view'] == 'PA']

After loading the dataframe, we load the images, labels and filenames into arrays.

images = []
labels = []
filenames = []

for index, row in df.iterrows():
    image = cv2.imread(image_folder + row.patientId + '.jpg')
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    images.append(image)

    label = row['class']
    labels.append(label)
    
    filename = row.patientId
    filenames.append(filename)

images = np.array(images)

Next, we have the a function get_black_pixel_indices() that loops through rows and columns of pixels – you could imagine an image as a table, that has rows and columns of pixels, that has pixel values.

We count the number of black pixels, defined by the number of pixel values that are less than 40. A pixel value consists of one value for each color channel, e.g. Red, Green, Blue (RGB). We could have an array [39, 67, 45], and the average pixel value would be the average of those numbers, which is 50.33.

When we are done counting the number of black pixels for a given row and column, we check if there are more black pixels than non black pixels, and if that is the case, we choose to remove that row or column.

def get_black_pixel_indices(img):
    rows_remove = []
    cols_remove = []

    for j, counter in enumerate(range(img.shape[0])):
        row = img[counter]
        # shift column dim to first dim, row dim to second dim
        # acts like a 90 degree counterclock-wise rotation of image
        col = img.transpose(1, 0, 2)[counter]

        row_black_pixel_count = 0
        col_black_pixel_count = 0
        for selector in range(img.shape[1]):
            row_pixel = row[selector]
            col_pixel = col[selector]
            
            if np.average(row_pixel) < 40:
                row_black_pixel_count += 1
            if np.average(col_pixel) < 40:
                col_black_pixel_count += 1
        
        if row_black_pixel_count > len(row)/2:
            rows_remove.append(j)
        if col_black_pixel_count > len(col)/2:
            cols_remove.append(j)

    return rows_remove, cols_remove

Before we remove the rows and columns, with a higher black pixel count than non black pixel count, we have to check for rows and columns that, if removed, would obstruct the image. The simplest way we found was that no rows and columns should be removed between column and row 200 and 800. After filtering these obstructing indices, we remove the rows and columns.

def remove_obstructing_indices(rows_remove, cols_remove):
    for i, value in enumerate(rows_remove):
        if 200 <= value <= 800:
            del rows_remove[i]
    for i, value in enumerate(cols_remove):
        if 200 <= value <= 800:
            del cols_remove[i]

    return rows_remove, cols_remove

def remove_black_pixels(img, rows_remove, cols_remove):
    img = np.delete(img, rows_remove, axis=0)
    img = np.delete(img, cols_remove, axis=1)
    
    return img

The last thing for preprocessing, is adjusting the aspect ratio. Quite simply, we know that, in most cases, a human torso is more tall than wide, so we choose to arbitrarily cut the image from the top and bottom. There are also cases where the torso is wider, but those were usually rare when inspecting the dataset.

We start by calculating the column to row difference and row to column difference, and based on that, we find how many rows or columns we need to remove, in order to fix the aspect ratio. We found that we can divide the difference by two, then remove half of the difference from either the top and bottom, or left and right. After that, we might have one leftover row or column if the difference is an odd number, so we compensate by either removing a row from the bottom, or a column from the right.

def adjust_aspect_ratio(img, rows_remove, cols_remove):
    col_row_diff = len(cols_remove) - len(rows_remove)
    row_col_diff = len(rows_remove) - len(cols_remove)
    
    if col_row_diff > 0:
        slice_size = int(col_row_diff/2)
        
        img = img[:-slice_size]
        img = img[slice_size:]
        
        if img.shape[0] != img.shape[1]:
            img = img[:-1]
            
    elif row_col_diff > 0:
        slice_size = int(row_col_diff/2)
        
        img = img[:,:-slice_size,:]
        img = img[:,slice_size:,:]
        
        if img.shape[0] != img.shape[1]:
            img = img[:,:-1,:]
    
    if img.shape[0] == img.shape[1]:
        return img, True
    else:
        return img, False

Finally, we run the loop that calls the four functions previously described. This results in images that have removed black bars from the left, right, top and bottom. We downscale that image to a 224x224 image, then we save the image, along with the label and filename in a .csv file that cumulates all the paths and filenames of all our preprocessed images.

save_folder = 'normal_dataset/'

start = time.time()
for i, img in enumerate(images):
    rows_remove, cols_remove = get_black_pixel_indices(img)
    rows_remove, cols_remove = remove_obstructing_indices(rows_remove, cols_remove)
    
    new_img = remove_black_pixels(img, rows_remove, cols_remove)
    adj_img, shape_match = adjust_aspect_ratio(new_img, rows_remove, cols_remove)

    if shape_match:
        resized_image = cv2.resize(adj_img, (224, 224))
    
        label = labels[i]
        name = filenames[i] + '.jpg'

        cv2.imwrite(save_folder + name, resized_image)
        new_df = pd.DataFrame({'filename': [name], 'finding': [label]})

        if i == 0:
            new_df.to_csv('normal_xray_dataset.csv')
        else:
            new_df.to_csv('normal_xray_dataset.csv', mode='a', header=False)

    print('image number {0}, time spent {1:2f}s'.format(i+1, time.time() - start))

For a final overview of another 40 images of negative and positive cases, we can clearly see that our preprocessing approach helped.

Augmentation Of COVID-19 Samples

To make the most of what we have (99 images), we are going to have to use augmentation of our training data. Augmentation is a manipulation of the image, that will look new to CNN – common examples of augmentation are rotations, shears and flips of images.

Keep in mind that we use augmentation in order to generalize our model better and prevent overfitting. Essentially, each time we use one form of augmentation, we add another similar, but slightly different, image to the training dataset. Note that this is also a demonstration of how you could augment the images, and not the augmentation we will use later.

Adjusted Contrast

We use TensorFlow's image.adjust_contrast() to adjust the contrast of all our images.

X_train_contrast = []

for x in X_train:
    contrast = tf.image.adjust_contrast( x, 2 )
    X_train_contrast.append(contrast.numpy())

plot_images(X_train_contrast, 'Adjusted Contrast')

Which results in the following updated images:

Adjusted Saturation

We use TensorFlow's image.adjust_saturation() to adjust the saturation of all our images.

X_train_saturation = []

for x in X_train:
    saturation = tf.image.adjust_saturation( x, 3 )
    X_train_saturation.append(saturation.numpy())

plot_images(X_train_saturation, 'Adjusted Saturation')

Which results in the following updated images:

Flipping Left Right

We use TensorFlow's image.flip_left_right() to flip all of our images from left to right.

X_train_flipped = []

for x in X_train:
    flipped = tf.image.flip_left_right(x)
    X_train_flipped.append(flipped.numpy())

plot_images(X_train_flipped, 'Flipped Left Right')

Which results in the following updated images:

Flipping Up Down

We use TensorFlow's image.flip_up_down to flip all of our images from top to the bottom.

X_train_flipped_up_down = []

for x in X_train:
    flipped = tf.image.flip_up_down(x)
    X_train_flipped_up_down.append(flipped.numpy())

plot_images(X_train_flipped_up_down, 'Flipped Up Down')

Which results in the following updated images:

Flipping Up Down Left Right

We use TensorFlow's image.flip_left_right() to flip all of our images that are already flipped from top to bottom, which results in the images being flipped from top to bottom, then from left to right.

X_train_flipped_up_down_left_right = []

for x in X_train_flipped_up_down:
    flipped = tf.image.flip_left_right(x)
    X_train_flipped_up_down_left_right.append(flipped.numpy())

plot_images(X_train_flipped_up_down_left_right, 'Flipped Up Down Left Right')

Which results in the following updated images:

Rotations

TensorFlow has an extension called TensorFlow Addons, which can be installed by the pip command pip install tensorflow-addons. The package has a function image.rotate() that allows us to rotate any image by an arbitrary number of degrees. Though, the function takes radians as input, so we use the radians() function from Python's math import.

import tensorflow_addons as tfa
from math import radians

X_train_rot_45_deg = []
X_train_rot_135_deg = []
X_train_rot_225_deg = []
X_train_rot_315_deg = []

for x in X_train:
    deg_45 = tfa.image.rotate(image, radians(45))
    deg_135 = tfa.image.rotate(image, radians(135))
    deg_225 = tfa.image.rotate(image, radians(225))
    deg_315 = tfa.image.rotate(image, radians(315))

    X_train_rot_45_deg.append(deg_45)
    X_train_rot_135_deg.append(deg_135)
    X_train_rot_225_deg.append(deg_225)
    X_train_rot_315_deg.append(deg_315)

plot_images(X_train_rot_45_deg, 'Rotated 45 Degrees')
plot_images(X_train_rot_135_deg, 'Rotated 135 Degrees')
plot_images(X_train_rot_225_deg, 'Rotated 225 Degrees')
plot_images(X_train_rot_315_deg, 'Rotated 315 Degrees')

Which results in the following updated images:

Architecture Overview

VGG and ResNet models are particularly popular at the moment, since they are performing the best, according to some of the papers released on COVID-19:

  1. Hemdan, E., Shouman M., & Karar M. (2020). COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images. arXiv:2003.11055.
  2. Wang, L., & Wong, A. (2020). COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. arXiv:2003.09871.

We have made an overview of two architectures, VGG19 and ResNet50, using an adapted style guide from this article. The yellow R stands for ReLU, while the yellow S stands for Softmax. Brackets means repeated blocks of layers. Please note that we abstracted batch normalization away.

The VGG19 architecture

The ResNet50 architecture

Running A Deep Learning Model

We are going to be using transfer learning using the VGG19 model, where we initialize the weights to be of those from ImageNet, although it is a very different dataset to ours. ImageNet consists of images of random objects, while our dataset is just X-Ray images. We are transferring the hidden layers that find colorful patterns from those previously seen objects, or their different geometrical forms in another dataset.

I ran a short experiment where I tried not using transfer learning, and it resulted in an accuracy and F1 around 0.5-0.6. I chose to go ahead with the approach that performed better, as we will see in a moment. If you want to experiment, you need to define a VGG19 model with weights=None.

Imports, Loading Data And Splitting Data

For all the imports for loading data and modeling, we use various packages; pandas, pyplot, numpy, cv2, tensorflow and scikit-learn.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import cv2, time
import tensorflow as tf

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG19
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense

We start by loading both the datasets (note that dataset/ in the path variables is the name of the folder of the GitHub repository). We put a constraint on the number of normal cases we use for modeling, since using all the images, or even 20% of the normal images lead to overfitting to those samples.

covid_path = 'dataset/covid_dataset.csv'
covid_image_path = 'dataset/covid_adjusted/'

normal_path = 'dataset/normal_xray_dataset.csv'
normal_image_path = 'dataset/normal_dataset/'

covid_df = pd.read_csv(covid_path, usecols=['filename', 'finding'])
normal_df = pd.read_csv(normal_path, usecols=['filename', 'finding'])

normal_df = normal_df.head(99)

covid_df.head()

The following code is used to load the images into Python arrays, and lastly, we load the images into NumPy arrays and normalize the images, so that all the pixel values are between zero and one.

covid_images = []
covid_labels = []

for index, row in covid_df.iterrows():
    filename = row['filename']
    label = row['finding']
    path = covid_image_path + filename

    image = cv2.imread(path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    covid_images.append(image)
    covid_labels.append(label)

normal_images = []
normal_labels = []

for index, row in normal_df.iterrows():
    filename = row['filename']
    label = row['finding']
    path = normal_image_path + filename

    image = cv2.imread(path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    normal_images.append(image)
    normal_labels.append(label)

# normalize to interval of [0,1]
covid_images = np.array(covid_images) / 255

# normalize to interval of [0,1]
normal_images = np.array(normal_images) / 255

We want to make sure that both the normal and infected cases are split correctly, and since we loaded them separately, we will also split them separately. If we later choose to load more normal cases than infected cases, thus loading an imbalanced dataset, we will still get a good split in terms of the number of infected cases in the training and testing datasets.

After splitting the dataset, we concatenate them into input X and output y, along with a label for training and testing data. We also convert the labels COVID-19 and Normal into numbers, by using LabelBinarizer() and to_categorical().

# split into training and testing
covid_x_train, covid_x_test, covid_y_train, covid_y_test = train_test_split(
    covid_images, covid_labels, test_size=0.2)

normal_x_train, normal_x_test, normal_y_train, normal_y_test = train_test_split(
    normal_images, normal_labels, test_size=0.2)

X_train = np.concatenate((normal_x_train, covid_x_train), axis=0)
X_test = np.concatenate((normal_x_test, covid_x_test), axis=0)
y_train = np.concatenate((normal_y_train, covid_y_train), axis=0)
y_test = np.concatenate((normal_y_test, covid_y_test), axis=0)

# make labels into categories - either 0 or 1
y_train = LabelBinarizer().fit_transform(y_train)
y_train = to_categorical(y_train)

y_test = LabelBinarizer().fit_transform(y_test)
y_test = to_categorical(y_test)

Defining The Model

We went with a pre-trained VGG19 model. We have set include_top=False, which means there are no fully connected layers at the end of the VGG19 architecture. This enables us to take the outputs from the convolutions, flatten them and parse them into our own output layer.

We also added regularization by using a Dropout layer, because we were strongly overfitting to our training data. We are still overfitting to the training data, but not as much – we went from roughly a 0.97 accuracy to a 0.88 accuracy.

vggModel = VGG19(weights="imagenet", include_top=False,
    input_tensor=Input(shape=(224, 224, 3)))

outputs = vggModel.output
outputs = Flatten(name="flatten")(outputs)
outputs = Dropout(0.5)(outputs)
outputs = Dense(2, activation="softmax")(outputs)

model = Model(inputs=vggModel.input, outputs=outputs)

for layer in vggModel.layers:
    layer.trainable = False

model.compile(
        loss='categorical_crossentropy', 
        optimizer='adam', 
        metrics=['accuracy']
)

Defining The Augmentation Of Our Training Data

The way we chose to augment our data, was by generating batches of augmentations on each sample. For each sample, we get a new batch of samples from the ImageDataGenerator we defined below, and all these new augmented samples will be used for training, instead of the original sample.

train_aug = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

We chose to set the rotation_range=20, which is the degree range for random rotations of the original image. This means Keras will randomly choose a degree from 1-20 and rotate the image by that degree either clockwise or counterclockwise.

Meanwhile, we also have both the width_shift_range=0.2 and height_shift_range=0.2, which randomly shifts the image by the width (i.e. to the left and right) and height (i.e. up and down) as specified.

At last, we also have horizontal_flip=True, which at random flips the image horizontally, i.e. flipping the image from left to right, as we also saw earlier.

A good visualization of the ImageDataGenerator can be found here.

Fitting, Predicting & Scoring

We have defined our model, augmentation and data. Now all we need to do, is to start training our VGG19 model. We can train our model with the augmented data, by feeding the batches from the augmentation by the train_aug.flow() function.

history = model.fit(train_aug.flow(X_train, y_train, batch_size=32),
                    steps_per_epoch=len(X_train) / 32,
                    epochs=500)

Since we want to see how well our model performs on the different classes that exists, we use the correct metrics to measure how well our model performs; namely, we F1, and not accuracy. Here is why we could use both AUC and F1 (same explanation), explained by David Ernst:

The ROC AUC [or F1] is sensitive to class imbalance in the sense that when there is a minority class, you typically define this as the positive class and it will have a strong impact on the AUC [or F1] value. This is very much desirable behaviour. Accuracy is for example not sensitive in that way. It can be very high even if the minority class is not well predicted at all.

After training our model, we can use it to make predictions on new and unseen data. We use argmax for finding the highest predicted probability, and then we can generate a classification report using scikit-learn.

from sklearn.metrics import classification_report

y_pred = model.predict(X_test, batch_size=32)
print(classification_report(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)))

Our LabelBinarizer from earlier made it such that '0' is the COVID-19 class and '1' is the normal class. We can clearly see that the accuracy is 82%, while the f1-score for COVID-19 is 0.84. This is pretty good considering the limited data. Perhaps this even suggests that there are more noise in the images, that we did not filter out or correct for.

              precision    recall  f1-score   support

           0       0.75      0.95      0.84        19
           1       0.93      0.70      0.80        20

    accuracy                           0.82        39
   macro avg       0.84      0.82      0.82        39
weighted avg       0.84      0.82      0.82        39

At last, we can plot the training accuracy and the training loss using pyplot.

plt.figure(figsize=(10,10))
plt.style.use('dark_background')

plt.plot(history.history['accuracy'])
plt.plot(history.history['loss'])

plt.title('Model Accuracy & Loss')
plt.ylabel('Value')
plt.xlabel('Epoch')

plt.legend(['Accuracy', 'Loss'])

plt.show()

This generates a simple plot that shows how the model's accuracy increased over time on the training dataset, as well as the loss on the training dataset.

Conclusion: Next Steps For Better Models

Although our approach attempted to mitigate some of the mistakes made in other approaches, we should still believe that we can improve by a lot. As we saw from the f1-score, we most likely still have some issues with our data, that we can't immediately see ourselves.

Here are some of the approaches we could use, to get a better model, and more reliable results.

  1. Using x-ray images of SARS, MERS, ARDS and other types of infections, instead of just the normal category. This would make us able to distinguish between different infections.
  2. More quality data of COVID-19 cases from the posteroanterior (PA) view.
  3. Data in different stages of COVID-19: infected, no symptoms; infected, mild symptoms; infected, heavy symptoms.
  4. More advanced types of imaging data, e.g. CT images, which includes more information.
  5. Cross-validation or nested cross-validation for better estimation of the true generalization error of the model.

The results achieved in this article are NOT to be interpreted as usable in real world scenarios. We should only consider a Deep Learning + COVID-19 approach, after rigorous testing by engineers, doctors and other professionals, followed by a clinical study.