Ships recognition in satellite images

Executive summary

Problem setup

Satellite imagery provides unique insights into various markets, including agriculture, defense and intelligence, energy, and finance. New commercial imagery providers, such as Planet, are using constellations of small satellites to capture images of the entire Earth every day.

This flood of new imagery is outgrowing the ability for organizations to manually look at each image that gets captured, and there is a need for machine learning and computer vision algorithms to help automate the analysis process.

Solution

This notebook describes using a CNN (convolutional neural network) to distinguish whether a presented satellite image has a ship on it. I use transfer learning on two publicly available pretrained CNNs (VGG16 and Inception), building classifiers on top of them, and compare those.

  1. The dataset Ships in Satellite Imagery has 1000 positive (ship) and 3000 negative examples, with each image being 80x80 px RGB
  2. I train a total of 6 models: 2 underlying embeddings (VGG16 and Inception v3) x 3 options for each (no imbalanced class correction / class weights / data augmentation)
  3. The best of 6 models yields accuracy 97.25%, precision 92.63% and recall 95.65% on unseen data

Acknowledgements

Many thanks to Gotam Dahiya for his excellent notebook Ship Detection using Faster R-CNN: Part 1 and to Tensorflow team for the Classification on imbalanced data tutorial.

Project structure

This is an overview of the entire project structure, or pipeline:

  1. Data wrangling
    • Aquire the data from a public dataset on kaggle
    • Perform basic EDA (exploratory data analysis)
    • Split the data into train/dev/test sets
  2. Transfer learning: use pretrained CNNs (VGG16 and Inception) to calculate feature vectors for each image
  3. Build and train neural network classifiers on top of the pretrained CNNs:
    • baseline (no correction for imbalanced classes)
    • class weights
    • data augmentation for the minority class
  4. Evaluate and compare all six models based on performance on unseen test data

Setup

Data wrangling

Load data

The data was downloaded manually and separated into ship and no-ship folders

Load images as CV2 format, normalize pixel values and read labels

EDA

Let's see the distribution of data betweeen classes

For each ship image, there are 3 no-ship ones. Such skewness will have to be corrected for, otherwise the alogrithm will be able to achieve 75% accuracy just by always predicting "no-ship" which will create unnecessary bias.

Let's now see a few images. First row is no-ship, second row are ships.

It is worth noting that only a full image of the vessel is considered to belong to ship class. No-ship sometimes can contain a part of a ship.

Train/dev/test split

I will split the entire dataset into train/dev/test using 70/20/10 ratio appropriate for a relatively small number of examples I have.

  1. Train set will be used to train classifiers
  2. Dev set, a.k.a. validation set, will be used to compare different approaches with each other and choose the best one. Each one comprises:
    • underlying pretrained CNN
    • classifier with its hyperparameters
    • a choice of correcting for skewed classes: either data augmentation or class weights
  3. Test set will be used only once to estimate the performance of the best model on unseen data

train_test_split function from scikitlearn only allows splitting into two sets (train and test), so I will use it twice, first splitting all data into train and test, and then splitting test further into dev and test proper

Transfer learning

The amount of original data (only 4000 examples) is not enough to train a CNN from scratch. It makes more sense to reuse some publicly available models that have been trained over several weeks using GPU on millions of examples. Thus I will be able to transfer the existing knowledge about low-level features (borders, angles etc) and only learn the actual classification between ships and no-ships.

This will be a two step process:

  1. Use pretrained CNN (without the final fully connected and classifier layers) to calculate feature vectors for each image (deterministic step, as no training will be done here)
  2. Build and train a separate classifier, using feature vectors from step 1 as an input

Generally speaking, the classifier at step 2 does not have to be a neural network at all. One can use SVM, logistic regression or anything at all, but I will stick to NN as this is the topic of the whole notebook.

VGG16

VGG16 is a deep CNN with 16 trainable layers that made history by achieving 92.7% top-5 test accuracy on ILSVRC-2014 competition by ImageNet. You can read about it in this article, but here is the architecture for general understanding.

VGG16 architecture

Load model

There are two parts to a CNN model: the network architecture (displayed above) and the weights after training. For transfer learning, we'll need both, and both very conveniently are now available via keras package.

However, out of VGG16 original 16 trainable layers (not counting pooling and softmax), only the first 13 are convolutional layers, and last 3 are fully connected ones that perform classification. I will only load the convolutional layers and build my own classifier on top of that. A nice perk is that it will allow me to use any size image for input (generating different size features vectors for output), because otherwise VGG16 only accepts 256x256 px format.

Calculate features

First pass all the images through VGG16 to get feature tensors. With a 256x256x3 image, the result would be 7x7x256, but with a 80x80x3 ones that I am using it will be smaller, some kind of n x n x 256, where n < 7. This step takes a few minutes.

Okay resulting tensors are 2x2x512. Need to flatten them into feature vectors because this is what fully connected layers like.

Perfect, now each image is represented with a 2048-long feature vector.

Inception

While VGG is conceptually a bunch of convolutional layers stacked upon each other, Inception has more complicated structure. Fully describing it is out of scope of this notebook (you can read a good overview here). I will only mention that it creates a "sparsely connected architecture" by stacking up a new kind of bricks called "Inception modules": Inception module

Among other things, this makes the network less expensive to train and less prone to overfitting.

For my particular task, I will just load the architecture and weights (without the top fully connected layers) and use them to calculate feature vectors as with VGG.

Load model

Calculate features

By analogy with VGG, first get features as tensors:

It's interesting that both VGG16 and Inception end up with the same size features, although shaped differently.

Now flatten:

Own model (classifier)

Due to the lucky coincidence that VGG16 and Inception output feature vectors of the same size, I can use the same classifier architurecture, feed it both outputs and compare.

Define architecture and metrics

Generally speaking, my classifier does not even have to be a neural network. I can feed CNN-generated features into an SVM or even logistic regression. However, here I do use a neural network of the following architecture:

  1. Input is 2048-length feature vector
  2. Fully connected layer with 1024 neurons and ReLU activation
  3. Dropout with 0.5
  4. Single-unit output layer with sigmoid activation: standard choice for binary classification

Understanding useful metrics

Notice that there are a few metrics defined above that can be computed by the model that will be helpful when evaluating the performance.

Note: Accuracy is may not be the best metric for this task. You can achieve 75% accuracy by always predicting no-ship. However, it cannot be discarded completely. Read more:

Build a model

Create a classifier using the previously defined function. This baseline version will probably not work very well and will be improved in later sections.

Save initial weights

Saving initial weights to a temp file so that I can load them and re-train from the same point later.

Baseline model

In this section, I will train two classifiers, using VGG16 and Inception embeddings as input, without correcting for the imbalanced classes.

VGG16 baseline

Train model

Plot training history

Evaluate metrics

One way to evaluate the resulting model is to use a confusion matrix to summarize the actual vs. predicted labels where the X axis is the predicted label and the Y axis is the actual label.

A function to plot the confusion matrix:

Plot the ROC

ROC curve is useful because it shows how you can tune your classifier by adjusting the prediction threshold. I will only plot it for test data as with a total of 6 models, the chart will be eventually quite cluttered already.

Inception baseline

Train model

Plot training history

We can see a severe case of overfitting. Training loss goes down, but dev loss goes up. Normally I would attempt to solve this with one of the following:

However, in this case I have another network that is learning well enough using VGG features for input, so I will keep that as main candidate, but keep an eye on Inception too just out of curiosity.

Evaluate metrics

As could be expected from training plots, the performance is not as good as VGG option, but still pretty decent. Maybe this classification task is just relatively easy by itself.

Plot the ROC

Class weights model

One option to improve the performance of a model on imbalanced data is to assign higher weight to the errors produced by minority class when training, making model "pay more attention" to the minority class.

Calculcate class weights

VGG 16 with class weights

Train a model with class weights

Keras has a special argument when training a model to pass the class weights.

Plot training history

Loss plot looks similar to baseline VGG model. Precision shows a lot of noise.

Evaluate metrics

Compared to baseline VGG model, this one has

Plot the ROC

Inception with class weights

Let's see what happens with the previously overfitting Inception model when the same class weights are applied.

Train a model with class weights

Plot training history

Similar overfitting picture. Inception-based model is unlikely to make it to production.

Evalate metrics

Plot the ROC

Data augmentation model

Another way to deal with imbalanced classes is do use data augmentation to create more examples of the minority class.

Several ways of augmentation will be applied randomly. Some other ways, like cropping or rotating the image, will not work very well for this case, because the ship usually takes almost the entire image, so we are risking of either cutting away a part of it, or creating significant black corners when rotating the square.

Augmentation will only be applied to the training set. This way dev and test sets will be same across all models, allowing to compare apples to apples.

An (almost) balanced dataset (original images_train was not exactly 3:1 ratio due random nature of tran/dev/test splitting)

VGG16 with augmentation

Calculate VGG16 features

I now have some more training example for which the features have to be calculated and flattened again.

Train a model with augmented data

Plot training history

Evaluate metrics

Compared with the defending champion (baseline VGG model), this one has a tiny tiny improvement: one false positive less (therefore higher precision) and also a bit better AUC. This is the world of modern computer vision: fighting for improvements somewhere in the forth digit after the point.

Plot the ROC

Inception with augmentation

Last chance for Inception based model. Maybe with more training data there will be less overfitting?

Calculate Inception features

Train a model with augmented data

Plot training history

Wow, the more data it gets, the worse it performs. Too bad for you, Inception!

Evaluate metrics

Plot the ROC

Final model

In the previous section, I trained a total of 6 classifiers = 2 underlying embeddings (VGG16 and Inception) x 3 options for each embedding (baseline, class weights and data augmentation). Time to compare all of them together.

Compare six models

Without having a specific business problem to solve, it is hard to choose between six models, all showing decent performance (even the worst one gives 96.75% accuracy). It is always good to use a single real-value metric to compare all the models, but what should it be? Three options come to mind:

  1. Raw accuracy
  2. F1 score
  3. AUC-ROC

VGG-based model with augmented data scores best both in terms of accuracy (only 6 mislabelled examples out of 400, or 98.5%) and F1 score. However, VGG-based model with class weights scores better on AUC-ROC metric.

Adjust VGG with class weights

With current decision treshold of 0.5, this model found 12 false positive ships where there are none (and missed one true ship), but here is what we can get of it by simple rising that threshold. I played with p manually until I found the value that yields the least false positives without adding any new false negatives. This may be considered data leakage, as I am adjusting the threshold according to test data, so I would not do this in production, but here my task is just to show what I can theoretically squeeze out of this model.

If my goal was not to miss an actual ship at any cost, this would probably be my best choice. Let's manually calculate precision, recall and F1 score for this model:

So even with this cheating adjustment of threshold on test data, the F1 score does not beat data augmentation model's one.

Display wrongly labelled test pics

Using VGG model with data augmentation, let's display mislabeled examples from training, dev and test sets to see if there is anything in common.

Training set

False positives:

False negatives:

Dev set

False positives:

False negatives:

Test set

False positives:

False negatives:

Findings are not super clear, but it looks like false positives often include either

By providing more images of this kind in training data, it may be possible to improve model's results.

Future work

Next step would be going from image classification to object detection: exploring bigger satellite images, finding regions of interest (ROI) and predicting bounding boxes for all the ships present.

Object detection