Skin cancer is the most common cancer in the world. In the US there are 5.4 million new cases of skin cancer every year. Different types of skin cancer can be found: Carcinomas, Melanomas (black cancer), etc. Survival chances of patients at the stage IV of the type of cancer is roughly 20%. Thus, early detection is essential for preventing dying from the skin cancer.
Classifying melanomas from clinical images of skin conditions is very hard problem. For example, looking at the following images, it is very difficult to determine if a lesion is benign (above) or malignant (below).
In this article I walk you through two different CNN architectures (VGG16 and ResNet50) to diagnose melanoma from two types of benign lesions (nevi and seborrheic keratoses). It should be noted that this is my Machine Learning Capstone Project for my Machine Learning Engineer Nano Program with Udacity.
1. Dataset Exploration
The training, validation and test data are received from the 2017 ISIC Challenge on Skin Lesion Analysis Towards Melanoma Detection and can be downloaded from the below links:
- Training data: https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/skin-cancer/train.zip
- Validation data: https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/skin-cancer/valid.zip
- Test data: https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/skin-cancer/test.zip
Each dataset contains three sub-folders representing images from one of the three image classes: melanomas, nevus and seborrheic keratoses. There are 2000, 150 and 600 images on the training, validation and test set respectively.
In general, it seems that we do not really suffer from the class’s imbalance challenge, which is very often seen in medical images classification.
All the training images are normalized and transformed (see section Methodology) and loaded to a DataLoader object. Then we can have a look at a particular image using the iterator.
2. Benchmark Models and Evaluation Metrics
Convolutional neural networks (CNNs) and transfer learning has been proved to be very efficient for medical images classification problem in general and skin cancer in particular. Indeed, low-level features learned from early layers of a pretrained CNN model can detect simple features like edges, colors, blobs, etc. For specific features to detect skin cancer we then need to finetune the last layers for the CNN model on our skin cancer dataset.
In this article two pretrained CNN models in Pytorch (ResNet50 and VGG16) will be fine-tuned for classifying the three classes of skin cancer. The model will be evaluated using the accuracy for each class prediction. Then we can compare our models the top scores (from the ISIC competition).
3. Methodology
3.1. Data Normalization and Augmentation
All images are loaded using the Pytorch DataLoader. The images are resized to 255 by 255 and center cropped so that we can easily fed them to a CNN model. We have roughly 2000 images for training a CNN model. It is probably a good idea to do some data augmentation to avoid over-fitting. Some data augmentation algorithms we can think about can be: flipping, rotating, zooming, contrasting, lighting, etc. For example, we can perform all the tasks as follow:
# transform the image size to 255 by 255 and then center croptrain_ts = transforms.Compose([
transforms.RandomRotation(30),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])test_ts = transforms.Compose([
transforms.Resize(255),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])# load train/valid/test set using ImageFolder from torchvisiondata_dir = 'skin-cancer-data'
train_dir = f'{data_dir}/train'
valid_dir = f'{data_dir}/valid'
test_dir = f'{data_dir}/test'train_set = datasets.ImageFolder(train_dir, transform=train_ts)
valid_set = datasets.ImageFolder(valid_dir, transform=test_ts)
test_set = datasets.ImageFolder(test_dir, transform=test_ts)# using Pytorch DataLoader to load the images
batch_size = 100
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True)valid_loader = torch.utils.data.DataLoader(valid_set, batch_size=batch_size)test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)
3.2. Define and train CNNs
As described earlier CNN models will be trained to classify skin cancer images. However, instead of training the models from scratch we are going to make use of pretrained CNN models in Pytorch (ResNet50 and VGG16), which already learned some lower features from earlier layer. Then we will fine-tune the model for classifying the three classes of skin cancer.
3.3. Evaluating the networks
All models are evaluated using the accuracy. Then the best model will be used on the test set. Then it is interesting to see the final model performance in comparison with the top scores (from the SIC competition).
4. Results
4.1. VGG16
In order to fine tune the VGG16 model the last linear layer is replace by a new linear layer with number of outputs is the number of skin types (3) that we want to classify. Then a learning rate and a loss function are picked. The last layer is then trained for 30 epochs.
# Load the pretrained model from pytorch
vgg16 = models.vgg16(pretrained=True)# Freeze training for all “features” layers
for param in vgg16.features.parameters():
param.requires_grad = Falsen_inputs = vgg16.classifier[6].in_features# add last linear layer (n_inputs -> 3 skin classes)
# new layers automatically have requires_grad = True
last_layer = nn.Linear(n_inputs, len(classes))
vgg16.classifier[6] = last_layer# if GPU is available, move the model to GPU
if train_on_gpu:
vgg16 = vgg16.cuda()# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()# specify optimizer (stochastic gradient descent) and learning rateoptimizer = optim.Adam(vgg16.parameters(), lr=1e-5)
If we look at the above figure the training and validation losses reduce gradually but slowly after 30 epochs. During the training process a model with a better validation loss is saved as the best model so far. The final model is the one with the smallest validation loss. The model is then used on the test set and its accuracy is captured. The figure bellow illustrates the confusion matrix of the last VGG16 model on the test set. As we can see the accuracy of the model for the nevus class is much better than for the others. It could be explained that we have more images of nevus type compared to the other twos on the training set.
4.2. ResNet50
Similarly, we define and train a ResNet50 model as bellow. However, instead of only one linear layer this time we replace the last ResNet50 layer with two linear layers. For this reason, we need to choose a smaller learning rate compared to our previous VGG16 model. The last two layers are also trained for 30 epochs. And if we look at figure 5.4 the training and validation losses reduce significantly after the first 5 epochs but then fluctuate a lot. This could be explained by the selected learning rate. It might probably better to try different learning rates and then pick the best one. The best model during the training is also saved and then tested on the test set. And as we can see the accuracy is slightly worst compared to our previous VGG16 model.
resnet50 = models.resnet50(pretrained = True)
# Freeze training for all “features” layers
for param in resnet50_2.parameters():
param.requires_grad = False
n_inputs = resnet50.fc.in_features# add last linear layer (n_inputs -> 3 cancer classes)
# new layers automatically have requires_grad = True
resnet50.fc = nn.Sequential(
nn.Linear(2048, 128),
nn.ReLU(inplace=True),
nn.Linear(128, 3))# if GPU is available, move the model to GPU
if train_on_gpu:
resnet50_2 = resnet50_2.cuda()# specify loss function (categorical cross-entropy)criterion = nn.CrossEntropyLoss()# specify optimizer (stochastic gradient descent) and learning rateoptimizer = optim.Adam(resnet50_2.fc.parameters(), lr=1e-3)
5. Conclusions
In the article, two different CNN models (VGG16 and ResNet50) are fine tuned to melanoma from two types of benign lesions (nevi and seborrheic keratoses). The two models are more or less similar in term of the accuracy on the test set. In particular, their performance is better for the nevus class compared to the two others. When compared to the top scores from the ISIC competition the two models are in the middle; thus there are clearly rooms for improvement.