Fed-Camelyon16

Camelyon16 as Camelyon17 are open access (CC0), the original dataset is accessible here. We will first fetch the slides from the public Google Drive, and will then tile the matter using a feature extractor, producing a bag of features for each slide.

Dataset description

Please refer to the dataset website for an exhaustive data sheet (https://academic.oup.com/gigascience/article/7/6/giy065/5026175#117856577). The table below provides a high-level description of the dataset.

	Dataset description
Description	Dataset from Camelyon16
Dataset size	900 GB (and 50 GB after features extraction).
Centers	2 centers - RUMC and UMCU.
Records per center	RUMC: 169 (Train) + 74 (Test), UMCU: 101 (Train) + 55 (Test)
Inputs shape	Tensor of shape (10000, 2048) (after feature extraction).
Total nb of points	399 slides.
Task	Weakly Supervised (Binary) Classification.

License and terms of use

This dataset is licensed under an open access Creative Commons 1.0 Universal (CC0 1.0) license by its authors. Anyone using this dataset should abide by its licence and give proper attribution to the original authors.

Ethical approval

As indicated by the dataset authors (https://academic.oup.com/gigascience/article/7/6/giy065/5026175#117856619), > The collection of the data was approved by the local ethics committee > (Commissie Mensgebonden Onderzoek regio Arnhem - Nijmegen) under 2016-2761, > and the need for informed consent was waived.

Download instructions

Introduction

The dataset is hosted on a several mirrors (GigaScience, Google Drive, Baidu Pan). We provide below some scripts to automatically download the dataset based on the Google Drive API, which requires a Google Account. If you do not have a Google account, you can alternatively download manually the dataset through one of the mirrors. You will find below detailed instructions for each method. In both cases, make sure you have enough space to store the raw dataset (~900GB).

Method A: Automatic download with the Google Drive API

In order to use the Google Drive API you need to have a google account and to access the google developpers console in order to get a json containing an OAuth2.0 secret.

All steps necessary to obtain the JSON are described in numerous places in the internet such as in pydrive’s quickstart, or in this very nice tutorial’s first 5 minutes on Youtube. It should not take more than 5 minutes. The important steps are listed below.

Step 1: Setting up Google App and associated secret

Create a project in Google console. For instance, you can call it flamby.
Go to Oauth2 consent screen (on the left of the webpage), choose a name for your app and publish it for external use.
Go to Credentials, create an id, then client oauth id
Choose Web app, go through the steps and allow URI redirect towards http://localhost:6006 and http://localhost:6006/ (notice the last backslash)
Retrieve the secrets in JSON by clicking on Download icon at the end of the process.
Enable Google Drive API for this project, by clicking on “API and services” on the left panel

Then copy-paste your secrets to the directory you want:

cp ~/Downloads/code_secret_client_bignumber.apps.googleusercontent.com.json client_secrets.json

Step 2: Downloading the dataset

Remark 1: If you are downloading on a remote server, make sure you do ssh forwarding of the port 6006 onto the port 6006 of your laptop.
Remark 2 : Make sure you have enough space to hold the dataset (900GB).

First cd into the dataset_creation_scripts folder:

cd flamby/datasets/fed_camelyon16/dataset_creation_scripts

Then run:

python download.py --output-folder ./camelyon16_dataset --path-to-secret /path/towards/client_secrets.json --port 6006

The first time this script is launched, the user will be asked to explicitly allow the app to operate by logging into his/her Google account (hence the need for the port 6006 forwarding in the case of a remote machine without browser).

This script will download all of Camelyon’s slides in the output folder. As there are multiple slides that are quite big, this script can take a few hours to complete. It can be stopped and resumed anytime however if you are ssh into a server better use detached mode (screenrc/tmux/etc.).

IMPORTANT : If you choose to relocate the dataset after downloading it, it is imperative that you run the following script otherwise all subsequent scripts will not find it:

python update_config.py --new-path /new/path/towards/dataset #adding --debug if you are in debug mode

Method B: Manual download from the official mirrors

We are interested in the Camelyon16 portion of the Camelyon dataset. In the following, we will detail the steps to manually download the dataset from the Google Drive repository. You can easily adapt the steps to the other mirrors.

Camelyon16 is stored on a public Google Drive. The dataset is pre-split into training and testing slides. The training slides are further divided into 2 folders: normal and tumor. Download all the .tif files in the normal, tumor and testing images folders. Put all the resulting files into a single folder. You should end up with 399 .tif files in a given folder PATH-TO-FOLDER.

The last step consists in creating a metadata file that will be used by the preprocessing step. Create a file name dataset_location.yaml under flamby/datasets/fed_camelyon16/dataset_creation_scripts/ with the following content:

dataset_path: PATH-TO-FOLDER
download_complete: true

The download is now complete. ## Dataset preprocessing (tile extraction)

The next step is to tile the matter on each slide with a feature extractor pretrained on IMAGENET.

We will use the histolab package to segment the matter on each slide and torchvision to download a pretrained ResNet50 that will be applied on each tile to convert each slide to a bag of numpy features. This package requires the installation of Openslide. The associated webpage contains instructions to install it on every major distributions. On Linux simply run:

sudo apt-get install openslide-tools

One can choose to remove or not the original slides that take up quite some space to keep only the features (therefore using only approximatively 50GB instead of 800GB).

As extracting the matter on all the slides is a lengthy process this script might take a few hours (and a few days if the tiling is done from scratch). It can also be stopped and resumed anytime and should be preferably run in detached mode. This process should be run on an environment with GPU, otherwise it might be prohibitively slow.

python tiling_slides.py --batch-size 64

or

python tiling_slides.py --batch-size 64 --remove-big-tiff

Using the dataset

Now that the dataset is ready for use you can load it using the low or high-level API by running in a python shell:

from flamby.datasets.fed_camelyon16 import FedCamelyon16, Camelyon16Raw

# To load the first center as a pytorch dataset
center0 = FedCamelyon16(center=0, train=True)
# To load the second center as a pytorch dataset
center1 = FedCamelyon16(center=1, train=True)

# To sample batches from each of the local datasets use the traditional pytorch API
from torch.utils.data import DataLoader as dl
# For this specific dataset samples do not have the same size and therefore batching requires padding implemented in collate_fn
from flamby.datasets.fed_camelyon16 import collate_fn

X, y = iter(dl(center0, batch_size=16, shuffle=True, num_workers=0, collate_fn=collate_fn)).next()

More informations on how to train model and handle flamby datasets in general are available in the Quickstart.

Benchmarking the baseline in a pooled setting

In order to benchmark the baseline on the pooled dataset one needs to download and preprocess the dataset and launch the following script:

python benchmark.py --log --num-workers-torch 10

This will launch 5 single-centric runs and store log results for training in ./runs/seed42-47 and testing in ./runs/tests-seed42-47. The command:

tensorboard --logdir=./runs

can then be used to visualize results (use port forwarding if necessary).