Fed-Camelyon16 ============== Camelyon16 as Camelyon17 are open access (CC0), the original dataset is accessible `here `__. We will first fetch the slides from the public Google Drive, and will then tile the matter using a feature extractor, producing a bag of features for each slide. Dataset description ------------------- Please refer to the `dataset website `__ for an exhaustive data sheet (``https://academic.oup.com/gigascience/article/7/6/giy065/5026175#117856577``). The table below provides a high-level description of the dataset. +--------------+-------------------------------------------------------------+ | | Dataset description | +==============+=============================================================+ | Description | Dataset from Camelyon16 | +--------------+-------------------------------------------------------------+ | Dataset size | 900 GB (and 50 GB after features extraction). | +--------------+-------------------------------------------------------------+ | Centers | 2 centers - RUMC and UMCU. | +--------------+-------------------------------------------------------------+ | Records per | RUMC: 169 (Train) + 74 (Test), UMCU: 101 (Train) + 55 | | center | (Test) | +--------------+-------------------------------------------------------------+ | Inputs shape | Tensor of shape (10000, 2048) (after feature extraction). | +--------------+-------------------------------------------------------------+ | Total nb of | 399 slides. | | points | | +--------------+-------------------------------------------------------------+ | Task | Weakly Supervised (Binary) Classification. | +--------------+-------------------------------------------------------------+ License and terms of use ~~~~~~~~~~~~~~~~~~~~~~~~ This dataset is licensed under an open access Creative Commons 1.0 Universal (**CC0 1.0**) license by its authors. *Anyone using this dataset should abide by its licence and* *give proper attribution to the original authors.* Ethical approval ~~~~~~~~~~~~~~~~ As indicated by the dataset authors (``https://academic.oup.com/gigascience/article/7/6/giy065/5026175#117856619``), > The collection of the data was approved by the local ethics committee > (Commissie Mensgebonden Onderzoek regio Arnhem - Nijmegen) under 2016-2761, > and the need for informed consent was waived. Download instructions --------------------- Introduction ~~~~~~~~~~~~ The dataset is hosted on a `several mirrors `__ (GigaScience, Google Drive, Baidu Pan). We provide below some scripts to automatically download the dataset based on the Google Drive API, which requires a Google Account. If you do not have a Google account, you can alternatively download manually the dataset through one of the mirrors. You will find below detailed instructions for each method. In both cases, make sure you have enough space to store the raw dataset (~900GB). Method A: Automatic download with the Google Drive API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In order to use the Google Drive API you need to have a google account and to access the `google developpers console `__ in order to get a json containing an OAuth2.0 secret. All steps necessary to obtain the JSON are described in numerous places in the internet such as in pydrive's `quickstart `__, or in this `very nice tutorial's first 5 minutes `__ on Youtube. It should not take more than 5 minutes. The important steps are listed below. Step 1: Setting up Google App and associated secret ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Create a project in `Google console `__. For instance, you can call it ``flamby``. 2. Go to Oauth2 consent screen (on the left of the webpage), choose a name for your app and publish it for external use. 3. Go to Credentials, create an id, then client oauth id 4. Choose Web app, go through the steps and **allow URI redirect** towards ``http://localhost:6006`` and ``http://localhost:6006/`` (notice the last backslash) 5. Retrieve the secrets in JSON by clicking on Download icon at the end of the process. 6. Enable Google Drive API for this project, by clicking on "API and services" on the left panel Then copy-paste your secrets to the directory you want: .. code:: bash cp ~/Downloads/code_secret_client_bignumber.apps.googleusercontent.com.json client_secrets.json Step 2: Downloading the dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - **Remark 1: If you are downloading on a remote server**, make sure you do ssh forwarding of the port 6006 onto the port 6006 of your laptop. - Remark 2 : Make sure you have enough space to hold the dataset (900GB). - First cd into the ``dataset_creation_scripts`` folder: .. code:: bash cd flamby/datasets/fed_camelyon16/dataset_creation_scripts Then run: .. code:: bash python download.py --output-folder ./camelyon16_dataset --path-to-secret /path/towards/client_secrets.json --port 6006 The first time this script is launched, the user will be asked to explicitly allow the app to operate by logging into his/her Google account (hence the need for the port 6006 forwarding in the case of a remote machine without browser). This script will download all of Camelyon's slides in the output folder. As there are multiple slides that are quite big, this script can take a few hours to complete. It can be stopped and resumed anytime however if you are ssh into a server better use detached mode (screenrc/tmux/etc.). **IMPORTANT :** If you choose to relocate the dataset after downloading it, it is imperative that you run the following script otherwise all subsequent scripts will not find it: :: python update_config.py --new-path /new/path/towards/dataset #adding --debug if you are in debug mode Method B: Manual download from the official mirrors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We are interested in the Camelyon16 portion of the `Camelyon dataset `__. In the following, we will detail the steps to manually download the dataset from the Google Drive repository. You can easily adapt the steps to the other mirrors. Camelyon16 is stored on a public `Google Drive `__. The dataset is pre-split into training and testing slides. The training slides are further divided into 2 folders: normal and tumor. Download all the ``.tif`` files in the `normal `__, `tumor `__ and `testing images `__ folders. Put all the resulting files into a single folder. You should end up with 399 ``.tif`` files in a given folder ``PATH-TO-FOLDER``. The last step consists in creating a metadata file that will be used by the preprocessing step. Create a file name ``dataset_location.yaml`` under ``flamby/datasets/fed_camelyon16/dataset_creation_scripts/`` with the following content: .. code:: yaml dataset_path: PATH-TO-FOLDER download_complete: true The download is now complete. ## Dataset preprocessing (tile extraction) The next step is to tile the matter on each slide with a feature extractor pretrained on IMAGENET. We will use the `histolab package `__ to segment the matter on each slide and torchvision to download a pretrained ResNet50 that will be applied on each tile to convert each slide to a bag of numpy features. This package requires the installation of `Openslide `__. The associated webpage contains instructions to install it on every major distributions. On Linux simply run: .. code:: python sudo apt-get install openslide-tools One can choose to remove or not the original slides that take up quite some space to keep only the features (therefore using only approximatively 50GB instead of 800GB). As extracting the matter on all the slides is a lengthy process this script might take a few hours (and a few days if the tiling is done from scratch). It can also be stopped and resumed anytime and should be preferably run in detached mode. This process should be run on an environment with GPU, otherwise it might be prohibitively slow. .. code:: bash python tiling_slides.py --batch-size 64 or .. code:: bash python tiling_slides.py --batch-size 64 --remove-big-tiff Using the dataset ----------------- Now that the dataset is ready for use you can load it using the low or high-level API by running in a python shell: .. code:: python from flamby.datasets.fed_camelyon16 import FedCamelyon16, Camelyon16Raw # To load the first center as a pytorch dataset center0 = FedCamelyon16(center=0, train=True) # To load the second center as a pytorch dataset center1 = FedCamelyon16(center=1, train=True) # To sample batches from each of the local datasets use the traditional pytorch API from torch.utils.data import DataLoader as dl # For this specific dataset samples do not have the same size and therefore batching requires padding implemented in collate_fn from flamby.datasets.fed_camelyon16 import collate_fn X, y = iter(dl(center0, batch_size=16, shuffle=True, num_workers=0, collate_fn=collate_fn)).next() More informations on how to train model and handle flamby datasets in general are available in the :any:`quickstart`. Benchmarking the baseline in a pooled setting --------------------------------------------- In order to benchmark the baseline on the pooled dataset one needs to download and preprocess the dataset and launch the following script: .. code:: bash python benchmark.py --log --num-workers-torch 10 This will launch 5 single-centric runs and store log results for training in ./runs/seed42-47 and testing in ./runs/tests-seed42-47. The command: .. code:: bash tensorboard --logdir=./runs can then be used to visualize results (use `port forwarding if necessary `__).