Fed-ISIC 2019 ============= The dataset used in this repo comes from the `ISIC2019 challenge `__ and the `HAM1000 database `__. We do not own the copyright of the data, everyone using those datasets should abide by their licences (see below) and give proper attribution to the original authors. Dataset description ------------------- The following table provides a data sheet: +-------------+--------------------------------------------------------------+ | | Dataset description | +=============+==============================================================+ | Description | Dataset from the ISIC 2019 challenge, we keep images for | | | which the datacenter can be extracted. | +-------------+--------------------------------------------------------------+ | Dataset | 23,247 images of skin lesions ((9930/2483), (3163/791), | | | (2691/672), (1807/452), (655/164), (351/88)) | +-------------+--------------------------------------------------------------+ | Centers | 6 centers (BCN, HAM\_vidir\_molemax, HAM\_vidir\_modern, | | | HAM\_rosendahl, MSK, HAM\_vienna\_dias) | +-------------+--------------------------------------------------------------+ | Task | Multiclass image classification | +-------------+--------------------------------------------------------------+ License ~~~~~~~ The `full licence `__ for ISIC2019 is CC-BY-NC 4.0. In order to extract the origins of the images in the HAM10000 Dataset (cited above), we store in this repository a copy of `the original HAM10000 metadata file `__. Please find attached the link to the `full licence and dataset terms `__ for the HAM10000 Dataset. Please first accept the licences on the HAM10000 and ISIC2019 dataset pages before going through the following steps. Ethics ~~~~~~ As per the `Terms of Use `__ of the `website `__ hosting the dataset, one of the requirements for this datasets to have been hosted is that it is properly de-identified in accordance with the applicable requirements and legislations. Data ---- To download the ISIC 2019 training data and extract the original datacenter information for each image, First cd into the ``dataset_creation_scripts`` folder: .. code:: bash cd flamby/datasets/fed_isic2019/dataset_creation_scripts then run: :: python download_isic.py --output-folder /path/to/user/folder The file train\_test\_split contains the train/test split of the images (stratified by center). Image preprocessing ------------------- To preprocess and resize images, run: :: python resize_images.py This script will resize all images so that the shorter edge of the resized image is 224px and the aspect ratio of the input image is maintained. `Color constancy `__ is added in the preprocessing. **Be careful: in order to allow for augmentations, images aspect ratios are conserved in the preprocessing so images are rectangular with a fixed width so they all have different heights. As a result they cannot be batched without cropping them to a square. An example of such a cropping strategy can be found in the benchmark found below.** Using the dataset ----------------- Now that the dataset is ready for use you can load it using the low or high-level API by running in a python shell: .. code:: python from flamby.datasets.fed_isic2019 import FedIsic2019 # To load the first center as a pytorch dataset center0 = FedIsic2019(center=0, train=True) # To load the second center as a pytorch dataset center1 = FedIsic2019(center=1, train=True) # To load the 3rd center ... # To sample batches from each of the local datasets use the traditional pytorch API from torch.utils.data import DataLoader as dl X, y = iter(dl(center0, batch_size=16, shuffle=True, num_workers=0)).next() More informations on how to train model and handle flamby datasets in general are available in the :any:`quickstart` Baseline training and evaluation in a pooled setting ---------------------------------------------------- To train and evaluate a classification model for the pooled dataset, run: :: python benchmark.py --GPU 0 --workers 4 References ---------- The "ISIC 2019: Training" is the aggregate of the following datasets: BCN\_20000 Dataset: (c) Department of Dermatology, Hospital Clínic de Barcelona HAM10000 Dataset: (c) by ViDIR Group, Department of Dermatology, Medical University of Vienna; `HAM10000 dataset `__ MSK Dataset: (c) Anonymous; `challenge 2017 `__; `challenge 2018 `__ See below the full citations: [1] Tschandl P., Rosendahl C. & Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi.10.1038/sdata.2018.161 (2018). [2] Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti, Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection: A Challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), Hosted by the International Skin Imaging Collaboration (ISIC)”, 2017; arXiv:1710.05006. [3] Marc Combalia, Noel C. F. Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Allan C. Halpern, Susana Puig, Josep Malvehy: “BCN20000: Dermoscopic Lesions in the Wild”, 2019; arXiv:1908.02288. ## Acknowledgement We thank `Aman Arora `__ for his `implementation `__ and `blog `__ that we used as a base for our own code.