plismbench.engine.extract.core module#

Perform features extraction from PLISM dataset.

plismbench.engine.extract.core.run_extract(feature_extractor_name: str, batch_size: int, device: int, export_dir: Path, download_dir: Path | None = None, streaming: bool = False, overwrite: bool = False, workers: int = 8) None[source]#

Run features extraction.

If stream==False, data will be downloaded and stored to disk from https://huggingface.co/datasets/owkin/plism-dataset. This dataset contains 91 .h5 files each containing 16,278 images converted into numpy arrays. In this scenario, 300Gb storage are necessary.

If stream==True, data will be downloaded on the fly from https://huggingface.co/datasets/owkin/plism-dataset-tiles but not stored to disk. This dataset contains 91x16278 images stored as .png files. Streaming is enable using the datasets library and datasets.load_dataset(…, streaming=True). Note that this comes with the limitation to use IterableDataset meaning that no easy resume can be performed if the features extraction fails.