plismbench.engine.extract.core module#
Perform features extraction from PLISM dataset.
- plismbench.engine.extract.core.run_extract(feature_extractor_name: str, batch_size: int, device: int, export_dir: Path, download_dir: Path | None = None, streaming: bool = False, overwrite: bool = False, workers: int = 8) None [source]#
Run features extraction.
If
stream==False
, data will be downloaded and stored to disk from https://huggingface.co/datasets/owkin/plism-dataset. This dataset contains 91 .h5 files each containing 16,278 images converted into numpy arrays. In this scenario, 300Gb storage are necessary.If
stream==True
, data will be downloaded on the fly from https://huggingface.co/datasets/owkin/plism-dataset-tiles but not stored to disk. This dataset contains 91x16278 images stored as .png files. Streaming is enable using thedatasets
library and datasets.load_dataset(…, streaming=True). Note that this comes with the limitation to useIterableDataset
meaning that no easy resume can be performed if the features extraction fails.