Optimizing the Training Step in Deep Learning-Automated High-Content Analysis at Sustained Result Quality
SLAS Europe 2021
May 25, 2021
In recent years, great strides have been made in the application of artificial intelligence (AI) and in particular, deep learning (DL) methods--to biopharma research, for example with the automation of image analysis in high-content screening (HCS). However, broader adoption of such methods remains limited and their true promise unmet, due to practical hurdles such as the difficulty of procuring high-quality training data sets.
Labelled training data is key for generating accurate DL model s and consequently, high-quality results. In HCS, assay experts usually curate such data manually, a tedious process requiring precious expert time, and raw image data, which is often scarce and/or imperfect. In previous work developing the Genedata Imagence software system, we optimized the training step to drastically reduce manual work and streamline data curation. Here, we have systematically analyzed the impact of sample size, sample imbalance across training classes, and noisiness of labels on the quality of pharmacological end results derived from a high-content screen. Applying Genedata Imagence to two industry assays, we found a remarkable robustness with respect to sample size and imbalance, yet sensitivity to mislabeling. This shows that training in Imagence can be done with relatively few samples but emphasizes the importance of accurate labeling in this critical step.