AI for automated sleep staging is considered mature and has found its way to commercial sleep evaluation systems. Systems underpinned by deep learning require large datasets providing a broad sample of sleep stages the machine seeks to ‘learn’. Of concern, most databases used for developing AI sleep stagers include only recordings from adults, creating a significant inherent sample bias when applied to when applied in the pediatric or geriatric sleep setting. Here we evaluated scoring errors created by age-biased AI sleep stagers.
We trained state-of-the-art deep learning sleep stagers separately with pediatric, adult, and older adult sleep EEG as well as with a combination of all groups and evaluated the sleep stage classification performance of the systems when subjected to pediatric and older adult EEG using a set of commonly used sleep metrics.
The experimental data comprised 7777 overnight PSG recordings with at least 5 h of valid EEG data pooled from:
When pediatric sleep EEG was staged by the system exclusively trained on pediatric data, the overall accuracy was 88.9%. But accuracy significantly dropped when sleep staging was performed by a system solely trained on adult data (78.9%). Using the system trained on mixed cohort data, the overall accuracy was comparable to the pediatric sleep stager (87.5%). Overall performance for older adults was comparable between systems. Notably, significant errors in individuals’ sleep metrics occurred despite good overall performance.
Age bias in the training sample of deep learning sleep stagers, particularly the lack of pediatric data, can reduce the overall classification performance. The problem can be effectively rectified by adding sleep EEG from children. Aside from statistically evaluating the accuracy of sleep stage prediction, AI systems should be assessed against clinically used metrics derived from the hypnogram.
Background: Various deep-learning systems have been proposed for automated sleep staging. Still, the significance of age-specific underrepresentation in training data and the resulting errors in clinically used sleep metrics are unknown.
Methods: We adopted XSleepNet2, a deep neural network for automated sleep staging, to train and test models using polysomnograms of 1232 children (7.0 ± 1.4 years) and 3757 adults (56.9 ± 19.4 years) and 2788 older adults (mean 80.7 ± 4.2 years). We developed four separate sleep stage classifiers using exclusively pediatric (P), adult (A), and older adults (O), as well as PSG from mixed cohorts: pediatric, adult, and older adult (PAO). Results were compared against an alternative sleep stager (DeepSleepNet) for validation purposes.
Results: When pediatric PSG was classified by XSleepNet2 exclusively trained on pediatric PSG, the overall accuracy was 88.9%, dropping to 78.9% when subjected to a system trained exclusively on adult PSG. Errors performed by the system staging PSG of older people were comparably lower. However, all systems produced significant errors in clinical markers when considering individual PSG. Results obtained with DeepSleepNet showed similar patterns.
Conclusion: Underrepresentation of age groups, in particular children, can significantly lower the performance of automatic deep-learning sleep stagers. In general, automated sleep stagers may behave unexpectedly, limiting clinical use. Future evaluation of automated systems must pay attention to PSG level performance and overall accuracy.
Guest Blogger: Mathias Baumerta
Paper Authors: Mathias Baumerta , Simon Hartmannb , Huy Phanc,1 a Discipline of Biomedical Engineering, School of Electrical and Mechanical Engineering, The University of Adelaide, Adelaide, Australia b Adelaide Medical School, The University of Adelaide, Adelaide, Australia c Amazon Alexa, Cambridge, MA, 02142, United States 1 The work was done when H. Phan was at the School of Electronic Engineering and Computer Science, Queen Mary University of London, UK and the Alan Turing Institute, UK and prior to joining Amazon.