Automatic sleep staging for the young and the old – Evaluating age bias in deep learning

Overview

AI for automated sleep staging is considered mature and has found its way to commercial sleep evaluation systems. Systems underpinned by deep learning require large datasets providing a broad sample of sleep stages the machine seeks to ‘learn’. Of concern, most databases used for developing AI sleep stagers include only recordings from adults, creating a significant inherent sample bias when applied to when applied in the pediatric or geriatric sleep setting. Here we evaluated scoring errors created by age-biased AI sleep stagers.

What was the approach to solving the problem?

We trained state-of-the-art deep learning sleep stagers separately with pediatric, adult, and older adult sleep EEG as well as with a combination of all groups and evaluated the sleep stage classification performance of the systems when subjected to pediatric and older adult EEG using a set of commonly used sleep metrics.

What NSRR data were used?

The experimental data comprised 7777 overnight PSG recordings with at least 5 h of valid EEG data pooled from:

Childhood Adenotonsillectomy Trial (CHAT)
Cleveland Family Study (CFS)
Multi-Ethnic Study of Atherosclerosis (MESA)
Osteoporotic Fractures in Men Study (MrOS Sleep)
Cleveland Children's Sleep and Health Study (CCSHS)
Study of Osteoporotic Fractures

What were the results?

When pediatric sleep EEG was staged by the system exclusively trained on pediatric data, the overall accuracy was 88.9%. But accuracy significantly dropped when sleep staging was performed by a system solely trained on adult data (78.9%). Using the system trained on mixed cohort data, the overall accuracy was comparable to the pediatric sleep stager (87.5%). Overall performance for older adults was comparable between systems. Notably, significant errors in individuals’ sleep metrics occurred despite good overall performance.

What were the conclusions and implications of this work?

Age bias in the training sample of deep learning sleep stagers, particularly the lack of pediatric data, can reduce the overall classification performance. The problem can be effectively rectified by adding sleep EEG from children. Aside from statistically evaluating the accuracy of sleep stage prediction, AI systems should be assessed against clinically used metrics derived from the hypnogram.

Are there any tools available?

We used two deep-learning-based sleep stagers, XSleepNet2 and DeepSleepNet, in our study.

Resources/multimedia

https://doi.org/10.1016/j.sleep.2023.04.002 10.1109/TBME.2022.3174680

Twitter: @MathiasBaumert @pquochuy

LinkedIn: Mathis Baumert Huy Phan

Paper Summary

Background: Various deep-learning systems have been proposed for automated sleep staging. Still, the significance of age-specific underrepresentation in training data and the resulting errors in clinically used sleep metrics are unknown.

Methods: We adopted XSleepNet2, a deep neural network for automated sleep staging, to train and test models using polysomnograms of 1232 children (7.0 ± 1.4 years) and 3757 adults (56.9 ± 19.4 years) and 2788 older adults (mean 80.7 ± 4.2 years). We developed four separate sleep stage classifiers using exclusively pediatric (P), adult (A), and older adults (O), as well as PSG from mixed cohorts: pediatric, adult, and older adult (PAO). Results were compared against an alternative sleep stager (DeepSleepNet) for validation purposes.

Results: When pediatric PSG was classified by XSleepNet2 exclusively trained on pediatric PSG, the overall accuracy was 88.9%, dropping to 78.9% when subjected to a system trained exclusively on adult PSG. Errors performed by the system staging PSG of older people were comparably lower. However, all systems produced significant errors in clinical markers when considering individual PSG. Results obtained with DeepSleepNet showed similar patterns.

Conclusion: Underrepresentation of age groups, in particular children, can significantly lower the performance of automatic deep-learning sleep stagers. In general, automated sleep stagers may behave unexpectedly, limiting clinical use. Future evaluation of automated systems must pay attention to PSG level performance and overall accuracy.

Guest Blogger: Mathias Baumert^a

Paper Authors: Mathias Baumert^a , Simon Hartmann^b , Huy Phan^c,1 ^a Discipline of Biomedical Engineering, School of Electrical and Mechanical Engineering, The University of Adelaide, Adelaide, Australia ^b Adelaide Medical School, The University of Adelaide, Adelaide, Australia ^c Amazon Alexa, Cambridge, MA, 02142, United States ¹ The work was done when H. Phan was at the School of Electronic Engineering and Computer Science, Queen Mary University of London, UK and the Alan Turing Institute, UK and prior to joining Amazon.

By szhivotovsky on May 16, 2023 May 16, 2023 in Guest Blogger

Author Posts