AASM guidelines are the result of decades of efforts aimed at standardizing sleep scoring procedures, with the final goal of sharing a worldwide common methodology. The guidelines cover several aspects from the technical/digital specifications, e.g., recommended EEG derivations, to detailed sleep scoring rules according to age.
Automated sleep scoring systems - e.g. highly performing deep learning (DL) algorithms - have always largely exploited the standards as fundamental guidelines.
In this study, we wanted to show that a DL-based sleep scoring algorithm may not need to fully exploit the clinical knowledge or strictly adhere to the AASM guidelines.
We did several experiments to evaluate the resilience of an existing DL based algorithm against the AASM guidelines. In particular, we focused on the following questions:
i. Can a sleep scoring algorithm successfully encode sleep patterns, from clinically non-recommended or non-conventional electrode derivations?
ii. Can a single sleep center large dataset contain enough heterogeneity (i.e., different demographic groups, different sleep disorders) to allow the algorithm to generalize on multiple data centers?
iii. Whenever we train an algorithm on a dataset with subjects with a large age range, should we exploit the information about their age, conditioning the training of the model on it? We ran all of our experiments on U-Sleep, a state-of-the-art sleep scoring architecture recently proposed by Perslev et al.
Important Note: In the original implementation of U-Sleep, we found an extremely interesting bug: the data sampling procedure was not extracting the channel derivations recommended in the AASM guidelines, as stated by the authors in Perslev et al. Instead, atypical or non-conventional channel derivations were randomly extracted. This insight triggered the above-mentioned question.
We trained and evaluated U-Sleep on 19578 recordings from 15,322 subjects of 12 publicly available datasets on the National Sleep Research Resource (NSRR) and on other external sources such as PhysioNet. Specifically, among the NSRR data, we used the following: Apnea, Bariatric surgery, and CPAP database (ABC); Cleveland Children’s Sleep and Health Study (CCSHS); Cleveland Family Study (CFS); Childhood Adenotonsillectomy Trial database (CHAT); Home Positive Airway Pressure database (HPAP); Multi-Ethnic Study of Atherosclerosis (MESA); MrOS Sleep Study (MROS); Sleep Heart Health Study (SHHS); Study of Osteoporotic Fractures (SOF).
We also exploited a private dataset, i.e., the Bern Sleep Data Base (BSDB) registry, the sleep disorder patient cohort of the Inselspital, University hospital Bern. The recordings have been collected from 2000 to 2021 at the Department of Neurology, at the University hospital Bern. Secondary usage was approved by the cantonal ethics committee (KEK-Nr. 2020-01094). The dataset consists of 8950 recordings from patients and healthy subjects aged 0–91 years. In our experiments we consider 8884 recordings, given the low signal quality of the remaining recordings.
In all our experiments we used in total 28528 polysomnography studies from 13 different clinical studies. To our knowledge, this study on the automatic sleep scoring task is the largest in terms of a number of polysomnography recordings and diversity with respect to both patient clinical pathology and age spectrum. An overview of the BSDB and the open access (OA) datasets along with demographic statistics is reported in Table 1.
(i) We demonstrated that a DL sleep scoring algorithm is still able to solve the scoring task, with high performance, even when trained with clinically non-conventional channel derivations.
(ii) We showed that a DL sleep scoring model, even if trained on a single large and heterogeneous sleep center - i.e. our private BSDB dataset - fails to generalize on new recordings from different data centers.
(iii) We showed that the conditional training based on the chronological age of the subjects does not improve the performance of a DL sleep scoring architecture.
Considering the previous study findings and our present results, the strong resilience and the generalization capability of a DL-based architecture is undeniable. DL algorithms are reaching better performance than the feature-based approach. DL is definitely able to extract feature representations that are extremely useful to generalize across datasets from different sleep data centers. These hidden feature representations seem to better decode the unconscious analytical evaluation process of the human scorer.
To conclude, with the AASM so widely criticized, sleep labels so noisy (e.g., high inter- and intra- scorer variability), and sleep so complex: could an unsupervised DL-based sleep scoring algorithm, that does not need to learn from labels, be the solution?
The code we used in our study is based on what was previously developed in Perslev et al., publicly available on GitHub. All our experiments were carried out using the following branch U-Time/tree/usleep-paper-version. As a result of important feedback received from the whole community, but especially thanks to our important feedback related to the use of atypical and clinically non-recommended derivations, the authors provide the bugfixed code in U-Time/tree/usleep-paper-version-branch-bugfixes.
Publication on nature npj digital medicine.
Related Perslev et al. publication on nature npj digital medicine.
In this paper, we demonstrated the resilience of a DL network when trained on a large and heterogeneous dataset. We focused on the three more significant influencing factors: channel derivation selection, multi-center heterogeneity needs, and age-conditioned fine-tuning.
Channel derivations do have complementary information, and a DL-based model resulted resilient enough to be able to extract sleep patterns also from atypical and clinically non-recommended derivations. We show that the variability among different sleep data centers (e.g., hardware, subjective interpretation of the scoring rules, etc.) needs to be taken into account more than the variability inside one single sleep center.
A large database such as the BSDB (sleep disorder patient cohort of the Inselspital, with patients covering the full spectrum of sleep disorders) does not have enough heterogeneity to strengthen the performance of the DL-based model on unseen data centers.
Lastly, we show that a state-of-the-art DL network is able to deal with different age groups simultaneously, mitigating the need of adding chronological age-related information during training. In summary, what seems to be essential for the visual scoring (e.g., specific channel derivations, or specific scoring rules that consider also the age of the individuals) is not necessary for the DL based automatic procedure, which follows other analysis principles.
Fiorillo L, Monachino G, van der Meer J, Pesce M, Warncke JD, Schmidt MH, Bassetti CL, Tzovara A, Favaro P, Faraci FD. U-Sleep’s resilience to AASM guidelines. NPJ digital medicine. 2023 Mar 6;6(1):33. https://doi.org/10.1038/s41746-023-00784-0
Guest Blogger: Dr. Luigi Fiorillo, Institute of Digital Technologies for Personalized Healthcare ∣ MeDiTech, Department of Innovative Technologies, University of Applied Sciences and Arts of Southern Switzerland, Lugano, Switzerland