Dear all,
I want your opinion regarding the comparison of sleep parameters calculated by LUNA, as I don't have much experience in multi-center research.
Now, we can compare the sleep data from multiple facilities thanks to the effort of the NSRR team. What I want to know is if there are problems in directly comparing the data acquired with different PSG systems and conditions.
Putting aside the inter-rater variability in sleep scoring, should I avoid directly comparing the results of PSG analysis (e.g., frequency analysis or sleep microarchitecture parameters such as SO and spindle) using data from different facilities? Are there any guidelines or recommended methods that enable us to make a direct comparison?
Many thanks,
A good but also generally applicable question, and there isn't really a single answer. IMO, as far as any physiological measurement goes, there probably aren't grounds for expecting PSG data to be exceptionally difficult in this regard - e.g. perhaps versus more experimental/task-based paradigms, although of course the devil is in the details. If pushed, I'd say that most times -- let's say 80% -- metrics will be broadly comparable across cohorts, such that cohorts can be (statistically) combined in analysis. Still, that leaves a non-trivial chance of issues arising (depending on the particular datasets and analyses) that could bite you...
A few general off-the-top-of-the-head thoughts and approximations to best practice (here thinking primarily about the sleep EEG, which I admit might be more directly comparable than some other channels, e.g. position sensors).
In favour of combining:
if 'cohort' can be included as a covariate in subsequent analyses, one is probably less worried about exact scale dependencies / systematic biases driven by cohort-specific factors
note that there can often be equally pervasive differences within individual cohorts, which are often multi-site studies themselves, i.e. probably all analyses should be approached with a similarly skeptical mindset, whether cross-cohort or not
FWIW, in our own work w/ NSRR cohorts, we've been able to perform multi-cohort analyses (from NSRR) that have shown broadly comparable results across cohorts. e.g. https://pubmed.ncbi.nlm.nih.gov/33199858/ https://pubmed.ncbi.nlm.nih.gov/28649997/ https://www.eneuro.org/content/9/5/ENEURO.0094-22.2022
should the prospects for potential bias necessarily preclude any analysis? Probably not, if one can also find other triangulating approaches (e.g. replication in different datasets, using different assumptions, etc) and appropriately report caveats, etc
Could go either way:
some cohorts will be more similar than others - e.g. MrOS and SOF had similar protocols and hardware I believe (and investigators) and so will presumably be intrinsically better matched
some metrics may be more likely to be more susceptible to cross-cohort effects, although hard to make general rules about this as it typically entails an "all other things being equal" assumption
the issue is certainly not related to Luna-derived parameters (nor NSRR datasets, for that matter). For example, sleep duration estimates based on manual staging can often show differences between cohorts that aren't obviously driven by demographic or clinical factors... same principles as above apply to approaches to analysis, i.e. sensitivity analysis, replication, orthogonal methodological approaches, etc.
Against:
direct comparisons between cohorts are expected to be biased. Even in the MrOS/SOF context, comparing those two cohorts directly as a proxy for sex differences (i.e. MrOS is a male cohort, SOF is a female cohort) is likely to be biased, i.e. as any artifact is completely correlated with the exposure of interest, etc.
cohorts are often likely to differ in (subtle or not so subtle) substantive ways due to ascertainment criteria, etc, as well as any technical factors due to PSGs / pre-processing, etc, making it difficult to determine in even principle whether two cohorts are completely comparable or not, if one doesn't even expect equivalent values for a given metric (conditional on some set of baseline, e.g. demographic, covariates)
Cheers, --Shaun