links: - [SHED - bookends link](bookends://sonnysoftware.com/ref/DL/235107) - [online](https://emj.bmj.com/content/early/2024/09/12/emermed-2024-214068) - [Spotify podcast](https://open.spotify.com/episode/3xmtOUS5qRBILRWydtDVtw?si=BJK2Qf5LS9CWQqwag7qthA&t=51) ***Related trials (by subject):*** ```dataview LIST FROM "Trials" SORT file.name ASC WHERE any(contains(subject, this.subject)) AND file.name != this.file.name ``` tags: #incomplete > [!info] Overview > - # Subarachnoid haemorrhage in the emergency department (SHED): a prospective, observational, multicentre cohort study **Journal** - [[EMJ]] **Authors** - Trainee Emergency Research Network (TERN) **Year** - 2024 # Clinical Question ## PICO - **population / problem** - **intervention / treatment** - **Comparison** - **Outcome** # Background # What this paper adds to the body of knowledge # What they did ## design - eg retrospective case controlled - randomisation - follow up - power analysis - type of analysis (eg intention to treat) - setting - blinding - definitions - inclusion - exclusion # Results ![[Pasted image 20250718133827.png]] # Discussion *Strengths:* *Limitations:* # Conclusions ## author conclusion ## my conclusion ### Statistics explanation When this paper was being discussed in our groupchat, someone asked the following question: "My biostats is very much undergraduate level, but doesn't post test probability depend a lot on prevalence? Makes applying the rule a bit difficult, no?" ***My reply:*** Well I’m post nights and I find this subject interesting, so excuse the long reply but without wading too much into journal club critique of the study, here are my thoughts on the statistics question you raised. I was going to reply in the group but then decided i didn’t really want to blow up that group so just DM’d you instead. Yes, the the use of pre- and post-test probability has been gradually gaining ground, but in spite of a lot of early enthusiasm for using this concept (via [[Statistics and research methods overview#Likelihood ratio nomogram (Fagan)|fagan nomograms]] for things like d-dimers), it doesn’t seem to be fully in vogue yet. These researchers actually weigh in on this explicitly, but I’ll loop back to that after first discussing the rest of the undergrad statistics of the graph. A classic undergrad statistics multiple choice question is “which statistical test is affected by *prevalence*,” and the answer is usually either the positive or negative [[Statistics and research methods overview#sensitivity/specificity, PPV, likelihood ratio|predictive value]]. In the case of CTB for SAH exclusion, the NPV means “patient doesn’t have SAH, given negative CTB.” SAH is clearly a low prevalence disease among people with headaches. Somewhere between 5-10% (6.5% overall in this study) in this study population. So **even a fairly bad test is going to have a fairly good negative predictive value for a low-prevalence disease**, since a large percentage of patients don’t have the disease and therefore will have “true negative” results. Eg a test that only has a 50% sensitivity — a coin toss — for a disease that occurs in 1/1000 people is going to still have a high negative predictive value, because even if it is wrong, 999/1000 is still a good NPV. This is an intrinsic problem with using NPV (and PPV) to evaluate the performance of the test: it says more about the disease prevalence in the population than it does about the test performance. Eg if you use weight gain of 1kg over 1 month as a test for pregnancy, at the royal women’s hospital the PPV would be a lot higher than at the RMH (a hospital that mostly does not admit pregnant people). That is an absurd screening test for pregnancy, but it demonstrates the hazards of conflating predictive value with test performance, and the importance of considering the population in question. The sensitivity is a test of the performance of the test itself: positive CTB, given has SAH (by whatever the gold standard is). In the above chart, we see CI bars for sensitivity increasing at longer hours since onset of headache. This doesn’t mean the test is worse there per se; it is because there were fewer patients in that timeframe category. 205 patients for 18-24 hours vs 772 for 0-6 hours. By the authors' own admission (whether this was a prospective decision or not I’m not sure), they said primary outcome was performance at 0-6 hours. So it’s a bit confusing, but essentially a 95% sensitivity means that for everyone who actually has an SAH, 95/100 of the scans will pick it up. It does not follow that 5% of all CTs will miss SAH, since CTs are performed on everyone, not just those with SAH. This is why the NPV is so much higher than the sensitivity. This is also why many are becoming wary of using the parameter of sensitivity to evaluate the performance of the test, because it isn’t as helpful for telling you how to interpret the meaning of a result for your patient. So we see some interpretive issues with both predictive values and sensitivity/specificity on their own merits. Enter liklihood ratios. This is supposed to help with interpreting test results by considering the likelihood that a patient with the given diagnostic result would have the disorder compared with the likelihood that same result would be seen without the disorder. For instance, you can calculate the negative likelihood ratio as the false negative rate divided by the specificity, which essentially gives you a ratio of false negative:true negative rate. In a vacuum, that value is very abstract, but it can be used in combination with a pre-test probability to quantify a post-test probability. This is what the authors of this study did. The pretest probability is controversial. It is derived from the prevalence of the disease in this case, and then multiplied by the liklihood ratio. In this study, the pre-test probability is the prevalence of SAH in each presentation grouping. Eg 13.1% in those presenting 0-6 hours. And the sensitivity is fairly high, not perfect, 97%. So when you multiply the prevalence of the disease by the likelihood ratio, you actually find that the post test probability of actually having an SAH with a negative CT is very low. Like 1/500 or so, which is much lower than what one might assume just seeing that sensitivity is 97% and assume there is a 3% chance you’re missing SAH. The problem here is, is the derived prevalence actually a useful scalar for the pre-test probability for a particular patient? It is one thing if you are designing a screening test for a general population, eg faecal occult blood, where the prevalence of bowel cancer really IS the pre-test probability. But a diagnostic test is different: not ALL headache patients have equal pre-test probability for having an SAH. As an example of a paradigm we use daily, consider PE Wells categories; there is a certain overall prevalence of PE and probability that any patient with chest pain or SOB will have a PE. But we use Wells to quantise patients into three discrete risk groups who have differing pre-test probabilities. So for example, a low risk wells (pre-test 1.3%) has a post-test probability of about 1/1000 for a negative d-dimer. But a high risk wells (37%) has a post-test probability of about 3% with negative d-dimer. Still pretty good IMHO, but that 1/33 risk is still determined to be too high a miss rate. So if you have an obtunded 45 year old patient with thunderclap headache onset during exertion meningitic, negative CTB at 12 hours, I’d argue that pre-test probability is probably phenomenally higher than the pre-test probability of of 23 year old rapid onset headache who looks well at 12 hours, but that Baysian reality is not taken into account in using a unidimensional “post-test” probability that uses only prevalence and liklihood ratio to generate pre-test probability. (It may well be the case that most of those sick-appearing patients headache actually will have positive CT scans, but these data don’t really empower us to draw that conclusion). To the researchers’ credit, they do consider this issue around pre-test probability, and in the podcast linked, discuss alternative pre-test probability metrics, eg using Ottawa. But Ottawa isn’t really supposed to be a two tiered probability test, nor is it validated for this use, nor did they really investigate this. Finally, it is worth noting that these results echo several other results about CT sensitivity past 6 hours (using modern [[Multisclice CT for SAH|Multislice CT]]), including a smaller single centre NZ study. I won’t go into a full journal club critique of the paper, but I do want to note that I get the impression that the researchers are actually trying to just contribute some reasonable data to the discussion, rather than trying to propose a new decision “rule.” They are invoking the phrase “shared decision making,” which was introduced several years ago when the debate about the utility of LP for SAH exclusion first surfaced. On the desire for useful clinical decision tools around SAH, I often reflect that many of us have grown up with MD calc and HEART scores and all sorts of clinical decision aids to cite that drive a lot of our practice, and without being able to cite “ottawa negative” or “SHED negative,” many clinicians will not feel confident making clinical decisions that are not codified practice standards. I feel that although the “shared decision” model for SAH-exclusion is well-intentioned, it is hard to imagine it will catch on in the defensive citation-based culture, especially where it is quite rare that the patients can deeply understand these statistics enough to actually join for an informed, “shared,” decision. The authors encourage upfront recognition that these decisions are grey areas and based in probabilities, and to that end I certainly agree and endeavour to educate patients on this reality of medicine (“uncertainty about the future is a great pleasure and burden of the human experience” is a very hit-and-mostly-miss line). My personal experience when it comes to the SAH and LP “shared decision” discussions is that patients generally don’t want to share a decision; they just want an expert to take the mantle of care and use our expertise to help them. To that end, although developing a set of individual practice standards is one of the great pleasures of training in medicine, I am skeptical that the LP-or-not decision will become dramatically easier until there is a phonetic decision rule for it published on MD Calc. Although I could be wrong; look how quickly everyone became comfortable with two hour troponins… It would be interesting to investigate SAH decision using gestalt, similar to how PERC was validated (gestalt PE risk <15%, which is essentially equivalent to WELLS <2h). # should this article change practice?