this study simulated as close to a real-world, EMU environment as possible.
The data were not pre-selected, clipped or enriched for seizure-density. The consecutive nature of the seizure-containing records meant they were not selected for overtly demonstrative EEG seizure patterns and finally, the large number of records from different patients resulted in a broader representation of seizure types, as would be typical in the EMU.
In fact, this study likely reflects the high-end of interreader agreement that would be found in an EMU environment. The readers were not facing significant time constraints, and they were aware that their markings would be compared with other experts. This would actually favor more thorough and considered marking rather than the standard clinical review setting, where reading time is more constrained, distractions are more prevalent, and fatigue is common.
Second, after establishing the average interreader agreement, the study then assessed the pairwise human-algorithm performance of two seizure detection algorithms (P13 and P14) compared to the human experts, (i.e., P13 and P14 were compared with readers A, B, and C).
By treating the algorithms exactly the same as the human readers, the study could directly assess whether the algorithms performed within the range of the human experts.
The results showed the P14 seizure detector is statistically non-inferior to the performance of this study’s expert readers. The performance of the older P13 seizure detection algorithm did not approach that of the human experts because of a much higher false-positive rate.
More specifically, P14 had a pairwise sensitivity of 78.2% and a false-positive rate of 0.9 per day, which is within the range of the human experts in this study.
The algorithm will be quite useful as an adjunct to the existing visual observation methods of seizure identification, sometimes enabling earlier seizure recognition, improving review efficiency, and enhancing overall seizure recognition.
While this may be an impressive achievement, where notably both the spike and seizure detector algorithms in Persyst 14 are found to be statistically non-inferior to expert human readers, the work is certainly not done.
Our aim at Persyst is to continue to push the edges of what is possible by developing additional metrics to keep improving the detection algorithm’s performance.
Other interesting findings: The performance of the older P13 seizure detection algorithm did not approach that of human experts, despite having a pairwise sensitivity of 82.5%, due to a much higher false positive rate (11.3 per day). Set to a higher sensitivity equivalent to the P13 sensitivity, the P14 detector produces 2.3 false positives per day.
The standard pairwise comparison compares the performance of the algorithm to one human expert. If a subset of seizure data is considered, where two readers were in agreement regarding the presence of a seizure, a third reader was more likely to also identify such events as seizures. Readers averaged a sensitivity of 89% for these consensus-of-two events. The P14 seizure detector demonstrated a sensitivity of 89.7% for the same-consensus-of-two seizures. When evaluated against the smaller subset of seizures marked by all readers (consensus-of-three), the P14 algorithm had a sensitivity of 90% with a false positive rate of about one per day. Finally, if, during clinical review, a P14 detection algorithm sensitivity was chosen that resulted in an average of four false positives per day, the sensitivity for consensus-of-three seizure events would be 98.5% in this dataset. This shows that using the new adjustable features of P14 sensitivity virtually all seizures in this data set were marked with what many would consider to be a reasonable number of false positives during review.
*Pairwise sensitivity means that every possible pair combination of readers is assessed where one reader’s marks is considered the gold standard and the sensitivity and false positive rate is calculated for the other and vice versa (see tables below). This is a statistical test, and rather than establishing “ground truth”, this is instead a calculation of the likelihood that any two expert readers will agree with one another.
This method also makes it possible to assess a detection algorithm in exactly the same way the expert readers were assessed, making it possible to say how likely the algorithm and a human reader will “agree” with one another. In the chart below A, B, and C identify each reader.