Compare the spike detection performance of three skilled humans and three computer algorithms.
40 prolonged EEGs, 35 containing reported spikes, were evaluated. Spikes and sharp waves were marked by the humans and algorithms. Pairwise sensitivity and false positive rates were calculated for each human–human and algorithm-human pair. Differences in human pairwise performance were calculated and compared to the range of algorithm versus human performance differences as a type of statistical Turing test.
5474 individual spike events were marked by the humans. Mean, pairwise human sensitivities and false positive rates were 40.0%, 42.1%, and 51.5%, and 0.80, 0.97, and 1.99/min. Only the Persyst 13 (P13) algorithm was comparable to humans – 43.9% and 1.65/min. Evaluation of pairwise differences in sensitivity and false positive rate demonstrated that P13 met statistical noninferiority criteria compared to the humans.
Humans had only a fair level of agreement in spike marking. The P13 algorithm was statistically noninferior to the humans.
This was the first time that a spike detection algorithm and humans performed similarly. The performance comparison methodology utilized here is generally applicable to problems in which skilled human performance is the desired standard and no external gold standard exists.