On the reliability of acoustic annotation and automatic detections of Antarctic blue whale calls under different acoustic conditions
Evaluation of the performance of computer-based algorithms to automatically detect mammalian vocalizations often relies on comparisons between detector outputs and a reference data set, generally obtained by manual annotation of acoustic recordings. To explore the reproducibility of these manual annotations, we investigate inter- and intra-analyst variability in manually annotated Antarctic blue whale (ABW) Z-calls by two analysts using acoustic data from two ocean basins with different call abundance and background noise. Manual annotations exhibited strong inter- and intra-analyst variability, with less than 50% agreement between analysts. This variability is mainly caused by the difficulty of reliably and reproducibly distinguishing single calls in an ABW chorus, which consists of overlapping distant calls. Furthermore, the performance of two automated detectors, one based on spectrogram correlation and one on subspace-detection strategy, was evaluated by comparing detector output to a “conservative” manually annotated reference data set, comprising only annotations that matched between analysts. This study highlights the need for a standardized approach for human annotations and automatic detections, including a quantitative description of their performance, to improve the comparability of acoustic data, which is particularly relevant in the context of collaborative approaches collecting and analyzing large passive acoustic data sets.