Automated Class Discovery and One-Shot Interactions for Acoustic Activity Recognition

Jason Wu, Chris Harrison, Jeffrey P. Bigham, Gierad Laput · 2020 · Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI '20) · doi:10.1145/3313831.3376875

Summary

This paper presents Listen Learner, an end-to-end system for acoustic activity recognition that automatically discovers and learns to classify environmental sounds with minimal user effort. Traditional approaches to sound recognition face a tradeoff: custom models trained in a specific environment achieve high accuracy but require extensive user labelling, while pre-trained general models work out of the box but perform poorly in specific environments. Listen Learner resolves this by using self-supervised learning — it continuously listens via a deployed microphone device (a Raspberry Pi with a 4-microphone array), automatically segments and clusters similar acoustic events using CNN-based audio embeddings and hierarchical agglomerative clustering, and then asks the user for a label only once per discovered sound class through a one-shot voice interaction (e.g., "what was that sound?" / "that's my faucet"). The system uses a VGG-ish deep CNN pre-trained on the YouTube-8M dataset to extract 96x64 log-mel spectrogram features, clusters them using Ward's method, and builds an ensemble of one-class SVMs for classification. The system also incorporates audio directionality from its 4-microphone array and supports three user interaction strategies: open-ended queries, confirmatory questions ("was that a microwave?"), and refinement questions ("was that a faucet or a microwave?"). The paper explicitly highlights home accessibility as an application — a smart speaker could learn to recognize doorbells, alarms, and other household sounds, then send push notifications to deaf or hard of hearing users.

Key findings

In evaluation across standard datasets (ESC-10, UrbanSound8K) and real-world data collected over one week from six environments, the system achieved 97% precision and 87% recall in its balanced configuration on the real-world apartment kitchen dataset. The Conservative setting achieved perfect F1 scores of 1.0 on both ESC-10 and UrbanSound8K. In the in-the-wild deployment across seven rooms in five buildings, the system discovered an average of 9.9 classes in the Balanced setting and 41.9 in the Relaxed setting. The audio event detection system also achieved 98.9% accuracy for in-game sound classification when applied to wheelchair basketball (as reported in the companion SpokeSense paper). A user interaction study with 12 participants found that 9 of 12 preferred confirmatory questions ("was that a faucet?") over open-ended or refinement queries, finding them easier to answer. Around 90% of sound classes were correctly labelled after one interaction per class. Users preferred to be queried infrequently (1-2 times per minute of activity at most), and 11 of 12 wanted the system to ask as infrequently as possible.

Relevance

Listen Learner has direct implications for accessibility, particularly for deaf and hard of hearing users who need awareness of environmental sounds. By automatically learning the specific sounds in a user's environment — rather than relying on generic pre-trained models that may not recognize individual doorbells, appliances, or alarms — the system can provide more accurate and personalized sound notifications. The one-shot labelling approach is especially valuable for accessibility because it minimizes the burden on users during setup, a significant consideration for assistive technology adoption. The system's ability to work across environments without pre-configuration also makes it practical for real-world deployment. For accessibility practitioners, this research demonstrates how machine learning can be designed to adapt to individual users' contexts rather than forcing users to adapt to the technology — a principle that extends well beyond sound recognition to many areas of assistive technology.

Tags: acoustic activity recognition · smart home · machine learning · Internet of Things · context awareness · self-supervised learning · sound recognition · deaf and hard of hearing