LLMs Recognize Activities With No Training Data

April 20, 20261 min read

TL;DR

A zero-shot method combines audio and motion sensors to accurately detect human activities, even when labeled training data is scarce.

In a recent study, large language models achieved 12-class zero- and one-shot classification F1-scores significantly above chance for activity recognition, with no task-specific training required.

This finding highlights how LLMs can effectively fuse multimodal sensor data, such as audio and motion time series, to classify diverse activities like household tasks and sports.

The researchers curated a subset from the Ego4D dataset, focusing on varied contexts to test the models' ability to integrate complementary information without aligned training data.

They employed a late fusion approach, where modality-specific models processed audio and motion streams separately, and an LLM combined these outputs for final classification.

Showed that this not only surpassed random guessing but also offered a practical solution for applications with limited labeled data, reducing the need for extensive model retraining.

This approach can enable deployment in scenarios where computational resources are constrained, as it avoids the memory and processing demands of custom multimodal models.

However, the authors note limitations, including potential s in scaling to more complex activity sets and dependencies on the quality of modality-specific inputs.

Overall, this work demonstrates a step toward more adaptable AI systems that leverage pre-trained models for real-world sensor fusion tasks.