CSI-Bench: A Large-Scale In-the-Wild Dataset for Multitask WiFi Sensing

CSI-Bench overview. The benchmark features multiple commercial routers and IoT devices deployed in real homes to collect CSI data. It supports a wide range of human-centric sensing tasks, enabling robust model development across diverse hardware setups and real-world scenarios.

Introduction

We introduce CSI-Bench, a large-scale, in-the-wild benchmark dataset collected from commercial WiFi edge devices deployed in real homes with real users. It includes single-task datasets for fall detection, breathing monitoring, localization, and motion source recognition, as well as a multitask dataset with joint labels for user identity, activity, and proximity. Some samples include multiple labels, enabling efficient multitask learning for resource-constrained edge health applications. CSI-Bench captures realistic signal variability and supports research on generalizable, data-efficient models for high-dimensional time-series data. We provide standardized splits and benchmark results across supervised, multitask, and few-shot learning settings. CSI-Bench provides a foundation for developing scalable, privacy-preserving wireless AI systems for health monitoring and broader human-centric applications.

Data collection

To support robust and generalizable WiFi sensing research, we build a diverse collection of datasets captured in real-world environments using commercial WiFi devices. CSI-Bench spans over 460 hours of CSI recordings across 35 unique users, 26 environments, and 16 device types, covering both infrastructure and edge devices operating under varied network conditions. Data were collected in homes, offices, and public indoor areas with minimal control over ambient interference or user behavior. Each dataset is designed to support one or more sensing tasks, including fall detection (Fall), breathing monitoring (Breath), localization (Loc.), human activity recognition (HAR), user identification (UID), and proximity estimation (Prox.).

The data collection system follows a hierarchical sensing architecture composed of edge devices, a router-based local network, a cloud-based sensing server, and a cloud-based front server for user interaction. Commercial WiFi-enabled edge devices (e.g., smart speakers) deployed in real-world environments act as signal sources or reflectors. These devices transmit or interact with WiFi signals within their surroundings. A central router captures CSI from these interactions and relays the data to a local Sensing Server via MQTT protocol. This server handles real-time CSI decoding, pre-processing, and task-specific feature extraction.

Figure: Overview of data collection system architecture.

Figure: Representative CSI samples for different activities and motion sources, where the x-axis is time and the y-axis is subcarrier index. These samples show the CSI signal patterns from human actions and non-human motions.

Results

Performance comparison of supervised models across four WiFi sensing tasks. Accuracy (Acc) and F1-score are reported as mean ± std (%) over three runs.

Model	Fall Detection		Breathing Detection		Room-Level Localization		Motion Source Recognition
Model	Acc	F1	Acc	F1	Acc	F1	Acc	F1
MLP	92.16 ±0.91	92.17 ±0.92	97.59 ±0.08	97.59 ±0.08	87.14 ±0.80	86.90 ±0.83	98.86 ±0.07	98.86 ±0.07
ResNet-18	94.88 ±0.26	94.89 ±0.26	98.58 ±0.17	98.58 ±0.17	100.00 ±0.00	100.00 ±0.00	99.56 ±0.07	99.56 ±0.07
LSTM	94.93 ±0.51	94.92 ±0.50	98.62 ±0.17	98.62 ±0.17	99.12 ±0.27	99.12 ±0.26	98.42 ±0.19	98.42 ±0.19
Transformer	94.28 ±0.72	94.26 ±0.72	98.64 ±0.19	98.64 ±0.19	99.27 ±0.22	99.27 ±0.22	98.61 ±0.27	98.61 ±0.27
ViT	93.58 ±0.71	93.59 ±0.70	98.63 ±0.17	98.63 ±0.17	99.94 ±0.11	99.94 ±0.11	98.74 ±0.10	98.74 ±0.10
PatchTST	94.03 ±0.74	94.03 ±0.73	98.84 ±0.13	98.84 ±0.13	99.91 ±0.10	99.91 ±0.10	98.86 ±0.19	98.86 ±0.19
TimeSformer-1D	93.86 ±1.16	93.87 ±1.13	98.68 ±0.21	98.68 ±0.21	100.00 ±0.00	100.00 ±0.00	98.38 ±0.17	98.39 ±0.17

Task-Specific vs Multi-Task Training

Comparison of task-specific and multi-task training for the Transformer model across shared-data tasks. Accuracy (Acc), F1-score (F1), and improvements (Δ) are reported as mean ± std (%) over three runs.

Task	Task-Specific Training		Multi-Task Joint Training		Improvement
Task	Acc	F1	Acc	F1	ΔAcc	ΔF1
Human Activity Recognition	75.40 ±0.93	75.49 ±0.73	87.79 ±0.00	86.47 ±0.00	+12.39	+10.98
User Identification	99.51 ±0.32	99.51 ±0.32	99.83 ±0.00	100.00 ±0.00	+0.32	+0.49
Proximity Recognition	77.52 ±3.13	77.35 ±3.24	87.85 ±0.00	88.67 ±0.00	+10.33	+11.32

BibTeX

@article{csi-bench,
Author = {Guozhen Zhu and Yuqian Hu and Weihang Gao and Wei-Hsiang Wang and Beibei Wang and K. J. Ray Liu},
Title = {CSI-Bench: A Large-Scale In-the-Wild Dataset for Multi-task WiFi Sensing},
Year = {2025},
Eprint = {arXiv:2505.21866}
}