Ensuring passenger safety is a critical requirement for sustainable mobility. Yet, despite the social relevance of this objective, the automatic detection of violent behaviors in public transportation has received limited attention compared with general-purpose surveillance. Most existing approaches are designed around benchmark settings and cloud-scale resources, while the deployment on board of vehicles— in constrained, latency-sensitive, and privacy-critical environments—remains underexplored. This dissertation addresses this gap by presenting, validating, and discussing a complete AI-based system for real-time violence detection specifically tailored to buses. From a system-design perspective, public transport imposes a unique combination of constraints. Sensing is a!ected by overhead viewpoints, frequent occlusions, abrupt motion, changes in illumination across stops and routes, and the highly dynamic arrangement of passengers. Computing must be performed at the edge to avoid dependency on unreliable or costly connectivity and to meet stringent latency and privacy requirements. Finally, alarms must be generated with a balanced trade-o! between sensitivity (to detect true incidents) and specificity (to limit nuisance alarms that could desensitize operators). The proposed solution is engineered around these constraints. It integrates six ceiling-mounted IP cameras with an embedded GPU server placed on the vehicle to enable on-board, edge-only inference. The architecture supports multi-camera ingestion, synchronized analysis, and a lightweight decision layer that fuses the evidence produced by the video models into actionable alerts. The end-to-end pipeline is designed to remain operational without cloud services, with all computation restricted to the vehicle and only high-level alerts exposed to external systems when available. A key scientific challenge is domain shift. Models trained on standard violence datasets often fail to generalize to the particular visual and behavioral patterns observed on buses. To mitigate this, we constructed a composite dataset specifically oriented to public-transport scenarios. It combines established public resources—RWF-2000, UCF-Crime, SCVD, and Bus Violence—with a proprietary dataset recorded in a full-scale bus simulator reproducing real layouts, camera geometry, and crowding conditions. The simulator enables controlled acquisition of edge cases (e.g., rapid crowd movements, partial occlusions, seated interactions) i that are underrepresented in generic datasets. This composite corpus is used both to pre-adapt models and to quantify the benefit of injecting bus-specific samples into training. On the algorithmic side, we investigate three state-of-the-art video architectures with complementary inductive biases: X3D, R(2+1)D, and SlowFast-50. All networks are initialized from Kinetics-400 to leverage large-scale motion representations and then adapted with a progressive unfreezing protocol. In this strategy, learning starts by training the classification head while early spatiotemporal blocks are kept frozen, and progressively deeper blocks are unfrozen as training stabilizes. This schedule supports stable transfer to the bus domain, avoids catastrophic forgetting, and reduces overfitting when the amount of domain-specific data is limited. We complement the fine-tuning with data treatments that reflect in-vehicle conditions, such as temporal subsampling, moderate motion blur, and compression artifacts consistent with IP camera streams, aiming to narrow the sim-to-real gap without adding computational burden at inference. System performance is assessed through a multi-stage evaluation. First, ablation studies analyze the impact of each design choice: (i) inclusion of proprietary bus data in training, (ii) choice of backbone architecture, and (iii) the progressive unfreezing schedule. Second, we validate the system in the bus simulator to stress-test detection under controlled yet realistic variations—lighting, occupancy, and camera viewpoints—while measuring latency end to end. Third, we conduct supervised field trials on an actual 13-meter bus to evaluate the full pipeline under operational conditions (vibration, network jitter, passenger flow). This layered methodology allows us to disentangle model-centric e!ects from system-level factors and to quantify the reliability of the deployed solution. Results consistently indicate that supplementing public benchmarks with busspecific proprietary data markedly improves generalization to in-vehicle scenes. Among the tested architectures, X3D delivers the best trade-o! between accuracy and e"ciency, sustaining real-time analysis with sub-second end-to-end latency on the embedded GPU server while maintaining competitive detection quality. R(2+1)D and SlowFast-50 remain valuable references—particularly in scenarios with abundant compute or when higher temporal fidelity is desirable—but X3D proves more suitable for continuous, on-board operation. Importantly, these findings hold across ablations and are confirmed in the simulator and during on-vehicle trials, suggesting that the proposed training protocol and system integration are robust to deployment variations. Beyond raw metrics, the study highlights several operational insights. First, multi-camera coverage from ceiling viewpoints is critical to mitigate occlusions and to capture interactions between standing and seated passengers; simple late fusion at the decision layer can already provide meaningful resilience without the cost of multi-view feature fusion. Second, prioritizing determinism in the video pipeline (fixed frame rates, bounded bu!ering, watchdogs) reduces tail latencies that could ii delay alarms during critical events. Third, edge-only processing not only satisfies privacy constraints by avoiding video streaming o! the vehicle but also enhances availability: the system remains functional in areas with poor connectivity and is immune to cloud outages. Finally, supervised validation in real service uncovers failure modes rarely observable in benchmarks—e.g., aggressive gestures partially hidden by grab poles, or non-violent yet energetic events such as joyful celebrations— that guide subsequent data curation. The main contributions of this dissertation are as follows: (1) a complete design and deployment of an edge-based, multi-camera violence detection system dedicated to buses, covering sensing, embedded inference, and decision layers; (2) a composite dataset strategy that blends established benchmarks with simulator-acquired, bus-specific samples to address domain shift; (3) a principled fine-tuning pipeline based on progressive unfreezing for e"cient transfer from Kinetics-400 to the bus domain; and (4) a comprehensive evaluation protocol spanning ablations, simulator validation, and real-world field trials on a 13-meter bus. To the best of our knowledge, this is among the first works to go beyond simulation and isolated benchmark testing by demonstrating a fully deployed, real-time, edge-only system for violence detection in public transport. While the achieved performance and latency meet the operational targets of the target platform, the study also surfaces limitations that motivate future work. The rarity and diversity of violent incidents constrain data scale and label granularity; additional targeted collection, continual learning in-the-loop, and stronger out-of-distribution detection could further reduce false positives. Multi-camera reasoning is handled at the decision level; exploring mid-level or feature-space fusion could capture inter-view dynamics more e!ectively when compute permits. Finally, broader ethical and legal considerations remain central to large-scale adoption: transparent governance, privacy-by-design data handling, and human-in-the-loop review should be embedded into any real-world deployment pipeline. In summary, this dissertation demonstrates that accurate, low-latency violence detection for public buses is feasible on embedded hardware through the joint design of a domain-aware dataset, an e"cient fine-tuning strategy, and an edge-first system architecture. The results provide empirical evidence that tailoring both learning and engineering choices to the constraints and phenomenology of in-vehicle environments is essential to bridge the gap between promising laboratory performance and dependable operation in the field.

Rilevamento in tempo reale delle azioni violente negli autobus urbani ​

GALLO, MARCO
2026

Abstract

Ensuring passenger safety is a critical requirement for sustainable mobility. Yet, despite the social relevance of this objective, the automatic detection of violent behaviors in public transportation has received limited attention compared with general-purpose surveillance. Most existing approaches are designed around benchmark settings and cloud-scale resources, while the deployment on board of vehicles— in constrained, latency-sensitive, and privacy-critical environments—remains underexplored. This dissertation addresses this gap by presenting, validating, and discussing a complete AI-based system for real-time violence detection specifically tailored to buses. From a system-design perspective, public transport imposes a unique combination of constraints. Sensing is a!ected by overhead viewpoints, frequent occlusions, abrupt motion, changes in illumination across stops and routes, and the highly dynamic arrangement of passengers. Computing must be performed at the edge to avoid dependency on unreliable or costly connectivity and to meet stringent latency and privacy requirements. Finally, alarms must be generated with a balanced trade-o! between sensitivity (to detect true incidents) and specificity (to limit nuisance alarms that could desensitize operators). The proposed solution is engineered around these constraints. It integrates six ceiling-mounted IP cameras with an embedded GPU server placed on the vehicle to enable on-board, edge-only inference. The architecture supports multi-camera ingestion, synchronized analysis, and a lightweight decision layer that fuses the evidence produced by the video models into actionable alerts. The end-to-end pipeline is designed to remain operational without cloud services, with all computation restricted to the vehicle and only high-level alerts exposed to external systems when available. A key scientific challenge is domain shift. Models trained on standard violence datasets often fail to generalize to the particular visual and behavioral patterns observed on buses. To mitigate this, we constructed a composite dataset specifically oriented to public-transport scenarios. It combines established public resources—RWF-2000, UCF-Crime, SCVD, and Bus Violence—with a proprietary dataset recorded in a full-scale bus simulator reproducing real layouts, camera geometry, and crowding conditions. The simulator enables controlled acquisition of edge cases (e.g., rapid crowd movements, partial occlusions, seated interactions) i that are underrepresented in generic datasets. This composite corpus is used both to pre-adapt models and to quantify the benefit of injecting bus-specific samples into training. On the algorithmic side, we investigate three state-of-the-art video architectures with complementary inductive biases: X3D, R(2+1)D, and SlowFast-50. All networks are initialized from Kinetics-400 to leverage large-scale motion representations and then adapted with a progressive unfreezing protocol. In this strategy, learning starts by training the classification head while early spatiotemporal blocks are kept frozen, and progressively deeper blocks are unfrozen as training stabilizes. This schedule supports stable transfer to the bus domain, avoids catastrophic forgetting, and reduces overfitting when the amount of domain-specific data is limited. We complement the fine-tuning with data treatments that reflect in-vehicle conditions, such as temporal subsampling, moderate motion blur, and compression artifacts consistent with IP camera streams, aiming to narrow the sim-to-real gap without adding computational burden at inference. System performance is assessed through a multi-stage evaluation. First, ablation studies analyze the impact of each design choice: (i) inclusion of proprietary bus data in training, (ii) choice of backbone architecture, and (iii) the progressive unfreezing schedule. Second, we validate the system in the bus simulator to stress-test detection under controlled yet realistic variations—lighting, occupancy, and camera viewpoints—while measuring latency end to end. Third, we conduct supervised field trials on an actual 13-meter bus to evaluate the full pipeline under operational conditions (vibration, network jitter, passenger flow). This layered methodology allows us to disentangle model-centric e!ects from system-level factors and to quantify the reliability of the deployed solution. Results consistently indicate that supplementing public benchmarks with busspecific proprietary data markedly improves generalization to in-vehicle scenes. Among the tested architectures, X3D delivers the best trade-o! between accuracy and e"ciency, sustaining real-time analysis with sub-second end-to-end latency on the embedded GPU server while maintaining competitive detection quality. R(2+1)D and SlowFast-50 remain valuable references—particularly in scenarios with abundant compute or when higher temporal fidelity is desirable—but X3D proves more suitable for continuous, on-board operation. Importantly, these findings hold across ablations and are confirmed in the simulator and during on-vehicle trials, suggesting that the proposed training protocol and system integration are robust to deployment variations. Beyond raw metrics, the study highlights several operational insights. First, multi-camera coverage from ceiling viewpoints is critical to mitigate occlusions and to capture interactions between standing and seated passengers; simple late fusion at the decision layer can already provide meaningful resilience without the cost of multi-view feature fusion. Second, prioritizing determinism in the video pipeline (fixed frame rates, bounded bu!ering, watchdogs) reduces tail latencies that could ii delay alarms during critical events. Third, edge-only processing not only satisfies privacy constraints by avoiding video streaming o! the vehicle but also enhances availability: the system remains functional in areas with poor connectivity and is immune to cloud outages. Finally, supervised validation in real service uncovers failure modes rarely observable in benchmarks—e.g., aggressive gestures partially hidden by grab poles, or non-violent yet energetic events such as joyful celebrations— that guide subsequent data curation. The main contributions of this dissertation are as follows: (1) a complete design and deployment of an edge-based, multi-camera violence detection system dedicated to buses, covering sensing, embedded inference, and decision layers; (2) a composite dataset strategy that blends established benchmarks with simulator-acquired, bus-specific samples to address domain shift; (3) a principled fine-tuning pipeline based on progressive unfreezing for e"cient transfer from Kinetics-400 to the bus domain; and (4) a comprehensive evaluation protocol spanning ablations, simulator validation, and real-world field trials on a 13-meter bus. To the best of our knowledge, this is among the first works to go beyond simulation and isolated benchmark testing by demonstrating a fully deployed, real-time, edge-only system for violence detection in public transport. While the achieved performance and latency meet the operational targets of the target platform, the study also surfaces limitations that motivate future work. The rarity and diversity of violent incidents constrain data scale and label granularity; additional targeted collection, continual learning in-the-loop, and stronger out-of-distribution detection could further reduce false positives. Multi-camera reasoning is handled at the decision level; exploring mid-level or feature-space fusion could capture inter-view dynamics more e!ectively when compute permits. Finally, broader ethical and legal considerations remain central to large-scale adoption: transparent governance, privacy-by-design data handling, and human-in-the-loop review should be embedded into any real-world deployment pipeline. In summary, this dissertation demonstrates that accurate, low-latency violence detection for public buses is feasible on embedded hardware through the joint design of a domain-aware dataset, an e"cient fine-tuning strategy, and an edge-first system architecture. The results provide empirical evidence that tailoring both learning and engineering choices to the constraints and phenomenology of in-vehicle environments is essential to bridge the gap between promising laboratory performance and dependable operation in the field.
2026
Inglese
Naso, David
Massenio, Paolo Roberto
Giaquinto, Nicola
Politecnico di Bari
File in questo prodotto:
File Dimensione Formato  
Tesi_PhD_Gallo.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 3.74 MB
Formato Adobe PDF
3.74 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/372027
Il codice NBN di questa tesi è URN:NBN:IT:POLIBA-372027