The rapid evolution of smart retail environments has created a pressing need for advanced computer vision solutions capable of addressing critical operational challenges, particularly in automated anomaly detection, fine-grained product classification, and interactive product localization. Existing systems often struggle with the complexity and variability inherent in retail scenarios, including dynamic customer behaviors, subtle product differences, occlusions, and inconsistent lighting conditions. This thesis addresses these challenges by proposing novel methodologies in three complementary research directions. First, we introduce an unsupervised anomaly detection framework based on diffusion models and compact motion representations. Our diffusion-based approach effectively captures complex spatiotemporal patterns without relying on labeled training data, while our compact motion representations significantly enhance computational efficiency without sacrificing detection accuracy. Extensive experimental evaluations demonstrate that our methods outperform current state-of-the-art techniques on benchmark datasets, highlighting their potential for real-world surveillance applications in retail environments. Second, we propose a robust zero-shot fine-grained product classification pipeline leveraging advanced vision models, specifically CLIP and DINOv2. By utilizing visual embeddings and prototype-based classification strategies, our approach effectively distinguishes visually similar products without requiring explicit training data for each class. To facilitate rigorous evaluation, we introduce the MIMEX dataset, a challenging benchmark tailored specifically to fine-grained retail product classification tasks. Our experiments confirm the superior performance and generalization capabilities of our proposed methods compared to existing approaches. Third, we present a zero-shot referring segmentation framework that enables precise localization and segmentation of products based on natural language descriptions. By combining open-vocabulary object detection with large language model reasoning, our approach addresses the unique challenges of dense product arrangements and fine-grained distinctions in complex retail scenes without requiring domain-specific training. This framework allows intuitive human-AI interaction, enabling customers and store associates to identify and locate products using natural language queries. Overall, this thesis advances the state-of-the-art computer vision for smart retail by introducing efficient unsupervised anomaly detection techniques, effective zero-shot fine-grained classification frameworks, and a novel approach to zero-shot referring segmentation. The proposed methodologies not only address key limitations of existing systems but also provide practical solutions readily applicable to real-world retail settings.
Advanced Computer Vision for Smart Retail: From Anomaly Detection to Fine-grained Product Classification
Tur, Anil Osman
2025
Abstract
The rapid evolution of smart retail environments has created a pressing need for advanced computer vision solutions capable of addressing critical operational challenges, particularly in automated anomaly detection, fine-grained product classification, and interactive product localization. Existing systems often struggle with the complexity and variability inherent in retail scenarios, including dynamic customer behaviors, subtle product differences, occlusions, and inconsistent lighting conditions. This thesis addresses these challenges by proposing novel methodologies in three complementary research directions. First, we introduce an unsupervised anomaly detection framework based on diffusion models and compact motion representations. Our diffusion-based approach effectively captures complex spatiotemporal patterns without relying on labeled training data, while our compact motion representations significantly enhance computational efficiency without sacrificing detection accuracy. Extensive experimental evaluations demonstrate that our methods outperform current state-of-the-art techniques on benchmark datasets, highlighting their potential for real-world surveillance applications in retail environments. Second, we propose a robust zero-shot fine-grained product classification pipeline leveraging advanced vision models, specifically CLIP and DINOv2. By utilizing visual embeddings and prototype-based classification strategies, our approach effectively distinguishes visually similar products without requiring explicit training data for each class. To facilitate rigorous evaluation, we introduce the MIMEX dataset, a challenging benchmark tailored specifically to fine-grained retail product classification tasks. Our experiments confirm the superior performance and generalization capabilities of our proposed methods compared to existing approaches. Third, we present a zero-shot referring segmentation framework that enables precise localization and segmentation of products based on natural language descriptions. By combining open-vocabulary object detection with large language model reasoning, our approach addresses the unique challenges of dense product arrangements and fine-grained distinctions in complex retail scenes without requiring domain-specific training. This framework allows intuitive human-AI interaction, enabling customers and store associates to identify and locate products using natural language queries. Overall, this thesis advances the state-of-the-art computer vision for smart retail by introducing efficient unsupervised anomaly detection techniques, effective zero-shot fine-grained classification frameworks, and a novel approach to zero-shot referring segmentation. The proposed methodologies not only address key limitations of existing systems but also provide practical solutions readily applicable to real-world retail settings.File | Dimensione | Formato | |
---|---|---|---|
phd_unitn_Tur_AnilOsman.pdf
accesso aperto
Dimensione
19.8 MB
Formato
Adobe PDF
|
19.8 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/214294
URN:NBN:IT:UNITN-214294