Large-scale pre-trained models have become the backbone of modern computer vision, yet their deployment in dynamic environments remains constrained by computational limitations and catastrophic forgetting. While these models demonstrate remarkable capabilities on individual tasks, adapting them to continuously evolving visual domains requires prohibitively expensive retraining or leads to severe performance degradation on previously learned tasks. Traditional fine-tuning approaches consume extensive resources, making continual adaptation impractical for most practitioners and environmentally unsustainable. This thesis addresses the fundamental challenge of enabling efficient continual learning for large vision models through novel parameter-efficient approaches. The work begins with a comprehensive survey that establishes the theoretical foundations and identifies the critical need for parameter-efficient continual fine-tuning (PECFT) as a distinct research area. Early investigations into interference patterns in large language models reveal the phenomenon of in-context interference, providing crucial insights that inform subsequent methodological developments. Building on these foundations, the thesis introduces three complementary methodologies that maintain performance while dramatically reducing computational requirements. First, an adaptive LoRA merging technique is developed that dynamically computes optimal combination weights for different visual domains, eliminating the need for manual hyperparameter tuning while achieving superior adaptation performance. Second, Hierarchical Adapters Merging (HAM) is presented, a framework that organizes learned adaptations into similarity-based groups, enabling efficient scaling to long sequences of visual tasks while maintaining a fixed parameter budget. Third, GRAD-BEN (Gradient Aligned Distillation and Beta Ensembling) extends these principles to challenging multimodal scenarios, specifically Few-Shot Domain-Incremental Learning. GRAD-BEN leverages vision-language models and integrates multi-modal prompting, gradient-aligned distillation, and Beta-based temporal ensembling to enable robust adaptation under severe domain shifts with minimal supervision, without relying on memory buffers or task identifiers. Through comprehensive evaluation on standard computer vision benchmarks including CIFAR-100, CUB-200, Tiny-ImageNet, DomainNet, CoRE50, and CDDB-Hard, this work demonstrates state-of-the-art performance in long-sequence continual learning scenarios and resource-constrained multimodal settings. The proposed methods achieve comparable or superior accuracy to full fine-tuning while requiring only a fraction of the computational resources. These contributions advance the field toward practical continual learning systems capable of adapting to the dynamic nature of real-world computer vision applications while maintaining computational sustainability.
Beyond Static Models: Enabling Continual Parameter Efficient Finetuning for Large Foundation Models
COLEMAN, ERIC NUERTEY
2026
Abstract
Large-scale pre-trained models have become the backbone of modern computer vision, yet their deployment in dynamic environments remains constrained by computational limitations and catastrophic forgetting. While these models demonstrate remarkable capabilities on individual tasks, adapting them to continuously evolving visual domains requires prohibitively expensive retraining or leads to severe performance degradation on previously learned tasks. Traditional fine-tuning approaches consume extensive resources, making continual adaptation impractical for most practitioners and environmentally unsustainable. This thesis addresses the fundamental challenge of enabling efficient continual learning for large vision models through novel parameter-efficient approaches. The work begins with a comprehensive survey that establishes the theoretical foundations and identifies the critical need for parameter-efficient continual fine-tuning (PECFT) as a distinct research area. Early investigations into interference patterns in large language models reveal the phenomenon of in-context interference, providing crucial insights that inform subsequent methodological developments. Building on these foundations, the thesis introduces three complementary methodologies that maintain performance while dramatically reducing computational requirements. First, an adaptive LoRA merging technique is developed that dynamically computes optimal combination weights for different visual domains, eliminating the need for manual hyperparameter tuning while achieving superior adaptation performance. Second, Hierarchical Adapters Merging (HAM) is presented, a framework that organizes learned adaptations into similarity-based groups, enabling efficient scaling to long sequences of visual tasks while maintaining a fixed parameter budget. Third, GRAD-BEN (Gradient Aligned Distillation and Beta Ensembling) extends these principles to challenging multimodal scenarios, specifically Few-Shot Domain-Incremental Learning. GRAD-BEN leverages vision-language models and integrates multi-modal prompting, gradient-aligned distillation, and Beta-based temporal ensembling to enable robust adaptation under severe domain shifts with minimal supervision, without relying on memory buffers or task identifiers. Through comprehensive evaluation on standard computer vision benchmarks including CIFAR-100, CUB-200, Tiny-ImageNet, DomainNet, CoRE50, and CDDB-Hard, this work demonstrates state-of-the-art performance in long-sequence continual learning scenarios and resource-constrained multimodal settings. The proposed methods achieve comparable or superior accuracy to full fine-tuning while requiring only a fraction of the computational resources. These contributions advance the field toward practical continual learning systems capable of adapting to the dynamic nature of real-world computer vision applications while maintaining computational sustainability.| File | Dimensione | Formato | |
|---|---|---|---|
|
TESI_ERIC_N_COLEMAN__.pdf
accesso aperto
Licenza:
Creative Commons
Dimensione
6.63 MB
Formato
Adobe PDF
|
6.63 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/355830
URN:NBN:IT:UNIPI-355830