Optimizing Transformer Models: From Isotropic Architectural Design to Parameter-Efficient Fine-Tuning for Large-Scale NLP

Estupiñán Ojeda, Cristian David

Please use this identifier to cite or link to this item: https://accedacris.ulpgc.es/jspui/handle/10553/169867

Title:	Optimizing Transformer Models: From Isotropic Architectural Design to Parameter-Efficient Fine-Tuning for Large-Scale NLP
Authors:	Estupiñán Ojeda, Cristian David
Director:	Guerra Artal, Cayetano
UNESCO Clasification:	120302 Lenguajes algorítmicos
Keywords:	Transformer Models NLP Parameter-Efficient Fine-Tuning
Issue Date:	2026
Abstract:	Transformer-based architectures now represent the leading approach in contemporary Natural Language Processing (NLP), enabling strong performance across generation, understanding, and information extraction tasks. At the same time, the practical reality of most deployments is that systems are built and maintained under hard constraints: limited compute and memory budgets, restricted access to labelled data, and operational requirements that make repeated full-model retraining di!cult to justify. These constraints are especially pronounced in institutional environments such as healthcare and public administration, where hardware, privacy, and maintenance costs are fixed and where the cost of failure is not only reduced accuracy but also reduced trust and usability. This thesis is organized around a single research question that follows directly from this question: How Transformerbased NLP systems can be improved when constraints on compute, memory, data availability, and deployment are treated as primary design requirements rather than secondary implementation details. 1 2 Chapter 1 Introduction A core claim of this thesis is that, under constrained settings, performance is determined less by the raw expressive capacity of Transformer architectures and more by how well their internal representations retain and reveal information that can be e"ectively used. In other words, e!ciency is not only a question of saving parameters or reducing FLOPs, but also a question of avoiding representational pathologies that waste capacity by collapsing variation. This motivates a representational viewpoint in which the geometry of contextual embedding spaces is treated as a first-class object of study. Empirically, widely used Transformer models often exhibit strong global anisotropy and representation degeneration, meaning that contextual vectors become increasingly aligned and concentrate in a narrow region of the space as depth grows. When this happens, token-level separability is reduced, downstream discrimination becomes more di!cult, and optimization can become less stable, particularly in settings that demand fine-grained distinctions or robust ranking among many candidates. Under tight resource budgets, these geometric e"ects are not merely diagnostic curiosities, because they can act as concrete bottlenecks that reduce the e"ective information content available to later modules and tasks performance. The thesis develops two complementary lines of work that share this geometric motivation while addressing di"erent phases of the model lifecycle. The first line studies architectural optimization through the lens of isotropy, treating the mitigation of severe anisotropy and degeneration as an actionable design objective rather than a post-hoc measurement. The core hypothesis is that certain standard architectural choices, especially when compounded across many layers, can unintentionally promote representational collapse, and that focused architectural adjustments can maintain greater angular diversity without the need for scaling. This part of the thesis proposes isotropy-oriented architectural interventions within the Transformer block, emphasizing post-attention transformations that remain expressive but avoid purely position-wise re-encoding patterns that can accumulate alignment e"ects. The designs are evaluated across heterogeneous objectives, including generation and understanding, with the aim of demonstrating that healthier representation geometry can translate into measurable downstream gains while remaining compatible with e!ciency goals. The second line of work moves from architecture to adaptation, focusing on how to specialize strong pre-trained models to demanding domains when labelled data is scarce and deployment constraints prohibit full fine-tuning. Here, the thesis maintains a uniform representational position: adaptation should be understood as a controlled intervention on top of a model whose geometry may already be imperfect, rather than as an unconstrained re-optimization of all parameters. This motivates parameter-e!cient fine-tuning (PEFT) as a principled strategy for constrained specialization. Instead of modifying the entire parameter space, PEFT introduces a small number of trainable parameters that can encode domain- and task-specific adjustments, preserving most of the pre-trained structure while allocating capacity where it is most useful. The thesis evaluates this approach in two realistic long-tail scenarios with large label spaces and high operational relevance: strict Joint Entity Recognition and Linking (JERL) of diagnoses in noisy bilingual clinical notes, and EuroVoc-based extreme multi-label indexing of parliamentary proceedings, where system utility depends on ranking quality and the ability to surface high-quality shortlists. These two lines are connected by a shared emphasis on discriminative structure under constraints. Architectural degeneration reduces separability and wastes rep- 4 Chapter 1 Introduction resentational degrees of freedom, which in turn makes downstream adaptation more demanding, especially in long-tailed regimes where rare labels require fine-grained disambiguation. Conversely, even strong adaptation mechanisms can inherit limitations from collapsed models if the base space does not preserve su!cient directional diversity for downstream objectives. The thesis therefore treats representation geometry and e!cient adaptation as coupled concerns: improving the health of contextual spaces can shift the e"ectiveness–e!ciency frontier at the backbone level, while PEFT provides a practical means of specialization that respects constraints and can be deployed in settings where full fine-tuning is infeasible. Across the empirical chapters, the long tail emerges as a consistent stress test for both themes, because rare labels and heavy-tailed distributions amplify the need for separability, calibration, and stable ranking behaviour. The work is also motivated by methodological goals that follow from the realities of constrained NLP research. First, e!ciency claims must be supported by controlled comparisons that separate architectural or adaptation e"ects from incidental di"erences in training recipes. Second, evaluation must reflect operational usage, including strict matching criteria in token-level clinical extraction and ranking-sensitive metrics in institutional multi-label indexing. Third, analysis must identify where residual errors concentrate, particularly across di"erent label-frequency groups, because a small aggregate gap can mask severe failures in the tail. The thesis therefore focuses on experimental designs and reporting practices that clearly spell out how resources are balanced, including counts of trainable parameters, memory usage, and runtime indicators relevant to e!ciency, in addition to task performance. The remainder of the thesis is structured to progressively build this argument. Chapter 2 reviews the foundations required for the contributions that follow, covering Transformer architectures, isotropy and anisotropy in neural representations, parameter-e!cient fine-tuning methods, prompt-based and chain-of-prompt techniques, and the specific challenges of extreme multi-label learning in NLP. Chapter 3 develops the architectural thread, introducing isotropy-oriented redesigns and empirical validation across representative tasks, with the objective of demonstrating that mitigating degeneration can be both general and e!ciency-compatible. Chapter 4 presents the clinical JERL study, formulating diagnosis recognition and ICD linking under strict token-level criteria in bilingual, noisy notes and evaluating PEFT as a constrained specialization strategy with careful attention to long-tail behaviour. Chapter 5 addresses parliamentary EuroVoc indexing under extreme label spaces, proposing a resource-aware pipeline that combines summarization and hierarchical prompt chaining, and evaluating PEFT-adapted small decoder-only models with an emphasis on ranking fidelity and shortlist quality. Finally, Chapter 6 consolidates findings across all chapters, articulates the main contributions, and outlines future directions that extend the thesis perspective to contemporary small and large language model families, to geometry-aware PEFT, and to agentic workflows for institutional indexing. Taken together, the thesis advances a unified view of optimisation under constraints: improvements do not require scaling alone, but can be achieved through targeted architectural choices that preserve discriminative structure and through parameter-e!cient adaptation mechanisms that specialize models responsibly in domains where compute, memory, and data are limited. This framing is scientific, linking geometric properties of representation spaces to observable model behaviour, and practical, o"ering concrete designs and workflows that can be used in real deployments without the full cost of modern large-scale training pipelines.
Description:	Programa de Doctorado en Tecnologías de Telecomunicación e Ingeniería Computacional por la Universidad de Las Palmas de Gran Canaria
URI:	https://accedacris.ulpgc.es/jspui/handle/10553/169867
Appears in Collections:	Tesis doctoral

Adobe PDF (10,8 MB)

Google Scholar^TM

Share

Export metadata

Dirección

Contacto

Legal

De interés

Adobe PDF (10,8 MB)

Google ScholarTM

Share

Export metadata

Dirección

Google Scholar^TM