Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias

Abstract

CLIP is one of the most popular foundation models and is heavily used for many vision language tasks, yet little is known about its inner workings. As CLIP is increasingly deployed in real-world applications, it is becoming even more critical to understand its limitations and embedded social biases to mitigate potentially harmful downstream consequences.

However, the question of what internal mechanisms drive both the impressive capabilities as well as problematic shortcomings of CLIP has largely remained unanswered. To bridge this gap, we study the conceptual consistency of text descriptions for attention heads in CLIP like models. Specifically, we propose Concept Consistency Score (CCS), a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts.

Our soft-pruning experiments reveal that high CCS heads are critical for preserving model performance, as pruning them leads to a significantly larger performance drop than pruning random or low CCS heads. Notably, we find that high CCS heads capture essential concepts and play a key role in out-ofdomain detection, concept-specific reasoning, and video-language understanding. Moreover, we prove that high CCS heads learn spurious correlations which amplify social biases. These results position CCS as a powerful interpretability metric exposing the paradox of performance and social biases in CLIP models.

Concept Consistency Score (CCS)

We introduce the Concept Consistency Score (CCS) as a systematic metric for analyzing the concepts (properties) learned by transformer layers and attention heads in CLIP-like models. This score quantifies the alignment between the textual representations produced by a given head and an assigned concept label.

Teaser — **Figure: Illustration of our approach to computing Concept Consistency Score for each attention head.**

From each layer and attention head of the CLIP model, we obtain a set of five textual outputs, denoted as {T1, T2, T3, T4, T5}, referred to as TEXTSPANs. These outputs serve as a textual approximation of the concepts encoded by the head. Using in-context learning with ChatGPT, we analyze the set of five TEXTSPAN outputs and infer a concept label C_h that best represents the dominant concept captured by the attention head h. This ensures that the label is data-driven and reflects the most salient pattern learned by the head. To assess the consistency of a head with respect to its assigned concept label, we employ three state-ofthe-art foundational models, GPT-4o, Gemini 1.5 pro and Claude Sonnet as external evaluators. For each TEXTSPAN T_i associated with head h, GPT4o determines whether it aligns with the assigned concept C_h. The Concept Consistency Score (CCS) for head h is then computed as:

We define CCS@K as the fraction of attention heads in a CLIP model that have a Concept Consistency Score (CCS) of K. This metric provides a global measure of how many heads strongly encode interpretable concepts. A higher CCS@K value indicates that a greater proportion of heads exhibit strong alignment with a single semantic property. Mathematically, CCS@K is defined as:

examples — **Figure: Examples of high, moderate and low CCS heads.**

count — **Figure: Count of high, medium and low CCS heads in CLIP models.**

Evaluating LLM Judgment Alignment with Human Annotations

In the previous section, we introduced the Concept Consistency Score (CCS), computed using three LLM judges as an external evaluator. This raises an important question: Are LLM evaluations reliable and aligned with human assessments? To investigate this, we conducted a human evaluation study comparing LLM-generated judgments with human annotations. We selected 100 TEXTSPAN descriptions from three different models, along with their assigned concept labels, and asked one of the authors to manually assess the semantic alignment between each span and its corresponding label. Table below reports the agreement metrics between human and LLM evaluations, including Cohen’s Kappa, Spearman’s ρ, and Kendall’s τ.

human-eval — **Table: Results between human judgment and LLM judgment on CCS labelling. SC denotes Spearman’s correlation.**

Interpretable CLIP Models: The Role of CCS.

In this section we examine the role of the Concept Consistency Score (CCS) in revealing CLIP’s decision-making process, focusing on the question: How does CCS provide deeper insights into the functional role of individual attention heads in influencing downstream tasks? To explore this, we perform a soft-pruning analysis by zeroing out attention weights of heads with extreme CCS values—specifically, high CCS (CCS = 5) and low CCS (CCS ≤ 1). This approach disables selected heads without modifying the model architecture. As shown in Table below, pruning high-CCS heads consistently causes significant drops in zero-shot classification performance across CIFAR-10, CIFAR100 and FOOD-101 while pruning low-CCS heads has a minimal effect. This performance gap demonstrates that CCS effectively identifies heads encoding critical, concept-aligned information, making it a reliable tool for interpreting CLIP’s internal decision-making mechanisms.

Pruning equal number of high CCS, low CCS and random heads.

In the previous section, we showed that attention heads with high Concept Consistency Scores (CCS) are crucial to CLIP’s performance. To validate whether these heads are truly more important than others, we perform a controlled comparison against random pruning. Specifically, we randomly prune the same number of attention heads—excluding high-CCS heads—and repeat this across five seeds, averaging the results. As illustrated in Figure below, pruning high-CCS heads consistently causes a significantly larger drop in zero-shot accuracy compared to random pruning across datasets and model variants. In contrast, random pruning results in only minor performance degradation, highlighting the functional importance of high-CCS heads. Interestingly, we also find that larger CLIP models show a smaller performance gap between high-CCS and random pruning, suggesting that larger architectures may be more robust due to greater redundancy or more distributed representations.

We conducted experiments where we pruned an equal number of high and low CCS attention heads across multiple datasets (CIFAR-100, FOOD-101, Country-211). Results are shown in the table 7. From the table, we observe that pruning high-CCS heads leads to a substantially larger performance drop, even when the number of pruned heads is held constant. This effectively rules out the explanation that the observed degradation is merely due to pruning more heads in the high-CCS condition. Taken together, these findings support CCS as a reliable and interpretable metric for identifying concept-relevant heads and offer deeper insights into how CLIP organizes conceptual information.

High CCS heads are crucial for out-of-domain (OOD) detection

While our earlier experiments primarily focused on in-domain datasets such as CIFAR-10 and CIFAR100 to validate the Concept Consistency Score (CCS), understanding model behavior under out-ofdomain (OOD) conditions is a critical step toward evaluating models’ robustness. Table below demonstrates the results on ImageNet-A and ImageNet-R datasets respectively. From the table, we observe that pruning heads with high CCS scores leads to a substantial degradation in model performance, underscoring the critical role these heads play in the model’s decision-making process.

High CCS heads are crucial for concept-specific tasks.

To investigate the functional role of high Concept Consistency Score (CCS) heads, we conduct concept-specific pruning experiments. In these experiments, we prune heads with high CCS scores corresponding to a target concept (e.g., locations) and evaluate the model’s performance on tasks aligned with that concept, such as location classification. In contrast, we also prune heads associated with unrelated concepts (e.g., animals) and assess the resulting impact on task performance. Our results indicate that pruning high CCS heads leads to a significant drop in task performance, validating that these heads encode essential concept relevant information. In more general classification tasks, objectrelated heads consistently exhibit a greater impact on performance than location or color heads.

Impact of CCS pruning on zero-shot video retrieval

To further assess the importance of high CCS heads for downstream tasks, we conducted zeroshot video retrieval experiments on three popular datasets (MSRVTT, MSVD, and DIDEMO) under different pruning strategies. The results in Figure show that pruning high CCS (Concept Consistency Score) heads consistently leads to a substantial drop in performance across all datasets, demonstrating their critical role in preserving CLIP’s retrieval capabilities. For instance, on MSRVTT and MSVD, high CCS pruning significantly underperforms compared to low CCS and random head pruning, which show much milder performance degradation. Interestingly, low CCS and random head pruning maintain performance much closer to the original unpruned model, indicating that not all attention heads contribute equally to model competence. This consistent trend across datasets highlights that heads with high CCS scores are essential for encoding concept-aligned information necessary for accurate zero-shot video retrieval.

CLIP’s high-CCS heads encode features that drive social biases.

Previously, we established that high-CCS heads in CLIP models are crucial for image and video tasks and pruning them leads to significant drop in performance. Now, we investigate if these high CCS heads learn spurious features leading to social biases. For this, we perform soft pruning experiment on FairFace and SocialCounterFactuals datasets. Here given neutral text prompts of 104 occupations, we measure MaxSkew across race and gender in the dataset.

On the FairFace dataset, pruning high-CCS heads consistently reduces the MaxSkew values for both race and gender across all models. These drops, although modest in some cases, indicate a consistent trend: high-CCS heads are contributing disproportionately to skewed model predictions. The effect is even more evident on the SocialCounterfactuals dataset, where MaxSkew values drop substantially upon pruning high-CCS heads.

These results reveal a fundamental paradox at the heart of CLIP models: high-CCS heads, while critical for strong performance in tasks such as classification, retrieval, and concept alignment, are also the primary contributors to social bias. Pruning these heads leads to a notable reduction in model bias, as shown in our experiments, but also comes at the cost of reduced performance, a clear tradeoff between fairness and utility.

BibTeX

@article{madasu2025pruning,
      title={Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias},
      author={Madasu, Avinash and Lal, Vasudev and Howard, Phillip},
      journal={EMNLP},
      year={2025},
    }