2305.19270Learning without Forgetting for Vision-Language Models-CFANZ编程社区

https://arxiv.org/pdf/2305.19270.pdf

2305.19270.pd

Abstract

Class-Incremental Learning (CIL) or continual learning is a desired capability in

the real world, which requires a learning system to adapt to new tasks without

forgetting former ones. While traditional CIL methods focus on visual information

to grasp core features, recent advances in Vision-Language Models (VLM) have

shown promising capabilities in learning generalizable representations with the aid

of textual information. However, when continually trained with new classes, VLMs

often suffer from catastrophic forgetting of former knowledge. Applying VLMs to

CIL poses two major challenges: 1) how to adapt the model without forgetting; and

2) how to make full use of the multi-modal information. To this end, we propose

PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To

handle the first challenge, we propose training task-specific projections based on the

frozen image/text encoders. When facing new tasks, new projections are expanded

and former projections are fixed, alleviating the forgetting of old concepts. For the

second challenge, we propose the fusion module to better utilize the cross-modality

information. By jointly adjusting visual and textual features, the model can capture

semantic information with a stronger representation ability. Extensive experiments

on nine benchmark datasets validate PROOF achieves state-of-the-art performance

1 Introduction

In our ever-changing world, training data often comes in a stream format with new classes, requiring

a learning system to absorb them continually [19, 18]. To address the challenge of learning emerging

new classes, Class-Incremental Learning (CIL) has been proposed [47]. However, in CIL, the absence

of former classes triggers catastrophic forgetting [16], where learning new concepts overwrites the

knowledge of old ones and results in decline in performance [33]. Numerous efforts have been

made [37, 15, 79, 53, 62, 77] to combat catastrophic forgetting in the machine learning field.

With the rapid development of pre-training techniques [20], recent years have witnessed the transition

of CIL research from training from scratch [67, 21, 78] to utilizing pre-trained models (PTM) [63, 64,

49]. With the help of PTM, e.g., Vision Transformers [13], incremental models are born with strong

transferability to grasp the visual features. Facing the domain gap introduced by the incremental

classes, they only need to learn a limited number of additional parameters [26, 11, 34] as the patches

to bridge the gap, which significantly simplifies the challenge of incremental learning.

While pre-trained ViT-based CIL methods focus on learning the visual features to recognize new

concepts, recent advances in Vision-Language Models (VLM) have demonstrated the potential of

textual information in building generalized feature representations. A typical work, i.e., contrastive

∗Han-Jia Ye and Ziwei Liu are corresponding authors.
Preprint. Under review.arXiv:2305.19270v1 [cs.CV] 30 May 2023

language-image pre-training [46] (CLIP), maps the visual and textual information in the shared

embedding space, enabling robust learning and recognition of concepts from diverse sources. This

integration of visual and textual modalities presents a promising avenue for developing continual

learning models that can effectively adapt to real-world scenarios.

Extending VLMs to CIL faces two significant challenges. First, sequentially tuning the VLM

overwrites the innate generalizability and former concepts, leading to forgetting and poor performance

on future tasks. Second, relying solely on textual information for classification neglects the valuable

cross-modal features present in the multi-modal inputs. To fully utilize this information, it is necessary

to explore methods for cross-modal fusion beyond textual features.

Correspondingly, we aim to turn a VLM into a continual learner that is both retentive and comprehensive. Retentive refers to the model’s ability to maintain its pre-trained capabilities, thereby preserving

generalizability and enabling it to perform well on future tasks without forgetting. Comprehensive

refers to the model’s capacity to integrate and adjust information from multiple modalities. By

leveraging these characteristics, we can mitigate catastrophic forgetting and use cross-modal features

to build more robust classifiers as data evolves.

In this paper, we propose PROjectiOn Fusion (PROOF) to address catastrophic forgetting in VLM.

To make the model retentive, we freeze the pre-trained image/text backbones and append liner

projections on top of them. The task-specific information is encoded in the corresponding projection

layer by mapping the projected features. When facing new tasks, new projections are extended while

old ones are frozen, preserving former knowledge. Besides, we aim to fuse the information from

different modalities via cross-model fusion, which allows for the query embedding to be adjusted

with context information. Consequently, PROOF efficiently incorporates new classes and meanwhile

resists forgetting old ones, achieving state-of-the-art performance on nine benchmark datasets. We

also investigate the zero-shot performance of VLM with new evaluation protocols and metrics, and

find that PROOF maintains its zero-shot performance with a simple modification.

2 Related Work

Vision-Language Model (VLM) Tuning: Recent years have witnessed the prosperity of research

in VLMs, e.g., CLIP [46], ALIGN [25], CoCa [70], Florence [73], BLIP [31], CLIPPO [54], and

Flamingo [1]. These models are pre-trained on vast amounts of images and texts, achieving a

unified embedding space across modalities. With great generalizability, they can be applied for

downstream tasks in a zero-shot manner. However, a domain gap still exists between the pre-trained

and downstream datasets, requiring further tuning for better performance. CoOp and CoCoOp [85, 84]

apply prompt learning [32] into VLM tuning with learnable prompt tokens. Subsequent works explore

VLM tuning via adapter tuning [17], prompt distribution learning [39], task residual learning [72],

similarity learning [76], descriptor learning [42], and optimal transport mapping [10]. However, they

only focus on adapting VLM to downstream tasks while overlooking the forgetting of former ones.

Class-Incremental Learning (CIL): aims to learn from evolutive data and absorb new knowledge

without forgetting [81]. Replay-based methods [40, 4, 8, 38, 9] save and replay former instances to

recover old knowledge when learning new ones. Knowledge distillation-based methods [47, 33, 14]

build the mapping between models as regularization. Parameter regularization-based methods [27,

2, 74, 3] weigh the importance of different parameters as regularization. Model rectification-based

methods [50, 78, 67, 71] rectify the inductive bias for unbiased predictions. Dynamic networks [69,

58, 82, 59] show strong performance by expanding the network structure as data evolves.

CIL with VLM: Aforementioned CIL methods aim to train an incremental model from scratch,

while it would be easier to start with a pre-trained model [30]. The integration of pre-trained Vision

Transformer [13] into CIL has attracted the attention of the community, and most methods [63,

64, 49] employ parameter-efficient tuning techniques to learn without forgetting. S-Prompt [61]

explores CLIP in domain-incremental learning, but the application of VLM in CIL remains relatively

unexplored. WiSE-FT [66] utilizes weight ensemble for robust finetuning, while it cannot be extended

to multiple tasks. This paper aims to address this research gap by presenting a comprehensive solution

for tuning vision-language models without suffering from forgetting.

3 From Old Classes to New Classes

In this section, we introduce the background information about class-incremental learning and vision

language models. We also discuss the naïve solutions for tuning VLM in CIL.

3.1 Class-Incremental LearningGiven a data stream with emerging new classes, class-incremental learning aims to continually

incorporate the knowledge and build a unified classifier [81]. We denote the sequence of B training

sets without overlapping classes as D1, D2, · · · , DB , where Db = {(xi, yi)}n i=1 b is the b-th training

set with nb instances. A training instance xi ∈ RD belongs to class yi ∈ Yb. Yb is the label space of

task b, and Yb ∩ Yb′ = ∅ for b ̸= b′. Following the typical CIL setting [47, 22, 67], a fixed number of

exemplars from the former classes are selected as the exemplar set E. During the b-th incremental

stage, we can only access data from Db and E for model training. The target is to build a unified

classifier for all seen classes Yb = Y1 ∪ · · · Yb continually. In other words, we hope to find a model

f(x) : X → Yb that minimizes the expected risk:

f∗ = argmin

f∈H

(x,y)∼Dt1∪···DtbI (y ̸= f(x)) , (1)

where H denotes the hypothesis space and I(·) is the indicator function. Dtb denotes the data

distribution of task b. Following [63, 64, 61], we assume that a pre-trained vision-language model is

available as the initialization for f(x), which will be introduced in Section 3.2.

3.2 Vision-Language ModelThis paper focuses on contrastive language-image pre-training (CLIP) [46] as the VLM. During pretraining, CLIP jointly learns an image encoder gi(·) : RD → Rd and a text encoder gt(·) : RDt → Rd

in a contrastive manner, where D/Dt are input dimensions of image/text, and d is the embedding

dimension. CLIP projects a batch of image-text pairs into a shared embedding space. It maximizes

the cosine similarity of paired inputs and minimizes it for unmatched ones. Benefiting from the

massive training data, CLIP can synthesize a zero-shot classifier that generalizes to unseen classes.

The output of CLIP is formulated as:

p(yi | x) = P|Y jexp (cos ( =1 b| exp (cos ( z, wzi,)w/τ j))/τ) , (2)

where cos(·, ·) denotes cosine similarity, τ is learnable temperature parameter, z = gi(x) is the image

embedding. Correspondingly, wi is the text embedding of class yi obtained by feeding templated

texts, e.g., “a photo of a [CLASS]” into the text encoder. We denote the templated text of class i as ti.

Eq. 2 aims to find the most similar text ti that maximizes the cosine similarity to the query image.

3.3 Overcome Forgetting in Class-Incremental LearningCIL, as a long-standing problem, has garnered significant attention from the research community. In

this section, we introduce two typical solutions for adapting pre-trained models with new classes.

Vision-Based Learning: Traditional CIL methods primarily rely on the image encoder to capture

the patterns of new classes. One such method, L2P [64], leverages visual prompt tuning [26] to

enable incremental updates of a pre-trained Vision Transformer [13]. By keeping the image encoder

frozen, L2P trains a learnable prompt pool Pool and combines it with patch embeddings to obtain

instance-specific embeddings. The optimization target can be formulated as:

L = ℓ (h ( ¯ gi (xi, Pool)) , yi) + Lreg , (3)

where h(·) is the classification head, g¯i is the frozen image encoder, Lreg is the regularization loss

for prompt selection. By freezing the encoder, Eq. 3 grasps the new pattern with humble forgetting.

CLIP Tuning: The issue of tuning VLM without forgetting in CIL remains unaddressed, as previous

works have solely focused on transferring CLIP to downstream tasks without considering the performance of former tasks. For instance, CoOp [85] converts text inputs into a learnable prompt, i.e.,

ti = [V]1[V]2 · · · [V]M [CLASS]i. The posterior probability in Eq. 2 is transformed into:

p(yi \| x) =	P\|Y jexp (cos ( =1 b\| exp (cos ( z, gzt(, g tit)) (t/τ j)))/τ)	.

(4)

With the help of the learned prompt, Eq. 4 enables the model to be transferred to the downstream

task. However, since the prompt template is shared for all tasks, sequentially tuning CoOp will suffer

catastrophic forgetting of former concepts.

Discussions: Current methods focus on different aspects of CIL. Vision-based methods (e.g., Eq. 3)

address the issue of forgetting but neglect the valuable semantic information conveyed in texts.

Conversely, CLIP’s pre-trained text encoder captures class-wise relationships that can enhance model

learning. Meanwhile, transfer learning methods (e.g., Eq. 4) effectively leverage the cross-modal

information, while sequentially tuning them suffers the catastrophic forgetting of former concepts. Is

it possible to combine the cross-modal information and meanwhile resist catastrophic forgetting?

4 PROOF: Projection Fusion for VLM

Observing the limitations of typical vision-based methods in utilizing textual information and

forgetting in CLIP tuning, we aim to leverage cross-modality knowledge in CLIP while effectively

mitigating forgetting. To this end, we must make the model retentive and comprehensive. Retentive

represents the ability to adapt to downstream tasks without forgetting, and we propose projections

to map the pre-trained features in the projected feature space. Our unique training strategy ensures

the preservation of former knowledge by freezing old projections and expanding new ones for new

tasks. The comprehensive aspect involves co-adapting and utilizing cross-modal information to

enhance unified predictions. The query instance’s embedding is influenced by both visual and textual

information, allowing for instance-specific adaptation and enabling comprehensive predictions.

In the following sections, we introduce the learning paradigm and the co-adaptation process. Lastly,

we provide detailed guidelines for training and inference.

4.1 Expandable Feature ProjectionCLIP is known for its strong zero-shot performance [46], i.e., Eq. 2 obtains competitive results even

without explicit training on the specific tasks. However, given the domain gap between pre-trained

and downstream tasks, an adaptation process is still necessary to capture the characteristics of the

latter. Specifically, we introduce a linear layer (denoted as “projection”) which is appended after the

frozen image and text embeddings to facilitate the matching of pair-wise projected features. Denoting

the projection of image/text as Pi(·) : Rd → Rd and Pt(·) : Rd → Rd, Eq. 2 is transformed into:

p(yi \| x) =	P\|Y jexp (cos ( =1 b\| exp (cos ( Pi (Pzi)(, P z)t, P (wt i()) w/τ j)))/τ)	.	(5)

| {z }

Projected MatchingWe denote the classification based on Eq. 5 as fPM(x). By freezing the image and text encoders, it

aligns the downstream features in the projected space, allowing the model to encode the relevant

downstream information into projection layers. Since the pre-trained model outputs generalizable

features, the projection layer learns to recombine features in a data-driven manner. For instance, in a

task involving ‘birds,’ the projection would assign a higher weight to features like ‘beaks’ and ‘wings.’

This adaptation enables the projected features to better discern and recognize downstream tasks.

Expandable Projections: However, sequentially training a single projection layer still leads to

forgetting of former tasks, resulting in confusion when combining old and new concepts. To

this end, we expand task-specific projections for each new task. Specifically, we append a newly

initialized projection layer Pib, Ptb when a new task Db arrives. This results in a set of projections:

{Pi1, P	ib, }, {Pt1, P	tb, }, and we adopt the aggregation as the output, i.e.,
Pi(z) = Pb m=1 Pim (z) , Pt(w) = Pb n=1 Ptn (w) .	(6)

i2, · · · Pt2, · · · PIn Eq. 6, projected features from different stages are mapped and aggregated to capture the different

emphases of former and latter tasks. For example, former tasks might emphasize ‘beak’ features

4A photo of

a panda

Image
EncoderVisual

Prototypes 𝑷𝑷𝒊𝒊 𝟏𝟏𝑷𝑷𝒊𝒊 𝟐𝟐𝑷𝑷𝒊𝒊 𝒃𝒃Aggregated FeatureText Encoder 𝑷𝑷𝒕𝒕 𝟏𝟏𝑷𝑷𝒕𝒕 𝟐𝟐𝑷𝑷𝒕𝒕 𝒃𝒃A photo of a dog A photo of a cat

Visual‘ Matching Textual Matchingpanda cat dog𝑪𝑪 Query Instance Visual Prototype Visual Feature𝑷𝑷𝒊𝒊 𝒃𝒃 Projection Module Fused Embedding Context𝑪𝑪 PromptpandaTextual Feature𝑾𝑾 𝒒𝒒𝑾𝑾𝒌𝒌𝑾𝑾 𝒗𝒗Atten
tion Add & LNCross-Modal Fusion

Figure 1: Illustration of PROOF. The model learns expandable projections and aggregates them to get the
aggregated features. The query instance, prototype features, textual features, and context prompts are fed into
the cross-modal fusion. The fusion process utilizes self-attention to co-adapt the input set, which outputs the
adapted features. The adapted query embedding is separately matched among the visual prototypes and textual
features to get the final prediction. Red parts are trainable while gray ones are frozen.for bird recognition, while later tasks may focus on ‘beards’ features to differentiate cats. The

aggregation of different projections produces a comprehensive representation of the query instance.

By substituting Eq. 6 into Eq. 5, the model aligns the unified features in the joint space.

How to resist forgetting of former projections? To overcome forgetting old concepts, we freeze the

projections of former tasks when learning new ones, i.e., {P¯i1, P¯i2, · · · Pib, } (same for Pt). It allows

the newly initialized projection to learn the residual information of new tasks, incorporating new

concepts while preserving the knowledge of former ones. During the learning process of task b, we

optimize the cross-entropy loss to encode the task-specific information into the current projections.

Effect of projections: The illustration of projections are shown in Figure 1 (left). PROOF learns

projections based on the pre-trained encoders, which fits new patterns and maintains the generalizability of pre-trained model. The parameter number of each projection layer is d × d, which is

neglectable for the pre-trained model. Furthermore, the model learns new projections for new tasks,

and task-specific projections fit new concepts easily. Since we only optimize the current projections

and freeze old ones, the former knowledge is preserved, and forgetting is alleviated.

4.2 Contextualizing Projections with Projection FusionIn Eq. 5, the projected visual and textual features are directly matched in the joint space. However, it would be beneficial to further refine these features to capture the contextual relationship between

images and texts. For instance, when the query instance is a ‘panda,’ it is desirable to adjust the

visual and textual features in a coherent manner to highlight discriminative attributes such as black

eyes and ears. Similarly, when the query instance is a ‘cat,’ features like beards and tails should be

emphasized. This adjustment process involves jointly adapting the query embedding and the context

(e.g., textual information) to obtain a contextualized embedding. Correspondingly, we propose a

set-to-set function that contextualizes and fuses the query embeddings and contextual information.

Specifically, we denote the adaptation function as T (·). It receives the query instance and context

information as bags, i.e., [Pi(z), Context], and outputs the set of adjusted embeddings while being

permutation-invariant: T ([Pi(z), Context]) = [Pi˜(z), Context ˜ ]. T (·) encodes the set information

and performs adaptation on each component. In the following, we describe the construction of the

context information Context and provide details on the implementation of the set-to-set function.

How to define the context? In Eq. 5, the mapping is established between the query instance and

the textual information (i.e., classifiers). The classifiers represent the typical textual description

of the corresponding class, i.e., the common feature. Hence, a naïve idea is to utilize textual

features as the context, i.e., Context = W, W = [Pt(w1), Pt(w2), · · · , Pt(w|Yb|)] ∈ R|Yb|×d

is the concatenation of all textual classifiers. However, recent works find an inherent domain

gap [35] between the visual and textual embeddings in VLM. The gap leads to visual and textual

embeddings residing in two separate clusters in the embedding space, which hinders effective

pair-wise mapping. Correspondingly, we leverage visual prototype features [51] as a useful tool

for capturing the common characteristics of each class. Define the visual prototype of class k as:

pk = N1 P\|D j=1 b\| I(yj = k)gi(xj), where N = P\|D	b\|

j=1 I(yj = k). They are calculated via forward pass

at the beginning of each incremental stage and stay fixed in subsequent tasks. Visual prototypes

are representative features of the corresponding class, which can serve as the visual context to

adjust the embeddings. Hence, we augment the context with projected visual information, i.e.,

Context = [P, W], where P = [Pi(p1), Pi(p2), · · · , Pi(p|Yb|)] ∈ R|Yb|×d is the concatenation of

all visual prototypes. Combining prototypes from multiple modalities help the model adapt and fuse

information in a cross-modal manner, which goes beyond simple visual-textual matching.

Implementing T with Self-Attention: In our implementation, we use the self-attention (SA)

mechanism [55, 36] as the cross-modal fusion function T . Being permutation invariant, SA is

good at outputting adapted embeddings even with long dependencies, which naturally suits the

characteristics of the adaptation function. Specifically, SA keeps the triplets (query Q, key, K, and

value V). The inputs are projected into the same space, i.e., K = WK⊤ [ kk; ∀kk ∈ K ] ∈ Rd×|K|.

Similar projections are made for Q and V. The query xq ∈ Q is matched against a list of keys K

where each key has a value V . The output is the sum of all the values weighted by the proximity of

the key to the query point:

P˜i(z) = Pi(z) + Pk αqkV:,k , (7)

where αqk ∝ exp Pi(z)√⊤Wd Q·K , V:,k is the k-th column of V . The adaptation process is the same

for other components in Context. Specifically, we have Q = K = V = [Pi(z), Context].

Effect of Cross-Modal Fusion: The illustration of the projection fusion is shown in Figure 1 (right).

We utilize the visual and textual information of seen classes as context information to help adjust the

instance-specific embeddings. The fusion model is trained incrementally to adjust embeddings to

reflect the context information as data evolves. With the contextualized embeddings, we can conduct

the visual mapping and textual matching:

p(yi \| x) = P\|Y j=1 b\| exp cos P˜i(z), P˜i(pj) /τ + P\|Y b\|	P˜i(z), P˜t(wj) /τ . (8)
exp cos P˜i(z), P˜i(pi) /τ	exp cos P˜i(z), P˜t(wi) /τ
\|	{z Visual Matching	} \|	{z Textual Matching	}
j=1 exp cos

In Eq. 8, the model assigns logits to the query instance by the similarity to the adapted visual and

textual prototypes. The incorporation of cross-modal matching enhances the prediction performance.

Learning Context Prompts: In addition to visual prototypes and textual classifiers, we also introduce

a set of learnable context prompts {c1, · · · , cb}, ci ∈ Rc×d to be optimized as data evolves. c denotes

the length of each prompt. Similar to projection layers, we make the context prompts expandable to

catch the new characteristics of new tasks. We initialize a new context prompt while learning a new

task and freeze others {c¯1, c¯2, · · · , cb}. The context prompts serve as adaptable context information,

enhancing the co-adaption. Hence, the context information is formulated as Context = [P, W, C],

where C is the aggregation of all context prompts. Note that C only encodes the task-specific

information into the self-attention process, which does not serve as the matching target in Eq. 8.

4.3 Summary of PROOFIn PROOF, we first enable learning new concepts via projected mapping. Then, to accommodate

new concepts without interference from previous ones, we initialize new projections for each new

task and freeze the former ones. Besides, we utilize self-attention to adjust the embeddings of the

query instance and the context information to promote cross-modal fusion. Figure 1 illustrates three

predictions, i.e., projected matching (Eq. 5), visual/textual matching (Eq. 8). We denote these models

as fPM(x), fVM(x), fTM(x), respectively. During training, we optimize the cross-entropy loss:

min

{Pib,Ptb,T ,cb} ℓ(fPM(x), y) + ℓ(fVM(x), y) + ℓ(fTM(x), y) . (9)

In Eq. 9, all pre-trained weights are frozen, and we only optimize these additional parameters. For

inference, we aggregate the three logits, i.e., f(x) = fPM(x) + fVM(x) + fTM(x). We give the

pseudo-code of PROOF in the supplementary.

Table 1: Average and last performance of different methods. The first and second columns represent the
methods with and without exemplars. The performance of L2P and DualPrompt are reproduced with the source
code with exemplars. The best performance is shown in bold. Full results are reported in supplementary.Method Exemplar
ImageNet-R CUB UCF

B0 Inc20	B100 Inc20
A ¯	A B	A ¯	A B

B0 Inc10	B50 Inc10
A ¯	A B	A ¯	A B

A A ¯ B A A ¯ B Finetune ✗ 1.37 0.43 1.01 0.88 2.06 0.64 0.56 0.47 4.51 1.59 1.21 0.80
Finetune LiT [75] ✗ 64.88 30.42 57.75 29.77 58.15 35.28 51.95 35.96 79.25 64.84 81.79 65.4
Finetune CoOp [85] ✗ 60.73 37.52 54.20 39.77 27.61 8.57 24.03 10.14 47.85 33.46 42.02 24.74
SimpleCIL [83] ✗ 81.06 74.48 76.84 74.48 83.81 77.52 79.75 77.52 90.44 85.68 88.12 85.68
ZS-CLIP [46] ✗ 83.37 77.17 79.57 77.17 74.38 63.06 67.96 63.06 75.50 67.64 71.44 67.64
CoOp [85] ✓ 82.40 76.20 79.76 77.13 77.34 68.70 74.09 67.47 90.13 86.24 88.36 85.71
iCaRL [47] ✓ 72.22 54.38 68.67 60.15 82.04 74.74 78.57 75.07 89.47 84.34 88.51 84.11
MEMO [82] ✓ 80.00 74.07 76.72 73.95 77.32 65.69 72.88 66.41 84.02 74.08 82.58 75.48
L2P [64] ✓ 75.73 67.22 74.15 71.20 79.23 68.54 75.85 71.12 88.71 83.93 86.51 83.22
DualPrompt [63] ✓ 78.47 70.82 72.98 69.18 83.21 74.94 78.06 74.27 89.48 85.41 86.96 84.65
PROOF ✓ 85.34 80.10 82.32 80.30 84.93 79.43 81.67 79.18 92.34 89.92 91.70 89.1620 40 60 80 100

Number of Classes0

Accuracy (%)5.5

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(a) Aircraft Base0 Inc1020 40 60 80 100

Number of Classes60

100

Accuracy (%)2.42

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(b) CIFAR100 Base0 Inc1020 40 60 80 100

Number of Classes65

100

Accuracy (%)2.99

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(c) Cars Base0 Inc10150 200 250 300

Number of Classes65

Accuracy (%)1.91

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(d) SUN Base150 Inc3050 60 70 80 90 100

Number of Classes65

Accuracy (%)1.66

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(e) Food Base50 Inc10100 120 140 160 180 200

Number of Classes20

Accuracy (%)3.28

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(f) ObjectNet Base100 Inc20Figure 2: Incremental performance of different methods. We report the performance gap after the last
incremental stage of PROOF and the runner-up method at the end of the line. Finetune-based methods in Table 1
are not plotted due to their inferior performance.5 Experiment

In this section, we compare PROOF in comparison to state-of-the-art methods on benchmark datasets

to investigate the capability of overcoming forgetting. We also conduct ablations to analyze the effect

of each component in the model. Furthermore, we address a fundamental issue in VLM training

known as zero-shot degradation. Finally, we extend PROOF to other VLMs to verify the universality

of proposed method. Further details and experimental results can be found in the supplementary.

5.1 Experimental Setup
Dataset: Following the benchmark CIL settings [47, 64, 63, 71, 83], we evaluate the performance

on CIFAR100 [29], CUB200 [57], ObjectNet [6], and ImageNet-R [12]. We also follow the

setting in VLM tuning [85], and formulate FGVCAircraft [41], StanfordCars [28], Food101 [7],

SUN397 [68] and UCF101 [52] into CIL setting. Specifically, we sample (a subset of) 100 classes

from CIFAR100, Aircraft, Cars, Food, UCF, 200 classes from CUB200, ObjectNet, ImageNet-R, and

300 classes from SUN to ease the data split. Following [47], the class order of training classes is

shuffled with random seed 1993. The dataset splits are denoted as Base-x, Inc-y, where x represents

the number of classes in the first stage, and y represents the number of new classes in each subsequent

task. x = 0 means each task contains y classes. More details are reported in the supplementary.

(a) OpenAI weight20 40 60 80 100

Number of Classes70

100

Accuracy (%)ZS-CLIP
Fusion
Projection
Projection & Fusion
Projection & Fusion & Context Prompt(b) Compositional components100 101 102Number of Context Prompts75

Accuracy (%)Last Accuracy
Average Accuracy(c) Context prompt lengthFigure 3: Ablation study. Left: experiments on 9 benchmarks with OpenAI weights. Middle: ablation study
on compositional components in PROOF. Every part improves the performance of CIL. Right: AB and A¯ with
change of context prompts. The performance is robust to the change of context prompt length.Comparison methods: We first compare to SOTA CIL methods iCaRL [47], MEMO [82], SimpleCIL [83] L2P [64], DualPrompt [63]. Denote the baseline of sequential finetuning as Finetune;

we combine it with different tuning techniques, e.g., LiT [75] and CoOp [85]. We also report the

zero-shot performance of CLIP as ZS-CLIP by matching the query instance to the template (Eq. 2).

Implementation details: We deploy all methods with PyTorch [44] and PyCIL [80] on Tesla V100.

We use the same network backbone ViT-B/16 for all compared methods for fair comparison. We

experiment with two commonly used pre-trained CLIP weights, i.e., OpenAI [46] and OpenCLIP

LAION-400M [24]. The model is trained with a batch size of 64 for 5 epochs, and we use SGD with

momentum for optimization. The learning rate starts from 0.001 and decays with cosine annealing.

Following [47], we use the herding [65] algorithm to select 20 exemplars per class for rehearsal.

The context prompt length is set to 3, and the head of self-attention is set to 1. The template for

classification is the same as [43]. The source code will be made publicly available upon acceptance.

Performance Measure: Denote the Top-1 accuracy after the b-th stage as Ab, we follow [47] to use

AB (last stage performance) and A¯ = B1 PB b=1 Ab (average performance) for evaluation.

5.2 Benchmark ComparisonWe report the results on nine benchmark datasets using ViT-B/16 (OpenCLIP LAION-400M) in

Table 1 and Figure 2. These splits include the scenarios with large and small base classes. Notably,

PROOF consistently achieves the best performance among all the methods compared. Sequential

finetuning of the model with contrastive loss leads to significant forgetting, irrespective of the tuning

techniques employed (e.g., LiT and CoOp). Since SimpleCIL and ZS-CLIP do not finetune the model

parameters, they achieve competitive results by transferring the knowledge from the pre-training

stage into the downstream tasks. However, most methods achieve better performance than ZS-CLIP,

indicating the importance of incremental learning on downstream tasks.

Specifically, we can draw three key conclusions from these results. 1) The first stage performance of

PROOF surpasses that of the typical prompt learning method, CoOp, thus validating the effectiveness

of learning projections for downstream tasks. 2) The performance curve of PROOF consistently

ranks at the top across all methods, demonstrating its capability to resist forgetting. 3) Compared to

vision-only methods (i.e., L2P and DualPrompt), PROOF exhibits substantial improvement, indicating

textual and visual information can be co-adapted to facilitate incremental learning.

5.3 Ablation Study
Different backbone weights: The comparison in Section 5.2 is based on LAION-400M pre-trained

CLIP. As another popular pre-trained weight, we also explore the performance of the weights provided

by OpenAI. We report the last accuracy AB of four competitive methods on nine benchmarks in

Figure 3(a). We report the full results of the incremental performance in the supplementary. As

depicted in the figure, PROOF still performs the best on all datasets among all compared methods.

Compositional components: We experiment on CIFAR100 B0 Inc10 to investigate the importance

of each part in PROOF. Specifically, we compare the performance of PROOF and its sub-modules,

i.e., projections and cross-modal fusion. The results, shown in Figure 3(b), indicate that training

expandable projections or the fusion module individually can both enhance the performance of vanilla

CLIP. This suggests that the expandable task representation and cross-modal information can help

20 40 60 80

Number of Classes0

Unseen Accuracy (%)iCaRL
L2P
DualPrompt
MEMO
CoOp
ZS-CLIP
PROOF
PROOF(a) Unseen class accuracy20 40 60 80 100

Number of Classes0

LAION Score (%)ZS-CLIP
iCaRL
L2P
DualPrompt
MEMO
CoOp
PROOF
PROOF(b) LAION scoreSeen Unseen HM

100

Accuracy (%)iCaRL
L2P
DualPrompt
MEMO
CoOp
PROOF
PROOF
ZS-CLIP(c) AS , AU , AHMFigure 4: Experiment on zero-shot performance. Left: accuracy on unseen classes during incremental learning.Middle: LAION score during incremental learning. Right: accuracy of seen, unseen, and harmonic mean (HM)
at the last incremental stage. PROOF† strikes a balance between adaptivity and the ZS performance.the learning process. Furthermore, when combining them together, we find ‘Projection & Fusion’

further shows better performance than any of them, verifying that they can work together by fusing

the expandable representations. Lastly, when incorporating the context prompts, the model shows

the best performance among all variations, verifying the effectiveness of expandable task-specific

prompts in incremental learning. Ablations verify the importance of each component in PROOF.

Number of context prompts: Figure 3(b) verifies the strong performance of context prompts, and

we explore the appropriate length c of the context prompt on CIFAR100 B0 Inc10. By varying the

number of c among {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 50, 100}, we report the average performance and

last performance of PROOF in Figure 3(c). As shown in the figure, the performance of PROOF is

robust with the change of the prompt length, and we set c = 3 as the default length.

5.4 Exploring Zero-Shot PerformanceCLIP is known to have the zero-shot (ZS) ability, i.e., even if the model has not been trained for

recognizing the image, it can still predict the possibility of an image x belonging to the class y by

matching the cosine similarity via Eq. 2. The strong generalizability of CLIP makes it a popular

model in computer vision. However, in CIL, the model is continuously updated with the downstream

task, which weakens the generalizability and harms the ZS performance [66] on subsequent tasks. In

this section, we explore the ZS performance degradation of CLIP and propose a variation of PROOFto maintain the ZS performance.

Evaluation protocol for ZS performance: Current CIL methods focus on evaluating ‘seen’ classes,

i.e., evaluating Yb = Y1 ∪ · · · Yb after learning task b. However, since CLIP exhibits ZS performance,

we can also assess the performance on ‘unseen’ classes Yu = Yb+1 ∪ · · · YB to investigate the

ZS performance. Correspondingly, we can obtain the performance metrics AS (seen classes), AU(unseen classes), and AHM (harmonic mean of AS and AU) after each task. Additionally, based

on the LAION400M [48] pre-trained CLIP, we also utilize a subset of 10,000 image-text pairs

from LAION400M, and calculate the matching score of them, i.e., cosine similarity of image-text

embeddings. We denote the average matching score as LAION score, which indicates the matching

degree of the adapted model on the upstream tasks. Given the relationship between generalizability

and the upstream task, the LAION score serves as an effective measure of ZS performance.

Results: We compare the aforementioned measures on CIFAR100 B0 Inc10. Apart from compared

methods in Section 5.2, we also report a variation of PROOF, namely PROOF†. The only difference lies

in the design of the projection, where PROOF† uses a residual format Pi(z) = Pb m=1 (Pim (z) + z)

as the output (same for Pt). To investigate the ZS performance as model updates, we show the

accuracy on unseen classes AU along incremental stages in Figure 4(a), where ZS-CLIP shows the

best performance. Due to the incorporation of pre-trained information into the projected features,

PROOF† maintains competitive ZS performance. Conversely, other methods experience a decline in

ZS performance as their focus shifts to downstream tasks. We observe a similar trend in Figure 4(b),

where PROOF† achieves a LAION score similar to that of ZS-CLIP. Lastly, we report AS, AU, AHMin the last incremental stage in Figure 4(c). We can infer a trade-off between the adaptivity on

downstream tasks and the generalizability of ZS performance. Compared to PROOF, PROOF†

sacrifices the adaptivity to maintain ZS performance, strikings a balance between seen and unseen

classes. Therefore, when ZS performance is essential, using PROOF† is the preferred choice.

Task 1:
Walk- A basket vendor walking down

a busy city street

- An old man in a suit is smoking

a cigar and walking forward

- Young Asian individuals

walking in a busy city street

Task 2:
Stand- Three women in black outfits

hold black umbrellas and signs

while a man stands by

- Four people in casual clothing

are standing outside holding

garbage bags

- A Muslim girl is standing on a

street corner listening to music in

a crowded city

Task 3:
Run- A woman in the blue sweater is

running through a brown field

- Two black and white dogs

running towards each other in the

grass

- A rugby player running the ball

between two downed opponents

Task 4:
Ride- Two people riding dirt bikes on a

bike trail

- A young woman riding a bike

down a street past a crowd of

people

- Two men , both wearing green

cycling clothes and helmets , are

riding bicycles

Task 5:
Play- A man plays a purple guitar

while sitting next to a man

playing the accordion

- A man is on a golf course

playing golf

- People playing hockey on ice

Figure 5: The training protocol of five incremental stages in Flickr30K. We split training instances into five
tasks, i.e., walk, stand, run, ride, and play. The training/testing sets do not include images that do not fall into
these tasks. We use the pre-trained BEiT-3 as the initialization and sequentially learn cross-modal retrieval tasks.
At the end of each task, the model is evaluated on all previously learned concepts.Table 2: Average and last performance of different methods. The best is in bold. The first row stands for the
text retrieval task, and the second is the image retrieval task.Method Image → Text

RB@1 R¯@1 RB@5 R¯@5 RB@10 R¯@10

Finetune 48.79 62.89 76.38 85.04 85.68 91.84

DER [69] 78.37 84.48 96.34 98.23 99.06 99.59

MEMO [82] 83.18 87.79 96.57 98.27 99.16 99.66

PROOF 85.68 89.43 97.07 98.68 99.79 99.86Method Text → Image

RB@1 R¯@1 RB@5 R¯@5 RB@10 R¯@10

Finetune 37.35 51.33 67.38 77.77 77.95 85.55

DER [69] 66.71 74.18 89.63 93.00 94.84 96.69

MEMO [82] 69.53 76.35 91.89 94.44 96.09 97.32

PROOF 72.10 78.01 93.10 95.27 96.92 97.90
5.5 Extension to Other Vision Language ModelsIn the main paper, we use CLIP as an exemplar VLM due to its popularity and representativeness.

However, the field of vision-language models is rapidly advancing, and various models are available.

Therefore, in this section, we extend our PROOF framework to another widely used vision-language

model, namely BEiT-3 [60], focusing on the cross-modal retrieval task. BEiT-3 is a popular VLM that

demonstrates promising performance across multiple vision-language tasks. When fine-tuning BEiT-3

for cross-modal retrieval, it functions as a dual encoder, similar to CLIP, featuring a dual-branch

structure. As the retrieval task differs from classification, we adopt a degradation of PROOF by solely

employing the projection expansion strategy without implementing cross-modal fusion. We refer the

readers to the BEiT-3 paper [60] for more details about the backbone model.

For evaluation, we employ the Flickr30K dataset [45] to assess the performance of incremental

cross-modal retrieval. Flickr30K comprises 31,783 images collected from the Flickr image-sharing

platform, encompassing diverse themes such as daily life, travel, people, food, and scenes. Each

image in the dataset is accompanied by five manually annotated textual descriptions, which provide

descriptive information capturing the main content and context of the images. To formulate an

incremental data stream, we utilize keyword matching to identify images containing different actions

(e.g., walk, stand, run, ride, play). Then, we split the training instances into five subsets based on

these specific actions. Figure 5 illustrates the formulation of the stream, while images not associated

with these actions are excluded from training. To create a balanced testing set, we maintain a 5:1

training-to-testing ratio for splitting the training and testing pairs. Following the instructions provided

by BEiT2, we use ‘beit3_base_itc_patch16_2243’ as the VLM’s initialization.

2https://github.com/microsoft/unilm/blob/master/beit3/README.md3https://conversationhub.blob.core.windows.net/beit-share-public/beit3/

pretraining/beit3_base_itc_patch16_224.pth

1 2 3 4 5

Incremental Stage40

Accuracy (%)Finetune
DER
MEMO
PROOF(a) IR@11 2 3 4 5

Incremental Stage70

Accuracy (%)Finetune
DER
MEMO
PROOF(b) IR@51 2 3 4 5

Incremental Stage80

Accuracy (%)Finetune
DER
MEMO
PROOF(c) IR@101 2 3 4 5

Incremental Stage50

Accuracy (%)Finetune
DER
MEMO
PROOF(d) TR@11 2 3 4 5

Incremental Stage80

100

Accuracy (%)Finetune
DER
MEMO
PROOF(e) TR@51 2 3 4 5

Incremental Stage85.0

87.5

90.0

92.5

95.0

97.5

100.0

Accuracy (%)Finetune
DER
MEMO
PROOF(f) TR@10Figure 6: Incremental performance of each method. IR means the recall of image retrieval, and TR denotes the
recall of text retrieval. PROOF consistently outperforms other compared methods with a substantial margin
on the incremental cross-modal retrieval task.For evaluation, we employ standard cross-modal retrieval measures, namely R@1, R@5, and R@10.

The retrieval is conducted in two directions: image → text and text → image. Similarly to the CIL

evaluation, we also report the last recall RB@1 and the average recall R¯@1 across incremental stages.

To provide a comparative analysis, we compare PROOF against typical fine-tuning as the baseline

and modify MEMO [82] and DER [69] for comparison. These methods represent state-of-the-art

CIL approaches that can be adapted with minor modifications to the current task. However, methods

such as L2P and DualPrompt are unsuitable for cross-modal retrieval tasks as they do not focus on

cross-modal matching.

The experimental results are presented in Table 2, and the incremental performance of each measure is

depicted in Figure 6. As evident from these figures, fine-tuning the model with new concepts leads to

catastrophic forgetting in cross-modal retrieval tasks. However, equipping the model with incremental

learning abilities alleviates forgetting. Among all the compared methods, PROOF consistently achieves

the best performance across different retrieval tasks and metrics, thereby verifying its effectiveness

in mitigating forgetting in VLMs. Experiments conducted on different VLMs and tasks establish

PROOF as a unified and general framework. Future work involves extending PROOF to other VLMs

and applications, such as image captioning [56] and VQA [5].

6 Conclusion

Real-world learning systems necessitate the ability to continually acquire new knowledge. In

this paper, we aim to equip the popular VLM with the CIL ability. Specifically, we learn the

expandable projections so that visual and textual information can be aligned incrementally. This

expansion technique allows for the integration of new concepts without compromising previous

ones. Additionally, we enforce cross-modality fusion with self-attention mechanism, where visual

and textual information are jointly adapted to produce instance-specific embeddings. Extensive

experiments validate the effectiveness of our proposed PROOF. Furthermore, we demonstrate that a

simple variation of PROOF preserves the model’s zero-shot capability during updating.

Limitations: Possible limitations include the usage of exemplars, where storage constraints and

privacy issues may happen. Future works include extending the model to exemplar-free scenarios.

Supplementary Material

In the main paper, we present a method to prevent forgetting in vision-language models through

projection expansion and fusion. The supplementary material provides additional details on the experimental results mentioned in the main paper, along with extra empirical evaluations and discussions.

The organization of the supplementary material is as follows:

• Section A presents the pseudo code of PROOF, explaining the training and testing pipeline.

• Section B reports comprehensive experimental results from the main paper, including the

full results of nine benchmark datasets with two data splits, as well as the results obtained

using OpenAI weights. Furthermore, this section includes additional ablations such as

variations of projection types, results from multiple runs, and an analysis of the number of

parameters.

• Sections C and D provide detailed information on the experiments, including dataset and

exemplar selection details, an introduction to the compared methods, and a discussion of the

broader impacts.

A Pseudo Code

In this section, we provide a detailed explanation of PROOF by presenting the pseudo-code in Alg 1. In

each incremental stage, we are provided with the training dataset Db and the exemplar set E, with the

objective of updating the current model f(·). Prior to training, we initially extract visual prototypes

for the new classes (Line 1). These prototypes are calculated using the frozen visual embedding gi(·),

ensuring their stability throughout model updates. Subsequently, we freeze the former projections

and context prompts, while initializing new projections and context prompts specifically for the new

incremental task (Line 2 to Line 4). These steps represent the model expansion process, which is

followed by the subsequent learning process.

During the learning process, we concatenate the training instances from the current dataset and the

exemplar set, initiating a for-loop. For each instance-label pair, we calculate the projected visual

and textual embeddings (Line 6 to Line 9). Subsequently, we compute the projected matching

loss (Line 10) to encode task-specific information into the current projection layers. Based on

the projected features, we derive context information and perform cross-modal fusion (Line 11 to

Line 13). Consequently, we obtain three logits for model updating and utilize the cross-entropy loss

to update these modules (Line 14). The updated model is then returned as the output of the training

process.

Discussions: Besides the simple addition operation, there exist alternative methods for aggregating

information from multiple projections. However, due to the requirement of fixed input dimensionality

for cross-modal fusion, we refrain from using concatenation as the aggregation function. Furthermore,

it is worth noting that MEMO [82] can be viewed as a specific case where concatenation is employed

for aggregation. Nonetheless, its inferior performance (as shown in Table 3) suggests that summation

is a more favorable choice.

B Additional Experimental Results

This section presents further experimental results of PROOF, including comparisons with multiple

runs, analysis of parameter numbers, and ablations on projection types. Additionally, we report the

results of using OpenAI pre-trained CLIP and provide the full results mentioned in the main paper.

B.1 Multiple RunsFollowing [47], we conduct typical CIL comparisons by randomly splitting the classes with a fixed

seed of 1993, and these results are reported in the main paper. In this supplementary section, we

perform multiple runs by varying the random seed among {1993, 1994, 1995, 1996, 1997}. We repeat

the comparison on CIFAR100 Base50 Inc10 and ImageNet-R Base100 Inc20 five times and present

the results in Figure 7. The solid line represents the mean performance, while the shaded area

indicates the standard deviation. From these figures, it is evident that PROOF consistently outperforms

Algorithm 1 Training PROOF for CIL

Input: Training dataset: Db; Exemplar set: E; Current model: f(·);

Output: Updated model;

1: Extract prototypes p for each new class in Db;

2: Freeze current projections and context prompts;

3: Initialize new projections for the visual and textual branches, Pib, Ptb; 4: Initialize new context prompt cb; 5: for (x, y) ∈ Db ∪ E do	▷ Expand projections
▷ Incremental learning
6: 7: 8: 9: 10: 11: 12: 13: 14:	Calculate the visual embedding z = gi(x); Calculate the projected visual feature Pi(z); Calculate the textual embedding w of all seen classes;
Calculate the projected textual embeddings of all seen classes Pt(w);
Calculate the logits for projected matching fPM(x) via Eq. 5; Calculate the projected visual features for all visual prototypes p; Conduct cross-modal fusion via Eq. 7;	▷ Projected matching
▷ Cross-modal fusion
Calculate the logits for visual and textual matching via Eq. 8; ▷ Visual & textual matching
Calculate the loss via Eq. 9; update the model; return the updated model;

50 60 70 80 90 100

Number of Classes60

Accuracy (%)iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(a) CIFAR100 B50 Inc10100 120 140 160 180 200

Number of Classes50

Accuracy (%)iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(b) ImageNet-R B100 Inc20Figure 7: Results of multiple runs for CIFAR100 and ImageNet-R. The solid line represents the

mean performance, while the shaded area indicates the standard deviation. PROOF consistently and
robustly outperforms other methods by a substantial margin.other methods by a significant margin across different dataset splits. These results validate the

robustness of PROOF.

B.2 Parameter AnalysisAs mentioned in the main paper, the additional parameters in PROOF come from two sources: the

projections and the fusion module. The projection layers are implemented with a single linear layer,

each containing d × d parameters, where d = 512 is the embedding dimension. Similarly, the

cross-modal fusion is implemented with a single-head self-attention mechanism, and the number

of parameters is determined by the weight matrices WQ, WK, and WV , each containing d × d

parameters. These extra parameters are negligible compared to the large backbone of the pre-trained

CLIP model, which has approximately 150 million parameters.

To provide a clear comparison of the parameter numbers for each method, we present the details in

Figure 8 using CIFAR100 B0 Inc10 as an example. The figure illustrates that PROOF has a similar

parameter scale to other finetune-based methods, while achieving significantly stronger performance.

SimpleCIL, which only utilizes the vision branch, requires fewer parameters for the textual branch but

100

200

300

400

Parameters (Million)Finetune
ZS-CLIP
iCaRL
CoOp
SimpleCIL
MEMO
L2P
DualPrompt
PROOFFigure 8: Number of parameters in different methods. The shaded area represents the parameters

used during training but dropped during inference. PROOF achieves state-of-the-art performance
with a comparable number of parameters to other methods.20 40 60 80 100

Number of Classes75

100

Accuracy (%)SSF
Adapter
LinearFigure 9: Variations of projection layers. The choice of using a single linear layer as the projection
layer achieves the best performance.lacks the zero-shot capability. L2P and DualPrompt also only require the vision branch but need an

additional encoder to identify the appropriate prompt, resulting in a higher parameter count compared

to PROOF.

B.3 Variation of Projection TypesApart from simple linear layers, there are other methods to implement the projection layers, such

as layer-wise rescale (SSF) [34] and Adapter [23]. SSF learns a d-dimensional rescale parameter to

project the features, while Adapter learns both the down-projection and up-projection for feature

mapping. In this section, we explore the performance of these projection methods on CIFAR100

B0 Inc10 and present the results in Figure 9. The figure clearly demonstrates that using a single

linear layer as the projection layer achieves the best performance among all methods, indicating its

superiority. Furthermore, this result suggests that a simple linear mapping can effectively bridge the

gap between visual and textual domains.

20 40 60 80 100

Number of Classes70

100

Accuracy (%)Context=[P]
Context=[W]
Context=[P, W]
Context=[P, W, C]Figure 10: Variations of context information. The choice of using visual prototypes, textual
prototypes, and context prompts as the context information achieves the best performance.
B.4 Variation of Context InformationIn the main paper, we discuss the composition of the context information Context, which should

include information from visual prototypes, textual classifiers, and context prompts. In this section,

we conduct ablations to demonstrate the effectiveness of constructing Context with [P, W, C].

Specifically, we perform experiments on CIFAR100 B0 Inc10 and change the context construction

to Context = P (visual prototypes only), Context = W (textual prototypes only), Context =

[P, W] (visual and textual prototypes), and Context = [P, W, C] (current choice). We keep

the same classification rule for these ablations, i.e., classification via Eq. 9. When visual/textual

prototypes are not included in the context, we use the projected features without adaptation as the

matching target in Eq. 8. The results are presented in Figure 10.

From the results, we observe that using visual prototypes or textual prototypes alone yields similar

performance, and the impact of adjustment is marginal. However, when both visual and textual

prototypes are jointly utilized as context information, the model can learn from cross-modality and

achieve better performance. Lastly, the introduction of context prompts into the context further

enhances the performance of PROOF, resulting in the best performance among all variations.

B.5 Different Pre-trained WeightsIn the main paper, we discussed two popular weights for pre-trained CLIP: OpenAI [46]4 and

OpenCLIP [24]5. We primarily presented the results of the OpenCLIP pre-trained model in the main

paper, while providing the results of the OpenAI weights using a radar chart. In this section, we

present the full results of the OpenAI pre-trained CLIP on nine benchmark datasets in Figure 11.

The results demonstrate that PROOF consistently achieves the best performance among all methods,

regardless of the pre-trained weights used. This highlights the robustness of PROOF in the learning

process.

B.6 Full ResultsWe provide the complete results of the benchmark comparison in the main paper, which are presented

in Table 3 and Figures 12 and 13. These results are obtained using OpenCLIP pre-trained weights on

LAION-400M [24]. Table 3 displays the average and last accuracy for the nine benchmark datasets.

Figures 12 and 13 illustrate the incremental performance with varying numbers of base classes.

4https://github.com/openai/CLIP

5https://github.com/mlfoundations/open_clip

20 40 60 80 100

Number of Classes0

Accuracy (%)3.21

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(a) Aircraft Base0 Inc1020 40 60 80 100

Number of Classes40

100

Accuracy (%)2.24

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(b) CIFAR100 Base0 Inc1020 40 60 80 100

Number of Classes50

100

Accuracy (%)5.34

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(c) Cars Base0 Inc1050 100 150 200

Number of Classes40

Accuracy (%)2.76

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(d) ImageNet-R Base0 Inc2050 100 150 200

Number of Classes40

100

Accuracy (%)3.06

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(e) CUB Base0 Inc2020 40 60 80 100

Number of Classes65

100

Accuracy (%)3.37

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(f) UCF Base0 Inc1050 100 150 200 250 300

Number of Classes60

Accuracy (%)2.82

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(g) SUN Base0 Inc3020 40 60 80 100

Number of Classes70

100

Accuracy (%)1.38

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(h) Food Base0 Inc1050 100 150 200

Number of Classes20

Accuracy (%)6.37

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(i) ObjectNet Base0 Inc20Figure 11: Incremental performance of different methods when using OpenAI weights. We report

the performance gap after the last incremental stage of PROOF and the runner-up method at the end of

the line. PROOF consistently achieves the best performance regardless of the pre-trained weights
used.Across all these evaluations, PROOF consistently outperforms the compared methods, demonstrating

its superior performance.

C Experimental Details

This section provides detailed information about the experiments conducted, including the introduction of datasets, exemplar selection, and the methods compared in the paper.

C.1 Dataset IntroductionIn our evaluation, we utilize nine datasets, which are introduced in Table 4 in the main paper. It is

worth noting that some of these datasets have a larger number of classes, but we select a subset of

classes for ease of data split and evaluation.

Exemplar Selection: As mentioned in the main paper, we follow the exemplar selection approach in

[47, 67, 22] to utilize herding algorithm [65]. In addition, there are two typical methods [81] to store

these exemplars in memory.

Table 3: Average and last performance comparison of different methods. The first and second columns

represent the methods with and without exemplars. The performance of L2P and DualPrompt are

reproduced with the source code with exemplars. The best performance is shown in bold.

Method Exemplar
Aircraft CIFAR100 Cars

B0 Inc10	B50 Inc10
A ¯	A B	A ¯	A B

A A ¯ B A A ¯ B A A ¯ B A A ¯ B Finetune ✗ 3.16 0.96 1.72 1.05 7.84 4.44 5.30 2.46 3.14 1.10 1.54 1.13
Finetune LiT [75] ✗ 27.74 14.28 25.10 13.77 44.66 14.69 27.69 7.67 84.12 72.37 83.08 78.23
Finetune CoOp [85] ✗ 14.54 7.14 13.05 7.77 47.00 24.24 41.23 24.12 36.46 21.65 37.40 20.87
SimpleCIL [83] ✗ 59.24 48.09 53.05 48.09 84.15 76.63 80.20 76.63 92.04 86.85 88.96 86.85
ZS-CLIP [46] ✗ 26.66 17.22 21.70 17.22 81.81 71.38 76.49 71.38 82.60 76.37 78.32 76.37

CoOp [85]	✓ ✓ ✓ ✓ ✓ ✓	44.26	39.87 41.81	39.18	83.37 73.36 78.34 73.04	89.73 89.38 88.23	84.91 84.95 81.31	87.98 86.71 84.90	86.60 84.19 81.83
iCaRL [47]	53.60 43.98	50.40 45.33 79.91	63.94 71.94	63.00
MEMO [82]	42.24	25.41	38.16	27.75	84.67 74.98	80.75 75.34
L2P [64]	55.06 44.88 47.78 43.37 76.42	66.21 72.67	67.88	83.81 72.44 79.76 73.47
DualPrompt [63]	55.95 46.53	50.93 46.50 79.07 70.06 74.81 70.75	85.30 74.35	81.32 75.85
PROOF	61.00	53.59	59.99	58.90	86.70 79.05	82.92 78.87	93.26	89.84	90.53	89.54

Method Exemplar
ImageNet-R CUB UCF

B0 Inc20	B100 Inc20
A ¯	A B	A ¯	A B

B0 Inc10	B50 Inc10
A ¯	A B	A ¯	A B

A A ¯ B A A ¯ B Finetune ✗ 1.37 0.43 1.01 0.88 2.06 0.64 0.56 0.47 4.51 1.59 1.21 0.80
Finetune LiT [75] ✗ 64.88 30.42 57.75 29.77 58.15 35.28 51.95 35.96 79.25 64.84 81.79 65.40
Finetune CoOp [85] ✗ 60.73 37.52 54.20 39.77 27.61 8.57 24.03 10.14 47.85 33.46 42.02 24.74
SimpleCIL [83] ✗ 81.06 74.48 76.84 74.48 83.81 77.52 79.75 77.52 90.44 85.68 88.12 85.68
ZS-CLIP [46] ✗ 83.37 77.17 79.57 77.17 74.38 63.06 67.96 63.06 75.50 67.64 71.44 67.64

CoOp [85]	✓ ✓ ✓ ✓ ✓ ✓	82.40 76.20 79.76 77.13 77.34	68.70 74.09	67.47	90.13 89.47	86.24 84.34	88.36 88.51	85.71 84.11
iCaRL [47]	72.22	54.38	68.67	60.15	82.04 74.74 78.57 75.07
MEMO [82]	80.00 74.07 76.72 73.95 77.32	65.69 72.88	66.41	84.02 74.08	82.58 75.48
L2P [64]	75.73	67.22 74.15 71.20 79.23	68.54 75.85 71.12	88.71 89.48 92.34	83.93 85.41 89.92	86.51 86.96 91.70	83.22 84.6589.16
DualPrompt [63]	78.47 70.82 72.98	69.18 80.30	83.21 74.94 78.06 74.27
PROOF	85.34	80.10	82.32	84.93 79.43	81.67 79.18

Method Exemplar
SUN Food ObjectNet

B0 Inc30	B150 Inc30
A ¯	A B	A ¯	A B

A A ¯ B A A ¯ B A A ¯ B A A ¯ BFinetune ✗ 4.51 1.59 0.78 0.72 3.49 1.71 2.14 1.52 1.34 0.47 0.69 0.54
Finetune LiT [75] ✗ 79.25 64.84 38.23 20.00 40.62 12.96 29.74 12.05 43.27 17.46 32.85 17.17
Finetune CoOp [85] ✗ 45.93 23.11 39.33 24.89 36.01 14.18 33.13 18.67 21.24 6.29 16.21 6.82
SimpleCIL [83] ✗ 82.13 75.58 78.62 75.58 87.89 81.65 84.73 81.65 52.06 40.13 45.11 40.13
ZS-CLIP [46] ✗ 79.42 72.11 74.95 72.11 87.86 81.92 84.75 81.92 38.43 26.43 31.12 26.43

CoOp [85]	✓ ✓ ✓ ✓ ✓ ✓	80.46 73.44 77.68 73.06	85.38 76.15	81.74 76.35 46.16	33.81 40.40	34.47 26.15 34.67 39.57
iCaRL [47]	78.56	67.30 74.74	69.07	84.12 71.68 78.86 70.64 45.28	26.97	37.22
MEMO [82]	81.48 73.45 78.00 73.87 79.83 72.14 76.16 72.32 80.14 73.06 77.25 73.82	89.18	82.85	86.50 85.04 85.37 87.52	83.08 46.98 80.56 46.18	33.37 41.62 34.00 43.90
L2P [64]	84.48 75.22
DualPrompt [63]	87.12 90.04	81.27 84.73	82.36 84.74	53.13 40.59 45.84 40.3755.28 44.36 49.64 43.65
PROOF	83.57 77.28	80.70 77.49

1. Fixed Memory Budget: In this approach, a fixed memory budget of K instances is allocated.

Given the number of seen classes denoted as |Yb|, the model selects |YKb| exemplars per class

after each incremental stage.

2. Expandable Exemplar Set: In this method, an expandable exemplar set is maintained as

the data evolves. With the number of exemplars per class denoted as k, the model stores

|Yb| × k exemplars in total after each incremental stage.

We evaluate both protocols using these benchmark datasets in our experiments. Specifically, we

employ the first policy for CIFAR100 and Food, keeping a total of 2,000 exemplars. Since these

datasets consist of 100 classes, the average number of exemplars per class after the last incremental

stage is 20. We adopt the second policy for the other datasets and store 20 exemplars per class.

C.2 Compared methods introductionThis section provides an overview of the compared methods discussed in the main paper. These

methods, listed in the order presented in Table 3, include:

• Finetune: This baseline method involves finetuning the pre-trained CLIP model using

contrastive loss. No regularization terms are set, and no part of the model is frozen, allowing

us to observe the forgetting phenomenon in sequential learning.

20 40 60 80 100

Number of Classes0

Accuracy (%)5.5

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(a) Aircraft Base0 Inc1020 40 60 80 100

Number of Classes60

100

Accuracy (%)2.42

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(b) CIFAR100 Base0 Inc1020 40 60 80 100

Number of Classes65

100

Accuracy (%)2.99

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(c) Cars Base0 Inc1050 100 150 200

Number of Classes50

Accuracy (%)2.93

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(d) ImageNet-R Base0 Inc2050 100 150 200

Number of Classes50

100

Accuracy (%)1.91

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(e) CUB Base0 Inc2020 40 60 80 100

Number of Classes60

100

Accuracy (%)3.68

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(f) UCF Base0 Inc1050 100 150 200 250 300

Number of Classes60

Accuracy (%)1.7

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(g) SUN Base0 Inc3020 40 60 80 100

Number of Classes65

100

Accuracy (%)1.88

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(h) Food Base0 Inc1050 100 150 200

Number of Classes10

Accuracy (%)3.77

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(i) ObjectNet Base0 Inc20Figure 12: Incremental performance of different methods. We report the performance gap after the

last incremental stage of PROOF and the runner-up method at the end of the line.

• Finetune LiT [75]: Following LiT, which freezes the image encoder and only finetunes the

text encoder, we design Finetune LiT with CIL. Similar to finetune, we sequentially tune the

model with contrastive loss while the image encoder is frozen during optimization.

• Finetune CoOp [85]: Following the CoOp method, this approach freezes both the image

encoder and text encoder. It optimizes a learnable prompt tensor t (as in Eq.4) using

contrastive loss without utilizing any historical data for rehearsal.

• SimpleCIL [83]: This method relies on the pre-trained image encoder and does not involve

the text encoder. The frozen image encoder extracts class centers (prototypes) for each new

class, and a cosine classifier is utilized for classification. Since the model is not updated

via backpropagation, it showcases the generalizability of the pre-trained vision encoder on

downstream tasks.

• ZS-CLIP [46]: This baseline freezes the pre-trained CLIP model and predicts the logits

of each incoming class using cosine similarity (Eq. 2). It serves as a reference for the

performance of pre-trained CLIP on downstream tasks.

• CoOp (with exemplars): This method combines the CoOp approach with exemplar rehearsal. During the learning of new classes, the model utilizes a combination of the current

dataset and exemplar set to optimize the learnable prompt.

50 60 70 80 90 100

Number of Classes0

Accuracy (%)10.81

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(a) Aircraft Base50 Inc1050 60 70 80 90 100

Number of Classes55

Accuracy (%)2.24

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(b) CIFAR100 Base50 Inc1050 60 70 80 90 100

Number of Classes65

Accuracy (%)2.69

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(c) Cars Base50 Inc10100 120 140 160 180 200

Number of Classes50

Accuracy (%)3.13

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(d) ImageNet-R Base100 Inc20100 120 140 160 180 200

Number of Classes55

Accuracy (%)1.66

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(e) CUB Base100 Inc2050 60 70 80 90 100

Number of Classes60

100

Accuracy (%)3.45

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(f) UCF Base50 Inc10150 200 250 300

Number of Classes65

Accuracy (%)1.91

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(g) SUN Base150 Inc3050 60 70 80 90 100

Number of Classes65

Accuracy (%)1.66

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(h) Food Base50 Inc10100 120 140 160 180 200

Number of Classes20

Accuracy (%)3.28

iCaRL
L2P
DualPrompt
SimpleCIL
MEMO
CoOp
ZS-CLIP
PROOF(i) ObjectNet Base100 Inc20Figure 13: Incremental performance of different methods with large base classes. We report the

performance gap after the last incremental stage of PROOF and the runner-up method at the end of

the line.

• iCaRL [47]: iCaRL is a typical class-incremental learning algorithm that employs knowledge distillation and exemplar replay to mitigate forgetting. It combines contrastive loss

with distillation loss to learn new classes while retaining knowledge of old classes.

• MEMO [82]: As a state-of-the-art class-incremental learning algorithm based on network

expansion, MEMO is modified to be compatible with the CLIP structure. The image and text

encoders are expanded for new tasks, and the concatenated features are used for prediction

based on cosine similarity.

• L2P [64]: L2P is a state-of-the-art class-incremental learning algorithm utilizing pre-trained

vision transformers. In this case, the text encoder of CLIP is dropped, and a prompt pool (as

in Eq. 3) is learned to adapt to evolving data. Another pre-trained image encoder is required

to select the appropriate prompt during inference.

• DualPrompt [63]: DualPrompt is an extension of L2P that incorporates two types of

prompts: general prompts and expert prompts. It also relies on another pre-trained image

encoder for prompt retrieval.

It is important to note that these methods are compared fairly, meaning they are initialized with the

same pre-trained weights for incremental learning. Since some compared methods are not designed

Table 4: Introduction about benchmark datasets.

Dataset # training instances # testing instances # Classes Link

CIFAR100 50,000 10,000 100 Link

CUB200 9,430 2,358 200 Link

ImageNet-R 24,000 6,000 200 Link

ObjectNet 26,509 6,628 200 Link

Aircraft 6,667 3,333 100 Link

Cars 4,135 4,083 100 Link

UCF 10,053 2,639 100 Link

SUN 72,870 18,179 300 Link

Food 79,998 20,012 100 Link

with the CLIP encoder, we modify their backbone into pre-trained CLIP for a fair comparison. We

use the same number of exemplars for a fair comparison of the methods with exemplars.

D Broader Impacts

In this work, we address the class-incremental learning problem with vision-language models, which

is a fundamental challenge in machine learning. Our focus is on tackling the forgetting problem

that arises when sequentially finetuning a vision-language model. We propose solutions to project

and integrate features from multiple modalities for unified classification. Our research provides

valuable insights for applications that struggle with managing the forgetting issue in large pre-trained

vision-language models. However, there are still ample opportunities for further exploration in this

field. Therefore, we aspire to stimulate discussions on class-incremental learning in real-world

scenarios and encourage more research to develop practical models for this purpose.

We also acknowledge the ethical considerations associated with this technology. It is crucial to

recognize that individuals expect learning systems to refrain from storing any personal information

for future rehearsal. While there are risks involved in AI research of this nature, we believe that

developing and demonstrating such techniques are vital for comprehending both the beneficial and

potentially concerning applications of this technology. Our aim is to foster discussions regarding best

practices and controls surrounding these methods, promoting responsible and ethical utilization of

technology.

Ref