Skip to content

Conversation

@zucchini-nlp
Copy link
Member

What does this PR do?

When working on #38635, I found that there are some models which have past_key_values in their signature, even though they cannot generate. The reason is that models were all copying from Bert

This PR clean it up and changes the copy statement to Align model and adds support for new attention API in all those models

@zucchini-nlp zucchini-nlp changed the title don't use cache in non-generative models Don't use cache in non-generative models Jun 11, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member

Yes, great cleanup! Ping me whenever it's ready and you want a review

@zucchini-nlp
Copy link
Member Author

@Rocketknight1 ready for review! One thing to note: I didn't deprecate past_key_values from kwargs and simply deleted, since it wasn't used anyway. Do you think we need a deprecation cycle or raise an error when cache related kwargs are passed?

@Rocketknight1
Copy link
Member

@zucchini-nlp I think it's okay! I really hope people weren't passing past_key_values to non-generative models anyway 😬

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! It's a really nice cleanup, made one comment, but also we should definitely run slow tests for some of these models before merging 😅

@zucchini-nlp zucchini-nlp requested a review from Cyrilvallez June 13, 2025 06:34
Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zucchini-nlp! Could you provide a bit more details about why we remove the cross attention and positional embeddings completely everywhere please? 🤗 It is not obvious to me, because at first glancr they look like they were used at least sometimes no?

@zucchini-nlp
Copy link
Member Author

run-slow: align,wav2vec2,layoutlm,clap

1 similar comment
@zucchini-nlp
Copy link
Member Author

run-slow: align,wav2vec2,layoutlm,clap

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/align', 'models/clap', 'models/layoutlm', 'models/wav2vec2']
quantizations: [] ...

1 similar comment
@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/align', 'models/clap', 'models/layoutlm', 'models/wav2vec2']
quantizations: [] ...

@zucchini-nlp
Copy link
Member Author

hey @Cyrilvallez do you have any comments to address?

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I do like it, but the PR has the potential to be quite breaking in different ways:

  • position embedding type could be set in some configs on the hub
  • some models present in the main init see themselves change directly (cross attention/head mask), and even if they are part of bigger models, they are still public classes. E.g. for align, there are no public classes using the encoder_hidden_states, so it would be straightforward to remove them everywhere as you did, but in altclip, they flow correctly from AltCLIPTextModel which is public, and only the main model AltCLIPModel does not propagate them (so removing them is not directly breaking when using AltCLIPModel, but is if using the public submodel AltCLIPTextModel 🥲)
  • even method signature in fully internal classes can be breaking sometimes (though in this instance I wouldn't worry about it)

In general, I really like the changes because they clean-up a lot of non-sense in those modelings, but we need to be a bit wary of the potential implications here.
cc @ArthurZucker here for an opinion on whether we want to be agressive in favor of simplifications here, or if we want to do it through a deprecation cycle for public classes (but once again, even if public they are building block of the real bigger models)

Comment on lines -874 to -882
if self.is_decoder and encoder_hidden_states is not None:
if not hasattr(self, "crossattention"):
raise ValueError(
f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers"
" by setting `config.add_cross_attention=True`"
)

# cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
cross_attention_outputs = self.crossattention(
attention_output,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
cross_attn_past_key_value,
output_attentions,
)
attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights

# add cross-attn cache to positions 3,4 of present_key_value tuple
cross_attn_present_key_value = cross_attention_outputs[-1]
present_key_value = present_key_value + cross_attn_present_key_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy that we had this block while the model does not propagate encoder_hidden_states 🥵

Comment on lines 857 to 795
past_key_values=encoder_outputs.past_key_values,
past_key_values=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should remove it directly everywhere as well if we don't use it anyway

@zucchini-nlp
Copy link
Member Author

Yeah, I have the same question. On one side the code path should have never been used and isn't propagated from Base/Task models. On the other side, we never know if users found a way to exploit it by loading specific layers and re-using them

I can add proper deprecation if we think this is too aggressive and remove everything in the next 2-3 releases. A bunch of unused code path is just 😖

@ArthurZucker
Copy link
Collaborator

Very nice! In the era of unbloating, let's remove as much as we can, and redirect users with on the hub code for the ones that still need this?
We can keep this for 1 release tho!

@zucchini-nlp zucchini-nlp force-pushed the clean-models-no-generation branch from 9ac4b2c to 0bda5f1 Compare June 20, 2025 10:00
@zucchini-nlp zucchini-nlp changed the title Don't use cache in non-generative models 🚨 Don't use cache in non-generative models Jun 20, 2025
@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Jun 20, 2025

@ArthurZucker @Cyrilvallez added deprecate_kwarg until the next v4.54 release in all forward calls. The failing test is not related

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮‍💨 finally getting rid of this old code!

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) July 1, 2025 09:01
@zucchini-nlp zucchini-nlp merged commit e435574 into huggingface:main Jul 1, 2025
20 checks passed
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* deprecate for 1 version

* style

* fix some tests

* fix esm

* skip for now, GC requires positional args but we have keyword args

* remove transpose for scores in modified models only

* skip fx trace tests
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* deprecate for 1 version

* style

* fix some tests

* fix esm

* skip for now, GC requires positional args but we have keyword args

* remove transpose for scores in modified models only

* skip fx trace tests
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* deprecate for 1 version

* style

* fix some tests

* fix esm

* skip for now, GC requires positional args but we have keyword args

* remove transpose for scores in modified models only

* skip fx trace tests
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* deprecate for 1 version

* style

* fix some tests

* fix esm

* skip for now, GC requires positional args but we have keyword args

* remove transpose for scores in modified models only

* skip fx trace tests
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* deprecate for 1 version

* style

* fix some tests

* fix esm

* skip for now, GC requires positional args but we have keyword args

* remove transpose for scores in modified models only

* skip fx trace tests
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* deprecate for 1 version

* style

* fix some tests

* fix esm

* skip for now, GC requires positional args but we have keyword args

* remove transpose for scores in modified models only

* skip fx trace tests
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* deprecate for 1 version

* style

* fix some tests

* fix esm

* skip for now, GC requires positional args but we have keyword args

* remove transpose for scores in modified models only

* skip fx trace tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants