-
Notifications
You must be signed in to change notification settings - Fork 31.5k
🚨 Don't use cache in non-generative models #38751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨 Don't use cache in non-generative models #38751
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Yes, great cleanup! Ping me whenever it's ready and you want a review |
|
@Rocketknight1 ready for review! One thing to note: I didn't deprecate |
|
@zucchini-nlp I think it's okay! I really hope people weren't passing |
Rocketknight1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! It's a really nice cleanup, made one comment, but also we should definitely run slow tests for some of these models before merging 😅
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zucchini-nlp! Could you provide a bit more details about why we remove the cross attention and positional embeddings completely everywhere please? 🤗 It is not obvious to me, because at first glancr they look like they were used at least sometimes no?
|
run-slow: align,wav2vec2,layoutlm,clap |
1 similar comment
|
run-slow: align,wav2vec2,layoutlm,clap |
|
This comment contains run-slow, running the specified jobs: models: ['models/align', 'models/clap', 'models/layoutlm', 'models/wav2vec2'] |
1 similar comment
|
This comment contains run-slow, running the specified jobs: models: ['models/align', 'models/clap', 'models/layoutlm', 'models/wav2vec2'] |
|
hey @Cyrilvallez do you have any comments to address? |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! I do like it, but the PR has the potential to be quite breaking in different ways:
- position embedding type could be set in some configs on the hub
- some models present in the main init see themselves change directly (cross attention/head mask), and even if they are part of bigger models, they are still public classes. E.g. for
align, there are no public classes using theencoder_hidden_states, so it would be straightforward to remove them everywhere as you did, but inaltclip, they flow correctly fromAltCLIPTextModelwhich is public, and only the main modelAltCLIPModeldoes not propagate them (so removing them is not directly breaking when usingAltCLIPModel, but is if using the public submodelAltCLIPTextModel🥲) - even method signature in fully internal classes can be breaking sometimes (though in this instance I wouldn't worry about it)
In general, I really like the changes because they clean-up a lot of non-sense in those modelings, but we need to be a bit wary of the potential implications here.
cc @ArthurZucker here for an opinion on whether we want to be agressive in favor of simplifications here, or if we want to do it through a deprecation cycle for public classes (but once again, even if public they are building block of the real bigger models)
| if self.is_decoder and encoder_hidden_states is not None: | ||
| if not hasattr(self, "crossattention"): | ||
| raise ValueError( | ||
| f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers" | ||
| " by setting `config.add_cross_attention=True`" | ||
| ) | ||
|
|
||
| # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple | ||
| cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None | ||
| cross_attention_outputs = self.crossattention( | ||
| attention_output, | ||
| attention_mask, | ||
| head_mask, | ||
| encoder_hidden_states, | ||
| encoder_attention_mask, | ||
| cross_attn_past_key_value, | ||
| output_attentions, | ||
| ) | ||
| attention_output = cross_attention_outputs[0] | ||
| outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights | ||
|
|
||
| # add cross-attn cache to positions 3,4 of present_key_value tuple | ||
| cross_attn_present_key_value = cross_attention_outputs[-1] | ||
| present_key_value = present_key_value + cross_attn_present_key_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crazy that we had this block while the model does not propagate encoder_hidden_states 🥵
| past_key_values=encoder_outputs.past_key_values, | ||
| past_key_values=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should remove it directly everywhere as well if we don't use it anyway
|
Yeah, I have the same question. On one side the code path should have never been used and isn't propagated from Base/Task models. On the other side, we never know if users found a way to exploit it by loading specific layers and re-using them I can add proper deprecation if we think this is too aggressive and remove everything in the next 2-3 releases. A bunch of unused code path is just 😖 |
|
Very nice! In the era of unbloating, let's remove as much as we can, and redirect users with on the hub code for the ones that still need this? |
9ac4b2c to
0bda5f1
Compare
|
@ArthurZucker @Cyrilvallez added |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮💨 finally getting rid of this old code!
* deprecate for 1 version * style * fix some tests * fix esm * skip for now, GC requires positional args but we have keyword args * remove transpose for scores in modified models only * skip fx trace tests
* deprecate for 1 version * style * fix some tests * fix esm * skip for now, GC requires positional args but we have keyword args * remove transpose for scores in modified models only * skip fx trace tests
* deprecate for 1 version * style * fix some tests * fix esm * skip for now, GC requires positional args but we have keyword args * remove transpose for scores in modified models only * skip fx trace tests
* deprecate for 1 version * style * fix some tests * fix esm * skip for now, GC requires positional args but we have keyword args * remove transpose for scores in modified models only * skip fx trace tests
* deprecate for 1 version * style * fix some tests * fix esm * skip for now, GC requires positional args but we have keyword args * remove transpose for scores in modified models only * skip fx trace tests
* deprecate for 1 version * style * fix some tests * fix esm * skip for now, GC requires positional args but we have keyword args * remove transpose for scores in modified models only * skip fx trace tests
* deprecate for 1 version * style * fix some tests * fix esm * skip for now, GC requires positional args but we have keyword args * remove transpose for scores in modified models only * skip fx trace tests
What does this PR do?
When working on #38635, I found that there are some models which have
past_key_valuesin their signature, even though they cannot generate. The reason is that models were all copying from BertThis PR clean it up and changes the copy statement to Align model and adds support for new attention API in all those models