Skip to content

Conversation

@mmathew23
Copy link
Collaborator

transformers==4.52.x introduced a GradientCheckpointingLayer and refactored the checkpointing logic. This adjusts FastLanguageModel to bypass our custom checkpoint logic when decoder layers are an instance of GradientCheckpointingLayer.

I have 3 notebooks to compare.

transformers==4.51.3
https://colab.research.google.com/drive/19tEe55Z-b3oz61S6R5diBoHAxHZVnOlp?usp=sharing
transformers==4.52.4
https://colab.research.google.com/drive/1IQZGdYYoF73NqG3WtO_ar7rfunllWGSr?usp=sharing
transformers==4.52.4 + checkpointing fix
https://colab.research.google.com/drive/1nfA9Sc20lBAbTzmo2S2VgAW-P6ZkOkMq?usp=sharing

as you can see in the second notebook the loss remains the same but timing is longer. In the final notebook loss and speed matches original transformers.

@danielhanchen danielhanchen merged commit 1a1b51c into unslothai:main Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants