-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Nightly #3664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly #3664
Conversation
* Enable FP8 + RL training for bf16 models **Summary:** Enable FP8 + RL training using TorchAO for 1.33x faster training and 42% less model memory usage: - We quantize the frozen LoRA weights into fp8 and keep the LoRA adapters in bf16 - We leverage TorchAO's `Float8Tensor`, which calls into fbgemm's fp8 x fp8 rowwise matmul kernel - For now, we need to do an offline quantization first, because vllm doesn't support on-the-fly quantization for torchao yet (this is in progress: vllm-project/vllm#26327) **Example usage:** ``` model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-8B-Base", max_seq_length = 2048, load_in_4bit = False, fast_inference = True, max_lora_rank = 32, load_in_fp8 = True, # set this to True ) \# the rest is the same as before model = FastLanguageModel.get_peft_model(...) ``` **Initial results:** ``` \# fp8 {'train_runtime': 1725.4337, 'train_samples_per_second': 0.232, 'train_steps_per_second': 0.058, 'train_loss': 0.00015715716748673002, 'epoch': 0.01} \# bf16 {'train_runtime': 2297.8145, 'train_samples_per_second': 0.174, 'train_steps_per_second': 0.044, 'train_loss': 0.00016081033063528594, 'epoch': 0.01} ``` <img width="1199" height="448" alt="Screenshot 2025-11-11 at 4 10 50 PM" src="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/user-attachments/assets/b6304afd-89e9-42b1-8064-775807e17b23" /> Test script: https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423 **Requires:** - pytorch/ao#3158 (torchao nightly or 0.15.0+) - unslothai/unsloth-zoo#351 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update utils.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * _get_inference_mode_context_manager * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update utils.py * Update utils.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* make loading gpt-oss-BF16 faster. Linked to unsloth-zoo PR #314 * fix model loading and clean merged model directory * revert default quant * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert mapper.py --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add 128x128 PerBlock FP8 + RL **Summary:** Following #3440, this PR extends torchao FP8 + RL support to also handle 128x128 PerBlock granularity (in addition to PerRow). **Example usage:** ``` model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-8B-Base", max_seq_length = 2048, load_in_4bit = False, fast_inference = True, max_lora_rank = 32, load_in_fp8 = "block", # or "row" or True ) ``` **Initial results:** TBD **Note:** - Requires pytorch/ao#3370 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
for more information, see https://pre-commit.ci
* vllm sampling params fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not patch base_trainer * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * seperate vllm fixes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestion from @danielhanchen * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit fbb98c5. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit c64d5b4. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit c156545. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
This reverts commit 6c47dc5.
* vllm sampling params fix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not patch base_trainer * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * seperate vllm fixes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixup deletion * Fix indentation * revert to old style --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Summary of ChangesHello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request encompasses a series of maintenance and compatibility updates across the codebase. It ensures that the project remains aligned with external library changes, particularly concerning Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Summary of ChangesHello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a set of maintenance and compatibility enhancements across the codebase. Key updates include bumping a core dependency version, implementing a crucial fix for vLLM integration with Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
unsloth/models/qwen3_moe.py
Outdated
| _flag_for_generation = self._flag_for_generation, | ||
| ) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states += hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve residual connection in Qwen3Moe attention path
When use_cache is enabled, Qwen3MoeDecoderLayer_fast_forward now does hidden_states += hidden_states instead of adding the saved residual. This doubles the self-attention output and removes the skip connection, so cached generation will produce incorrect activations compared to training/eager mode. It should add the stored residual back in this spot.
Useful? React with 👍 / 👎.
unsloth/models/qwen3_moe.py
Outdated
| self.mlp, hidden_states | ||
| ) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states += hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore MoE residual addition in generation branch
In the same use_cache branch, the MoE block now applies hidden_states += hidden_states instead of combining with the pre-MLP residual. This removes the skip connection and doubles the MoE activations during cached generation, yielding different outputs from the non-cached path. It should add the stored residual instead of self-adding.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request includes several updates, likely part of a nightly build. It bumps the unsloth_zoo dependency version, fixes some typos in comments, and adds a patch for vLLM compatibility. However, I've found two critical bugs in unsloth/models/qwen3_moe.py where residual connections are incorrectly implemented, causing the hidden states to be doubled instead of having the residual added. These need to be fixed to ensure correct model behavior.
unsloth/models/qwen3_moe.py
Outdated
| _flag_for_generation = self._flag_for_generation, | ||
| ) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states += hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems to have a typo. It should be hidden_states += residual to add the residual connection back. Currently, it's hidden_states += hidden_states, which doubles the hidden_states tensor. This is likely not the intended behavior for a residual connection.
| hidden_states += hidden_states | |
| hidden_states += residual |
unsloth/models/qwen3_moe.py
Outdated
| self.mlp, hidden_states | ||
| ) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states += hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of ChangesHello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a set of maintenance and compatibility updates across the Unsloth library. It ensures that dependencies are current, addresses a specific compatibility challenge with Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request primarily focuses on version updates, minor bug fixes, and logging improvements across various model implementations. A new vLLM compatibility patch has been added to import_fixes.py to handle GuidedDecodingParams renaming. Several model files (falcon_h1.py, gemma.py, gemma2.py, llama.py, mistral.py, qwen2.py, qwen3.py, qwen3_moe.py) include fixes for a typo in comments. The pyproject.toml file has been updated to reflect a new unsloth_zoo version. Logging in unsloth/models/rl.py has been enhanced to provide more detailed messages during trainer patching. A critical bug was identified in unsloth/models/qwen3_moe.py where residual connections are incorrectly applied.
unsloth/models/qwen3_moe.py
Outdated
| _flag_for_generation = self._flag_for_generation, | ||
| ) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states += hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appears to be a logical error. hidden_states += hidden_states will double the hidden_states value, effectively computing 2 * hidden_states. Based on the previous line residual = hidden_states, it seems the intention was to add the residual connection, i.e., hidden_states = residual + hidden_states or hidden_states += residual.
| hidden_states += hidden_states | |
| hidden_states += residual |
unsloth/models/qwen3_moe.py
Outdated
| self.mlp, hidden_states | ||
| ) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states += hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appears to be a logical error. hidden_states += hidden_states will double the hidden_states value, effectively computing 2 * hidden_states. Based on the previous line residual = hidden_states, it seems the intention was to add the residual connection, i.e., hidden_states = residual + hidden_states or hidden_states += residual.
| hidden_states += hidden_states | |
| hidden_states += residual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request includes nightly updates, primarily version bumps and a new fix for vLLM guided decoding. While most changes are improvements, I've identified two critical bugs in unsloth/models/qwen3_moe.py related to incorrect residual connections. These have been commented on with suggested fixes. Other changes include better debugging logs and minor code refinements.
unsloth/models/qwen3_moe.py
Outdated
| _flag_for_generation = self._flag_for_generation, | ||
| ) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states += hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a bug in the residual connection. The code was changed from hidden_states = residual + hidden_states to hidden_states += hidden_states, which is equivalent to hidden_states = 2 * hidden_states. This doubles the attention output instead of adding the residual. It should be hidden_states += residual.
| hidden_states += hidden_states | |
| hidden_states += residual |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
for more information, see https://pre-commit.ci
This reverts commit 9bf82bb.
for more information, see https://pre-commit.ci
This reverts commit 1146769.
for more information, see https://pre-commit.ci
This reverts commit c427be9.
No description provided.