Skip to content

Conversation

@ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Aug 27, 2025

What does this PR do?

The CI job of the tests in Qwen2MoeIntegrationTest is killed from time to time due to the CPU memory limit (60 G), which is unclear to me why the usage may differ in different runs.

This PR simply reuses the same model set once in the first test, except the test_model_a2_7b_long_prompt_flash_attn, because it loads the model using flash_attn.

@ydshieh ydshieh requested a review from zucchini-nlp August 27, 2025 13:45
out = model(input_ids).logits.float().cpu()
# Expected mean on dim = -1
EXPECTED_MEAN = torch.tensor([[-4.2125, -3.6416, -4.9136, -4.3005, -4.9938, -3.4393, -3.5195, -4.1621]])
EXPECTED_MEAN = torch.tensor([[-4.2106, -3.6411, -4.9111, -4.2840, -4.9950, -3.4438, -3.5262, -4.1624]])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed because now we use fp16 (previously fp32)

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_moe

def test_speculative_generation(self):
EXPECTED_TEXT_COMPLETION = (
"To be or not to be, that is the question.\nThe answer is to be, of course. But what does it"
"To be or not to be, that is the question. Whether 'tis nobler in the mind to suffer the sl"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous value never pass

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

assistant_model = model
assistant_model.generation_config.num_assistant_tokens = 2
assistant_model.generation_config.num_assistant_tokens_schedule = "constant"
generated_ids = model.generate(input_ids, max_new_tokens=4, temperature=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another test where assistant_model is not actually used 😄

@ydshieh ydshieh merged commit 6350636 into main Aug 27, 2025
19 checks passed
@ydshieh ydshieh deleted the fix_qwen2_moe branch August 27, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants