-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Feat/support multimodal embedding #29115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # api/models/dataset.py
# Conflicts: # api/tests/test_containers_integration_tests/tasks/test_add_document_to_index_task.py
crazywoola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
# Conflicts: # api/controllers/console/datasets/datasets.py # api/controllers/console/datasets/datasets_segments.py # api/controllers/console/datasets/hit_testing_base.py
api/core/rag/index_processor/processor/paragraph_index_processor.py
Outdated
Show resolved
Hide resolved
…g' into feat/support-multimodal-embedding
| db.session.add(binding) | ||
| db.session.commit() | ||
|
|
||
| # save vector index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This segment should not be included in the if section
Chenyl-Sai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see this bug
Summary
Why Build a Multimodal Knowledge Base
As enterprises increasingly rely on internal knowledge systems, the need to retrieve information from large volumes of heterogeneous files continues to grow. These materials often span multiple formats, including text, images, documents, videos, and audio—for example: product photos, illustrated manuals, or reports containing mixed text and graphics.
Traditional embedding models can vectorize certain types of data, but they are usually limited to a single modality. This limitation forces organizations into one of two sub-optimal solutions:
• Building complex cross-modality pipelines, embedding each modality separately and then manually fusing results;
• Restricting applications to a single modality, leaving most of the data’s value untapped.
Furthermore, for content that naturally includes multiple modalities—such as documents containing both text and images—traditional models struggle to capture the deep relationships between modalities, resulting in incomplete understanding.
For these reasons, multimodal embeddings have become essential for enterprises seeking to enhance data comprehension, unify data processing workflows, and overcome the constraints of single-modality systems.
⸻
Supported Multimodal Embedding & Rerank Models (First Release)
AWS Bedrock
1. nova-2-multimodal-embeddings-v1:0
Google Vertex AI
1. multimodalembedding@001
Jina
1. jina-embedding-v4
2. jina-clip-v1
3. jina-clip-v2
4. jina-reranker-m0 (rerank model)
Tongyi (Alibaba Cloud)
1. multimodal-embedding-v1
Screenshots
PR for plugin: langgenius/dify-plugin-daemon#503
PR for SDK: langgenius/dify-plugin-sdks#237
Checklist
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods