Configuration File Reference
This comprehensive reference covers all configuration options available in QDrant Loader's YAML configuration files. QDrant Loader uses a multi-project configuration structure that allows you to organize multiple data sources into logical projects within a single workspace.
🎯 Overview
QDrant Loader supports YAML configuration files for managing settings in a structured, version-controllable format. The configuration follows a multi-project architecture where:
- Global settings apply to all projects (embedding, chunking, state management)
- Project-specific settings define data sources and project metadata
- All projects share the same QDrant collection for unified search
Configuration Structure
# Multi-project configuration structure
global:
# Global settings for all projects
qdrant: {}
llm: {}
chunking: {}
state_management: {}
file_conversion: {}
projects:
# Project definitions
project-1:
project_id: "project-1"
display_name: "Project One"
description: "Project description"
sources: {}
project-2:
project_id: "project-2"
display_name: "Project Two"
description: "Another project"
sources: {}
Configuration Priority
1. Command-line arguments (highest priority)
2. Environment variables
3. Configuration file ← This guide
4. Default values (lowest priority)
📁 Basic Configuration File
Minimal Configuration
# config.yaml - Minimal multi-project configuration
global:
qdrant:
url: "http://localhost:6333"
collection_name: "documents"
# New unified LLM configuration (provider-agnostic)
llm:
provider: ${LLM_PROVIDER} # openai | ollama | openai_compat | custom
base_url: ${LLM_BASE_URL} # e.g. https://api.openai.com/v1, http://localhost:11434/v1
api_key: ${LLM_API_KEY} # optional for local providers like Ollama
models:
embeddings: ${LLM_EMBEDDING_MODEL} # e.g. text-embedding-3-small | nomic-embed-text | bge-small-en-v1.5
chat: ${LLM_CHAT_MODEL} # e.g. gpt-4o-mini | llama3.1:8b-instruct
tokenizer: cl100k_base # cl100k_base | none
embeddings:
vector_size: 1536 # set according to chosen embedding model
projects:
my-project:
project_id: "my-project"
display_name: "My Project"
description: "Basic project setup"
sources:
git:
my-repo:
base_url: "https://github.com/user/repo.git"
branch: "main"
token: "${REPO_TOKEN}"
file_types:
- "*.md"
- "*.txt"
Complete Configuration Template
# config.yaml - Complete multi-project configuration template (abridged)
global:
qdrant:
url: "http://localhost:6333"
api_key: "${QDRANT_API_KEY}"
collection_name: "documents"
llm:
provider: ${LLM_PROVIDER}
base_url: ${LLM_BASE_URL}
api_key: ${LLM_API_KEY}
api_version: ${LLM_API_VERSION} # Required for azure_openai
headers: {}
models:
embeddings: ${LLM_EMBEDDING_MODEL}
chat: ${LLM_CHAT_MODEL}
tokenizer: cl100k_base
request:
timeout_s: 30
max_retries: 3
backoff_s_min: 1
backoff_s_max: 30
rate_limits:
rpm: 1800
tpm: 2000000
concurrency: 5
embeddings:
vector_size: 1536
provider_options:
azure_endpoint: ${AZURE_OPENAI_ENDPOINT}
native_endpoint: auto
> Azure OpenAI notes:
> - Use the resource root for `base_url`, e.g. `https://<resource>.openai.azure.com` (do not include `/openai/deployments/...`).
> - Set `api_version` (e.g., `2024-05-01-preview`).
> - `models.embeddings` and `models.chat` should be Azure deployment names.
> Ollama notes:
> - If your server exposes OpenAI-compatible `/v1`, set `base_url` to include `/v1` and we will call `/v1/embeddings` and parse `data[].embedding`.
> - In native mode (no `/v1`), we try batch `POST /api/embed` first and parse `{ "embeddings": [[...], ...] }`. If unsupported (404/405/501), we fall back to per-item `POST /api/embeddings` and parse `{ "embedding": [...] }` or `{ "data": { "embedding": [...] } }`.
> - You can override detection with `global.llm.provider_options.native_endpoint: embed | embeddings | auto` (default: auto).
### Troubleshooting
- Azure 404 or 400 with deployment path
- Symptom: 404 Not Found or confusing errors when calling embeddings/chat
- Cause: `base_url` incorrectly includes `/openai/deployments/...`
- Fix: Use resource root `https://<resource>.openai.azure.com` and set `models.*` to deployment names; include `api_version` (e.g., `2024-05-01-preview`).
- Ollama empty or invalid embeddings
- Symptom: Empty vectors or key errors parsing response
- Cause: Server endpoint mismatch (`/api/embed` vs `/api/embeddings`)
- Fix: Set `provider_options.native_endpoint` to `embed` or `embeddings`, or rely on auto-detection.
chunking:
chunk_size: 1500
chunk_overlap: 200
max_chunks_per_document: 500
strategies:
default:
min_chunk_size: 100
enable_semantic_analysis: true
enable_entity_extraction: true
html:
simple_parsing_threshold: 100000
max_html_size_for_parsing: 500000
max_sections_to_process: 200
max_chunk_size_for_nlp: 20000
preserve_semantic_structure: true
code:
max_file_size_for_ast: 75000
max_elements_to_process: 800
max_recursion_depth: 8
max_element_size: 20000
enable_ast_parsing: true
enable_dependency_analysis: true
json:
max_json_size_for_parsing: 1000000
max_objects_to_process: 200
max_chunk_size_for_nlp: 20000
max_recursion_depth: 5
max_array_items_per_chunk: 50
max_object_keys_to_process: 100
enable_schema_inference: true
markdown:
min_content_length_for_nlp: 100
min_word_count_for_nlp: 20
min_line_count_for_nlp: 3
min_section_size: 500
max_chunks_per_section: 1000
max_overlap_percentage: 0.25
max_workers: 4
estimation_buffer: 0.2
words_per_minute_reading: 200
header_analysis_threshold_h1: 3
header_analysis_threshold_h3: 8
enable_hierarchical_metadata: true
semantic_analysis:
num_topics: 3
lda_passes: 10
spacy_model: "en_core_web_md"
state_management:
database_path: "${STATE_DB_PATH}"
table_prefix: "qdrant_loader_"
connection_pool:
size: 5
timeout: 30
file_conversion:
max_file_size: 52428800
conversion_timeout: 300
markitdown:
enable_llm_descriptions: false
llm_model: "gpt-4o"
llm_endpoint: "https://api.openai.com/v1"
llm_api_key: "${OPENAI_API_KEY}"
projects:
docs-project:
project_id: "docs-project"
display_name: "Documentation Project"
description: "Company documentation and guides"
sources:
git:
docs-repo:
base_url: "https://github.com/company/docs.git"
branch: "main"
include_paths:
- "docs/**"
- "README.md"
exclude_paths:
- "docs/archive/**"
file_types:
- "*.md"
- "*.rst"
- "*.txt"
max_file_size: 1048576
depth: 1
token: "${DOCS_REPO_TOKEN}"
enable_file_conversion: true
confluence:
company-wiki:
base_url: "https://company.atlassian.net/wiki"
deployment_type: "cloud"
space_key: "DOCS"
content_types:
- "page"
- "blogpost"
include_labels: []
exclude_labels: []
token: "${CONFLUENCE_TOKEN}"
email: "${CONFLUENCE_EMAIL}"
enable_file_conversion: true
download_attachments: true
support-project:
project_id: "support-project"
display_name: "Support Knowledge Base"
description: "Customer support documentation and tickets"
sources:
jira:
support-tickets:
base_url: "https://company.atlassian.net"
deployment_type: "cloud"
project_key: "SUPPORT"
requests_per_minute: 60
page_size: 100
issue_types:
- "Bug"
- "Story"
- "Task"
include_statuses:
- "Open"
- "In Progress"
- "Done"
token: "${JIRA_TOKEN}"
email: "${JIRA_EMAIL}"
enable_file_conversion: true
download_attachments: true
localfile:
support-docs:
base_url: "file:///path/to/support/docs"
include_paths:
- "**/*.md"
- "**/*.pdf"
exclude_paths:
- "tmp/**"
- "archive/**"
file_types:
- "*.md"
- "*.pdf"
- "*.txt"
max_file_size: 52428800
enable_file_conversion: true
🔧 Detailed Configuration Sections
Global Configuration (global
)
QDrant Database Configuration
global:
qdrant:
# Required: URL of your QDrant database instance
url: "http://localhost:6333"
# Optional: API key for QDrant Cloud or secured instances
api_key: "${QDRANT_API_KEY}"
# Required: Name of the QDrant collection (shared by all projects)
collection_name: "documents"
Required Fields:
url
- QDrant instance URLcollection_name
- Collection name for all projects
Optional Fields:
api_key
- API key for QDrant Cloud (use environment variable)
LLM Configuration (Unified)
global:
llm:
# Required: Provider and endpoint
provider: ${LLM_PROVIDER}
base_url: ${LLM_BASE_URL}
api_key: ${LLM_API_KEY}
# Required: Models
models:
embeddings: ${LLM_EMBEDDING_MODEL}
chat: ${LLM_CHAT_MODEL}
# Optional: Tokenizer and policies
tokenizer: cl100k_base
request:
timeout_s: 30
max_retries: 3
embeddings:
vector_size: 1536
> Deprecation: `global.embedding.*` is still honored but will be removed in a future release. Migrate to `global.llm.*`.
Required Fields:
provider
- LLM provider (openai, azure_openai, ollama, openai_compat)base_url
- API endpoint URLmodels.embeddings
- Embedding model namemodels.chat
- Chat model name (optional, for LLM-enabled features)
Optional Fields:
api_key
- API key (use environment variable, required for most providers)api_version
- API version (required for azure_openai)headers
- Custom HTTP headerstokenizer
- Tokenizer for token counting (cl100k_base, none)request
- Request policy settings (timeout, retries, backoff)rate_limits
- Rate limiting configuration (rpm, tpm, concurrency)embeddings.vector_size
- Vector dimension sizeprovider_options
- Provider-specific options
Chunking Configuration
global:
chunking:
# Optional: Maximum size of text chunks in characters (default: 1500)
chunk_size: 1500
# Optional: Overlap between chunks in characters (default: 200)
chunk_overlap: 200
# Optional: Maximum chunks per document - safety limit (default: 500)
max_chunks_per_document: 500
# Optional: Strategy-specific configurations for different content types
strategies:
default:
min_chunk_size: 100
enable_semantic_analysis: true
enable_entity_extraction: true
html:
simple_parsing_threshold: 100000
max_html_size_for_parsing: 500000
max_sections_to_process: 200
max_chunk_size_for_nlp: 20000
preserve_semantic_structure: true
code:
max_file_size_for_ast: 75000
max_elements_to_process: 800
max_recursion_depth: 8
max_element_size: 20000
enable_ast_parsing: true
enable_dependency_analysis: true
json:
max_json_size_for_parsing: 1000000
max_objects_to_process: 200
max_chunk_size_for_nlp: 20000
max_recursion_depth: 5
max_array_items_per_chunk: 50
max_object_keys_to_process: 100
enable_schema_inference: true
markdown:
min_content_length_for_nlp: 100
min_word_count_for_nlp: 20
min_line_count_for_nlp: 3
min_section_size: 500
max_chunks_per_section: 1000
max_overlap_percentage: 0.25
max_workers: 4
estimation_buffer: 0.2
words_per_minute_reading: 200
header_analysis_threshold_h1: 3
header_analysis_threshold_h3: 8
enable_hierarchical_metadata: true
Strategy-Specific Configuration Details:
The strategies
section allows you to fine-tune how different content types are processed:
Default Strategy (Text Files):
min_chunk_size
: Prevents creation of very small chunks that may lack contextenable_semantic_analysis
: Controls topic extraction and semantic analysisenable_entity_extraction
: Controls named entity recognition processing
HTML Strategy:
simple_parsing_threshold
: Files below this size use simple text extractionmax_html_size_for_parsing
: Maximum size for complex HTML parsing before fallbackmax_sections_to_process
: Limits processing to prevent infinite loops on malformed HTMLmax_chunk_size_for_nlp
: Maximum chunk size for semantic analysispreserve_semantic_structure
: Maintains HTML structure relationships in chunks
Code Strategy:
max_file_size_for_ast
: Maximum file size for Abstract Syntax Tree parsingmax_elements_to_process
: Limits code elements to prevent excessive processing timemax_recursion_depth
: Controls AST traversal depth for nested code structuresmax_element_size
: Maximum size for individual code elements (functions, classes)enable_ast_parsing
: Toggle AST-based code analysis vs simple text splittingenable_dependency_analysis
: Extract import/dependency relationships
JSON Strategy:
max_json_size_for_parsing
: Maximum JSON file size for structured parsingmax_objects_to_process
: Limits number of JSON objects to prevent memory issuesmax_chunk_size_for_nlp
: Maximum chunk size for semantic processingmax_recursion_depth
: Controls nested structure traversal depthmax_array_items_per_chunk
: Groups array items for better contextmax_object_keys_to_process
: Limits object key processing for large objectsenable_schema_inference
: Automatically detect and extract JSON schema information
Markdown Strategy:
min_content_length_for_nlp
: Minimum content length required for NLP processingmin_word_count_for_nlp
: Minimum word count required for semantic analysismin_line_count_for_nlp
: Minimum line count required for complex text processingmin_section_size
: Minimum section size before merging with adjacent sectionsmax_chunks_per_section
: Safety limit to prevent runaway chunking on large sectionsmax_overlap_percentage
: Maximum overlap between adjacent chunks (0.25 = 25%)max_workers
: Number of parallel threads for processing large documentsestimation_buffer
: Buffer factor for chunk count estimation accuracywords_per_minute_reading
: Reading speed for estimated reading time calculationheader_analysis_threshold_h1
: H1 header count threshold for split level decisionsheader_analysis_threshold_h3
: H3 header count threshold for split level decisionsenable_hierarchical_metadata
: Extract section relationships and breadcrumb navigation
Semantic Analysis Configuration
global:
semantic_analysis:
# Optional: Number of topics to extract using LDA (default: 3)
num_topics: 3
# Optional: Number of passes for LDA training (default: 10)
lda_passes: 10
# Optional: spaCy model for text processing (default: "en_core_web_md")
spacy_model: "en_core_web_md"
State Management Configuration
global:
state_management:
# Required: Path to SQLite database file
# Supports environment variable expansion (e.g., $HOME, ${STATE_DB_PATH})
# Special values: ":memory:" for in-memory database
database_path: "${STATE_DB_PATH}"
# Optional: Prefix for database tables (default: "qdrant_loader_")
table_prefix: "qdrant_loader_"
# Optional: Connection pool settings
connection_pool:
size: 5 # Maximum connections (default: 5)
timeout: 30 # Connection timeout in seconds (default: 30)
Path Validation Notes:
- Supports environment variable expansion including
$HOME
- Automatically creates parent directories if they don't exist
- Validates directory permissions before use
- Special handling for in-memory databases (
:memory:
)
File Conversion Configuration
global:
file_conversion:
# Optional: Maximum file size for conversion in bytes (default: 50MB)
# Range: 0 < max_file_size ≤ 100MB (104857600)
max_file_size: 52428800
# Optional: Timeout for conversion operations in seconds (default: 300)
# Range: 0 < conversion_timeout ≤ 3600 seconds
conversion_timeout: 300
# Optional: MarkItDown specific settings
markitdown:
enable_llm_descriptions: false
llm_model: "gpt-4o"
llm_endpoint: "https://api.openai.com/v1"
llm_api_key: "${OPENAI_API_KEY}"
Project Configuration (projects
)
Project Structure
projects:
project-id:
# Unique project identifier
project_id: "project-id" # Must match the key above
display_name: "Project Name"
description: "Description"
sources:
# Project-specific data sources
# Source configurations go here
Data Source Types
Git Repository Sources
sources:
git:
source-name:
base_url: "https://github.com/user/repo.git"
token: "${REPO_TOKEN}"
file_types:
- "*.md"
- "*.rst"
- "*.txt"
branch: "main"
include_paths:
- "docs/**"
- "README.md"
exclude_paths:
- "node_modules/**"
- ".git/**"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
Confluence Sources
sources:
confluence:
source-name:
base_url: "https://company.atlassian.net/wiki"
deployment_type: "cloud"
space_key: "DOCS"
token: "${CONFLUENCE_TOKEN}"
email: "${CONFLUENCE_EMAIL}"
content_types:
- "page"
- "blogpost"
include_labels: []
exclude_labels: []
enable_file_conversion: true
download_attachments: true
JIRA Sources
sources:
jira:
source-name:
base_url: "https://company.atlassian.net"
deployment_type: "cloud"
project_key: "PROJ"
token: "${JIRA_TOKEN}"
email: "${JIRA_EMAIL}"
requests_per_minute: 60
page_size: 100
issue_types:
- "Bug"
- "Story"
- "Task"
include_statuses:
- "Open"
- "In Progress"
- "Done"
enable_file_conversion: true
download_attachments: true
Local File Sources
sources:
localfile:
source-name:
base_url: "file:///path/to/files"
include_paths:
- "docs/**"
- "*.md"
exclude_paths:
- "tmp/**"
- ".*" # Hidden files
file_types:
- "*.md"
- "*.pdf"
- "*.txt"
max_file_size: 1048576
enable_file_conversion: true
Public Documentation Sources
sources:
publicdocs:
source-name:
base_url: "https://docs.example.com"
version: "1.0"
content_type: "html" # Options: html, markdown, rst
path_pattern: "/docs/{version}/**"
exclude_paths:
- "/api/**"
- "/internal/**"
selectors:
content: "article.main-content"
remove:
- "nav"
- "header"
- "footer"
- ".sidebar"
code_blocks: "pre code"
enable_file_conversion: true
download_attachments: true
attachment_selectors:
- "a[href$='.pdf']"
- "a[href$='.doc']"
- "a[href$='.docx']"
- "a[href$='.xls']"
- "a[href$='.xlsx']"
- "a[href$='.ppt']"
- "a[href$='.pptx']"
🔧 Configuration Management
Using Configuration Files
Workspace Mode (Recommended)
# Create workspace directory
mkdir my-workspace && cd my-workspace
# Download configuration templates
curl -o config.yaml https://raw.githubusercontent.com/martin-papy/qdrant-loader/main/packages/qdrant-loader/conf/config.template.yaml
curl -o .env https://raw.githubusercontent.com/martin-papy/qdrant-loader/main/packages/qdrant-loader/conf/.env.template
# Edit configuration files
# Then use workspace mode
qdrant-loader init --workspace .
qdrant-loader ingest --workspace .
Traditional Mode
# Use specific configuration files
qdrant-loader --config /path/to/config.yaml --env /path/to/.env init
qdrant-loader --config /path/to/config.yaml --env /path/to/.env ingest
Configuration Validation
Validate Configuration
# Display current configuration (shows all projects and their status)
qdrant-loader config --workspace .
# Note: Dedicated project management commands (list, status, validate)
# are not currently implemented. All project information is displayed
# through the config command.
Project Management
# View all configured projects and their information
qdrant-loader config --workspace .
# Note: Project-specific filtering (--project-id) is not currently
# supported in the CLI. All project information is shown together.
📋 Environment Variables
Configuration files support environment variable substitution using ${VARIABLE_NAME}
syntax:
Required Environment Variables
# QDrant Configuration
QDRANT_URL=http://localhost:6333
QDRANT_COLLECTION_NAME=documents
QDRANT_API_KEY=your_api_key # Optional: for QDrant Cloud
# LLM Configuration (new unified approach)
LLM_PROVIDER=openai
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=your_openai_key
LLM_EMBEDDING_MODEL=text-embedding-3-small
LLM_CHAT_MODEL=gpt-4o-mini
# Legacy (still supported but deprecated)
OPENAI_API_KEY=your_openai_key
# State Management
STATE_DB_PATH=./data/qdrant-loader.db
Source-Specific Environment Variables
# Git Repositories
REPO_TOKEN=your_github_token
# Confluence (Cloud)
CONFLUENCE_TOKEN=your_confluence_token
CONFLUENCE_EMAIL=your_email
# JIRA (Cloud)
JIRA_TOKEN=your_jira_token
JIRA_EMAIL=your_email
🎯 Configuration Examples
Single Project Setup
global:
qdrant:
url: "${QDRANT_URL}"
collection_name: "${QDRANT_COLLECTION_NAME}"
llm:
provider: "openai"
base_url: "https://api.openai.com/v1"
api_key: "${LLM_API_KEY}"
models:
embeddings: "text-embedding-3-small"
chat: "gpt-4o-mini"
embeddings:
vector_size: 1536
state_management:
database_path: "${STATE_DB_PATH}"
projects:
main:
project_id: "main"
display_name: "Main Project"
description: "Primary documentation project"
sources:
git:
docs:
base_url: "https://github.com/company/docs.git"
branch: "main"
token: "${REPO_TOKEN}"
file_types:
- "*.md"
- "*.rst"
enable_file_conversion: true
Multi-Project Setup
global:
qdrant:
url: "${QDRANT_URL}"
collection_name: "${QDRANT_COLLECTION_NAME}"
llm:
provider: "openai"
base_url: "https://api.openai.com/v1"
api_key: "${LLM_API_KEY}"
models:
embeddings: "text-embedding-3-small"
chat: "gpt-4o-mini"
embeddings:
vector_size: 1536
chunking:
chunk_size: 1500
chunk_overlap: 200
max_chunks_per_document: 500
strategies:
default:
min_chunk_size: 100
enable_semantic_analysis: true
enable_entity_extraction: true
html:
simple_parsing_threshold: 100000
max_html_size_for_parsing: 500000
max_sections_to_process: 200
max_chunk_size_for_nlp: 20000
preserve_semantic_structure: true
code:
max_file_size_for_ast: 75000
max_elements_to_process: 800
max_recursion_depth: 8
max_element_size: 20000
enable_ast_parsing: true
enable_dependency_analysis: true
json:
max_json_size_for_parsing: 1000000
max_objects_to_process: 200
max_chunk_size_for_nlp: 20000
max_recursion_depth: 5
max_array_items_per_chunk: 50
max_object_keys_to_process: 100
enable_schema_inference: true
markdown:
min_content_length_for_nlp: 100
min_word_count_for_nlp: 20
min_line_count_for_nlp: 3
min_section_size: 500
max_chunks_per_section: 1000
max_overlap_percentage: 0.25
max_workers: 4
estimation_buffer: 0.2
words_per_minute_reading: 200
header_analysis_threshold_h1: 3
header_analysis_threshold_h3: 8
enable_hierarchical_metadata: true
state_management:
database_path: "${STATE_DB_PATH}"
file_conversion:
max_file_size: 52428800
markitdown:
enable_llm_descriptions: false
projects:
documentation:
project_id: "documentation"
display_name: "Documentation"
description: "Technical documentation and guides"
sources:
git:
docs-repo:
base_url: "https://github.com/company/docs.git"
branch: "main"
include_paths:
- "docs/**"
- "README.md"
token: "${DOCS_REPO_TOKEN}"
file_types:
- "*.md"
- "*.rst"
enable_file_conversion: true
confluence:
tech-wiki:
base_url: "https://company.atlassian.net/wiki"
deployment_type: "cloud"
space_key: "TECH"
token: "${CONFLUENCE_TOKEN}"
email: "${CONFLUENCE_EMAIL}"
enable_file_conversion: true
support:
project_id: "support"
display_name: "Customer Support"
description: "Support documentation and tickets"
sources:
jira:
support-tickets:
base_url: "https://company.atlassian.net"
deployment_type: "cloud"
project_key: "SUPPORT"
token: "${JIRA_TOKEN}"
email: "${JIRA_EMAIL}"
enable_file_conversion: true
localfile:
support-docs:
base_url: "file:///path/to/support/docs"
include_paths:
- "**/*.md"
- "**/*.pdf"
file_types:
- "*.md"
- "*.pdf"
enable_file_conversion: true
🔗 Related Documentation
- Environment Variables Reference - Environment variable configuration
- Basic Configuration - Getting started with configuration
- Data Sources - Source-specific configuration guides
- Workspace Mode - Workspace configuration details
📋 Configuration Checklist
- [ ] Global configuration completed (qdrant, llm, state_management)
- [ ] Environment variables configured in
.env
file - [ ] Project definitions created with unique project IDs
- [ ] Data source credentials configured for your sources
- [ ] Required fields provided (Git token, Confluence email/token, etc.)
- [ ] File conversion settings configured if processing non-text files
- [ ] Configuration displayed with
qdrant-loader config
- [ ] Projects reviewed in config output
- [ ] File permissions set appropriately (chmod 600 for sensitive configs)
- [ ] Version control configured (exclude
.env
files)
Multi-project configuration complete! 🎉
Your QDrant Loader is now configured using the multi-project structure. This provides organized, scalable configuration management while maintaining unified search across all your projects through a shared QDrant collection.