Core Concepts

Understanding these core concepts will help you make the most of QDrant Loader and troubleshoot issues effectively.

🎯 Overview

QDrant Loader transforms your documents into searchable, AI-accessible knowledge. This guide explains the key concepts behind how it works.

🧠 Vector Databases and Embeddings

What are Vector Databases?

A vector database stores information as high-dimensional numerical vectors that represent the semantic meaning of text, images, or other data.

Traditional Database:
"QDrant Loader is powerful" β†’ Stored as text

Vector Database:
"QDrant Loader is powerful" β†’ [0.1, -0.3, 0.8, 0.2, ...] (1536 dimensions)

Why Vector Databases?

Traditional Search Vector Search
Exact keyword matching Semantic meaning matching
"QDrant Loader" finds only exact matches "document ingestion tool" finds QDrant Loader
Limited context understanding Understands relationships and context
Boolean results (match/no match) Similarity scores (0.0 to 1.0)

What are Embeddings?

Embeddings are numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings.

# Example embeddings (simplified to 3 dimensions)
"dog" β†’ [0.8, 0.2, 0.1]
"puppy" β†’ [0.7, 0.3, 0.1]  # Similar to "dog"
"car" β†’ [0.1, 0.1, 0.9]    # Different from "dog"

How QDrant Loader Uses Embeddings

  1. Text Chunking: Documents are split into manageable chunks
  2. Embedding Generation: Each chunk is converted to a vector using OpenAI's models
  3. Storage: Vectors are stored in QDrant with metadata
  4. Search: Query text is converted to a vector and matched against stored vectors

πŸ“„ Document Processing Pipeline

Step 1: Data Source Connection

QDrant Loader connects to various data sources:

Data Sources β†’ QDrant Loader β†’ QDrant Database
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Git Repos   β”‚ ──┐
β”‚ Confluence  β”‚   β”‚
β”‚ JIRA        β”‚   β”œβ”€β†’ QDrant Loader ──→ Vector Database
β”‚ Local Files β”‚   β”‚
β”‚ Public Docs β”‚ β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 2: File Conversion

Documents are converted to plain text:

Input Formats β†’ Conversion β†’ Plain Text
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PDF         β”‚    β”‚              β”‚    β”‚             β”‚
β”‚ DOCX        β”‚ ──→│ MarkItDown   │──→ β”‚ Plain Text  β”‚
β”‚ PPTX        β”‚    β”‚ Conversion   β”‚    β”‚ Content     β”‚
β”‚ Images      β”‚    β”‚              β”‚    β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 3: Text Chunking

Large documents are split into smaller, manageable chunks:

Large Document β†’ Intelligent Chunking β†’ Smaller Chunks
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 50,000 words    β”‚    β”‚ Respect      β”‚    β”‚ Chunk 1     β”‚
β”‚ Technical Doc   β”‚ ──→│ Boundaries   │──→ β”‚ Chunk 2     β”‚
β”‚                 β”‚    β”‚ Preserve     β”‚    β”‚ Chunk 3     β”‚
β”‚                 β”‚    β”‚ Context      β”‚    β”‚ ...         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Chunking Strategy:

  • Respect boundaries: Don't split sentences or code blocks
  • Maintain context: Include overlapping content between chunks
  • Optimal size: Balance between context and processing efficiency
  • Preserve structure: Keep headings and formatting context

Step 4: Embedding Generation

Each chunk is converted to a vector embedding:

Text Chunk β†’ OpenAI API β†’ Vector Embedding
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "QDrant Loader  β”‚    β”‚ OpenAI       β”‚    β”‚ [0.1, -0.3, β”‚
β”‚ is a powerful   β”‚ ──→│ text-embed-  │──→ β”‚  0.8, 0.2,  β”‚
β”‚ tool for..."    β”‚    β”‚ ding-3-small β”‚    β”‚  ...]       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 5: Storage in QDrant

Vectors and metadata are stored in QDrant:

Vector + Metadata β†’ QDrant Collection
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Vector:         β”‚    β”‚ QDrant Collection            β”‚
β”‚ [0.1, -0.3,...] β”‚    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚                 β”‚ ──→│ β”‚ Vector  β”‚ Metadata        β”‚β”‚
β”‚ Metadata:       β”‚    β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚
β”‚ - Source file   β”‚    β”‚ β”‚ [0.1..] β”‚ file: doc.md    β”‚β”‚
β”‚ - Chunk index   β”‚    β”‚ β”‚ [0.3..] β”‚ chunk: 1        β”‚β”‚
β”‚ - Content       β”‚    β”‚ β”‚ [0.8..] β”‚ source: git     β”‚β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”˜

πŸ” Search and Retrieval

How Search Works

  1. Query Processing: Your search query is converted to a vector
  2. Similarity Search: QDrant finds vectors most similar to your query
  3. Ranking: Results are ranked by similarity score
  4. Metadata Filtering: Results can be filtered by source, date, etc.
Search Query β†’ Vector β†’ QDrant Search β†’ Ranked Results
     ↓
"API documentation" β†’ [0.2, 0.7, ...] β†’ QDrant β†’ [
  {score: 0.95, content: "API endpoints...", source: "api.md"},
  {score: 0.87, content: "REST API guide...", source: "guide.md"},
  {score: 0.82, content: "Authentication...", source: "auth.md"}
]

Search Types

1. Semantic Search (via MCP Server)

Finds content based on meaning, not just keywords. Search is performed through the MCP server tools:

# Search is performed via MCP server tools in AI applications
# Basic semantic search
{
  "name": "search",
  "arguments": {
    "query": "authentication methods",
    "limit": 10
  }
}

2. Hierarchy Search (via MCP Server)

Combines semantic search with document structure awareness:

# Hierarchy-aware search via MCP server
{
  "name": "hierarchy_search", 
  "arguments": {
    "query": "API authentication",
    "organize_by_hierarchy": true
  }
}

3. Attachment Search (via MCP Server)

Search within file attachments and their parent documents:

# Attachment search via MCP server
{
  "name": "attachment_search",
  "arguments": {
    "query": "deployment scripts",
    "attachment_filter": {
      "file_type": "sh"
    }
  }
}

πŸ€– MCP Server and AI Integration

What is MCP?

Model Context Protocol (MCP) is a standard for connecting AI tools to external data sources.

AI Tool (Cursor) ←→ MCP Server ←→ QDrant Database
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User asks:  β”‚    β”‚ MCP Server   β”‚    β”‚ QDrant      β”‚
β”‚ "How do I   β”‚ ──→│ - Receives   │──→ β”‚ - Searches  β”‚
β”‚ configure   β”‚    β”‚   query      β”‚    β”‚   vectors   β”‚
β”‚ QDrant?"    β”‚    β”‚ - Calls      β”‚    β”‚ - Returns   β”‚
β”‚             β”‚ ←──│   search     │←── β”‚   results   β”‚
β”‚ Gets answer β”‚    β”‚ - Formats    β”‚    β”‚             β”‚
β”‚ with contextβ”‚    β”‚   response   β”‚    β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

MCP Tools Available

QDrant Loader provides several MCP tools:

1. Search Tool

Basic semantic search across all documents:

{
  "name": "search",
  "description": "Search across all ingested documents",
  "parameters": {
    "query": "search terms",
    "limit": 10,
    "source_types": ["git", "confluence"]
  }
}

2. Hierarchy Search Tool

Search with document structure awareness:

{
  "name": "hierarchy_search", 
  "description": "Search with document hierarchy context",
  "parameters": {
    "query": "search terms",
    "organize_by_hierarchy": true,
    "hierarchy_filter": {
      "depth": 3,
      "has_children": true
    }
  }
}

3. Attachment Search Tool

Search file attachments and their parent documents:

{
  "name": "attachment_search",
  "description": "Search file attachments",
  "parameters": {
    "query": "search terms",
    "attachment_filter": {
      "file_type": "pdf",
      "include_parent_context": true
    }
  }
}

πŸ“Š Data Sources and Connectors

Supported Data Sources

Source Description Use Cases
Git Repositories Clone and index Git repos Code documentation, README files
Confluence Connect to Confluence spaces Team wikis, knowledge bases
JIRA Index JIRA issues and comments Project documentation, requirements
Local Files Process local directories Personal documents, project files
Public Documentation Scrape public websites External API docs, tutorials

How Connectors Work

Each data source has a specialized connector:

# Simplified connector interface
class DataSourceConnector:
    def connect(self, config):
        """Establish connection to data source"""

    def discover(self):
        """Find all available documents"""

    def fetch(self, document_id):
        """Retrieve document content"""

    def get_metadata(self, document_id):
        """Extract document metadata"""

Incremental Updates

QDrant Loader tracks changes and only processes new or modified content:

Initial Sync: All documents processed
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Document A      β”‚ ──→│ Processed    β”‚
β”‚ Document B      β”‚ ──→│ Processed    β”‚
β”‚ Document C      β”‚ ──→│ Processed    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Incremental Sync: Only changes processed
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Document A      β”‚ ──→│ Skipped      β”‚ (unchanged)
β”‚ Document B      β”‚ ──→│ Updated      β”‚ (modified)
β”‚ Document D      β”‚ ──→│ Added        β”‚ (new)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βš™οΈ Configuration and Customization

Configuration Hierarchy

QDrant Loader uses a layered configuration system:

Environment Variables (highest priority)
     ↓
Configuration File (~/.qdrant-loader/config.yaml)
     ↓
Command Line Arguments
     ↓
Default Values (lowest priority)

Project-Based Configuration

QDrant Loader uses a project-based configuration structure:

# Multi-project configuration
projects:
  # Development project
  dev-project:
    project_id: "dev-project"
    display_name: "Development Documentation"
    description: "Development team documentation and code"

    sources:
      git:
        backend-repo:
          base_url: "https://github.com/company/backend"
          include_paths: ["docs/**", "README.md"]

      confluence:
        dev-space:
          base_url: "https://company.atlassian.net/wiki"
          space_key: "DEV"
          token: "${CONFLUENCE_TOKEN}"

  # Production project  
  prod-project:
    project_id: "prod-project"
    display_name: "Production Documentation"
    description: "Production systems and operations"

    sources:
      confluence:
        ops-space:
          base_url: "https://company.atlassian.net/wiki"
          space_key: "OPS"
          token: "${CONFLUENCE_TOKEN}"

Key Configuration Areas

1. Vector Database Settings

qdrant:
  url: "http://localhost:6333"
  collection_name: "documents"
  vector_size: 1536
  distance_metric: "cosine"

2. Embedding Configuration

embedding:
  model: "text-embedding-3-small"
  api_key: "${OPENAI_API_KEY}"
  batch_size: 100
  endpoint: "https://api.openai.com/v1"
  vector_size: 1536

3. Processing Settings

chunking:
  chunk_size: 1000
  chunk_overlap: 200
  max_file_size: "10MB"
  supported_formats: ["md", "txt", "pdf", "docx"]

4. Data Source Configuration

sources:
  git:
    my-repo:
      base_url: "https://github.com/company/docs"
      branch: "main"
      include_paths: ["**/*.md", "**/*.rst"]
      exclude_paths: ["node_modules/", ".git/"]

  confluence:
    company-wiki:
      base_url: "https://company.atlassian.net/wiki"
      space_key: "DOCS"
      token: "${CONFLUENCE_TOKEN}"
      email: "${CONFLUENCE_EMAIL}"

πŸ”§ Performance and Optimization

Understanding Performance Factors

1. Document Size and Chunking

  • Larger chunks: Better context, slower processing
  • Smaller chunks: Faster processing, less context
  • Optimal size: 500-1500 tokens per chunk

2. Embedding Model Choice

Model Speed Quality Cost
text-embedding-3-small Fast Good Low
text-embedding-3-large Slower Better Higher
text-embedding-ada-002 Medium Good Medium

3. QDrant Configuration

# Performance optimization
qdrant:
  # Use memory-mapped storage for large datasets
  storage_type: "mmap"

  # Optimize for search speed
  hnsw_config:
    m: 16
    ef_construct: 200

  # Batch operations
  batch_size: 100

Monitoring Performance

# Check ingestion status
qdrant-loader project --workspace . status

# Monitor QDrant performance (direct API call)
curl http://localhost:6333/metrics

# Check collection statistics
qdrant-loader project --workspace . status --detailed

πŸ”— Integration Patterns

Common Integration Scenarios

1. Development Workflow

Code Changes β†’ Git Push β†’ Webhook β†’ QDrant Loader β†’ Updated Index
     ↓
Developer asks AI about code β†’ MCP Server β†’ Search β†’ Contextual answers

2. Documentation Workflow

Wiki Updates β†’ Confluence β†’ Scheduled Sync β†’ QDrant Loader β†’ Updated Index
     ↓
Support team searches β†’ AI Tool β†’ MCP Server β†’ Accurate answers

3. Knowledge Management

Multiple Sources β†’ QDrant Loader β†’ Unified Index β†’ AI Tools
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Git Repos   β”‚    β”‚              β”‚    β”‚ Cursor IDE  β”‚
β”‚ Confluence  β”‚ ──→│ QDrant Loader│──→ β”‚ Windsurf    β”‚
β”‚ JIRA        β”‚    β”‚              β”‚    β”‚ Claude      β”‚
β”‚ Local Docs  β”‚    β”‚              β”‚    β”‚ Custom Apps β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎯 Best Practices

1. Data Organization

  • Use consistent naming: Clear, descriptive file and folder names
  • Maintain structure: Organize documents logically
  • Regular cleanup: Remove outdated or duplicate content

2. Configuration Management

  • Environment-specific configs: Different settings for dev/prod
  • Secure credentials: Use environment variables for API keys
  • Version control: Track configuration changes

3. Performance Optimization

  • Batch processing: Process multiple documents together
  • Incremental updates: Only sync changed content
  • Monitor resources: Watch memory and API usage

4. Quality Assurance

  • Test search results: Verify search quality regularly
  • Monitor accuracy: Check AI responses for correctness
  • Update regularly: Keep embeddings fresh with new content

πŸ” Troubleshooting Concepts

Common Issues and Concepts

1. Poor Search Results

Cause: Embedding model mismatch or poor chunking Solution: Adjust chunk size or try different embedding model

2. Slow Performance

Cause: Large chunks, inefficient QDrant config, or API rate limits Solution: Optimize chunking, tune QDrant, implement rate limiting

3. Memory Issues

Cause: Processing too many large documents simultaneously Solution: Reduce batch size, process in smaller chunks

4. Inconsistent Results

Cause: Outdated embeddings or mixed content types Solution: Re-index content, separate different content types

πŸ“š Next Steps

Now that you understand the core concepts:

  1. Basic Configuration - Set up your specific use case
  2. User Guides - Explore detailed features
  3. Data Source Guides - Configure specific connectors
  4. MCP Server Guide - Advanced AI integration

Understanding these concepts will help you:

  • Configure QDrant Loader effectively for your use case
  • Troubleshoot issues when they arise
  • Optimize performance for your specific needs
  • Make the most of AI tool integration
Back to Documentation
Generated from core-concepts.md