File Conversion

QDrant Loader supports comprehensive file conversion to extract text content from various file formats using Microsoft's MarkItDown library. This guide covers supported formats, configuration, and best practices.

🎯 Supported File Formats

QDrant Loader uses Microsoft's MarkItDown library to handle a wide variety of file formats:

📄 Document Formats

Format	Extension	Description	Features
PDF	`.pdf`	Portable Document Format	Text extraction, OCR for images, metadata
Word	`.docx`, `.doc`	Microsoft Word documents	Text, tables, images, metadata
PowerPoint	`.pptx`, `.ppt`	Microsoft PowerPoint presentations	Slide text, speaker notes, metadata
Excel	`.xlsx`, `.xls`	Microsoft Excel spreadsheets	Cell data, formulas, sheet names
OpenDocument	`.odt`, `.ods`, `.odp`	LibreOffice/OpenOffice documents	Text, tables, metadata

📝 Text Formats

Format	Extension	Description	Features
Markdown	`.md`, `.markdown`	Markdown markup	Formatted text, code blocks, tables
reStructuredText	`.rst`	reStructuredText markup	Formatted text, directives
Plain Text	`.txt`	Plain text files	Raw text content
Rich Text	`.rtf`	Rich Text Format	Formatted text, basic styling
LaTeX	`.tex`	LaTeX documents	Mathematical content, structured text

🖼️ Image Formats (with OCR)

Format	Extension	Description	Features
JPEG	`.jpg`, `.jpeg`	JPEG images	OCR text extraction, metadata
PNG	`.png`	PNG images	OCR text extraction, transparency
GIF	`.gif`	GIF images	OCR text extraction, animation frames
TIFF	`.tiff`, `.tif`	TIFF images	OCR text extraction, high quality
BMP	`.bmp`	Bitmap images	OCR text extraction

🎵 Audio Formats (with Transcription)

Format	Extension	Description	Features
MP3	`.mp3`	MP3 audio	Speech-to-text transcription
WAV	`.wav`	WAV audio	Speech-to-text transcription
M4A	`.m4a`	M4A audio	Speech-to-text transcription
FLAC	`.flac`	FLAC audio	Speech-to-text transcription

📊 Data Formats

Format	Extension	Description	Features
JSON	`.json`	JSON data	Structured data extraction
CSV	`.csv`	Comma-separated values	Tabular data, headers
XML	`.xml`	XML documents	Structured data, attributes
YAML	`.yaml`, `.yml`	YAML configuration	Configuration data
TOML	`.toml`	TOML configuration	Configuration data

📦 Archive Formats

Format	Extension	Description	Features
ZIP	`.zip`	ZIP archives	Extract and process contents
TAR	`.tar`, `.tar.gz`, `.tgz`	TAR archives	Extract and process contents
7-Zip	`.7z`	7-Zip archives	Extract and process contents

⚙️ Configuration

Global File Conversion Configuration

File conversion is configured globally and applies to all projects and sources that enable it:

global:
  file_conversion:
    # Maximum file size for conversion (in bytes)
    max_file_size: 52428800  # 50MB

    # Timeout for conversion operations (in seconds)
    conversion_timeout: 300  # 5 minutes

    # MarkItDown specific settings
    markitdown:
      # Enable LLM integration for image descriptions
      enable_llm_descriptions: false

      # LLM model for image descriptions (when enabled)
      llm_model: "gpt-4o"

      # LLM endpoint (when enabled)
      llm_endpoint: "https://api.openai.com/v1"

      # API key for LLM service (required when enable_llm_descriptions is true)
      llm_api_key: "${LLM_API_KEY}"

projects:
  my-project:
    display_name: "My Project"
    description: "Project with file conversion enabled"
    sources:
      localfile:
        documents:
          base_url: "file:///path/to/documents"
          file_types:
            - "*.pdf"
            - "*.docx"
            - "*.pptx"
            - "*.xlsx"
          max_file_size: 52428800

          # Enable file conversion for this source
          enable_file_conversion: true

Configuration Options

Global File Conversion Settings

Option	Type	Description	Default
`max_file_size`	int	Maximum file size in bytes	`52428800` (50MB)
`conversion_timeout`	int	Timeout for conversion operations in seconds	`300` (5 minutes)

MarkItDown Settings

Option	Type	Description	Default
`enable_llm_descriptions`	bool	Enable LLM integration for image descriptions	`false`
`llm_model`	string	LLM model for image descriptions	`gpt-4o`
`llm_endpoint`	string	LLM endpoint URL	`https://api.openai.com/v1`
`llm_api_key`	string	API key for LLM service	`null`

Source-Level Settings

Each data source can enable or disable file conversion:

Option	Type	Description	Default
`enable_file_conversion`	bool	Enable file conversion for this source	`false`

🔧 How File Conversion Works

Conversion Process

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│     File    │───▶│   Format    │───▶│ MarkItDown  │───▶│  Markdown   │
│  Detection  │    │  Detection  │    │ Conversion  │    │   Content   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
       │                   │                   │                   │
       ▼                   ▼                   ▼                   ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  MIME Type  │    │  Extension  │    │ Text + OCR  │    │ Structured  │
│  Detection  │    │   Mapping   │    │  + Audio    │    │Text Output  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Processing Pipeline

File Detection - MIME type detection - Extension analysis - File size validation
Format-Specific Processing - PDF: Text extraction + OCR for images - Office Documents: Document structure + embedded content - Images: OCR text extraction - Audio: Speech-to-text transcription - Archives: Extraction + recursive processing
Content Extraction - Main text content - Metadata (author, creation date, etc.) - Structured data (tables, lists) - Embedded objects (images, charts)
Output Generation - Markdown-formatted text - Preserved formatting where possible - Ready for chunking and vector storage

🚀 Usage Examples

Basic Document Processing

global:
  file_conversion:
    max_file_size: 52428800  # 50MB
    conversion_timeout: 300  # 5 minutes
    markitdown:
      enable_llm_descriptions: false

projects:
  documents:
    display_name: "Document Processing"
    description: "Process various document formats"
    sources:
      localfile:
        office-docs:
          base_url: "file:///documents/office"
          file_types:
            - "*.pdf"
            - "*.docx"
            - "*.pptx"
            - "*.xlsx"
          enable_file_conversion: true

Research Papers with LLM Enhancement

global:
  file_conversion:
    max_file_size: 104857600  # 100MB for large papers
    conversion_timeout: 600   # 10 minutes
    markitdown:
      enable_llm_descriptions: true
      llm_model: "gpt-4o"
      llm_endpoint: "https://api.openai.com/v1"
      llm_api_key: "${LLM_API_KEY}"

projects:
  research:
    display_name: "Research Papers"
    description: "Academic papers and research documents"
    sources:
      localfile:
        papers:
          base_url: "file:///research/papers"
          file_types:
            - "*.pdf"
            - "*.tex"
          enable_file_conversion: true

Multimedia Content Processing

global:
  file_conversion:
    max_file_size: 52428800
    conversion_timeout: 900  # 15 minutes for audio/video
    markitdown:
      enable_llm_descriptions: true
      llm_model: "gpt-4o"
      llm_api_key: "${LLM_API_KEY}"

projects:
  multimedia:
    display_name: "Multimedia Content"
    description: "Audio, images, and presentations"
    sources:
      localfile:
        media:
          base_url: "file:///media/content"
          file_types:
            - "*.mp3"
            - "*.wav"
            - "*.png"
            - "*.jpg"
            - "*.pptx"
          enable_file_conversion: true

Confluence with Attachment Processing

global:
  file_conversion:
    max_file_size: 52428800
    conversion_timeout: 300
    markitdown:
      enable_llm_descriptions: false

projects:
  confluence-docs:
    display_name: "Confluence Documentation"
    description: "Confluence pages and attachments"
    sources:
      confluence:
        company-wiki:
          base_url: "${CONFLUENCE_URL}"
          deployment_type: "cloud"
          space_key: "DOCS"
          email: "${CONFLUENCE_EMAIL}"
          token: "${CONFLUENCE_TOKEN}"
          download_attachments: true
          enable_file_conversion: true

🧪 Testing and Validation

Test File Conversion

# Initialize the project
qdrant-loader init --workspace .

# Test ingestion with file conversion enabled
qdrant-loader ingest --workspace . --project my-project

# Check configuration and project status
qdrant-loader config --workspace .

# Enable debug logging to see conversion details
qdrant-loader ingest --workspace . --log-level DEBUG --project my-project

Validate Configuration

# Validate configuration (includes all projects)
qdrant-loader config --workspace .

# Display configuration with debug logging
qdrant-loader config --workspace . --log-level DEBUG

🔧 Troubleshooting

Common Issues

File Size Exceeded

Problem: Files are too large to process Solutions:

global:
  file_conversion:
    # Increase size limit
    max_file_size: 104857600  # 100MB

    # Or filter at source level
projects:
  my-project:
    sources:
      localfile:
        documents:
          max_file_size: 20971520  # 20MB limit for this source

Conversion Timeout

Problem: Large files timing out during conversion

Solutions:

global:
  file_conversion:
    # Increase timeout
    conversion_timeout: 900  # 15 minutes

LLM Integration Issues

Problem: Image descriptions not working Solutions:

Check API key:

echo $LLM_API_KEY
# Or check legacy variable
echo $OPENAI_API_KEY

Verify configuration:

global:
  file_conversion:
    markitdown:
      enable_llm_descriptions: true
      llm_api_key: "${LLM_API_KEY}"

Test API access:

curl -H "Authorization: Bearer $LLM_API_KEY" \
  https://api.openai.com/v1/models

Memory Issues

Problem: Large files causing memory problems Solutions:

global:
  file_conversion:
    # Reduce file size limits
    max_file_size: 20971520  # 20MB

    # Reduce timeout to fail faster
    conversion_timeout: 180  # 3 minutes

Unsupported File Types

Problem: Some files not being processed