Attachment Search Guide
This guide covers the attachment search capabilities of the QDrant Loader MCP Server, enabling you to find and work with file attachments across your knowledge base with AI assistance.
🎯 Overview
The attachment search tool specializes in finding file attachments and their associated documents. Currently, this feature is specifically designed for Confluence sources and includes:
- PDF documents with extracted text content
- Office documents (Word, Excel, PowerPoint)
- Images with text extraction via MarkItDown
- Code files and configuration files
- Data files (CSV, JSON, YAML)
Key Benefits
- Content Extraction: Searches inside file contents using MarkItDown conversion
- Parent Context: Understands the relationship between attachments and their parent Confluence pages
- File Type Intelligence: Optimized search for different file formats supported by MarkItDown
- Metadata Awareness: Searches file properties, authors, and creation dates from Confluence
⚠️ Important Limitations
- Confluence Only: Currently limited to Confluence attachments and documents
- MarkItDown Dependency: File conversion capabilities depend on MarkItDown library support
- No OCR: Text extraction from images relies on MarkItDown, not dedicated OCR processing
📎 How Attachment Search Works
File Processing Pipeline
Confluence Attachment
↓
1. File Detection (MIME type and extension analysis)
↓
2. MarkItDown Conversion (text extraction from various formats)
↓
3. Content Processing (markdown structure analysis)
↓
4. Vector Embedding (semantic search via OpenAI)
↓
5. Confluence Context Integration (parent page relationship)
↓
6. Searchable Attachment Index
Search Process
Query: "architecture diagrams"
↓
1. Semantic Search (find relevant Confluence attachments)
↓
2. Confluence Filter (only Confluence sources processed)
↓
3. File Type Filtering (based on MIME type and filename)
↓
4. Content Analysis (MarkItDown extracted text)
↓
5. Parent Context (associated Confluence pages)
↓
6. Ranked Results (by relevance and attachment metadata)
🔧 Attachment Search Parameters
Available Parameters
{
"name": "attachment_search",
"description": "Search for file attachments and their parent documents across Confluence sources",
"parameters": {
"query": "string", // Required: Search query in natural language
"limit": 10, // Optional: Number of results (default: 10)
"include_parent_context": true, // Optional: Include parent document info (default: true)
"attachment_filter": { // Optional: Attachment-specific filters
"attachments_only": true, // Show only file attachments
"parent_document_title": "API Documentation", // Filter by parent document title
"file_type": "pdf", // Filter by file type (e.g., 'pdf', 'xlsx', 'png')
"file_size_min": 1024, // Minimum file size in bytes
"file_size_max": 10485760, // Maximum file size in bytes
"author": "data-team" // Filter by attachment author
}
}
}
Parameter Details
Required Parameters
query
(string): The search query in natural language
Optional Parameters
limit
(integer): Maximum number of results to return (default: 10)include_parent_context
(boolean): Include parent document information (default: true)
Attachment Filter Options
attachments_only
(boolean): Show only file attachmentsparent_document_title
(string): Filter by parent document titlefile_type
(string): Filter by file type (e.g., 'pdf', 'xlsx', 'png')file_size_min
(integer): Minimum file size in bytesfile_size_max
(integer): Maximum file size in bytesauthor
(string): Filter by attachment author
📁 Supported File Types
The attachment search supports file types that can be processed by MarkItDown for text extraction:
Document Files
PDF Documents (.pdf)
- Content: Text extraction via MarkItDown
- Metadata: Basic Confluence attachment metadata (author, size, date)
- Features: Text content search within PDF documents
- Use Cases: Reports, manuals, specifications
Microsoft Office Documents
- Word Documents (.docx, .doc): Text content extraction
- Excel Spreadsheets (.xlsx, .xls): Cell content and sheet data extraction
- PowerPoint Presentations (.pptx, .ppt): Slide text and notes extraction
- Metadata: Author, file size, last modified date from Confluence
- Use Cases: Documentation, presentations, data analysis
Image Files
Common Image Formats (.png, .jpg, .jpeg, .gif, .bmp, .tiff, .webp)
- Content: Text extraction via MarkItDown (limited capability)
- Metadata: File size, dimensions (basic), upload date from Confluence
- Features: Basic text recognition where supported by MarkItDown
- Use Cases: Screenshots, diagrams, charts (text extraction may be limited)
Data and Text Files
CSV Files (.csv)
- Content: Column headers and data structure extraction
- Features: Data content made searchable
- Use Cases: Data exports, configuration data
Archive Files (.zip, .epub)
- Content: Archive content extraction where supported by MarkItDown
- Features: Basic content indexing
Plain Text Files (.txt)
- Content: Full text extraction
- Features: Complete content searchability
Audio Files (Limited Support)
Audio Formats (.mp3, .wav)
- Content: Limited to metadata extraction only
- Note: Audio transcription is not supported
Important Notes
- Conversion Dependency: All file processing depends on MarkItDown library capabilities
- Text-Based Search: Search operates on text content extracted by MarkItDown
- Confluence Metadata: File metadata comes from Confluence attachment properties
- No Custom OCR: No dedicated OCR processing beyond MarkItDown's built-in capabilities
🔍 Search Examples and Use Cases
1. Finding Specific File Types in Confluence
Architecture Diagrams
Query: "system architecture diagrams"
Parameters: {
"attachment_filter": {
"file_type": "pdf"
}
}
Results:
1. 📄 system-architecture-v2.pdf (2.3 MB)
Parent: Architecture Documentation (Confluence)
Content: "Microservices architecture with API gateway..."
Author: architecture-team
2. 📄 database-schema.pdf (1.1 MB)
Parent: Database Design (Confluence)
Content: "User Table, Product Table, Order Table..."
Author: database-team
Performance Reports
Query: "performance benchmarks and metrics"
Parameters: {
"attachment_filter": {
"file_type": "xlsx"
}
}
Results:
1. 📊 q4-performance-report.xlsx (1.2 MB)
Parent: Quarterly Reports (Confluence)
Content: Extracted spreadsheet data and metrics
Author: performance-team
2. 📊 daily-metrics.xlsx (456 KB)
Parent: Monitoring Dashboard (Confluence)
Content: Response times, throughput, error rates data
Author: devops-team
2. Content-Based Search in Confluence Attachments
Finding Specific Information
Query: "API rate limits and throttling policies"
Parameters: {
"limit": 10
}
Results:
1. 📄 api-rate-limiting-policy.pdf (1.8 MB)
Parent: API Documentation (Confluence)
Content: "Rate limiting implementation using token bucket..."
2. 📄 throttling-implementation.docx (890 KB)
Parent: Development Guidelines (Confluence)
Content: "Implementation guide for rate limiting middleware..."
3. Author and Date Filtering
Recent Updates by Team
Query: "deployment procedures"
Parameters: {
"attachment_filter": {
"author": "devops-team"
}
}
Results:
1. 📄 deployment-runbook-v3.pdf (2.1 MB)
Author: devops-team
Parent: Operations Documentation (Confluence)
Content: "Updated deployment procedures for Kubernetes..."
2. 📄 rollback-procedures.docx (678 KB)
Author: devops-team
Parent: Emergency Procedures (Confluence)
Content: "Step-by-step rollback process for production..."
Large Documents
Query: "comprehensive documentation"
Parameters: {
"attachment_filter": {
"file_size_min": 1048576 // Files larger than 1MB
}
}
Results:
1. 📄 complete-api-specification.pdf (5.2 MB)
Parent: API Documentation (Confluence)
Content: "Complete REST API specification with examples..."
2. 📄 system-architecture-guide.pdf (3.8 MB)
Parent: Architecture Documentation (Confluence)
Content: "Comprehensive system architecture documentation..."
🔧 Advanced Attachment Features
1. MarkItDown-Based Content Extraction
The attachment search uses MarkItDown for converting file content to searchable text:
PDF Content Extraction:
"System Architecture Overview
This document describes our microservices architecture..."
Excel Data Extraction:
"Performance Metrics
Response Time: 250ms
Database Queries: 45ms average"
PowerPoint Content:
"Deployment Strategy
Slide 1: Overview
Slide 2: Implementation Steps..."
Word Document:
"API Documentation
Authentication endpoints require JWT tokens..."
2. Confluence Integration
Results include context from the parent Confluence page:
Attachment: database-migration-script.sql
Parent Page: "Database Schema Updates v2.1"
Confluence Space: Development Documentation
Author: database-team
Upload Date: 2024-01-15
Parent Context:
This migration script updates the user table schema to support
new authentication features. Please run during maintenance window.
Related Attachments on Same Page:
- rollback-script.sql
- migration-test-results.md
3. Basic File Metadata
The search indexes available file metadata from Confluence:
{
"filename": "api-performance-analysis.xlsx",
"file_type": "xlsx",
"size": 2457600,
"author": "performance-team",
"parent_document": "Performance Testing Results",
"confluence_space": "Engineering Docs",
"upload_date": "2024-01-15T10:30:00Z"
}
4. Search Integration
Attachment search integrates with the broader search system:
- Semantic Search: Uses the same vector embeddings as regular document search
- Relevance Scoring: Combines content similarity with file metadata relevance
- Parent Context: Considers both attachment content and parent page context
- Filter Support: Allows filtering by file type, size, author, and parent page
🎯 Optimization Strategies
1. Query Optimization
File-Type Specific Queries
✅ "Find Excel files with performance metrics"
✅ "Show me PDF documents about deployment"
✅ "Search for architecture diagrams in PNG format"
❌ "performance"
❌ "deployment"
❌ "architecture"
Content-Specific Queries
✅ "Find documents containing database schema definitions"
✅ "Show me files with API endpoint documentation"
✅ "Search for configuration files with rate limiting settings"
2. Filter Optimization
File Size Filtering
{
"attachment_filter": {
"file_size_min": 1024, // Exclude tiny files
"file_size_max": 52428800 // Exclude files larger than 50MB
}
}
Author Filtering
{
"attachment_filter": {
"author": "architecture-team" // Specific team
}
}
3. Performance Optimization
Limit File Types
{
"attachment_filter": {
"file_type": "pdf", // Only search specific types
"attachments_only": true // Skip parent document content
}
}
Control Result Size
{
"limit": 5 // Fewer results for faster response
}
🎨 Result Interpretation
Understanding Attachment Results
File Information
📄 api-documentation.pdf (2.3 MB)
├── 📊 Metadata
│ ├── Author: technical-writing-team
│ ├── File Type: application/pdf
│ └── Size: 2.3 MB
├── 🔍 Content Preview
│ └── "This document provides comprehensive API documentation..."
├── 📁 Parent Context
│ ├── Document: API Reference Guide
│ └── Section: Complete API Documentation
└── 🔗 Related Files
├── api-examples.json
├── postman-collection.json
└── api-changelog.md
Similarity Scoring
Attachment search uses specialized similarity scoring:
Content Similarity: 0.89 (text content match)
Metadata Similarity: 0.76 (file properties match)
Context Similarity: 0.82 (parent document relevance)
Overall Score: 0.85 (weighted combination)
Quality Indicators
High-Quality Results
- High content similarity (>0.8)
- Rich metadata (author, creation date, etc.)
- Clear parent context (well-documented source)
- Appropriate file size (not too small or large)
Lower-Quality Results
- Low content similarity (<0.6)
- Missing metadata (unknown author, no dates)
- Orphaned files (no clear parent context)
- Unusual file sizes (very small or very large)
🔗 Integration with Other Search Tools
Combining Search Strategies for Confluence
1. Start with Semantic Search
Query: "deployment procedures"
→ Find general documentation about deployment in all sources
2. Use Attachment Search for Confluence Files
Query: "deployment scripts and configurations"
Parameters: {
"attachment_filter": {
"file_type": "yml"
}
}
→ Find specific implementation files in Confluence attachments
3. Use Hierarchy Search for Confluence Structure
Query: "deployment documentation structure"
→ Understand how deployment docs are organized in Confluence
Multi-Tool Workflow Example
1. Semantic Search: "API authentication methods"
→ Understand authentication concepts across all sources
2. Attachment Search: "authentication configuration files"
→ Find Confluence attachments with implementation details
3. Hierarchy Search: "authentication documentation structure"
→ See how auth docs are organized in Confluence
4. Attachment Search: "authentication examples and certificates"
→ Find practical examples and certificates in Confluence attachments
When to Use Attachment Search
✅ Use Attachment Search When: - Looking for files stored in Confluence - Need to find specific file types (PDFs, Excel, Word docs) - Want to search within file content, not just titles - Need parent page context for attachments - Working primarily with Confluence-based knowledge
❌ Don't Use Attachment Search When: - Looking for Git repository files (use semantic search) - Searching JIRA tickets (use semantic search) - Need to search across all source types - Looking for regular page content (use semantic or hierarchy search)
🔗 Related Documentation
- MCP Server Overview - Complete MCP server guide
- Search Capabilities - All search tools overview
- Hierarchy Search - Document structure navigation
- Setup and Integration - MCP server setup
📋 Attachment Search Checklist
- [ ] Understand Confluence attachments in your knowledge base
- [ ] Use file-type specific queries for targeted search within Confluence
- [ ] Apply appropriate filters (size, author, file type, parent page)
- [ ] Include parent context for complete Confluence page understanding
- [ ] Check file metadata for quality and relevance
- [ ] Combine with other search tools for comprehensive results across all sources
- [ ] Optimize performance with appropriate limits
- [ ] Understand MarkItDown limitations for file content extraction
Unlock the knowledge in your Confluence files! 📎
Attachment search reveals the wealth of information stored in your Confluence attachments - from detailed specifications in PDFs to data insights in spreadsheets to presentation content in slides. By understanding how to search and interpret Confluence file attachments, you can access important content that might otherwise be buried in file repositories.
Note: This feature currently focuses on Confluence sources and uses MarkItDown for file content extraction. For files in other sources (Git repositories, JIRA, etc.), use the standard semantic search tool.