Public Documentation
Connect QDrant Loader to public documentation websites, API references, and external knowledge sources. This guide covers setup for web scraping and processing publicly available content.
๐ฏ What Gets Processed
When you configure public documentation processing, QDrant Loader can handle:
- API Documentation - REST API docs, OpenAPI specs, SDK documentation
- Technical Documentation - Framework docs, library references, tutorials
- Knowledge Bases - Public wikis, help centers, support documentation
- Blog Posts - Technical blogs, release notes, announcements
- Static Sites - Documentation sites built with Jekyll, Hugo, GitBook
- Versioned Documentation - Specific versions of documentation
๐ง Setup and Configuration
QDrant Loader uses a project-based configuration structure. Each project can have multiple public documentation sources.
Basic Configuration
projects:
my-project:
display_name: "My Documentation Project"
description: "External documentation and knowledge sources"
collection_name: "my-docs"
sources:
publicdocs:
example-docs:
base_url: "https://docs.example.com"
version: "1.0"
content_type: "html"
selectors:
content: "article, main, .content"
remove: ["nav", "header", "footer", ".sidebar"]
code_blocks: "pre code"
download_attachments: false
enable_file_conversion: false
Advanced Configuration
projects:
my-project:
display_name: "My Documentation Project"
description: "External documentation and knowledge sources"
collection_name: "my-docs"
sources:
publicdocs:
example-docs:
base_url: "https://docs.example.com"
version: "1.0"
content_type: "html"
# Path filtering
path_pattern: "/docs/**"
exclude_paths:
- "/docs/archive/**"
- "/docs/deprecated/**"
- "/api/internal/**"
# Content extraction selectors
selectors:
content: "article, main, .content"
remove: ["nav", "header", "footer", ".sidebar", ".advertisement"]
code_blocks: "pre code, .highlight code"
# Attachment handling
download_attachments: true
attachment_selectors:
- "a[href$='.pdf']"
- "a[href$='.doc']"
- "a[href$='.docx']"
- "a[href$='.xlsx']"
- "a[href$='.pptx']"
# File conversion
enable_file_conversion: true
Multiple Documentation Sites
projects:
multi-docs:
display_name: "Multi-Documentation Project"
description: "Documentation from multiple external sources"
collection_name: "multi-docs"
sources:
publicdocs:
# Main API documentation
api-docs:
base_url: "https://api.example.com/docs"
version: "v2"
content_type: "html"
path_pattern: "/docs/**"
selectors:
content: ".api-content"
remove: [".sidebar", ".navigation"]
download_attachments: false
enable_file_conversion: false
# Framework documentation
framework-docs:
base_url: "https://framework.example.com"
version: "latest"
content_type: "html"
path_pattern: "/guide/**"
selectors:
content: ".documentation"
remove: [".menu", ".footer"]
download_attachments: false
enable_file_conversion: false
# Community wiki
community-wiki:
base_url: "https://wiki.example.com"
version: "current"
content_type: "html"
exclude_paths:
- "/wiki/user:**"
- "/wiki/talk:**"
selectors:
content: ".wiki-content"
remove: [".sidebar", ".edit-section"]
download_attachments: false
enable_file_conversion: false
๐ฏ Configuration Options
Required Settings
Option | Type | Description | Example |
---|---|---|---|
base_url |
string | Base URL to start crawling | https://docs.example.com |
version |
string | Version identifier for the documentation | 1.0 |
Content Settings
Option | Type | Description | Default |
---|---|---|---|
content_type |
string | Content type: html , markdown , or rst |
html |
Path Filtering
Option | Type | Description | Default |
---|---|---|---|
path_pattern |
string | Specific path pattern to match | null (all paths) |
exclude_paths |
list | List of paths to exclude from processing | [] |
Content Extraction
Option | Type | Description | Default |
---|---|---|---|
selectors.content |
string | Main content CSS selector | article, main, .content |
selectors.remove |
list | Elements to remove from content | ["nav", "header", "footer", ".sidebar"] |
selectors.code_blocks |
string | Code blocks CSS selector | pre code |
Attachment Processing
Option | Type | Description | Default |
---|---|---|---|
download_attachments |
bool | Download and process linked files | false |
attachment_selectors |
list | CSS selectors for finding attachments | PDF, DOC, XLS, PPT selectors |
enable_file_conversion |
bool | Enable file conversion for attachments | false |
๐ Usage Examples
API Documentation
projects:
api-documentation:
display_name: "API Documentation"
description: "External API documentation sources"
collection_name: "api-docs"
sources:
publicdocs:
# REST API Documentation
stripe-api:
base_url: "https://stripe.com/docs/api"
version: "2023-10-16"
content_type: "html"
path_pattern: "/docs/api/**"
selectors:
content: ".api-content"
remove: [".sidebar", ".navigation", ".footer"]
code_blocks: "pre code, .highlight code"
download_attachments: false
enable_file_conversion: false
# OpenAPI/Swagger Documentation
petstore-api:
base_url: "https://petstore.swagger.io"
version: "v3"
content_type: "html"
path_pattern: "/v3/**"
selectors:
content: ".swagger-ui"
remove: [".topbar", ".information-container"]
download_attachments: false
enable_file_conversion: false
Framework Documentation
projects:
frameworks:
display_name: "Framework Documentation"
description: "Documentation for development frameworks"
collection_name: "framework-docs"
sources:
publicdocs:
# React Documentation
react-docs:
base_url: "https://react.dev"
version: "18"
content_type: "html"
path_pattern: "/learn/**"
exclude_paths:
- "/blog/**"
- "/community/**"
selectors:
content: ".content"
remove: [".sidebar", ".navigation"]
download_attachments: false
enable_file_conversion: false
# Django Documentation
django-docs:
base_url: "https://docs.djangoproject.com"
version: "stable"
content_type: "html"
path_pattern: "/en/stable/**"
selectors:
content: ".document"
remove: [".sphinxsidebar", ".related"]
download_attachments: false
enable_file_conversion: false
Knowledge Bases and Wikis
projects:
knowledge:
display_name: "Knowledge Base"
description: "External knowledge bases and wikis"
collection_name: "knowledge-base"
sources:
publicdocs:
# GitHub Wiki
vscode-wiki:
base_url: "https://github.com/microsoft/vscode/wiki"
version: "current"
content_type: "html"
selectors:
content: ".markdown-body"
remove: [".gh-header", ".pagehead"]
download_attachments: false
enable_file_conversion: false
# GitBook Documentation
gitbook-docs:
base_url: "https://docs.gitbook.com"
version: "latest"
content_type: "html"
path_pattern: "/product-tour/**"
selectors:
content: ".page-content"
remove: [".sidebar", ".header"]
download_attachments: false
enable_file_conversion: false
Technical Blogs and Release Notes
projects:
technical-content:
display_name: "Technical Content"
description: "Technical blogs and release notes"
collection_name: "tech-content"
sources:
publicdocs:
# Engineering Blog
engineering-blog:
base_url: "https://engineering.example.com"
version: "current"
content_type: "html"
path_pattern: "/posts/**"
exclude_paths:
- "/author/**"
- "/tag/**"
selectors:
content: ".post-content"
remove: [".sidebar", ".author-bio", ".related-posts"]
download_attachments: false
enable_file_conversion: false
# Release Notes
release-notes:
base_url: "https://releases.example.com"
version: "latest"
content_type: "html"
path_pattern: "/notes/**"
selectors:
content: ".release-content"
remove: [".navigation", ".footer"]
download_attachments: false
enable_file_conversion: false
๐งช Testing and Validation
Initialize and Test Configuration
# Initialize the project (creates collection if needed)
qdrant-loader --workspace . init
# Test ingestion with your public docs configuration
qdrant-loader --workspace . ingest --project my-project
# Check project status
qdrant-loader --workspace . project status --project-id my-project
# List all configured projects
qdrant-loader --workspace . project list
# Validate project configuration
qdrant-loader --workspace . project validate --project-id my-project
Debug Public Documentation Processing
# Enable debug logging
qdrant-loader --workspace . --log-level DEBUG ingest --project my-project
# Process specific project only
qdrant-loader --workspace . ingest --project my-project
# Process specific source within a project
qdrant-loader --workspace . ingest --project my-project --source-type publicdocs --source example-docs
๐ง Troubleshooting
Common Issues
Access Denied or Blocked
Problem: 403 Forbidden
, 429 Too Many Requests
, or blocked by anti-bot measures
Solutions:
- Check robots.txt: Ensure the site allows crawling
- Verify URL accessibility: Test the base URL manually
- Check path patterns: Ensure path_pattern matches actual site structure
# Test website accessibility
curl -I "https://docs.example.com"
# Check robots.txt
curl "https://docs.example.com/robots.txt"
Content Not Found
Problem: CSS selectors don't match content or pages appear empty
Solutions:
# Test CSS selectors manually
curl -s "https://example.com/page" | grep -A 10 -B 10 "class=\"content\""
# Use browser developer tools to find correct selectors
projects:
my-project:
sources:
publicdocs:
example-docs:
base_url: "https://example.com"
version: "1.0"
# Try multiple selectors
selectors:
content: "article, main, .content, .documentation, .md-content"
remove: ["nav", "header", "footer", ".sidebar", ".menu"]
Path Pattern Issues
Problem: No pages being processed due to incorrect path patterns
Solutions:
projects:
my-project:
sources:
publicdocs:
example-docs:
base_url: "https://docs.example.com"
version: "1.0"
# Use broader path pattern or remove it entirely
path_pattern: "/**" # Allow all paths
# Or be more specific
# path_pattern: "/docs/**"
# Check exclude patterns
exclude_paths:
- "/docs/archive/**"
- "/api/internal/**"
Configuration Issues
Problem: Configuration validation errors
Solutions:
- Verify project structure:
yaml
projects:
your-project: # Project ID
sources:
publicdocs:
source-name: # Source name
base_url: "..."
# ... other settings
- Check required fields:
base_url
: Must be a valid URLversion
: Must be a non-empty string-
content_type
: Must behtml
,markdown
, orrst
-
Validate selectors:
yaml
selectors:
content: "article, main, .content" # Valid CSS selector
remove: ["nav", "header", "footer"] # List of CSS selectors
code_blocks: "pre code" # Valid CSS selector
Attachment Processing Issues
Problem: Attachments not being downloaded or processed
Solutions:
projects:
my-project:
sources:
publicdocs:
example-docs:
base_url: "https://docs.example.com"
version: "1.0"
# Enable attachment processing
download_attachments: true
enable_file_conversion: true
# Customize attachment selectors
attachment_selectors:
- "a[href$='.pdf']"
- "a[href$='.doc']"
- "a[href$='.docx']"
- "a[href*='download']"
Debugging Commands
# Check website structure
curl -s "https://example.com" | grep -E '<title>|<h1>|class="content"'
# Test specific page
curl -s "https://example.com/docs/page" | head -50
# Check for JavaScript requirements
curl -s "https://example.com" | grep -i javascript
๐ Monitoring and Processing
Check Processing Status
# View project status
qdrant-loader --workspace . project status
# Check specific project
qdrant-loader --workspace . project status --project-id my-project
# List all projects
qdrant-loader --workspace . project list
Configuration Management
# View current configuration
qdrant-loader --workspace . config
# Validate all projects
qdrant-loader --workspace . project validate
๐ Best Practices
Site Selection
- Choose stable documentation sites - Avoid frequently changing sites
- Verify site accessibility - Ensure the site allows automated access
- Check content structure - Verify consistent HTML structure
- Test CSS selectors - Ensure selectors work across different pages
Performance Optimization
- Use specific path patterns - Limit crawling to relevant sections
- Optimize CSS selectors - Use efficient selectors for content extraction
- Filter aggressively - Exclude unnecessary paths and content
- Enable file conversion selectively - Only when needed for attachments
Content Quality
- Verify content extraction - Check that selectors capture the right content
- Remove navigation elements - Exclude menus, headers, and footers
- Handle code blocks properly - Use appropriate selectors for code
- Test with sample pages - Verify configuration with representative pages
Security Considerations
- Respect robots.txt - Follow site crawling guidelines
- Avoid overloading servers - Use reasonable crawling rates
- Check terms of service - Ensure compliance with site policies
- Monitor access patterns - Track which content is being accessed
Maintenance
- Monitor site changes - Documentation sites may change structure
- Update selectors regularly - Adjust selectors when sites are updated
- Version documentation appropriately - Use meaningful version identifiers
- Regular validation - Periodically check that processing still works
๐ Related Documentation
- Configuration Reference - Complete configuration options
- File Conversion - Processing downloaded attachments
- Troubleshooting - Common issues and solutions
- MCP Server - Using processed documentation with AI tools
- Project Management - Managing multiple projects
Ready to connect to public documentation? Start with the basic configuration above and customize based on the specific documentation site structure and your needs.