Git Repositories
Connect QDrant Loader to Git repositories to index source code, documentation, and project files. This guide covers setup for GitHub, GitLab, Bitbucket, and self-hosted Git servers.
๐ฏ What Gets Processed
When you connect a Git repository, QDrant Loader can process:
- Source code files - Python, JavaScript, Java, C++, and more
- Documentation - Markdown, reStructuredText, plain text files
- Configuration files - YAML, JSON, TOML, XML
- README files - Project documentation and guides
- Any text-based files - Based on your file type configuration
๐ง Authentication Setup
GitHub
Personal Access Token (Recommended)
- Create a Personal Access Token:
- Go to GitHub Settings โ Developer settings โ Personal access tokens
- Click "Generate new token (classic)"
- Select scopes:
repo
(for private repos) orpublic_repo
(for public repos only) -
Copy the token (starts with
ghp_
) -
Set environment variable:
bash
export REPO_TOKEN=ghp_your_github_token_here
GitLab
Personal Access Token
- Create a Personal Access Token:
- Go to GitLab Settings โ Access Tokens
- Create token with
read_repository
scope -
Copy the token
-
Set environment variable:
bash
export REPO_TOKEN=glpat_your_gitlab_token_here
Other Git Providers
For other Git providers (Bitbucket, self-hosted), use their respective token systems:
export REPO_TOKEN=your_access_token_here
โ๏ธ Configuration
QDrant Loader uses a project-based configuration structure. Each project can have multiple Git repository sources.
Basic Configuration
projects:
my-project:
display_name: "My Code Project"
description: "Source code and documentation"
collection_name: "my-code"
sources:
git:
main-repo:
base_url: "https://github.com/your-org/your-repo.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "docs/**"
- "src/**"
- "README.md"
exclude_paths:
- "node_modules/**"
- "build/**"
file_types:
- "*.md"
- "*.py"
- "*.js"
max_file_size: 1048576 # 1MB
depth: 1
enable_file_conversion: true
Advanced Configuration
projects:
development:
display_name: "Development Project"
description: "Multiple repositories for development"
collection_name: "dev-docs"
sources:
git:
# Frontend repository
frontend-repo:
base_url: "https://github.com/your-org/frontend.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "src/**"
- "docs/**"
- "README.md"
exclude_paths:
- "node_modules/**"
- "dist/**"
- "build/**"
file_types:
- "*.js"
- "*.jsx"
- "*.ts"
- "*.tsx"
- "*.md"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
# Backend repository
backend-repo:
base_url: "https://github.com/your-org/backend.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "src/**"
- "docs/**"
- "README.md"
exclude_paths:
- "__pycache__/**"
- "venv/**"
- ".pytest_cache/**"
file_types:
- "*.py"
- "*.md"
- "*.yaml"
- "*.json"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
Multiple Repositories
projects:
multi-repo:
display_name: "Multi-Repository Project"
description: "Documentation from multiple repositories"
collection_name: "multi-repo-docs"
sources:
git:
# Documentation repository
docs-repo:
base_url: "https://github.com/your-org/docs.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "docs/**"
- "README.md"
exclude_paths: []
file_types:
- "*.md"
- "*.rst"
- "*.txt"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
# API documentation
api-docs:
base_url: "https://github.com/your-org/api-docs.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "**/*.md"
- "**/*.yaml"
exclude_paths:
- "archive/**"
file_types:
- "*.md"
- "*.yaml"
- "*.json"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
๐ฏ Configuration Options
Required Settings
Option | Type | Description | Example |
---|---|---|---|
base_url |
string | Repository URL (HTTPS or SSH) | https://github.com/org/repo.git |
branch |
string | Branch to process | main |
token |
string | Authentication token | ${REPO_TOKEN} |
file_types |
list | File extensions to process | ["*.md", "*.py"] |
Path Filtering
Option | Type | Description | Default |
---|---|---|---|
include_paths |
list | Glob patterns for paths to include | [] (all) |
exclude_paths |
list | Glob patterns for paths to exclude | [] |
Processing Options
Option | Type | Description | Default |
---|---|---|---|
max_file_size |
int | Maximum file size in bytes | 1048576 (1MB) |
depth |
int | Repository clone depth | 1 |
enable_file_conversion |
bool | Enable file conversion for attachments | true |
๐ Usage Examples
Software Development Team
projects:
dev-team:
display_name: "Development Team"
description: "Source code and technical documentation"
collection_name: "dev-code"
sources:
git:
# Main application repository
main-app:
base_url: "https://github.com/company/main-app.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "src/**"
- "docs/**"
- "README.md"
- "CHANGELOG.md"
exclude_paths:
- "tests/**"
- "node_modules/**"
- "__pycache__/**"
file_types:
- "*.py"
- "*.js"
- "*.ts"
- "*.md"
- "*.yaml"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
# Shared libraries
shared-libs:
base_url: "https://github.com/company/shared-libs.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "lib/**"
- "docs/**"
exclude_paths:
- "tests/**"
file_types:
- "*.py"
- "*.js"
- "*.md"
max_file_size: 524288 # 512KB
depth: 1
enable_file_conversion: true
Documentation Team
projects:
docs-team:
display_name: "Documentation Team"
description: "All documentation repositories"
collection_name: "documentation"
sources:
git:
# Main documentation
main-docs:
base_url: "https://github.com/company/documentation.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "docs/**"
- "guides/**"
- "README.md"
exclude_paths:
- "archive/**"
file_types:
- "*.md"
- "*.rst"
- "*.txt"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
# API documentation
api-docs:
base_url: "https://github.com/company/api-docs.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "**/*.md"
- "**/*.yaml"
exclude_paths: []
file_types:
- "*.md"
- "*.yaml"
- "*.json"
max_file_size: 1048576
depth: 1
enable_file_conversion: true
Research Team
projects:
research-team:
display_name: "Research Team"
description: "Research code and documentation"
collection_name: "research"
sources:
git:
# Analysis tools
analysis-tools:
base_url: "https://github.com/research-org/analysis-tools.git"
branch: "main"
token: "${REPO_TOKEN}"
include_paths:
- "src/**"
- "notebooks/**"
- "docs/**"
- "README.md"
exclude_paths:
- "data/**"
- "output/**"
file_types:
- "*.py"
- "*.ipynb"
- "*.md"
- "*.txt"
max_file_size: 2097152 # 2MB for notebooks
depth: 1
enable_file_conversion: true
๐งช Testing and Validation
Initialize and Test Configuration
# Initialize the project (creates collection if needed)
qdrant-loader --workspace . init
# Test ingestion with your Git configuration
qdrant-loader --workspace . ingest --project my-project
# Check project status
qdrant-loader --workspace . project status --project-id my-project
# List all configured projects
qdrant-loader --workspace . project list
# Validate project configuration
qdrant-loader --workspace . project validate --project-id my-project
Debug Git Processing
# Enable debug logging
qdrant-loader --workspace . --log-level DEBUG ingest --project my-project
# Process specific project only
qdrant-loader --workspace . ingest --project my-project
# Process specific source within a project
qdrant-loader --workspace . ingest --project my-project --source-type git --source main-repo
๐ง Troubleshooting
Common Issues
Authentication Failures
Problem: Authentication failed
or Permission denied
Solutions:
# Check token validity for GitHub
curl -H "Authorization: token $REPO_TOKEN" https://api.github.com/user
# Check token validity for GitLab
curl -H "Authorization: Bearer $REPO_TOKEN" https://gitlab.com/api/v4/user
# Test repository access manually
git clone https://github.com/org/repo.git /tmp/test-repo
Check your configuration:
- Ensure the
token
is set correctly via environment variable - For private repositories, ensure the token has appropriate permissions
- Verify the repository URL is correct and accessible
Repository Access Issues
Problem: Repository not found
or No permission to access repository
Solutions:
- Verify repository URL:
- Ensure the URL is correct and includes
.git
extension - Check if the repository is public or private
-
Verify you have access to the repository
-
Check authentication:
bash
# Test manual clone
git clone https://github.com/org/repo.git /tmp/test-clone
- Verify token permissions:
- For GitHub: Ensure token has
repo
scope for private repos - For GitLab: Ensure token has
read_repository
scope
Configuration Issues
Problem: Configuration validation errors
Solutions:
- Verify project structure:
yaml
projects:
your-project: # Project ID
sources:
git:
source-name: # Source name
base_url: "..."
# ... other settings
- Check required fields:
base_url
: Must be a valid Git repository URLbranch
: Must be a valid branch nametoken
: Must be set via environment variable-
file_types
: Must be a non-empty list -
Validate file patterns:
yaml
file_types:
- "*.md" # Correct
- "*.py" # Correct
include_paths:
- "docs/**" # Correct glob pattern
- "src/**" # Correct glob pattern
Large Repository Performance
Problem: Processing takes too long or uses too much memory
Solutions:
- Filter paths aggressively:
yaml
git:
large-repo:
include_paths:
- "docs/**"
- "README.md"
exclude_paths:
- "node_modules/**"
- "build/**"
- "dist/**"
- ".git/**"
- Limit file types:
yaml
git:
focused-repo:
file_types:
- "*.md"
- "*.py"
# Skip binary files, images, etc.
- Set file size limits:
yaml
git:
size-limited:
max_file_size: 524288 # 512KB
File Processing Errors
Problem: Some files fail to process
Solutions:
- Check file size limits:
yaml
git:
repo-with-limits:
max_file_size: 1048576 # 1MB
- Verify file types:
yaml
git:
text-only:
file_types:
- "*.md"
- "*.txt"
- "*.py"
- "*.js"
# Avoid binary files
- Check file paths:
- Ensure include/exclude patterns are correct
- Verify files exist in the specified paths
Debugging Commands
# Check Git configuration
git config --list
# Test repository access manually
git clone https://github.com/org/repo.git /tmp/test-repo
# Check file patterns locally
find /tmp/test-repo -name "*.py" | head -10
# Verify authentication
curl -H "Authorization: token $REPO_TOKEN" \
https://api.github.com/repos/org/repo
๐ Monitoring and Processing
Check Processing Status
# View project status
qdrant-loader --workspace . project status
# Check specific project
qdrant-loader --workspace . project status --project-id my-project
# List all projects
qdrant-loader --workspace . project list
Configuration Management
# View current configuration
qdrant-loader --workspace . config
# Validate all projects
qdrant-loader --workspace . project validate
๐ Best Practices
Repository Organization
- Use specific branches - Process stable branches like
main
orrelease
- Filter aggressively - Only include files you need to search
- Set size limits - Avoid processing very large files
- Exclude build artifacts - Skip generated files and dependencies
Path Filtering
- Include specific paths:
yaml
include_paths:
- "docs/**"
- "src/**"
- "README.md"
- Exclude unnecessary paths:
yaml
exclude_paths:
- "node_modules/**"
- "build/**"
- "dist/**"
- "__pycache__/**"
- ".git/**"
File Type Selection
- Focus on text files:
yaml
file_types:
- "*.md"
- "*.py"
- "*.js"
- "*.yaml"
- "*.json"
- Avoid binary files - They don't provide searchable content
Performance Optimization
- Use shallow clones - Set
depth: 1
for faster cloning - Limit file sizes - Set reasonable
max_file_size
limits - Process incrementally - Run regular updates rather than full reprocessing
- Monitor resources - Watch memory and disk usage during processing
Security Considerations
- Use minimal permissions - Grant only necessary repository access
- Rotate tokens regularly - Update access tokens periodically
- Secure token storage - Store tokens in environment variables
- Audit access - Monitor which repositories are being accessed
- Use environment variables - Never commit tokens to version control
๐ Related Documentation
- Configuration Reference - Complete configuration options
- File Conversion - Processing different file types found in repositories
- Troubleshooting - Common issues and solutions
- MCP Server - Using processed Git content with AI tools
- Project Management - Managing multiple projects
Ready to connect your Git repositories? Start with the basic configuration above and customize based on your specific needs and repository structure.