Start indexing a documentation website or any web content. Supports advanced crawling options like URL patterns, content filtering, and llms.txt support.
API key must be provided in the Authorization header
URL to index (documentation or website)
"https://docs.example.com"
URL patterns to include in crawling (supports wildcards)
[
"https://docs.example.com/api/*",
"https://docs.example.com/guides/*"
]URL patterns to exclude from crawling
["/blog/*", "/changelog/*"]Optional project ID to associate with
Maximum age of cached content in seconds (for fast scraping)
Content formats to return
["markdown", "html"]Extract only main content (removes nav, ads, etc.)
Maximum number of pages to crawl
Maximum crawl depth
Whether to crawl the entire domain
Time to wait for page to load in milliseconds
Include full page screenshot
Check for llms.txt file for curated documentation URLs
How to use llms.txt if found:
prefer, only, ignore Data source indexing started successfully
Unique identifier for the data source
The indexed URL
File name for text sources
Current indexing status
pending, processing, completed, failed, error Number of pages indexed
Number of chunks/embeddings created
Associated project ID if any
web, documentation, research_paper, text Custom display name for the data source
arXiv identifier when source_type is research_paper
Research paper source provider (e.g. arxiv)
Optional lightweight metadata. For research_paper sources this may include title/authors/etc. For other source types this may be omitted.
Error message if status is 'error' or 'failed'
Error code for programmatic error handling