POST
/
data-sources
Index a new data source
curl --request POST \
  --url https://apigcp.trynia.ai/v2/data-sources \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://docs.example.com",
  "url_patterns": [
    "https://docs.example.com/api/*",
    "https://docs.example.com/guides/*"
  ],
  "exclude_patterns": [
    "/blog/*",
    "/changelog/*"
  ],
  "project_id": "<string>",
  "max_age": 123,
  "formats": [
    "markdown",
    "html"
  ],
  "only_main_content": true,
  "limit": 10000,
  "max_depth": 20,
  "crawl_entire_domain": true,
  "wait_for": 2000,
  "include_screenshot": true,
  "check_llms_txt": true,
  "llms_txt_strategy": "prefer"
}'
{
  "id": "<string>",
  "url": "<string>",
  "file_name": "<string>",
  "status": "pending",
  "created_at": "2023-11-07T05:31:56Z",
  "updated_at": "2023-11-07T05:31:56Z",
  "page_count": 0,
  "chunk_count": 0,
  "project_id": "<string>",
  "source_type": "web",
  "is_active": true,
  "display_name": "<string>",
  "error": "<string>",
  "error_code": "<string>"
}

Authorizations

Authorization
string
header
required

API key must be provided in the Authorization header

Body

application/json
url
string<uri>
required

URL to index (documentation or website)

Example:

"https://docs.example.com"

url_patterns
string[]

URL patterns to include in crawling (supports wildcards)

Example:
[
"https://docs.example.com/api/*",
"https://docs.example.com/guides/*"
]
exclude_patterns
string[]

URL patterns to exclude from crawling

Example:
["/blog/*", "/changelog/*"]
project_id
string

Optional project ID to associate with

max_age
integer

Maximum age of cached content in seconds (for fast scraping)

formats
string[]

Content formats to return

Example:
["markdown", "html"]
only_main_content
boolean
default:true

Extract only main content (removes nav, ads, etc.)

limit
integer
default:10000

Maximum number of pages to crawl

max_depth
integer
default:20

Maximum crawl depth

crawl_entire_domain
boolean
default:true

Whether to crawl the entire domain

wait_for
integer
default:2000

Time to wait for page to load in milliseconds

include_screenshot
boolean
default:true

Include full page screenshot

check_llms_txt
boolean
default:true

Check for llms.txt file for curated documentation URLs

llms_txt_strategy
enum<string>
default:prefer

How to use llms.txt if found:

  • prefer: Start with llms.txt URLs, then crawl additional pages if under limit
  • only: Only index URLs listed in llms.txt
  • ignore: Skip llms.txt check (traditional behavior)
Available options:
prefer,
only,
ignore

Response

Data source indexing started successfully

id
string

Unique identifier for the data source

url
string

The indexed URL

file_name
string

File name for text sources

status
enum<string>

Current indexing status

Available options:
pending,
processing,
completed,
failed,
error
created_at
string<date-time>
updated_at
string<date-time>
page_count
integer
default:0

Number of pages indexed

chunk_count
integer
default:0

Number of chunks/embeddings created

project_id
string

Associated project ID if any

source_type
enum<string>
default:web
Available options:
web,
text
is_active
boolean
default:true
display_name
string

Custom display name for the data source

error
string

Error message if status is 'error' or 'failed'

error_code
string

Error code for programmatic error handling