Index a HuggingFace dataset for semantic search. The dataset is fetched from the HuggingFace Hub, text columns are extracted and chunked, and embeddings are created for semantic search.
Supports multiple input formats:
Large datasets are automatically sampled to manage storage and indexing time. Datasets are globally deduplicated - if another user has already indexed a dataset, you’ll get instant access to the existing index.
API key must be provided in the Authorization header
HuggingFace dataset URL or identifier. Supports multiple formats:
"dair-ai/emotion"
Dataset configuration name (for multi-config datasets)
Add to global shared pool (default true). Set false for private indexing.
HuggingFace dataset indexing started or completed successfully
Unique identifier for the data source
Dataset identifier (e.g., "squad", "emotion")
Canonical HuggingFace dataset URL
Current indexing status
pending, processing, completed, failed, error Dataset owner/organization
Dataset description
Available dataset splits (e.g., ["train", "test", "validation"])
Dataset columns with names and data types
Total number of rows in the dataset
Number of rows actually indexed (may differ due to sampling)
Number of text chunks created
Sampling strategy used (full or sampled)
Dataset license
Error message if status is 'failed'