Overview
Nia supports indexing HuggingFace datasets for semantic and agentic search. This enables your AI agents to query dataset contents, understand schema structures, and retrieve relevant rows using natural language. Try it now in the Interactive Playground →Quick Start
Supported URL Formats
Theindex tool auto-detects HuggingFace dataset URLs:
Indexing Strategy
Nia uses intelligent sampling based on dataset size to balance coverage and performance:| Dataset Size | Strategy | Rows Indexed |
|---|---|---|
| < 200K rows | Full | All rows |
| 200K - 2M rows | Sampled | Up to 100K rows |
| > 2M rows | Sampled | Up to 25K rows |
Binary columns (images, audio, arrays) are automatically excluded. Only text-compatible columns (strings, numbers, booleans) are indexed for semantic search.
What Gets Indexed
For each dataset, Nia extracts and indexes:- Row content: Text from all compatible columns, formatted and chunked for semantic retrieval
- Dataset metadata: Schema, splits, column types, row counts, license info
- Configuration info: Available configs and the selected configuration
MCP Tools
Indexing
Use the unifiedindex tool:
Searching
Usesearch with the dataset as a data source:
Reading
Usenia_read to read dataset content:
Grep (Regex Search)
Usenia_grep to search with regex patterns:
Exploring
Usenia_explore to see dataset structure:
Managing Datasets
List Indexed Datasets
Check Status
Delete a Dataset
Global Source Deduplication
HuggingFace datasets participate in Nia’s global source pool:- If someone has already indexed
openai/gsm8k, you can subscribe instantly without re-indexing - Use
manage_resource(action="subscribe", identifier="https://huggingface.co/datasets/openai/gsm8k")to subscribe to an existing index - Set
add_as_global_source=Falsewhen indexing to keep datasets private
API Endpoints
Index a Dataset
Response
Use Cases
Fine-tuning Data Discovery
Search through datasets to find relevant training examples for your specific use case.
Benchmark Analysis
Query benchmark datasets to understand evaluation metrics and test case distribution.
Data Augmentation
Find similar examples across multiple datasets to augment your training data.
Dataset Documentation
Let your agents understand dataset schemas and find specific examples on demand.
Limitations
- Binary data: Images, audio, and array columns are excluded from indexing
- Large datasets: Very large datasets (>2M rows) are sampled to maintain performance
- Streaming: Datasets must support HuggingFace’s streaming mode
- Private datasets: Requires
HF_TOKENenvironment variable for private dataset access

