curl --request POST \
  --url https://apigcp.trynia.ai/v2/huggingface-datasets \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "url": "dair-ai/emotion"
}
'

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "dataset_id": "emotion",
  "url": "https://huggingface.co/datasets/dair-ai/emotion",
  "status": "processing",
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "owner": "dair-ai",
  "description": "Emotion is a dataset of English Twitter messages with six basic emotions.",
  "splits": [
    "train",
    "test",
    "validation"
  ],
  "columns": [
    {
      "name": "text",
      "dtype": "string"
    },
    {
      "name": "label",
      "dtype": "int64"
    }
  ],
  "row_count": 20000,
  "indexed_row_count": 0,
  "chunk_count": 0
}

HuggingFace Datasets

Index a HuggingFace dataset

Index a HuggingFace dataset for semantic search. The dataset is fetched from the HuggingFace Hub, text columns are extracted and chunked, and embeddings are created for semantic search.

Supports multiple input formats:

Full URL: https://huggingface.co/datasets/squad
Owner/dataset format: dair-ai/emotion
Dataset name only: squad

Large datasets are automatically sampled to manage storage and indexing time. Datasets are globally deduplicated - if another user has already indexed a dataset, you’ll get instant access to the existing index.

POST

huggingface-datasets

curl --request POST \
  --url https://apigcp.trynia.ai/v2/huggingface-datasets \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "url": "dair-ai/emotion"
}
'

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "dataset_id": "emotion",
  "url": "https://huggingface.co/datasets/dair-ai/emotion",
  "status": "processing",
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "owner": "dair-ai",
  "description": "Emotion is a dataset of English Twitter messages with six basic emotions.",
  "splits": [
    "train",
    "test",
    "validation"
  ],
  "columns": [
    {
      "name": "text",
      "dtype": "string"
    },
    {
      "name": "label",
      "dtype": "int64"
    }
  ],
  "row_count": 20000,
  "indexed_row_count": 0,
  "chunk_count": 0
}

Authorizations

Authorization

string

header

required

API key must be provided in the Authorization header

Body

application/json

url

string

required

HuggingFace dataset URL or identifier. Supports multiple formats:

Full URL: https://huggingface.co/datasets/squad
Owner/dataset: dair-ai/emotion
Dataset name: squad

Example:

"dair-ai/emotion"

config

string | null

Dataset configuration name (for multi-config datasets)

add_as_global_source

boolean

default:true

Add to global shared pool (default true). Set false for private indexing.

Response

HuggingFace dataset indexing started or completed successfully

string

Unique identifier for the data source

dataset_id

string

Dataset identifier (e.g., "squad", "emotion")

url

string

Canonical HuggingFace dataset URL

status

enum<string>

Current indexing status

Available options:

pending,

processing,

completed,

failed,

error

created_at

string<date-time>

updated_at

string<date-time>

owner

string | null

Dataset owner/organization

description

string | null

Dataset description

splits

string[]

Available dataset splits (e.g., ["train", "test", "validation"])

columns

object[]

Dataset columns with names and data types

Show child attributes

row_count

integer

default:0

Total number of rows in the dataset

indexed_row_count

integer

default:0

Number of rows actually indexed (may differ due to sampling)

chunk_count

integer

default:0

Number of text chunks created

sampling_strategy

string | null

Sampling strategy used (full or sampled)

license

string | null

Dataset license

error

string | null

Error message if status is 'failed'

List indexed HuggingFace datasets Subscribe to a global source

⌘I

Sources

Search

Repositories

Search & Research

Data Sources

Categories

Research Papers

HuggingFace Datasets

Global Sources

Package Search

Oracle Research

Context Sharing

Usage

Dependencies

Advisor

Local Folders

Index a HuggingFace dataset

Authorizations

Body

Response