> ## Documentation Index
> Fetch the complete documentation index at: https://docs.trynia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Extraction

> Extract structured data from PDFs using JSON schemas, detect visual elements, and use engineering-specific extraction for technical documents

Extract structured records from any PDF — financial filings, invoices, spec sheets, engineering drawings — using custom JSON schemas, visual element detection, or purpose-built engineering extraction.

***

## Three Extraction Modes

<CardGroup cols={3}>
  <Card title="Table Extraction" icon="table">
    Define a JSON schema describing the fields you need, and Nia extracts structured records from any PDF. Ideal for financial data, line items, tabular content, and repeating structures.
  </Card>

  <Card title="Detect Extraction" icon="eye">
    Detect and locate visual elements — tables, figures, charts, diagrams — in PDF pages. Returns bounding boxes, classifications, and annotated page images.
  </Card>

  <Card title="Engineering Extraction" icon="compass-drafting">
    Purpose-built for technical documents — engineering drawings, P\&IDs, schematics, and spec sheets. Extracts structured metadata with optional follow-up queries for deeper analysis.
  </Card>
</CardGroup>

***

## How It Works

<Steps>
  <Step title="Submit a Document">
    Provide a PDF URL or an existing Nia source ID along with an optional page range. For table extraction, include a JSON schema defining the fields to extract.
  </Step>

  <Step title="Processing">
    Nia parses the document, identifies relevant structures, and extracts data according to your schema (table mode) or built-in engineering models (engineering mode).
  </Step>

  <Step title="Retrieve Results">
    Poll the extraction job until it completes. Table extraction returns an array of structured records; engineering extraction returns a result object you can query further.
  </Step>
</Steps>

***

## Table Extraction

Define a JSON schema and Nia returns structured records matching your specification. This is ideal for pulling repeating data out of dense documents like SEC filings, invoices, or product catalogs.

### Start an Extraction Job

```bash theme={null}
curl -X POST https://apigcp.trynia.ai/v2/extract \
  -H "Authorization: Bearer $NIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.sec.gov/Archives/edgar/data/1326801/000132680124000006/meta-20231231.htm",
    "page_range": "60-80",
    "json_schema": {
      "type": "object",
      "properties": {
        "line_item": {
          "type": "string",
          "description": "Name of the financial line item"
        },
        "fiscal_year_2023": {
          "type": "number",
          "description": "Value for fiscal year 2023 in millions USD"
        },
        "fiscal_year_2022": {
          "type": "number",
          "description": "Value for fiscal year 2022 in millions USD"
        },
        "yoy_change_pct": {
          "type": "number",
          "description": "Year-over-year change as a percentage"
        }
      },
      "required": ["line_item", "fiscal_year_2023"]
    }
  }'
```

Response:

```json theme={null}
{
  "id": "ext_abc123",
  "status": "queued"
}
```

### Check Extraction Status

```bash theme={null}
curl https://apigcp.trynia.ai/v2/extract/ext_abc123 \
  -H "Authorization: Bearer $NIA_API_KEY"
```

Response when completed:

```json theme={null}
{
  "id": "ext_abc123",
  "status": "completed",
  "progress": 100,
  "record_count": 24,
  "page_count": 20,
  "records": [
    {
      "line_item": "Total revenue",
      "fiscal_year_2023": 134902,
      "fiscal_year_2022": 116609,
      "yoy_change_pct": 15.69
    },
    {
      "line_item": "Cost of revenue",
      "fiscal_year_2023": 38019,
      "fiscal_year_2022": 25249,
      "yoy_change_pct": 50.57
    }
  ]
}
```

### JSON Schema Tips

<Tip>
  **Use descriptions** — Add a `description` to each field in your schema. Nia uses these to understand what data to look for, especially when column headers in the PDF are ambiguous.
</Tip>

<Tip>
  **Narrow the page range** — If you know which pages contain the data, specify `page_range` to speed up extraction and improve accuracy.
</Tip>

***

## Detect Extraction

Detect and locate visual elements within PDF pages — tables, figures, charts, and diagrams. Detect mode returns bounding boxes and classifications for each element found, and can render annotated page images with the detections overlaid.

### Start a Detect Extraction Job

```bash theme={null}
curl -X POST https://apigcp.trynia.ai/v2/extract/detect \
  -H "Authorization: Bearer $NIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/annual-report-2024.pdf",
    "page_range": "1-10",
    "include_symbols": false
  }'
```

| Parameter         | Description                                                             |
| ----------------- | ----------------------------------------------------------------------- |
| `url`             | URL of the PDF to process (provide either `url` or `source_id`)         |
| `source_id`       | Source ID of an already-indexed document                                |
| `page_range`      | Pages to process (e.g. `"1-10"`, `"5,8,12"`)                            |
| `include_symbols` | Enable symbol-level detection for technical documents (default `false`) |
| `filter_pattern`  | Regex to filter detected element types                                  |

Response:

```json theme={null}
{
  "id": "det_abc123",
  "status": "queued",
  "type": "detect"
}
```

### Check Detect Extraction Status

```bash theme={null}
curl https://apigcp.trynia.ai/v2/extract/detect/det_abc123 \
  -H "Authorization: Bearer $NIA_API_KEY"
```

Response when completed:

```json theme={null}
{
  "id": "det_abc123",
  "status": "completed",
  "progress": 100,
  "type": "detect",
  "page_count": 10,
  "result": {
    "pages": [
      {
        "page_number": 1,
        "elements": [
          {
            "type": "table",
            "bbox": [72, 200, 540, 450],
            "confidence": 0.97
          },
          {
            "type": "figure",
            "bbox": [72, 500, 400, 700],
            "confidence": 0.93
          }
        ]
      }
    ]
  }
}
```

### Get Annotated Page Image

Retrieve a page image with bounding boxes drawn over detected elements:

```bash theme={null}
curl https://apigcp.trynia.ai/v2/extract/detect/det_abc123/page/1/image \
  -H "Authorization: Bearer $NIA_API_KEY" \
  --output page-1-annotated.png
```

This returns a PNG image with detection bounding boxes overlaid on the original page.

<Tip>
  **Use detect before table extraction** — Run detect first to identify which pages contain tables, then target those specific pages with table extraction for faster, more accurate results.
</Tip>

***

## Engineering Extraction

Extract structured information from technical documents — engineering drawings, P\&IDs, schematics, datasheets, and construction specifications. Engineering mode uses specialized models tuned for technical content.

### Start an Engineering Extraction

```bash theme={null}
curl -X POST https://apigcp.trynia.ai/v2/extract/engineering \
  -H "Authorization: Bearer $NIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/piping-diagram-rev3.pdf",
    "page_range": "1-5",
    "accuracy_mode": "precise"
  }'
```

The `accuracy_mode` parameter controls the speed/accuracy tradeoff:

| Mode      | Description                                                               |
| --------- | ------------------------------------------------------------------------- |
| `fast`    | Optimized for speed. Good for initial scans and high-volume processing.   |
| `precise` | Maximum accuracy. Best for critical documents where every detail matters. |

Response:

```json theme={null}
{
  "id": "eng_xyz789",
  "status": "queued"
}
```

### Check Engineering Extraction Status

```bash theme={null}
curl https://apigcp.trynia.ai/v2/extract/engineering/eng_xyz789 \
  -H "Authorization: Bearer $NIA_API_KEY"
```

Response when completed:

```json theme={null}
{
  "id": "eng_xyz789",
  "status": "completed",
  "result": {
    "document_type": "P&ID",
    "title": "Process Flow - Unit 400 Cooling System",
    "revision": "Rev 3",
    "components": [
      {
        "tag": "P-401A",
        "type": "Centrifugal Pump",
        "specifications": "250 GPM, 150 PSI"
      },
      {
        "tag": "HX-402",
        "type": "Shell and Tube Heat Exchanger",
        "specifications": "500 sq ft, 150 PSI design"
      }
    ]
  }
}
```

### Follow-Up Queries

After an engineering extraction completes, you can ask follow-up questions about the results without re-processing the document:

```bash theme={null}
curl -X POST https://apigcp.trynia.ai/v2/extract/engineering/eng_xyz789/query \
  -H "Authorization: Bearer $NIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is the design pressure rating for all heat exchangers in this diagram?"
  }'
```

Response:

```json theme={null}
{
  "id": "eng_xyz789",
  "chat_messages": [
    {
      "role": "user",
      "content": "What is the design pressure rating for all heat exchangers in this diagram?"
    },
    {
      "role": "assistant",
      "content": "Based on the extraction results, there is one heat exchanger in this diagram:\n\n- **HX-402** (Shell and Tube Heat Exchanger): Design pressure is **150 PSI**."
    }
  ]
}
```

<Info>
  Follow-up queries use the already-extracted context, so they are fast and do not consume additional extraction credits.
</Info>

***

## List All Extractions

Retrieve all your extraction jobs, optionally filtered by type:

```bash theme={null}
# List all extractions
curl https://apigcp.trynia.ai/v2/extractions \
  -H "Authorization: Bearer $NIA_API_KEY"

# Filter by type
curl "https://apigcp.trynia.ai/v2/extractions?type=table" \
  -H "Authorization: Bearer $NIA_API_KEY"

curl "https://apigcp.trynia.ai/v2/extractions?type=engineering" \
  -H "Authorization: Bearer $NIA_API_KEY"
```

***

## Extraction Statuses

Both table and engineering extractions follow the same status lifecycle:

| Status       | Description                                              |
| ------------ | -------------------------------------------------------- |
| `queued`     | Job received and waiting to be processed                 |
| `processing` | Extraction is actively running                           |
| `completed`  | Extraction finished successfully — results are available |
| `failed`     | Extraction encountered an error                          |

***

## Use Cases

<CardGroup cols={2}>
  <Card title="Financial Analysis" icon="chart-line">
    Extract line items, revenue figures, and balance sheet data from SEC filings (10-K, 10-Q) into structured records for analysis and comparison.
  </Card>

  <Card title="Engineering Review" icon="gear">
    Parse P\&IDs, wiring diagrams, and spec sheets to catalog components, materials, and specifications. Ask follow-up questions about extracted details.
  </Card>

  <Card title="Invoice Processing" icon="file-invoice-dollar">
    Pull vendor names, line items, quantities, and totals from invoices using a custom JSON schema tailored to your format.
  </Card>

  <Card title="Technical Due Diligence" icon="magnifying-glass-chart">
    Extract equipment lists, compliance data, and specifications from engineering documents during M\&A or audits.
  </Card>
</CardGroup>
