Skills document-extraction-api

Extract structured data from documents using AI-powered field extraction.

install
source · Clone the upstream repo
git clone https://github.com/iterationlayer/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/iterationlayer/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/document-extraction-api" ~/.claude/skills/iterationlayer-skills-document-extraction-api && rm -rf "$T"
manifest: skills/document-extraction-api/SKILL.md
source content

Document Extraction API

Extract structured data from documents using AI-powered field extraction.

Cost: 1 credit per page

Prerequisites

You need an Iteration Layer API key. Get one at platform.iterationlayer.com — free trial credits included, no credit card required.

For full integration guidance (SDKs, auth, MCP, error handling), see the Iteration Layer Integration Guide.

API Reference

Extract structured data from any document with a single API call. Send one or more files and a schema defining the fields you need, and receive typed, validated results with confidence scores and source citations.

Key Features

  • Multi-Format Support — Extract from 40+ file formats: PDF, Office documents (DOCX, PPTX, ODT, ODS, XLSX), EPUB, LaTeX, email (EML), Jupyter notebooks, images, and text/markup formats.
  • Rich Field Types — Primitive types (text, number, date, boolean, email, enum) plus purpose-built validated types:
    IBAN
    (validated against the standard format),
    ADDRESS
    (returns a structured object with street, city, region, postal code, and country),
    CURRENCY_CODE
    (normalized to ISO 4217),
    CURRENCY_AMOUNT
    (numeric monetary value), and
    COUNTRY
    (normalized to ISO 3166-1 alpha-2). These validated types mean you get clean, usable data — not raw strings you have to parse yourself.
  • Structured Arrays — Extract repeating data like invoice line items with nested schemas.
  • Calculated Fields — Define arithmetic operations (sum, subtract, multiply, divide) computed from other extracted fields.
  • Confidence Scores — Every extracted value includes a confidence score between 0 and 1.
  • Source Citations — Verbatim quotes from the document that support each extracted value.
  • Schema Validation — Field schemas are validated before extraction, catching errors like circular dependencies or type mismatches early.

Overview

The Document Extraction API analyzes documents and extracts structured data based on a schema you define. You send one or more files (base64 or URL) and a schema with field definitions, and receive a JSON response with typed values, confidence scores, and citations.

Endpoint:

POST /document-extraction/v1/extract

Limits:

  • Max files per request: 20
  • Max file size: 50 MB per file

Supported File Formats

  • Documents: PDF, DOCX, PPTX, ODT, EPUB, RTF
  • Spreadsheets: XLSX, XLS, ODS, CSV, TSV
  • Email: EML, MSG (headers, body, and attachment extraction — attachments are ingested and included in extraction context)
  • Notebooks: Jupyter (.ipynb)
  • Academic & Publishing: LaTeX (.tex, .latex), BibTeX (.bib), Typst (.typst, .typ)
  • Markup & Text: HTML, Markdown, JSON, XML, YAML, TOML, RST, Org, Djot, MDX, TXT
  • Images: PNG, JPEG, GIF, WebP, AVIF, HEIF, BMP, TIFF, JP2, PNM/PBM/PGM/PPM, SVG

How It Works

Every extraction runs the same pipeline:

  1. Validate — the schema is checked before any files are touched. Three things are validated: CALCULATED source field references (each must exist in the schema and be a numeric type), circular dependencies between CALCULATED fields, and default values matching their field type. The first failure stops validation with a descriptive error. No LLM call is made if the schema is invalid.
  2. Ingest — files are converted to a normalized text representation.
  3. Extract — each field is extracted according to its type and configuration. All non-CALCULATED fields are handled together, with all source files available during extraction. Every extracted value is tagged with the file it came from.
  4. Calculate — CALCULATED fields are computed from the extraction results. Operations are pure arithmetic:
    sum
    ,
    subtract
    ,
    multiply
    ,
    divide
    . The confidence of a CALCULATED field is the minimum confidence of its source fields. Division by zero returns
    0
    . The result is deterministic — if the source values are correct, the computed value is exact.
  5. Consolidate — default values are applied to fields that were not extracted, and required field constraints are checked.

Intelligent extraction — the API automatically selects the best extraction approach based on the complexity of your schema and the nature of the documents. You don't configure this.

Request Format

<!-- tabs -->
curl -X POST \
  https://api.iterationlayer.com/document-extraction/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "files": [
      {
        "type": "base64",
        "name": "invoice.pdf",
        "base64": "<base64-encoded-file>"
      }
    ],
    "schema": {
      "fields": [
        {
          "name": "invoice_number",
          "type": "TEXT",
          "description": "The invoice number"
        },
        {
          "name": "total_amount",
          "type": "CURRENCY_AMOUNT",
          "description": "Total invoice amount"
        }
      ]
    }
  }'
import { IterationLayer } from "iterationlayer";
const client = new IterationLayer({
  apiKey: "YOUR_API_KEY",
});

const result = await client.extract({
  files: [
    {
      type: "base64",
      name: "invoice.pdf",
      base64: "<base64-encoded-file>",
    },
  ],
  schema: {
    fields: [
      {
        name: "invoice_number",
        type: "TEXT",
        description: "The invoice number",
      },
      {
        name: "total_amount",
        type: "CURRENCY_AMOUNT",
        description: "Total invoice amount",
      },
    ],
  },
});
from iterationlayer import IterationLayer
client = IterationLayer(api_key="YOUR_API_KEY")

result = client.extract(
    files=[
        {
            "type": "base64",
            "name": "invoice.pdf",
            "base64": "<base64-encoded-file>",
        }
    ],
    schema={
        "fields": [
            {
                "name": "invoice_number",
                "type": "TEXT",
                "description": "The invoice number",
            },
            {
                "name": "total_amount",
                "type": "CURRENCY_AMOUNT",
                "description": "Total invoice amount",
            },
        ]
    },
)
import il "github.com/iterationlayer/sdk-go"
client := il.NewClient("YOUR_API_KEY")

result, err := client.Extract(il.ExtractRequest{
	Files: []il.FileInput{
		il.NewFileFromBase64(
			"invoice.pdf",
			"<base64-encoded-file>",
		),
	},
	Schema: il.ExtractionSchema{
		"invoice_number": il.NewTextFieldConfig(
			"invoice_number",
			"The invoice number",
		),
		"total_amount": il.NewCurrencyAmountFieldConfig(
			"total_amount",
			"Total invoice amount",
		),
	},
})
<!-- response -->
{
  "success": true,
  "data": {
    "invoice_number": {
      "type": "TEXT",
      "value": "INV-2024-001",
      "confidence": 0.98,
      "citations": ["Invoice No: INV-2024-001"],
      "source": "invoice.pdf"
    },
    "total_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 1250.00,
      "confidence": 0.95,
      "citations": ["Total: €1,250.00"],
      "source": "invoice.pdf"
    }
  }
}
<!-- /tabs -->

Top-Level Fields

FieldTypeRequiredDescription
files
arrayYesList of file inputs to extract from (see File Input below)
schema
objectYesExtraction schema defining the fields to extract
webhook_url
stringNoHTTPS URL to receive results asynchronously. If provided, returns 201 immediately. See Webhooks.

Async Mode

Add a

webhook_url
parameter to process the request in the background. The API returns
201 Accepted
immediately and delivers the result to your webhook URL when processing completes. See Webhooks for payload format and retry behavior.

File Input

Provide each file as either base64 or a URL:

FieldTypeRequiredDescription
type
stringYes
"base64"
or
"url"
name
stringYesFilename with extension (e.g.,
invoice.pdf
)
base64
stringIf type is
base64
Base64-encoded file content
url
stringIf type is
url
Publicly accessible URL to fetch the file from

URL Example

{
  "files": [
    {
      "type": "url",
      "name": "invoice.pdf",
      "url": "https://example.com/invoice.pdf"
    }
  ],
  "schema": {
    "fields": [
      {
        "name": "invoice_number",
        "type": "TEXT",
        "description": "The invoice number"
      }
    ]
  }
}

Schema Definition

The

schema
object contains a
fields
array. Each field has these base properties:

FieldTypeRequiredDescription
name
stringYesUnique field identifier (used as key in the response)
type
stringYesOne of the supported field types (see Field Types below)
description
stringYesNatural language description of what to extract
is_required
booleanNoIf
true
, returns an error when the field cannot be extracted and has no default. Default:
false

Each field type supports additional type-specific properties described below.

Field Types

TEXT

Short single-line text value.

PropertyTypeRequiredDescription
max_length
integerNoMaximum character length (> 0)
default_value
stringNoDefault value if not found
{
  "name": "company_name",
  "type": "TEXT",
  "description": "Name of the company"
}

TEXTAREA

Multi-line text value.

PropertyTypeRequiredDescription
max_length
integerNoMaximum character length (> 0)
default_value
stringNoDefault value if not found
{
  "name": "notes",
  "type": "TEXTAREA",
  "description": "Additional notes or comments"
}

INTEGER

Whole number value.

PropertyTypeRequiredDescription
min
integerNoMinimum value (inclusive)
max
integerNoMaximum value (inclusive)
unit
stringNoUnit label (e.g.,
"kg"
,
"items"
)
default_value
integerNoDefault value if not found
{
  "name": "quantity",
  "type": "INTEGER",
  "description": "Number of items ordered",
  "min": 1
}

DECIMAL

Floating-point number value.

PropertyTypeRequiredDescription
min
floatNoMinimum value (inclusive)
max
floatNoMaximum value (inclusive)
decimal_points
integerNoNumber of decimal places to round to (>= 0)
unit
stringNoUnit label
default_value
floatNoDefault value if not found
{
  "name": "weight",
  "type": "DECIMAL",
  "description": "Package weight",
  "unit": "kg",
  "decimal_points": 2
}

DATE

Calendar date, extracted as an ISO 8601 string (

YYYY-MM-DD
).

PropertyTypeRequiredDescription
allow_future_dates
booleanNoWhether to allow dates in the future
allow_past_dates
booleanNoWhether to allow dates in the past
{
  "name": "invoice_date",
  "type": "DATE",
  "description": "Date the invoice was issued"
}

DATETIME

Date and time, extracted as an ISO 8601 datetime string.

PropertyTypeRequiredDescription
allow_future_dates
booleanNoWhether to allow dates in the future
allow_past_dates
booleanNoWhether to allow dates in the past
{
  "name": "timestamp",
  "type": "DATETIME",
  "description": "Transaction timestamp"
}

TIME

Time value (e.g.,

"14:30:00"
). No additional parameters.

{
  "name": "delivery_time",
  "type": "TIME",
  "description": "Scheduled delivery time"
}

ENUM

One or more values from a predefined list. Extracted as a string array.

PropertyTypeRequiredDescription
values
string[]YesAllowed options
min_selected
integerNoMinimum number of selected values (>= 0)
max_selected
integerNoMaximum number of selected values (> 0)
default_value
string[]NoDefault selected values
{
  "name": "payment_method",
  "type": "ENUM",
  "description": "How the invoice was paid",
  "values": ["bank_transfer", "credit_card", "cash", "paypal"],
  "max_selected": 1
}

BOOLEAN

True or false value.

PropertyTypeRequiredDescription
default_value
booleanNoDefault value if not found
{
  "name": "is_paid",
  "type": "BOOLEAN",
  "description": "Whether the invoice has been paid"
}

EMAIL

Email address string.

PropertyTypeRequiredDescription
default_value
stringNoDefault value if not found
{
  "name": "contact_email",
  "type": "EMAIL",
  "description": "Contact email address"
}

IBAN

International Bank Account Number. Validated against the pattern

^[A-Z]{2}\d{2}[A-Z0-9]{11,30}$
.

PropertyTypeRequiredDescription
default_value
stringNoDefault value if not found
{
  "name": "bank_account",
  "type": "IBAN",
  "description": "Recipient IBAN"
}

COUNTRY

ISO 3166-1 alpha-2 country code (e.g.,

"DE"
,
"US"
).

PropertyTypeRequiredDescription
default_value
stringNoMust be a valid ISO 3166-1 alpha-2 code
{
  "name": "origin_country",
  "type": "COUNTRY",
  "description": "Country of origin"
}

CURRENCY_CODE

ISO 4217 currency code (e.g.,

"EUR"
,
"USD"
).

PropertyTypeRequiredDescription
default_value
stringNoMust be a valid ISO 4217 code
{
  "name": "currency",
  "type": "CURRENCY_CODE",
  "description": "Invoice currency"
}

CURRENCY_AMOUNT

Numeric monetary amount.

PropertyTypeRequiredDescription
min
floatNoMinimum value (inclusive)
max
floatNoMaximum value (inclusive)
decimal_points
integerNoNumber of decimal places to round to (>= 0)
default_value
floatNoDefault value if not found
{
  "name": "total_amount",
  "type": "CURRENCY_AMOUNT",
  "description": "Total invoice amount",
  "decimal_points": 2
}

ADDRESS

Structured address object. Extracted as an object with

street
,
city
,
region
,
postal_code
, and
country
fields.

PropertyTypeRequiredDescription
allowed_country_codes
string[]NoRestrict to specific ISO 3166-1 alpha-2 country codes

Response value shape:

{
  "street": "123 Main St",
  "city": "Berlin",
  "region": "Berlin",
  "postal_code": "10115",
  "country": "DE"
}
{
  "name": "billing_address",
  "type": "ADDRESS",
  "description": "Billing address",
  "allowed_country_codes": ["DE", "AT", "CH"]
}

ARRAY

A list of structured objects, each conforming to a nested schema. Use this for repeating data like line items.

PropertyTypeRequiredDescription
fields
arrayYesArray of field configurations for each item

The

fields
array uses the same field configuration format as top-level fields.

{
  "name": "line_items",
  "type": "ARRAY",
  "description": "Invoice line items",
  "fields": [
    {
      "name": "description",
      "type": "TEXT",
      "description": "Item description"
    },
    {
      "name": "quantity",
      "type": "INTEGER",
      "description": "Quantity ordered",
      "min": 1
    },
    {
      "name": "unit_price",
      "type": "CURRENCY_AMOUNT",
      "description": "Price per unit",
      "decimal_points": 2
    },
    {
      "name": "total",
      "type": "CURRENCY_AMOUNT",
      "description": "Line item total",
      "decimal_points": 2
    }
  ]
}

CALCULATED

A derived numeric value computed from other extracted fields. Not extracted from the document — calculated locally after all source fields are resolved.

PropertyTypeRequiredDescription
operation
stringYesOne of:
"sum"
,
"subtract"
,
"multiply"
,
"divide"
source_field_names
string[]YesNames of fields to apply the operation to, in order
unit
stringNoUnit label

Source fields must be numeric types:

INTEGER
,
DECIMAL
,
CURRENCY_AMOUNT
, or another
CALCULATED
. Circular dependencies are detected and rejected at validation time.

The confidence score of a CALCULATED field is the minimum confidence of its source fields. Division by zero returns

0
.

{
  "name": "tax_amount",
  "type": "CALCULATED",
  "description": "Tax amount (total minus net)",
  "operation": "subtract",
  "source_field_names": ["total_amount", "net_amount"]
}

Response Format

Success Response

{
  "success": true,
  "data": {
    "invoice_number": {
      "type": "TEXT",
      "value": "INV-2024-001",
      "confidence": 0.98,
      "citations": ["Invoice No: INV-2024-001"],
      "source": "invoice.pdf"
    },
    "total_amount": {
      "type": "CURRENCY_AMOUNT",
      "value": 1250.00,
      "confidence": 0.95,
      "citations": ["Total: €1,250.00"],
      "source": "invoice.pdf"
    }
  }
}

Each field in

data
contains:

FieldTypeDescription
type
stringThe field type from the schema
value
variesExtracted value (type depends on field type — see below)
confidence
floatConfidence score between 0.0 and 1.0
citations
string[]Verbatim quotes from the source document
source
stringFilename the value was extracted from

Value types by field type:

Field TypeValue Type
TEXT
,
TEXTAREA
,
EMAIL
,
IBAN
,
COUNTRY
,
CURRENCY_CODE
,
DATE
,
DATETIME
,
TIME
string
INTEGER
,
DECIMAL
,
CURRENCY_AMOUNT
,
CALCULATED
number
BOOLEAN
boolean
ENUM
string[]
ADDRESS
object
ARRAY
object[]

Fields that could not be extracted and have no

default_value
are omitted from
data
(unless
is_required
is
true
, which causes an error). Fields resolved via
default_value
have a confidence of
1.0
.

Recipes

For complete, runnable examples see the Recipes page.

  • Extract Invoice Data -- Extract line items, totals, and vendor details from an invoice into structured JSON.
  • Extract Resume Data -- Extract contact info, work history, and skills from a resume into structured data.
  • Extract Medical Record -- Extract patient details, diagnoses, and medications from a medical record into structured JSON.
  • Extract Receipt Data -- Extract merchant, amount, date, and line items from a receipt image or PDF.
  • Extract Multi-Invoice Data -- Extract structured data from multiple invoice files in a single API call using an array schema.

Error Responses

All errors return a JSON body with

{ "success": false, "error": "<message>" }
.

StatusDescription
400Invalid request (missing files/schema, invalid base64, URL fetch failure, file size exceeded, invalid field config)
401Missing or invalid API key
402Insufficient credits or budget cap exceeded
422Processing error (circular dependency in CALCULATED fields, required field not extractable, LLM parsing failure)
429Rate limit exceeded

Links