This component extracts structured data from unstructured or semi-structured text. It identifies and extracts specific fields and values from documents based on configurable extraction rules.
Use a Data Extractor in a flow
To use a Data Extractor in your workflow, follow these steps:
- Drag the Data Extractor component from the Other section onto your canvas
- Connect document outputs from upstream nodes to the Documents input
- Configure the field definitions and extraction settings
- Connect the output ports to downstream processing nodes
Inputs
| Input | Type | Description |
|---|---|---|
| Documents | Documents | Document objects from which to extract data |
| Text | Text | Raw text content from which to extract data |
Outputs
| Output | Type | Description |
|---|---|---|
| Data | Data | Extracted structured data |
| Documents | Documents | Original documents with extraction metadata added |
| Tables | Tables | Extracted tabular data |
Configuration
Field Definitions
| Setting | Description | Default | Notes |
|---|---|---|---|
| Fields | Array of field definitions | [] | Defines the fields to extract |
| Column | Field name/column | Required for each field | |
| Type | Data type of the field | “string” | Options: string, number, date, boolean |
| Default Value | Default value if extraction fails | Optional fallback value |
Extraction Settings
| Setting | Description | Default | Notes |
|---|---|---|---|
| Extraction Method | Method for data extraction | “llm” | Options: llm, regex, rules |
| LLM Provider | LLM provider for extraction | Required if method is “llm” | |
| Context Window | Text context size for extraction | 1000 | Characters around potential matches |
Advanced Settings
| Setting | Description | Default | Notes |
|---|---|---|---|
| Confidence Threshold | Minimum confidence score | 0.7 | For LLM extraction |
| Output Format | Format of extracted data | “json” | Options: json, csv, table |
| Validation | Enable data validation | true | Validates against field types |
Example Usage
Basic Field Extraction
This example shows how to configure the Data Extractor for basic field extraction:
{
"fields": [
{
"column": "invoice_number",
"type": "string",
"defval": ""
},
{
"column": "date",
"type": "date",
"defval": ""
},
{
"column": "total_amount",
"type": "number",
"defval": "0.0"
}
],
"extractionMethod": "llm",
"contextWindow": 1000,
"confidenceThreshold": 0.7,
"outputFormat": "json",
"validation": true
}
LLM-Powered Invoice Extraction
For extracting invoice data using an LLM with specific field definitions:
{
"fields": [
{
"column": "invoice_number",
"type": "string",
"defval": ""
},
{
"column": "date",
"type": "date",
"defval": ""
},
{
"column": "due_date",
"type": "date",
"defval": ""
},
{
"column": "vendor_name",
"type": "string",
"defval": ""
},
{
"column": "vendor_address",
"type": "string",
"defval": ""
},
{
"column": "total_amount",
"type": "number",
"defval": "0.0"
},
{
"column": "tax_amount",
"type": "number",
"defval": "0.0"
}
],
"extractionMethod": "llm",
"llmProvider": {
"provider": "openai",
"model": "gpt-4",
"temperature": 0.2
},
"contextWindow": 2000,
"confidenceThreshold": 0.8,
"outputFormat": "table",
"validation": true
}
Best Practices
Field Definition
- Define clear, specific fields with appropriate data types
- Provide default values for optional fields
- Use consistent naming conventions for fields
- Consider the structure of your source documents when defining fields
Extraction Method Selection
- Use LLM for complex, varied documents requiring understanding
- Use regex for well-defined patterns and simple extractions
- Use rules for structured documents with consistent layouts
Performance Optimization
- Adjust context window size based on document structure
- Set appropriate confidence threshold based on extraction quality needs
- Enable validation to ensure data quality
- Process documents in batches for large collections
Troubleshooting
Common Issues
Extraction Problems
- Missing data: Adjust field definitions or try a different extraction method
- Incorrect formats: Check field type definitions or adjust validation rules
- Low confidence scores: Refine LLM prompts or adjust confidence threshold
Performance Issues
- Slow processing: Reduce context window size or batch process documents
- Memory errors: Process documents in smaller batches
- High error rates: Validate and refine field definitions
