The General Text preprocessor node splits unstructured or semi-structured text into smaller document chunks suitable for indexing, embedding, or LLM input. It supports table and plain text formats and applies configurable splitting logic for optimal text segmentation.

Key capabilities
- Text normalization and cleaning
- Configurable document segmentation
- Support for specialized formats (Markdown, LaTeX)
- Language-aware splitting options
- Custom splitting logic
Configuration
When setting up the General Text node, you’ll need to configure several parameters:
- Splitter Type: Choose the appropriate splitting algorithm based on your text format and needs
- Chunk Size: Set the desired size of text segments
- Chunk Overlap: Configure how much text should overlap between segments

Inputs and Outputs
Input Channels
- Text: Unstructured or semi-structured free-form text for processing
- Documents: Document objects containing text to be preprocessed
- Table: Structured data in table format (ex CSV or tabular JSON)
Output Channels
- Text: Preprocessed text content
- Documents: List of segmented text blocks as structured documents
Splitter Types
| Splitter Type | Description | Best For |
|---|---|---|
| Default Text Splitter | General-purpose splitter that provides the best balance of structure and size | Most use cases |
| Recursive Character Text Splitter | Splits text recursively by different separators | Complex documents with hierarchical structure |
| Character Text Splitter | Splits text based on character count | Simple text segmentation |
| Markdown Text Splitter | Specialized for processing Markdown-formatted text | Documentation, README files |
| Latex Text Splitter | Designed for handling LaTeX document structures | Academic papers, scientific documents |
| NLTK Text Splitter | Uses Natural Language Toolkit for linguistically-aware text splitting | Natural language processing tasks |
| Spacy Text Splitter | Leverages Spacy NLP library for advanced linguistic processing | Advanced NLP applications |
| Custom Splitter | Allows for user-defined splitting logic | Specialized formats and requirements |
Common Use Cases
Text Preparation
- Prepare documents for embedding or vector storage
- Break down large texts into manageable chunks for LLMs
- Segment documents while preserving semantic relationships
- Process diverse text sources for unified downstream handling
Performance Considerations
- Lemmatization is more resource-intensive than stemming
- Processing very large documents may require batch processing
- Consider memory usage when processing large volumes of text
- Balance chunk size with context requirements for downstream tasks
Frequently Asked Questions
- Why is my text processing so slow?
Disable resource-intensive options like lemmatization when processing large volumes of text.
- How do I prevent losing important information during processing?
Adjust settings to preserve critical content, such as keeping numbers for financial text analysis.
- What should I do if I’m experiencing memory errors?
Process your text in smaller batches to reduce memory usage.
Additional Resources
spaCy Tokenization Documentation
Pinecone: Text Chunking Strategies for Vector Search
