General Text

The General Text preprocessor node splits unstructured or semi-structured text into smaller document chunks suitable for indexing, embedding, or LLM input. It supports table and plain text formats and applies configurable splitting logic for optimal text segmentation.

Key capabilities

Text normalization and cleaning
Configurable document segmentation
Support for specialized formats (Markdown, LaTeX)
Language-aware splitting options
Custom splitting logic

Configuration

When setting up the General Text node, you’ll need to configure several parameters:

Splitter Type: Choose the appropriate splitting algorithm based on your text format and needs
Chunk Size: Set the desired size of text segments
Chunk Overlap: Configure how much text should overlap between segments

Inputs and Outputs

Input Channels

Text: Unstructured or semi-structured free-form text for processing
Documents: Document objects containing text to be preprocessed
Table: Structured data in table format (ex CSV or tabular JSON)

Output Channels

Text: Preprocessed text content
Documents: List of segmented text blocks as structured documents

Splitter Types

Splitter Type	Description	Best For
Default Text Splitter	General-purpose splitter that provides the best balance of structure and size	Most use cases
Recursive Character Text Splitter	Splits text recursively by different separators	Complex documents with hierarchical structure
Character Text Splitter	Splits text based on character count	Simple text segmentation
Markdown Text Splitter	Specialized for processing Markdown-formatted text	Documentation, README files
Latex Text Splitter	Designed for handling LaTeX document structures	Academic papers, scientific documents
NLTK Text Splitter	Uses Natural Language Toolkit for linguistically-aware text splitting	Natural language processing tasks
Spacy Text Splitter	Leverages Spacy NLP library for advanced linguistic processing	Advanced NLP applications
Custom Splitter	Allows for user-defined splitting logic	Specialized formats and requirements

Common Use Cases

Text Preparation

Prepare documents for embedding or vector storage
Break down large texts into manageable chunks for LLMs
Segment documents while preserving semantic relationships
Process diverse text sources for unified downstream handling

Performance Considerations

Lemmatization is more resource-intensive than stemming
Processing very large documents may require batch processing
Consider memory usage when processing large volumes of text
Balance chunk size with context requirements for downstream tasks

Frequently Asked Questions

Why is my text processing so slow?
Disable resource-intensive options like lemmatization when processing large volumes of text.

How do I prevent losing important information during processing?
Adjust settings to preserve critical content, such as keeping numbers for financial text analysis.

What should I do if I’m experiencing memory errors?
Process your text in smaller batches to reduce memory usage.

Additional Resources

spaCy Tokenization Documentation
Pinecone: Text Chunking Strategies for Vector Search