Code

What does it do?

The Preprocessor – Code connector is designed to intelligently parse and tokenize source code files, breaking them down into meaningful components such as functions, classes, and comments. This enables deeper semantic analysis, classification, embedding, or search by preparing code for downstream AI and data processing tasks. The node segments code into logical chunks by detecting syntax boundaries and respecting language-specific structures.

With the Preprocessor – Code connector, you can:

Automatically split code into logical chunks for analysis or embedding
Extract structural elements (functions, classes, comments) from source files
Prepare code for advanced AI workflows, such as search, summarization, or classification
Support multiple programming languages with automatic or manual language selection
Apply code normalization and formatting for consistent processing
Remove or preserve comments and docstrings based on your needs

Inputs and Outputs

Inputs

Text – Receives plain code or code-formatted text (raw content to be split and transformed)
Documents – Document objects containing code to be preprocessed

Outputs

Text – Preprocessed code content
Documents – Emits structured chunks of code as processed document segments (optimized for downstream components such as vector embeddings or LLMs)

Configuration Fields

Basic Configuration

Field	Description	Default	Options/Example	Notes
Language (Code Splitter Profile)	Programming language to use for parsing	Auto	auto, c, cpp, python, javascript, typescript	Automatically detect or specify manually
Maximum String Length	Maximum length (in characters) for each code chunk	–	`512`	Controls chunk size for downstream tasks

Code Normalization Options

Field	Description	Default	Notes
Remove Comments	Remove code comments	True	Can be configured to preserve docstrings
Format Code	Apply code formatting	True	Language-specific formatting rules
Remove Whitespace	Normalize whitespace	True	Preserves indentation structure
Normalize Identifiers	Standardize variable/function names	False	Useful for code comparison

Advanced Settings

Field	Description	Default	Notes
Preserve Docstrings	Keep documentation strings	True	Only applies when Remove Comments is True
Extract Functions	Extract function definitions	False	Creates separate outputs for each function
Extract Classes	Extract class definitions	False	Creates separate outputs for each class
Custom Patterns	Apply custom regex patterns	None	For specialized code cleaning

How do I use it?

To use the Preprocessor – Code connector in your workflow:

Add the Preprocessor – Code Connector
- Insert the Preprocessor – Code node into your pipeline where you want to process source code files
Connect Input
- Connect the input lane (text or documents) to the source of your code
- This could be a file dropper, file system, document parser, or any code source
Configure Parameters
- In the attributes editor, customize how code is processed
- Set the programming language (or use “Auto” for automatic detection)
- Configure maximum string length for chunk size control
- Adjust normalization options based on your needs
Connect Output
- The connector outputs structured code segments (as documents)
- Send these to downstream nodes for embedding, search, or further analysis

Example Use Cases

Prepare code for semantic search by splitting large files into manageable, meaningful chunks
Enable code summarization or documentation generation by extracting functions and classes
Feed code into embedding models for similarity search or clustering
Automate code review or analysis by breaking down files for AI-powered inspection
Process code repositories for knowledge base creation
Normalize code across different projects for comparison and analysis

Best Practices

Code Preparation

Specify the programming language explicitly for best results
Consider preserving docstrings for documentation-heavy code
Use function extraction for analyzing specific code components
Normalize identifiers when comparing code functionality across different implementations

Performance Considerations

Code formatting can be resource-intensive for large files
Processing very large codebases may require batch processing
Consider memory usage when processing large volumes of code

Troubleshooting

Processing Problems

Incorrect language detection – Specify language explicitly instead of using Auto
Important comments lost – Enable preserveDocstrings option
Code structure altered – Disable formatting if precise structure preservation is needed

Performance Issues

Slow processing – Disable resource-intensive options like formatting for large codebases
Memory errors – Process code in smaller batches

Summary Table of Parameters

Parameter	Description	Effect/Usage
Language	Programming language to use for parsing (auto, c, cpp, etc.)	Determines parsing rules and syntax
Maximum String Length	Max length (in characters) for each code chunk	Controls chunk size for downstream tasks
Remove Comments	Strip comments from code	Cleans code for analysis while optionally preserving docstrings
Format Code	Apply language-specific formatting	Standardizes code appearance and structure
Extract Functions/Classes	Separate functions and classes into individual outputs	Enables component-level analysis and processing

In summary:

The Preprocessor – Code connector prepares your source code for advanced AI workflows by parsing, chunking, and structuring it for downstream analysis, embedding, or search. It provides comprehensive code normalization and extraction capabilities to optimize code processing for machine learning and AI applications.