Classify and Anonymize

This example is of a streamlined pipeline for text classification and anonymization before embedding. This integrated workflow intelligently categorizes content and removes personally identifiable information (PII) from text data, ensuring both regulatory compliance and data privacy while preparing it for optimal vector embedding and retrieval in AI applications.

Tutorial Video

Pipeline Overview

This is a Classification and Anonymization pipeline that processes text for privacy and categorization before embedding.

Node	Function
Chat Input	Accepts user text input from chat interface
Parser	Extracts plain text from the input
Text Classification	Categorizes text into predefined classes/labels
Text Anonymizer	Removes or masks personally identifiable information (PII)
Preprocessor – General Text	Cleans and splits text into appropriate segments
Embedding – Transformer	Converts processed text into vector embeddings

Key Capabilities:

Multi-class Classification: Categorizes text into multiple predefined categories, enabling content routing and specialized processing.
PII Detection & Masking: Identifies and anonymizes personal information, ensuring compliance with privacy regulations.
Configurable Chunking: Splits text into optimally sized segments, improving retrieval precision and relevance.
Vector Transformation: Converts text to semantic embeddings, enabling meaning-based search and retrieval.

Workflow Breakdown:

1. Data Source Node

This node pulls your source files (e.g., Google Drive, S3, local docs).

How it works

(e.g., Google Drive)

1. Click ▶️ on the Aparavi Sample Data node. You’ll see two options in the Files section.
2. The node fetches files from the configured drive path when you select ‘Preselected Data Sets.’
3. If you select ‘Custom Data sets,’ you’ll need to provide the sample data path in the text box.
4. Learn more about the Google Drive source node here: link

2. Data Parser Node

In this node, we extract the content from the files. In this scenario, we are extracting text.

How it works

Receives “Data” from the Sample Data node.
Splits into multiple channels: Text, Table, Image, Audio, Video.

3. Text Classification Node

Analyzes and categorizes text into predefined classes or categories.

How it works

Processes text using a machine learning model to identify content categories.
Assigns one or more classification labels to the text.
Provides confidence scores for each classification.

Configuration:

Text classification helps identify content types for appropriate processing and handling. The configuration interface for a text classification node in a data processing pipeline. This interface allows users to select various classification categories related to sensitive information policies in the United States, the US health, and Germany.

4. Text Anonymizer Node

The Text Anonymizer node identifies and masks personally identifiable information (PII) in text to ensure privacy and compliance.

How it works

Scans text for PII such as names, emails, phone numbers, addresses, etc.
Replaces identified PII with generic tokens or masks.
Preserves the semantic structure of the text while removing sensitive data.

Configuration:

Field	Example Value	Description
PII Types to Detect	Names, Emails, Phone Numbers, Addresses	Categories of PII that will be identified and masked
Anonymization Method	Token Replacement	How PII is masked (e.g., with generic tokens or asterisks)
Custom Entities	Company-specific terms	Additional entity types to anonymize beyond standard PII

5. Preprocessor – General Text

The Preprocessor – General Text node splits large text blocks into smaller “documents” ready for embedding. Here, we’ll take the large text segments and split them up into documents that can be embedded.

How it works

Takes parser output, applies sanitization, chunking, and metadata tagging.
Emits a stream of document objects.

Configuration:

It ensures that our documents are split into well-sized, context-preserving segments, optimizing them for embedding and retrieval in your RAG pipeline.

Setting	Value/Option	Description
Text Splitter	Default Text Splitter	Balances structure and size for general-purpose use
Split by	String length	Splits text based on character count
String Length	512	Each segment is up to 512 characters long

Text Splitter

6. Embedding – Transformer

The Embedding – Sentence Transformer node is responsible for converting text (such as document segments or user questions) into vector embeddings, which are essential for semantic search and retrieval in a RAG pipeline.

How it works

Documents transformer:
turns each text chunk into a vector.

Configuration:

Available Options:

Custom model: Use your own pre-trained or fine-tuned model.
miniAll: A general-purpose model, suitable for a variety of tasks.
miniLM: Optimized for general use, offering a good balance between performance and speed.
mpnet: Another high-quality model, often providing strong results on semantic similarity tasks.

7. Vector Store (Qdrant) Node

The Qdrant Vector Store node is critical in a RAG pipeline. It is responsible for storing and retrieving vector embeddings, enabling efficient semantic search and context retrieval for downstream language models.

How it works

Ingests document embeddings + metadata.
On query, it computes similarity between the question vector and stored vectors.
Returns top-K relevant document segments.

Configuration:

Field	Example Value	Description
Type of Qdrant host	Your own Qdrant server	Specifies the Qdrant deployment (local, cloud, or custom server).
Host	localhost	The server address where Qdrant is running.
Port	6333	The port number Qdrant listens on (default: 6333).
Retrieval Score	Related	Sets the minimum similarity score for retrieving relevant vectors.
Collection	APARAVI	The name of the Qdrant collection used to store and query vectors.

Hit the Play Button ▶️

Common Use Cases:

Healthcare Data Processing:

Healthcare organizations can use this pipeline to process patient records, ensuring PHI compliance by anonymizing sensitive information before embedding for AI analysis.
Legal Document Analysis:

Law firms can process case documents, automatically classifying content by legal domain while removing client identifiers to maintain confidentiality.
Financial Services:

Banks and financial institutions can process customer communications, removing account details and personal information while preserving the semantic meaning for trend analysis.
Customer Support:

Support teams can process ticket data to build knowledge bases while ensuring customer information is properly anonymized before embedding.

Frequently Asked Questions:

What PII types are detected?

Standard detection includes names, addresses, phone numbers, emails, SSNs, and credit card numbers. Custom entities can be added.

How accurate is the classification?

Accuracy depends on the model used and training data. Adjust confidence thresholds to balance precision and recall.

Can I preserve the original text alongside anonymized versions?

Yes, configure the anonymizer to output both original and anonymized versions, but ensure proper access controls.

Tutorial Video

Pipeline Overview

Key Capabilities:

Workflow Breakdown:

1. Data Source Node

How it works

2. Data Parser Node

How it works

3. Text Classification Node

How it works

Configuration:

4. Text Anonymizer Node

How it works

Configuration:

5. Preprocessor – General Text

How it works

Configuration:

Text Splitter

6. Embedding – Transformer

How it works

Configuration:

Available Options:

7. Vector Store (Qdrant) Node

How it works

Configuration:

Hit the Play Button ▶️

Common Use Cases:

Healthcare Data Processing:

Legal Document Analysis:

Financial Services:

Customer Support:

Frequently Asked Questions:

What PII types are detected?

How accurate is the classification?

Can I preserve the original text alongside anonymized versions?