Document Extractor

Extract text content from PDF, DOCX, and XLSX files with flexible output formatting.

What is Document Extractor?

The Document Extractor node converts various document formats into readable text content that can be processed by other nodes in your workflow. It automatically detects and supports PDF, DOCX (Word), and XLSX (Excel) files, providing intelligent extraction that preserves document structure and formatting when possible.

How to use it?

To extract content from your documents:

Connect Your File:
- Connect any file output from File Reader, File Writer, or other file-producing nodes
- The node automatically detects whether it's a PDF, DOCX, or XLSX file
- No need to manually specify the file type - detection is handled automatically
Configure Extraction Options:

For Excel Files (XLSX) - when detected:
- Sheet Name: Specify a particular sheet to extract (optional)
- Include Sheet Names: Add sheet names to the output for context
- Include Metadata: Add file metadata information to the output
For All File Types:
- Choose output format based on how you plan to use the extracted content
Select Output Format:
- Extracted Text: Plain text content with basic formatting
- Markdown: Formatted text that preserves document structure
- HTML: Web-formatted output that maintains rich formatting
Use Extracted Content:
- Connect the output to text processing nodes like Document Splitter
- Send to LLM nodes for analysis or summarization
- Use with Text Embedder for semantic search applications

Example of usage

Objective: Process a batch of research papers in PDF format to create a searchable knowledge base.

Document Extraction Workflow:

File Processing Setup:
- Connect File Reader to read PDF files from storage
- Connect the file output directly to Document Extractor (no format selection needed)
- Set output format to Markdown to preserve document structure
Extraction Configuration:
- File Input: Connected from File Reader (automatic format detection)
- Output Format: Markdown (to maintain headings and structure)
- The node automatically detects it's a PDF and processes accordingly
Post-Processing Chain:
- Connect Document Splitter to break text into manageable chunks
- Use Text Embedder to create vector representations
- Store in Vector Store for semantic search capabilities

Additional Examples:

Excel Data Processing:

File Reader → Document Extractor → JSON Query → Data Analysis

Sheet Name: "Sales_Data" (specific sheet, when Excel is detected)
Include Sheet Names: true (for multi-sheet context)
Output Format: Extracted Text → parse with JSON Query

Word Document Analysis:

File Reader → Document Extractor → LLM → Summary Output

Auto-Detection: Node automatically recognizes DOCX format
Output Format: Markdown (preserves formatting)
Use Case: Automated document summarization

Multi-Format Document Processing:

Multiple File Readers → Single Document Extractor → Document Splitter → Vector Store

Single extractor handles all file types (PDF, DOCX, XLSX) automatically
Unified text processing pipeline after extraction
Consistent chunking and embedding for all document types

Additional information

Supported File Formats

PDF Documents:

Text-based PDFs: Direct text extraction with high accuracy
Complex Layouts: Handles multi-column layouts and formatted documents
Embedded Content: Extracts readable text while preserving document flow

DOCX (Word) Documents:

Rich Text: Preserves formatting like bold, italic, and headers
Document Structure: Maintains paragraph breaks and section organization
Tables and Lists: Converts structured content to readable format

XLSX (Excel) Spreadsheets:

Multiple Sheets: Extract from all sheets or specify individual sheets
Data Types: Handles text, numbers, dates, and formulas
Cell Structure: Preserves row and column relationships in output
Sheet Organization: Optional sheet names for multi-sheet workbooks

Output Format Options

Extracted Text (Plain Text):

Clean, readable text without formatting markup
Suitable for basic text analysis and processing
Minimal file size and processing overhead
Compatible with all text-processing nodes

Markdown Format:

Preserves document structure with headers and formatting
Maintains tables, lists, and emphasis markers
Ideal for documents that need structural preservation
Good balance between formatting and readability

HTML Format:

Rich formatting with full styling preservation
Maintains complex layouts and visual elements
Suitable for web display or rich text applications
Larger output size due to markup tags

Common Use Cases

Research and Analysis:

Extract content from academic papers and research documents
Process legal documents for content analysis and search
Analyze business reports and financial documents

Content Management:

Convert legacy documents to searchable text formats
Build document search and retrieval systems
Migrate content from various formats to unified systems

Data Processing:

Extract tabular data from PDF reports
Process Excel spreadsheets for data analysis workflows
Convert document-based data to structured formats

Knowledge Management:

Build searchable knowledge bases from document collections
Create AI-powered document Q&A systems
Enable semantic search across diverse document types

The Document Extractor provides essential document processing capabilities that bridge the gap between static document formats and dynamic text processing workflows, enabling sophisticated document analysis and content management systems.

What is Document Extractor?​

How to use it?​

Example of usage​

Additional information​

Supported File Formats​

Output Format Options​

Common Use Cases​