Skip to main content

Document Extractor

Extract text content from PDF, DOCX, and XLSX files with flexible output formatting.

What is Document Extractor?

The Document Extractor node converts various document formats into readable text content that can be processed by other nodes in your workflow. It automatically detects and supports PDF, DOCX (Word), and XLSX (Excel) files, providing intelligent extraction that preserves document structure and formatting when possible.

How to use it?

To extract content from your documents:

  1. Connect Your File:

    • Connect any file output from File Reader, File Writer, or other file-producing nodes
    • The node automatically detects whether it's a PDF, DOCX, or XLSX file
    • No need to manually specify the file type - detection is handled automatically
  2. Configure Extraction Options:

    For Excel Files (XLSX) - when detected:

    • Sheet Name: Specify a particular sheet to extract (optional)
    • Include Sheet Names: Add sheet names to the output for context
    • Include Metadata: Add file metadata information to the output

    For All File Types:

    • Choose output format based on how you plan to use the extracted content
  3. Select Output Format:

    • Extracted Text: Plain text content with basic formatting
    • Markdown: Formatted text that preserves document structure
    • HTML: Web-formatted output that maintains rich formatting
  4. Use Extracted Content:

    • Connect the output to text processing nodes like Document Splitter
    • Send to LLM nodes for analysis or summarization
    • Use with Text Embedder for semantic search applications

Example of usage

Objective: Process a batch of research papers in PDF format to create a searchable knowledge base.

Document Extraction Workflow:

  1. File Processing Setup:

    • Connect File Reader to read PDF files from storage
    • Connect the file output directly to Document Extractor (no format selection needed)
    • Set output format to Markdown to preserve document structure
  2. Extraction Configuration:

    • File Input: Connected from File Reader (automatic format detection)
    • Output Format: Markdown (to maintain headings and structure)
    • The node automatically detects it's a PDF and processes accordingly
  3. Post-Processing Chain:

    • Connect Document Splitter to break text into manageable chunks
    • Use Text Embedder to create vector representations
    • Store in Vector Store for semantic search capabilities

Additional Examples:

Excel Data Processing:

File Reader → Document Extractor → JSON Query → Data Analysis
  • Sheet Name: "Sales_Data" (specific sheet, when Excel is detected)
  • Include Sheet Names: true (for multi-sheet context)
  • Output Format: Extracted Text → parse with JSON Query

Word Document Analysis:

File Reader → Document Extractor → LLM → Summary Output
  • Auto-Detection: Node automatically recognizes DOCX format
  • Output Format: Markdown (preserves formatting)
  • Use Case: Automated document summarization

Multi-Format Document Processing:

Multiple File Readers → Single Document Extractor → Document Splitter → Vector Store
  • Single extractor handles all file types (PDF, DOCX, XLSX) automatically
  • Unified text processing pipeline after extraction
  • Consistent chunking and embedding for all document types

Additional information

Supported File Formats

PDF Documents:

  • Text-based PDFs: Direct text extraction with high accuracy
  • Complex Layouts: Handles multi-column layouts and formatted documents
  • Embedded Content: Extracts readable text while preserving document flow

DOCX (Word) Documents:

  • Rich Text: Preserves formatting like bold, italic, and headers
  • Document Structure: Maintains paragraph breaks and section organization
  • Tables and Lists: Converts structured content to readable format

XLSX (Excel) Spreadsheets:

  • Multiple Sheets: Extract from all sheets or specify individual sheets
  • Data Types: Handles text, numbers, dates, and formulas
  • Cell Structure: Preserves row and column relationships in output
  • Sheet Organization: Optional sheet names for multi-sheet workbooks

Output Format Options

Extracted Text (Plain Text):

  • Clean, readable text without formatting markup
  • Suitable for basic text analysis and processing
  • Minimal file size and processing overhead
  • Compatible with all text-processing nodes

Markdown Format:

  • Preserves document structure with headers and formatting
  • Maintains tables, lists, and emphasis markers
  • Ideal for documents that need structural preservation
  • Good balance between formatting and readability

HTML Format:

  • Rich formatting with full styling preservation
  • Maintains complex layouts and visual elements
  • Suitable for web display or rich text applications
  • Larger output size due to markup tags

Common Use Cases

Research and Analysis:

  • Extract content from academic papers and research documents
  • Process legal documents for content analysis and search
  • Analyze business reports and financial documents

Content Management:

  • Convert legacy documents to searchable text formats
  • Build document search and retrieval systems
  • Migrate content from various formats to unified systems

Data Processing:

  • Extract tabular data from PDF reports
  • Process Excel spreadsheets for data analysis workflows
  • Convert document-based data to structured formats

Knowledge Management:

  • Build searchable knowledge bases from document collections
  • Create AI-powered document Q&A systems
  • Enable semantic search across diverse document types

The Document Extractor provides essential document processing capabilities that bridge the gap between static document formats and dynamic text processing workflows, enabling sophisticated document analysis and content management systems.