Skip to content

aviasoletechnologies/DocMatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocMatch

License: MIT Python 3.10+ Streamlit App Made by Aviasole

Overview

DocMatch is an intelligent document search and semantic matching tool that uses spaCy NLP and cosine similarity to find the most relevant sections in your documents. Supporting PDF, DOCX, and PPTX formats, it enables rapid information retrieval with natural language queries.

Key Features

  • 🔍 Semantic Search - Find relevant content using natural language queries
  • 📄 Multi-Format Support - PDF, DOCX, PPTX, and TXT files
  • 📊 Similarity Scoring - Confidence scores for every match
  • 💾 Search History - Cached results for repeated queries
  • Fast Processing - Optimized for large document sets
  • 🎯 PDF Highlighting - Visual results with highlighted text
  • 🖥️ Simple UI - Streamlit-based intuitive interface

Quick Start

1. Install Dependencies

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Download spaCy model (required)
python -m spacy download en_core_web_md

2. Configure Application

cp .env.example .env
# Edit .env if needed

3. Prepare Documents

# Create data directory
mkdir data

# Copy PDF, DOCX, or PPTX files into data folder
cp your_documents/* data/

4. Run Application

streamlit run app.py

Access at http://localhost:8501


Installation

Prerequisites

  • Python 3.10+
  • 4GB RAM minimum
  • 2GB disk space

Step-by-Step Setup

Windows

# Create virtual environment
python -m venv venv
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download spaCy language model
python -m spacy download en_core_web_md

# Run application
streamlit run app.py

macOS/Linux

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Download spaCy language model
python -m spacy download en_core_web_md

# Run application
streamlit run app.py

Docker Setup

# Build image
docker build -t docmatch .

# Run container
docker run -p 8501:8501 -v $(pwd)/data:/app/data docmatch

# Access at http://localhost:8501

Configuration

Environment Variables

# Application
APP_NAME=DocMatch
APP_TITLE=Document Search Tool
LOG_LEVEL=INFO

# Supported file formats
SUPPORTED_FORMATS=.pdf,.docx,.pptx

# Similarity threshold (0.0-1.0)
# Lower = more lenient matching
SIMILARITY_THRESHOLD=0.5

# spaCy language model
# Options: en_core_web_sm (small, 40MB)
#          en_core_web_md (medium, 42MB) - RECOMMENDED
#          en_core_web_lg (large, 745MB) - Best accuracy
SPACY_MODEL=en_core_web_md

# Search History
ENABLE_HISTORY=true
HISTORY_FILE=history.json

# Performance
MAX_WORKERS=4
CACHE_EMBEDDINGS=true

Recommended Settings

Fast Performance (Less Accurate):

SPACY_MODEL=en_core_web_sm
SIMILARITY_THRESHOLD=0.6
MAX_WORKERS=8

Balanced (Recommended):

SPACY_MODEL=en_core_web_md
SIMILARITY_THRESHOLD=0.5
MAX_WORKERS=4

High Accuracy (Slower):

SPACY_MODEL=en_core_web_lg
SIMILARITY_THRESHOLD=0.4
MAX_WORKERS=2

Usage

Basic Workflow

1. Enter Directory Path

  • Enter absolute path to folder containing documents
  • Example: /home/user/documents or C:\Users\Documents\data
  • Supports nested folders (recursively searches subdirectories)

2. Enter Search Query

  • Type natural language question or search phrase
  • Examples:
    • "What are the main benefits?"
    • "How do I implement this feature?"
    • "List all safety procedures"

3. Configure Chunk Size

  • Adjust slider from 1 to 10
  • Chunk size = number of sentences grouped together
  • Larger chunks = better context, smaller = more specific results
  • Recommended: 3-5 for general documents

4. Execute Search

  • Click "Search" button
  • Wait for processing (typically 2-10 seconds)

5. Review Results

  • View similarity score (0.0 to 1.0, higher = better match)
  • Read most relevant text chunk
  • See matched file name
  • PDF results show highlighted match

Advanced Usage

Batch Processing

# Process multiple queries programmatically
from utils.cosineSimi import search

queries = [
    "What is the overview?",
    "How to get started?",
    "What are the features?"
]

for query in queries:
    result = search(dir="./data", query=query, chunk=3)
    print(f"Query: {result['query']}")
    print(f"Score: {result['similarity_score']}")

Custom Similarity Threshold

Adjust threshold in .env to filter results:

  • < 0.3: Very lenient (many false positives)
  • 0.3-0.5: Lenient (broad matches)
  • 0.5-0.7: Balanced (recommended)
  • > 0.7: Strict (only exact matches)

Architecture

┌──────────────────────────┐
│   Streamlit UI (app.py)  │
│  - Input: directory path │
│  - Input: search query   │
│  - Input: chunk size     │
└────────────┬─────────────┘
             │
    ┌────────▼────────┐
    │ File Discovery  │
    │ (.pdf, .docx,   │
    │  .pptx support) │
    └────────┬────────┘
             │
    ┌────────▼──────────────┐
    │ Format Conversion     │
    │ (DOCX→PDF, PPTX→PDF) │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Text Extraction       │
    │ (PyPDF2, python-docx) │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Text Chunking         │
    │ (Divide into chunks)  │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ spaCy NLP Processing  │
    │ - Vectorize text      │
    │ - Compute embeddings  │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Similarity Matching   │
    │ (Cosine Similarity)   │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Results               │
    │ - Best match          │
    │ - Score               │
    │ - Highlighted PDF     │
    └──────────────────────┘

File Structure

├── app.py                      # Main Streamlit application
├── requirements.txt            # Python dependencies
├── .env.example               # Configuration template
├── README.md                  # This file
├── history.json               # Search history cache
├── data/                      # Your documents folder
│   ├── document1.pdf
│   ├── document2.docx
│   └── document3.pptx
└── utils/
    ├── cosineSimi.py          # Similarity search logic
    ├── highlight_pdf.py       # PDF highlighting
    └── json_writer.py         # History management

Troubleshooting

spaCy Model Not Found

Error: Can't find model 'en_core_web_md'

Solution:

python -m spacy download en_core_web_md

LibreOffice Not Installed (DOCX/PPTX Conversion)

Error: libreoffice not found or installed

Solution:

# Ubuntu/Debian
sudo apt-get install libreoffice

# macOS
brew install libreoffice

# Windows
# Download from: https://www.libreoffice.org/download/

PDF Highlighting Issues

Error: Could not highlight PDF

Solution:

  1. Ensure PDF is not corrupted
  2. Check write permissions in output directory
  3. Try with different PDF file

Memory Issues

Error: Out of memory

Solution:

  1. Process smaller document sets
  2. Use smaller spaCy model: en_core_web_sm
  3. Reduce MAX_WORKERS in .env
  4. Split large files before processing

Slow Processing

Takes > 30 seconds per search

Solution:

  1. Use faster model: en_core_web_sm
  2. Increase MAX_WORKERS in .env
  3. Reduce number of documents
  4. Use SSD instead of HDD

Performance Tips

  1. Model Selection

    • sm (small): 40MB, 5-10 sec/search
    • md (medium): 42MB, 10-20 sec/search ⭐ Recommended
    • lg (large): 745MB, 20-40 sec/search
  2. Chunk Size Optimization

    • Smaller chunks: more specific, less context
    • Larger chunks: better context, less specificity
    • Ideal range: 3-5 sentences
  3. Document Organization

    • Keep document sets < 100 files for speed
    • Organize by category
    • Remove unnecessary files
  4. Hardware

    • SSD significantly faster than HDD
    • 8GB RAM sufficient for most use cases
    • Multi-core CPU helps (MAX_WORKERS setting)

Advanced Features

Search History Caching

Repeated queries are cached for instant results:

{
  "search_history": [
    {
      "query": "benefits",
      "file_name": "guide.pdf",
      "most_similar_chunk": "...",
      "similarity_score": 0.89
    }
  ]
}

Custom PDF Highlighting

Highlight custom text in PDFs:

from utils.highlight_pdf import highlight_pdf_main

output_path = highlight_pdf_main(
    path="document.pdf",
    text="search phrase"
)

Contributing

# Setup development
pip install pytest black flake8

# Format code
black .

# Check quality
flake8 .

# Run tests
pytest tests/

Security

  • 🔐 Search queries are logged locally only
  • 🔐 No data sent to external servers
  • 🔐 All processing happens on local machine
  • 🔐 History file can be cleared: rm history.json

License

MIT License - See LICENSE file


FAQ

Q: What file sizes are supported? A: Up to 500MB per file; split larger files.

Q: Does it support OCR? A: Currently text-based only; use OCR tools first for scanned PDFs.

Q: How accurate is the matching? A: 85-95% accurate with proper chunk size tuning.

Q: Can I use this offline? A: Yes, once dependencies and spaCy model are installed.

Q: How do I improve search results? A: Adjust chunk size and similarity threshold; use longer, more specific queries.

Q: Is there a batch API? A: Not yet; coming in future release.


Performance Benchmarks

Document Set Files Size Model Avg Search Time
Small 10 50MB sm 2 sec
Medium 50 250MB md 8 sec
Large 100 500MB lg 20 sec

About Aviasole

Aviasole is an AI development company specializing in cutting-edge artificial intelligence solutions. We create innovative POCs and production-ready applications that demonstrate the power and potential of modern AI technologies.

DocMatch is one of our showcase projects demonstrating semantic search and intelligent document matching using NLP and cosine similarity matching.

Learn more: https://aviasole.com


Company

Aviasole - AI Development & Innovation Website: https://aviasole.com

This project is proudly developed and maintained by Aviasole, a leading AI development company focused on creating innovative AI solutions for document processing and semantic search.

For more AI projects and solutions, visit aviasole.com


Last Updated: March 2026 Documentation Version: 1.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors