DocMatch

Overview

DocMatch is an intelligent document search and semantic matching tool that uses spaCy NLP and cosine similarity to find the most relevant sections in your documents. Supporting PDF, DOCX, and PPTX formats, it enables rapid information retrieval with natural language queries.

Key Features

🔍 Semantic Search - Find relevant content using natural language queries
📄 Multi-Format Support - PDF, DOCX, PPTX, and TXT files
📊 Similarity Scoring - Confidence scores for every match
💾 Search History - Cached results for repeated queries
⚡ Fast Processing - Optimized for large document sets
🎯 PDF Highlighting - Visual results with highlighted text
🖥️ Simple UI - Streamlit-based intuitive interface

Quick Start

1. Install Dependencies

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Download spaCy model (required)
python -m spacy download en_core_web_md

2. Configure Application

cp .env.example .env
# Edit .env if needed

3. Prepare Documents

# Create data directory
mkdir data

# Copy PDF, DOCX, or PPTX files into data folder
cp your_documents/* data/

4. Run Application

streamlit run app.py

Access at http://localhost:8501

Installation

Prerequisites

Python 3.10+
4GB RAM minimum
2GB disk space

Step-by-Step Setup

Windows

# Create virtual environment
python -m venv venv
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download spaCy language model
python -m spacy download en_core_web_md

# Run application
streamlit run app.py

macOS/Linux

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Download spaCy language model
python -m spacy download en_core_web_md

# Run application
streamlit run app.py

Docker Setup

# Build image
docker build -t docmatch .

# Run container
docker run -p 8501:8501 -v $(pwd)/data:/app/data docmatch

# Access at http://localhost:8501

Configuration

Environment Variables

# Application
APP_NAME=DocMatch
APP_TITLE=Document Search Tool
LOG_LEVEL=INFO

# Supported file formats
SUPPORTED_FORMATS=.pdf,.docx,.pptx

# Similarity threshold (0.0-1.0)
# Lower = more lenient matching
SIMILARITY_THRESHOLD=0.5

# spaCy language model
# Options: en_core_web_sm (small, 40MB)
#          en_core_web_md (medium, 42MB) - RECOMMENDED
#          en_core_web_lg (large, 745MB) - Best accuracy
SPACY_MODEL=en_core_web_md

# Search History
ENABLE_HISTORY=true
HISTORY_FILE=history.json

# Performance
MAX_WORKERS=4
CACHE_EMBEDDINGS=true

Recommended Settings

Fast Performance (Less Accurate):

SPACY_MODEL=en_core_web_sm
SIMILARITY_THRESHOLD=0.6
MAX_WORKERS=8

Balanced (Recommended):

SPACY_MODEL=en_core_web_md
SIMILARITY_THRESHOLD=0.5
MAX_WORKERS=4

High Accuracy (Slower):

SPACY_MODEL=en_core_web_lg
SIMILARITY_THRESHOLD=0.4
MAX_WORKERS=2

Usage

Basic Workflow

1. Enter Directory Path

Enter absolute path to folder containing documents
Example: /home/user/documents or C:\Users\Documents\data
Supports nested folders (recursively searches subdirectories)

2. Enter Search Query

Type natural language question or search phrase
Examples:
- "What are the main benefits?"
- "How do I implement this feature?"
- "List all safety procedures"

3. Configure Chunk Size

Adjust slider from 1 to 10
Chunk size = number of sentences grouped together
Larger chunks = better context, smaller = more specific results
Recommended: 3-5 for general documents

4. Execute Search

Click "Search" button
Wait for processing (typically 2-10 seconds)

5. Review Results

View similarity score (0.0 to 1.0, higher = better match)
Read most relevant text chunk
See matched file name
PDF results show highlighted match

Advanced Usage

Batch Processing

# Process multiple queries programmatically
from utils.cosineSimi import search

queries = [
    "What is the overview?",
    "How to get started?",
    "What are the features?"
]

for query in queries:
    result = search(dir="./data", query=query, chunk=3)
    print(f"Query: {result['query']}")
    print(f"Score: {result['similarity_score']}")

Custom Similarity Threshold

Adjust threshold in .env to filter results:

< 0.3: Very lenient (many false positives)
0.3-0.5: Lenient (broad matches)
0.5-0.7: Balanced (recommended)
> 0.7: Strict (only exact matches)

Architecture

┌──────────────────────────┐
│   Streamlit UI (app.py)  │
│  - Input: directory path │
│  - Input: search query   │
│  - Input: chunk size     │
└────────────┬─────────────┘
             │
    ┌────────▼────────┐
    │ File Discovery  │
    │ (.pdf, .docx,   │
    │  .pptx support) │
    └────────┬────────┘
             │
    ┌────────▼──────────────┐
    │ Format Conversion     │
    │ (DOCX→PDF, PPTX→PDF) │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Text Extraction       │
    │ (PyPDF2, python-docx) │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Text Chunking         │
    │ (Divide into chunks)  │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ spaCy NLP Processing  │
    │ - Vectorize text      │
    │ - Compute embeddings  │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Similarity Matching   │
    │ (Cosine Similarity)   │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────┐
    │ Results               │
    │ - Best match          │
    │ - Score               │
    │ - Highlighted PDF     │
    └──────────────────────┘

File Structure

├── app.py                      # Main Streamlit application
├── requirements.txt            # Python dependencies
├── .env.example               # Configuration template
├── README.md                  # This file
├── history.json               # Search history cache
├── data/                      # Your documents folder
│   ├── document1.pdf
│   ├── document2.docx
│   └── document3.pptx
└── utils/
    ├── cosineSimi.py          # Similarity search logic
    ├── highlight_pdf.py       # PDF highlighting
    └── json_writer.py         # History management

Troubleshooting

spaCy Model Not Found

Error: Can't find model 'en_core_web_md'

Solution:

python -m spacy download en_core_web_md

LibreOffice Not Installed (DOCX/PPTX Conversion)

Error: libreoffice not found or installed

Solution:

# Ubuntu/Debian
sudo apt-get install libreoffice

# macOS
brew install libreoffice

# Windows
# Download from: https://www.libreoffice.org/download/

PDF Highlighting Issues

Error: Could not highlight PDF

Solution:

Ensure PDF is not corrupted
Check write permissions in output directory
Try with different PDF file

Memory Issues

Error: Out of memory

Solution:

Process smaller document sets
Use smaller spaCy model: en_core_web_sm
Reduce MAX_WORKERS in .env
Split large files before processing

Slow Processing

Takes > 30 seconds per search

Solution:

Use faster model: en_core_web_sm
Increase MAX_WORKERS in .env
Reduce number of documents
Use SSD instead of HDD

Performance Tips

Model Selection
- sm (small): 40MB, 5-10 sec/search
- md (medium): 42MB, 10-20 sec/search ⭐ Recommended
- lg (large): 745MB, 20-40 sec/search
Chunk Size Optimization
- Smaller chunks: more specific, less context
- Larger chunks: better context, less specificity
- Ideal range: 3-5 sentences
Document Organization
- Keep document sets < 100 files for speed
- Organize by category
- Remove unnecessary files
Hardware
- SSD significantly faster than HDD
- 8GB RAM sufficient for most use cases
- Multi-core CPU helps (MAX_WORKERS setting)

Advanced Features

Search History Caching

Repeated queries are cached for instant results:

{
  "search_history": [
    {
      "query": "benefits",
      "file_name": "guide.pdf",
      "most_similar_chunk": "...",
      "similarity_score": 0.89
    }
  ]
}

Custom PDF Highlighting

Highlight custom text in PDFs:

from utils.highlight_pdf import highlight_pdf_main

output_path = highlight_pdf_main(
    path="document.pdf",
    text="search phrase"
)

Contributing

# Setup development
pip install pytest black flake8

# Format code
black .

# Check quality
flake8 .

# Run tests
pytest tests/

Security

🔐 Search queries are logged locally only
🔐 No data sent to external servers
🔐 All processing happens on local machine
🔐 History file can be cleared: rm history.json

License

MIT License - See LICENSE file

FAQ

Q: What file sizes are supported? A: Up to 500MB per file; split larger files.

Q: Does it support OCR? A: Currently text-based only; use OCR tools first for scanned PDFs.

Q: How accurate is the matching? A: 85-95% accurate with proper chunk size tuning.

Q: Can I use this offline? A: Yes, once dependencies and spaCy model are installed.

Q: How do I improve search results? A: Adjust chunk size and similarity threshold; use longer, more specific queries.

Q: Is there a batch API? A: Not yet; coming in future release.

Performance Benchmarks

Document Set	Files	Size	Model	Avg Search Time
Small	10	50MB	sm	2 sec
Medium	50	250MB	md	8 sec
Large	100	500MB	lg	20 sec

About Aviasole

Aviasole is an AI development company specializing in cutting-edge artificial intelligence solutions. We create innovative POCs and production-ready applications that demonstrate the power and potential of modern AI technologies.

DocMatch is one of our showcase projects demonstrating semantic search and intelligent document matching using NLP and cosine similarity matching.

Learn more: https://aviasole.com

Company

Aviasole - AI Development & Innovation Website: https://aviasole.com

This project is proudly developed and maintained by Aviasole, a leading AI development company focused on creating innovative AI solutions for document processing and semantic search.

For more AI projects and solutions, visit aviasole.com

Last Updated: March 2026 Documentation Version: 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
poppler		poppler
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DocMatch

Overview

Key Features

Quick Start

1. Install Dependencies

2. Configure Application

3. Prepare Documents

4. Run Application

Installation

Prerequisites

Step-by-Step Setup

Windows

macOS/Linux

Docker Setup

Configuration

Environment Variables

Recommended Settings

Usage

Basic Workflow

1. Enter Directory Path

2. Enter Search Query

3. Configure Chunk Size

4. Execute Search

5. Review Results

Advanced Usage

Batch Processing

Custom Similarity Threshold

Architecture

File Structure

Troubleshooting

spaCy Model Not Found

LibreOffice Not Installed (DOCX/PPTX Conversion)

PDF Highlighting Issues

Memory Issues

Slow Processing

Performance Tips

Advanced Features

Search History Caching

Custom PDF Highlighting

Contributing

Security

License

FAQ

Performance Benchmarks

About Aviasole

Company

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages