DocMatch is an intelligent document search and semantic matching tool that uses spaCy NLP and cosine similarity to find the most relevant sections in your documents. Supporting PDF, DOCX, and PPTX formats, it enables rapid information retrieval with natural language queries.
- 🔍 Semantic Search - Find relevant content using natural language queries
- 📄 Multi-Format Support - PDF, DOCX, PPTX, and TXT files
- 📊 Similarity Scoring - Confidence scores for every match
- 💾 Search History - Cached results for repeated queries
- ⚡ Fast Processing - Optimized for large document sets
- 🎯 PDF Highlighting - Visual results with highlighted text
- 🖥️ Simple UI - Streamlit-based intuitive interface
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Download spaCy model (required)
python -m spacy download en_core_web_mdcp .env.example .env
# Edit .env if needed# Create data directory
mkdir data
# Copy PDF, DOCX, or PPTX files into data folder
cp your_documents/* data/streamlit run app.pyAccess at http://localhost:8501
- Python 3.10+
- 4GB RAM minimum
- 2GB disk space
# Create virtual environment
python -m venv venv
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download spaCy language model
python -m spacy download en_core_web_md
# Run application
streamlit run app.py# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Download spaCy language model
python -m spacy download en_core_web_md
# Run application
streamlit run app.py# Build image
docker build -t docmatch .
# Run container
docker run -p 8501:8501 -v $(pwd)/data:/app/data docmatch
# Access at http://localhost:8501# Application
APP_NAME=DocMatch
APP_TITLE=Document Search Tool
LOG_LEVEL=INFO
# Supported file formats
SUPPORTED_FORMATS=.pdf,.docx,.pptx
# Similarity threshold (0.0-1.0)
# Lower = more lenient matching
SIMILARITY_THRESHOLD=0.5
# spaCy language model
# Options: en_core_web_sm (small, 40MB)
# en_core_web_md (medium, 42MB) - RECOMMENDED
# en_core_web_lg (large, 745MB) - Best accuracy
SPACY_MODEL=en_core_web_md
# Search History
ENABLE_HISTORY=true
HISTORY_FILE=history.json
# Performance
MAX_WORKERS=4
CACHE_EMBEDDINGS=trueFast Performance (Less Accurate):
SPACY_MODEL=en_core_web_sm
SIMILARITY_THRESHOLD=0.6
MAX_WORKERS=8Balanced (Recommended):
SPACY_MODEL=en_core_web_md
SIMILARITY_THRESHOLD=0.5
MAX_WORKERS=4High Accuracy (Slower):
SPACY_MODEL=en_core_web_lg
SIMILARITY_THRESHOLD=0.4
MAX_WORKERS=2- Enter absolute path to folder containing documents
- Example:
/home/user/documentsorC:\Users\Documents\data - Supports nested folders (recursively searches subdirectories)
- Type natural language question or search phrase
- Examples:
- "What are the main benefits?"
- "How do I implement this feature?"
- "List all safety procedures"
- Adjust slider from 1 to 10
- Chunk size = number of sentences grouped together
- Larger chunks = better context, smaller = more specific results
- Recommended: 3-5 for general documents
- Click "Search" button
- Wait for processing (typically 2-10 seconds)
- View similarity score (0.0 to 1.0, higher = better match)
- Read most relevant text chunk
- See matched file name
- PDF results show highlighted match
# Process multiple queries programmatically
from utils.cosineSimi import search
queries = [
"What is the overview?",
"How to get started?",
"What are the features?"
]
for query in queries:
result = search(dir="./data", query=query, chunk=3)
print(f"Query: {result['query']}")
print(f"Score: {result['similarity_score']}")Adjust threshold in .env to filter results:
< 0.3: Very lenient (many false positives)0.3-0.5: Lenient (broad matches)0.5-0.7: Balanced (recommended)> 0.7: Strict (only exact matches)
┌──────────────────────────┐
│ Streamlit UI (app.py) │
│ - Input: directory path │
│ - Input: search query │
│ - Input: chunk size │
└────────────┬─────────────┘
│
┌────────▼────────┐
│ File Discovery │
│ (.pdf, .docx, │
│ .pptx support) │
└────────┬────────┘
│
┌────────▼──────────────┐
│ Format Conversion │
│ (DOCX→PDF, PPTX→PDF) │
└────────┬──────────────┘
│
┌────────▼──────────────┐
│ Text Extraction │
│ (PyPDF2, python-docx) │
└────────┬──────────────┘
│
┌────────▼──────────────┐
│ Text Chunking │
│ (Divide into chunks) │
└────────┬──────────────┘
│
┌────────▼──────────────┐
│ spaCy NLP Processing │
│ - Vectorize text │
│ - Compute embeddings │
└────────┬──────────────┘
│
┌────────▼──────────────┐
│ Similarity Matching │
│ (Cosine Similarity) │
└────────┬──────────────┘
│
┌────────▼──────────────┐
│ Results │
│ - Best match │
│ - Score │
│ - Highlighted PDF │
└──────────────────────┘
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── .env.example # Configuration template
├── README.md # This file
├── history.json # Search history cache
├── data/ # Your documents folder
│ ├── document1.pdf
│ ├── document2.docx
│ └── document3.pptx
└── utils/
├── cosineSimi.py # Similarity search logic
├── highlight_pdf.py # PDF highlighting
└── json_writer.py # History management
Error: Can't find model 'en_core_web_md'
Solution:
python -m spacy download en_core_web_mdError: libreoffice not found or installed
Solution:
# Ubuntu/Debian
sudo apt-get install libreoffice
# macOS
brew install libreoffice
# Windows
# Download from: https://www.libreoffice.org/download/Error: Could not highlight PDF
Solution:
- Ensure PDF is not corrupted
- Check write permissions in output directory
- Try with different PDF file
Error: Out of memory
Solution:
- Process smaller document sets
- Use smaller spaCy model:
en_core_web_sm - Reduce MAX_WORKERS in .env
- Split large files before processing
Takes > 30 seconds per search
Solution:
- Use faster model:
en_core_web_sm - Increase MAX_WORKERS in .env
- Reduce number of documents
- Use SSD instead of HDD
-
Model Selection
sm(small): 40MB, 5-10 sec/searchmd(medium): 42MB, 10-20 sec/search ⭐ Recommendedlg(large): 745MB, 20-40 sec/search
-
Chunk Size Optimization
- Smaller chunks: more specific, less context
- Larger chunks: better context, less specificity
- Ideal range: 3-5 sentences
-
Document Organization
- Keep document sets < 100 files for speed
- Organize by category
- Remove unnecessary files
-
Hardware
- SSD significantly faster than HDD
- 8GB RAM sufficient for most use cases
- Multi-core CPU helps (MAX_WORKERS setting)
Repeated queries are cached for instant results:
{
"search_history": [
{
"query": "benefits",
"file_name": "guide.pdf",
"most_similar_chunk": "...",
"similarity_score": 0.89
}
]
}Highlight custom text in PDFs:
from utils.highlight_pdf import highlight_pdf_main
output_path = highlight_pdf_main(
path="document.pdf",
text="search phrase"
)# Setup development
pip install pytest black flake8
# Format code
black .
# Check quality
flake8 .
# Run tests
pytest tests/- 🔐 Search queries are logged locally only
- 🔐 No data sent to external servers
- 🔐 All processing happens on local machine
- 🔐 History file can be cleared:
rm history.json
MIT License - See LICENSE file
Q: What file sizes are supported? A: Up to 500MB per file; split larger files.
Q: Does it support OCR? A: Currently text-based only; use OCR tools first for scanned PDFs.
Q: How accurate is the matching? A: 85-95% accurate with proper chunk size tuning.
Q: Can I use this offline? A: Yes, once dependencies and spaCy model are installed.
Q: How do I improve search results? A: Adjust chunk size and similarity threshold; use longer, more specific queries.
Q: Is there a batch API? A: Not yet; coming in future release.
| Document Set | Files | Size | Model | Avg Search Time |
|---|---|---|---|---|
| Small | 10 | 50MB | sm | 2 sec |
| Medium | 50 | 250MB | md | 8 sec |
| Large | 100 | 500MB | lg | 20 sec |
Aviasole is an AI development company specializing in cutting-edge artificial intelligence solutions. We create innovative POCs and production-ready applications that demonstrate the power and potential of modern AI technologies.
DocMatch is one of our showcase projects demonstrating semantic search and intelligent document matching using NLP and cosine similarity matching.
Learn more: https://aviasole.com
Aviasole - AI Development & Innovation Website: https://aviasole.com
This project is proudly developed and maintained by Aviasole, a leading AI development company focused on creating innovative AI solutions for document processing and semantic search.
For more AI projects and solutions, visit aviasole.com
Last Updated: March 2026 Documentation Version: 1.0