ValidationAI is an intelligent data validation and reconciliation platform that compares data from multiple sources (Excel files and PostgreSQL databases) using AI-powered fuzzy matching and comparison. Built with LangChain, Google Generative AI, and Streamlit, it provides enterprise-grade data validation capabilities with minimal configuration.
- 🤖 AI-Powered Comparison - Uses Google Generative AI for intelligent data analysis
- 🎯 Fuzzy Matching - Intelligent matching with configurable thresholds
- 📊 Multi-Source Validation - Compare Excel files against PostgreSQL databases
- 🔍 Detailed Reporting - Real-time validation results with comprehensive analytics
- 🚀 Easy Integration - Simple configuration with environment variables
- 📈 Scalable Architecture - Handles large datasets efficiently
- Python: 3.10 or higher
- PostgreSQL: 12 or higher (for database connection)
- Operating System: Windows, macOS, or Linux
- RAM: Minimum 4GB recommended
- Disk Space: 500MB for dependencies
- Google Cloud Account - For Generative AI API access
- PostgreSQL Database - For data source
cd ValidationAI# On Windows
python -m venv venv
venv\Scripts\activate
# On macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txt# Copy the example file
cp .env.example .env
# Edit .env with your actual values
# See Configuration section belowstreamlit run main.pyThe application will open at http://localhost:8501
- Docker 20.10+
- Docker Compose 2.0+
# Build the Docker image
docker build -t validation-ai .
# Create .env file with your configuration
cp .env.example .env
# Edit .env with your values
# Run with Docker Compose
docker-compose up
# Application will be available at http://localhost:8501docker-compose downpython -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install requirements
pip install --upgrade pip
pip install -r requirements.txtEnsure your PostgreSQL instance is running and accessible.
# Test connection (optional)
psql -h YOUR_HOST -U YOUR_USER -d YOUR_DATABASE- Go to Google Cloud Console
- Create a new project or select existing one
- Enable "Generative Language API"
- Create an API key:
- Navigate to "Credentials"
- Click "Create Credentials" → "API Key"
- Copy the API key
cp .env.example .envEdit .env file with your configuration (see Configuration section)
streamlit run main.pyCreate a .env file in the project root directory with the following variables:
# Google Generative AI Configuration
# Get your API key from: https://makersuite.google.com/app/apikey
GOOGLE_API_KEY=your_actual_google_api_key_here
# PostgreSQL Database Configuration
# Example: db.example.com
DB_HOST=your_database_host
# Database port (default: 5432)
DB_PORT=5432
# Database name
DB_NAME=your_database_name
# Database username
DB_USER=your_database_user
# Database password (use strong password in production)
DB_PASSWORD=your_database_password
# Application Settings
LOG_LEVEL=INFO- Never commit
.envfile - It contains sensitive information - Use strong database passwords - Minimum 12 characters with special characters
- Restrict API key scope - Use Google Cloud Console to limit API key usage
- Environment-specific configs - Use different credentials for dev/staging/production
streamlit run main.py- Click "Browse files" in the sidebar
- Select your Excel file containing data to validate
- The file will be processed automatically
- Statement ID: Enter the database record identifier
- Agency Type: Select or enter agency classification
- Fuzzy Match Threshold: Adjust matching sensitivity (0-100, default: 80)
- Enable Swap: Toggle for agent NPN mapping
- Click "Search" button
- View real-time validation results
- Review detailed comparison report
- Check validation scores
- Review matching confidence levels
- Export results if needed
The fuzzy matching threshold determines matching sensitivity:
- 90-100: Very strict matching (high precision, may miss valid matches)
- 70-89: Standard matching (recommended for most cases)
- 50-69: Lenient matching (high recall, may include false positives)
For files with 10,000+ rows:
- Increase match threshold to reduce processing time
- Consider splitting into multiple validation batches
- Monitor database connection stability
┌─────────────────────────────────────────────────────────┐
│ Streamlit Web Interface (main.py) │
│ - File upload & configuration UI │
│ - Real-time result display │
└─────────────────┬───────────────────────────────────────┘
│
┌─────────┴──────────────┐
│ │
┌───▼──────────────┐ ┌──────▼────────────────┐
│ Excel Processing │ │ Database Connection │
│ (Pandas) │ │ (psycopg2) │
└───┬──────────────┘ └──────┬────────────────┘
│ │
└─────────────┬──────────┘
│
┌─────────▼──────────────┐
│ Data Comparison Engine │
│ - FuzzyWuzzy matching │
│ - Pandas operations │
└─────────┬──────────────┘
│
┌─────────▼──────────────────┐
│ AI Analysis & Validation │
│ (LangChain + Google GenAI) │
│ - Tool-calling agent │
│ - Intelligent comparison │
└─────────┬──────────────────┘
│
┌─────────▼──────────────┐
│ Results & Reporting │
│ - Console output │
│ - Session state │
└────────────────────────┘
- Streamlit application entry point
- UI configuration and layout
- Session state management
- Data validation orchestration
- PostgreSQL connection management
- Database query execution
- Data retrieval and transformation
- LangChain tool definitions
- Google Generative AI integration
- Comparison logic and analysis
Error: could not connect to server: Connection refused
Solutions:
- Verify PostgreSQL is running:
systemctl status postgresql(Linux/macOS) - Check host, port, and credentials in
.env - Ensure database exists:
psql -l - Test connection:
psql -h HOST -U USER -d DATABASE
Error: GOOGLE_API_KEY environment variable is not set
Solutions:
- Verify
.envfile exists in project root - Check API key value in
.envis correct - Reload environment:
source venv/bin/activate(Linux/macOS) - Restart application:
streamlit run main.py
ModuleNotFoundError: No module named 'streamlit'
Solutions:
- Activate virtual environment
- Reinstall dependencies:
pip install -r requirements.txt - Check Python version:
python --version(must be 3.10+)
Error: File format not recognized
Solutions:
- Ensure file is
.xlsxor.xlsformat - Check file is not corrupted
- Try opening file in Excel first
- Convert to
.xlsxif using.xls
Error: Connection timeout
Solutions:
- Check database server status
- Verify network connectivity
- Increase timeout settings in code (if needed)
- Try with smaller dataset first
# Set debug logging
export LOG_LEVEL=DEBUG
streamlit run main.py# Display all configured variables
cat .env
# Verify specific variables
echo $GOOGLE_API_KEY
echo $DB_HOST-
Batch Processing
- Process data in chunks for large files
- Adjust chunk size based on available memory
-
Database Indexing
- Ensure database tables have proper indexes
- Check query execution plans
-
Connection Pooling
- Reuse database connections
- Monitor active connections
-
Caching
- Use Streamlit's @st.cache_data decorator
- Cache expensive computations
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make your changes and test thoroughly
- Commit with clear messages:
git commit -m "Add feature description" - Push to branch:
git push origin feature/your-feature - Submit a Pull Request
# Install development dependencies
pip install -r requirements.txt
# Run tests
pytest tests/
# Format code
black .
# Check code quality
flake8 .- 🔐 Never commit
.envfile containing sensitive credentials - 🔐 Use environment variables for all secrets in production
- 🔐 Restrict API key scope using Google Cloud Console
- 🔐 Use HTTPS when deploying to production
- 🔐 Implement authentication for production deployments
- 🔐 Rotate credentials regularly for maximum security
For production environments:
# Use strong database passwords
# Implement API authentication
# Use HTTPS/TLS for all connections
# Enable audit logging
# Set up monitoring and alerts
# Implement rate limitingValidates and compares data from Excel and database sources.
Parameters:
statementid(int): Database statement identifieragency_type(str): Agency classificationswap(bool): Enable agent NPN mappingfile(UploadedFile): Excel file objectquery(str): Agency name querythreshold(int): Fuzzy match threshold (0-100)
Returns:
- Validation results and comparison data
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or suggestions:
- Open an issue on the project repository
- Contact the development team
- Check documentation for solutions
- Visit Aviasole.com for more information
Aviasole is an AI development company specializing in cutting-edge artificial intelligence solutions. We create innovative POCs and production-ready applications that demonstrate the power and potential of modern AI technologies.
ValidationAI is one of our showcase projects demonstrating enterprise-grade AI-powered data validation capabilities.
Learn more: https://aviasole.com
- Initial release
- AI-powered data validation
- Excel to PostgreSQL comparison
- Real-time validation interface
Built with:
- Streamlit - Web application framework
- LangChain - LLM framework
- Google Generative AI - Intelligent analysis
- Pandas - Data processing
- FuzzyWuzzy - Fuzzy matching
Q: What Excel formats are supported?
A: .xlsx and .xls formats are supported.
Q: Can I use this without PostgreSQL? A: Currently, PostgreSQL is required. Contact support for alternative database support.
Q: How do I increase matching accuracy? A: Adjust the fuzzy match threshold and ensure database data is properly cleaned and normalized.
Q: Is there a limit on file size? A: Files up to 100MB are supported. For larger files, contact support.
Q: How do I deploy to production? A: See the Production Deployment section in Security.
Last Updated: March 2026 Documentation Version: 1.0
Aviasole - AI Development & Innovation Website: https://aviasole.com
This project is proudly developed and maintained by Aviasole, a leading AI development company focused on creating innovative AI solutions for enterprise challenges.
For more AI projects and solutions, visit aviasole.com