ValidationAI

Overview

ValidationAI is an intelligent data validation and reconciliation platform that compares data from multiple sources (Excel files and PostgreSQL databases) using AI-powered fuzzy matching and comparison. Built with LangChain, Google Generative AI, and Streamlit, it provides enterprise-grade data validation capabilities with minimal configuration.

Key Features

🤖 AI-Powered Comparison - Uses Google Generative AI for intelligent data analysis
🎯 Fuzzy Matching - Intelligent matching with configurable thresholds
📊 Multi-Source Validation - Compare Excel files against PostgreSQL databases
🔍 Detailed Reporting - Real-time validation results with comprehensive analytics
🚀 Easy Integration - Simple configuration with environment variables
📈 Scalable Architecture - Handles large datasets efficiently

Prerequisites

System Requirements

Python: 3.10 or higher
PostgreSQL: 12 or higher (for database connection)
Operating System: Windows, macOS, or Linux
RAM: Minimum 4GB recommended
Disk Space: 500MB for dependencies

Required Services

Google Cloud Account - For Generative AI API access
PostgreSQL Database - For data source

Installation

Quick Start

1. Clone or Download the Repository

cd ValidationAI

2. Create Virtual Environment

# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

# Copy the example file
cp .env.example .env

# Edit .env with your actual values
# See Configuration section below

5. Run the Application

streamlit run main.py

The application will open at http://localhost:8501

Docker Setup

Prerequisites

Docker 20.10+
Docker Compose 2.0+

Build and Run

# Build the Docker image
docker build -t validation-ai .

# Create .env file with your configuration
cp .env.example .env
# Edit .env with your values

# Run with Docker Compose
docker-compose up

# Application will be available at http://localhost:8501

Stop Services

docker-compose down

Manual Setup

Step 1: Install Python Dependencies

python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install requirements
pip install --upgrade pip
pip install -r requirements.txt

Step 2: PostgreSQL Configuration

Ensure your PostgreSQL instance is running and accessible.

# Test connection (optional)
psql -h YOUR_HOST -U YOUR_USER -d YOUR_DATABASE

Step 3: Google Cloud Setup

Go to Google Cloud Console
Create a new project or select existing one
Enable "Generative Language API"
Create an API key:
- Navigate to "Credentials"
- Click "Create Credentials" → "API Key"
- Copy the API key

Step 4: Environment Configuration

cp .env.example .env

Edit .env file with your configuration (see Configuration section)

Step 5: Run Application

streamlit run main.py

Configuration

Environment Variables

Create a .env file in the project root directory with the following variables:

# Google Generative AI Configuration
# Get your API key from: https://makersuite.google.com/app/apikey
GOOGLE_API_KEY=your_actual_google_api_key_here

# PostgreSQL Database Configuration
# Example: db.example.com
DB_HOST=your_database_host

# Database port (default: 5432)
DB_PORT=5432

# Database name
DB_NAME=your_database_name

# Database username
DB_USER=your_database_user

# Database password (use strong password in production)
DB_PASSWORD=your_database_password

# Application Settings
LOG_LEVEL=INFO

Configuration Best Practices

Never commit .env file - It contains sensitive information
Use strong database passwords - Minimum 12 characters with special characters
Restrict API key scope - Use Google Cloud Console to limit API key usage
Environment-specific configs - Use different credentials for dev/staging/production

Usage

Basic Workflow

1. Launch the Application

streamlit run main.py

2. Load Excel File

Click "Browse files" in the sidebar
Select your Excel file containing data to validate
The file will be processed automatically

3. Configure Validation

Statement ID: Enter the database record identifier
Agency Type: Select or enter agency classification
Fuzzy Match Threshold: Adjust matching sensitivity (0-100, default: 80)
Enable Swap: Toggle for agent NPN mapping

4. Execute Validation

Click "Search" button
View real-time validation results
Review detailed comparison report

5. Analyze Results

Check validation scores
Review matching confidence levels
Export results if needed

Advanced Usage

Custom Threshold Configuration

The fuzzy matching threshold determines matching sensitivity:

- 90-100: Very strict matching (high precision, may miss valid matches)
- 70-89:  Standard matching (recommended for most cases)
- 50-69:  Lenient matching (high recall, may include false positives)

Handling Large Datasets

For files with 10,000+ rows:

Increase match threshold to reduce processing time
Consider splitting into multiple validation batches
Monitor database connection stability

Architecture

System Components

┌─────────────────────────────────────────────────────────┐
│          Streamlit Web Interface (main.py)              │
│        - File upload & configuration UI                 │
│        - Real-time result display                       │
└─────────────────┬───────────────────────────────────────┘
                  │
        ┌─────────┴──────────────┐
        │                        │
    ┌───▼──────────────┐  ┌──────▼────────────────┐
    │ Excel Processing │  │ Database Connection  │
    │ (Pandas)         │  │ (psycopg2)           │
    └───┬──────────────┘  └──────┬────────────────┘
        │                        │
        └─────────────┬──────────┘
                      │
            ┌─────────▼──────────────┐
            │ Data Comparison Engine │
            │ - FuzzyWuzzy matching  │
            │ - Pandas operations    │
            └─────────┬──────────────┘
                      │
            ┌─────────▼──────────────────┐
            │ AI Analysis & Validation   │
            │ (LangChain + Google GenAI) │
            │ - Tool-calling agent       │
            │ - Intelligent comparison   │
            └─────────┬──────────────────┘
                      │
            ┌─────────▼──────────────┐
            │ Results & Reporting    │
            │ - Console output       │
            │ - Session state        │
            └────────────────────────┘

Key Components

`main.py`

Streamlit application entry point
UI configuration and layout
Session state management
Data validation orchestration

`utils/data_pg.py`

PostgreSQL connection management
Database query execution
Data retrieval and transformation

`utils/agent_tools.py`

LangChain tool definitions
Google Generative AI integration
Comparison logic and analysis

Troubleshooting

Common Issues

1. Database Connection Error

Error: could not connect to server: Connection refused

Solutions:

Verify PostgreSQL is running: systemctl status postgresql (Linux/macOS)
Check host, port, and credentials in .env
Ensure database exists: psql -l
Test connection: psql -h HOST -U USER -d DATABASE

2. API Key Error

Error: GOOGLE_API_KEY environment variable is not set

Solutions:

Verify .env file exists in project root
Check API key value in .env is correct
Reload environment: source venv/bin/activate (Linux/macOS)
Restart application: streamlit run main.py

3. Module Import Error

ModuleNotFoundError: No module named 'streamlit'

Solutions:

Activate virtual environment
Reinstall dependencies: pip install -r requirements.txt
Check Python version: python --version (must be 3.10+)

4. Excel File Format Error

Error: File format not recognized

Solutions:

Ensure file is .xlsx or .xls format
Check file is not corrupted
Try opening file in Excel first
Convert to .xlsx if using .xls

5. Timeout Issues

Error: Connection timeout

Solutions:

Check database server status
Verify network connectivity
Increase timeout settings in code (if needed)
Try with smaller dataset first

Enable Debug Mode

# Set debug logging
export LOG_LEVEL=DEBUG
streamlit run main.py

Check Environment Variables

# Display all configured variables
cat .env

# Verify specific variables
echo $GOOGLE_API_KEY
echo $DB_HOST

Performance Optimization

Best Practices

Batch Processing
- Process data in chunks for large files
- Adjust chunk size based on available memory
Database Indexing
- Ensure database tables have proper indexes
- Check query execution plans
Connection Pooling
- Reuse database connections
- Monitor active connections
Caching
- Use Streamlit's @st.cache_data decorator
- Cache expensive computations

Contributing

We welcome contributions! Please follow these guidelines:

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make your changes and test thoroughly
Commit with clear messages: git commit -m "Add feature description"
Push to branch: git push origin feature/your-feature
Submit a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements.txt

# Run tests
pytest tests/

# Format code
black .

# Check code quality
flake8 .

Security

Important Security Notes

🔐 Never commit .env file containing sensitive credentials
🔐 Use environment variables for all secrets in production
🔐 Restrict API key scope using Google Cloud Console
🔐 Use HTTPS when deploying to production
🔐 Implement authentication for production deployments
🔐 Rotate credentials regularly for maximum security

Production Deployment

For production environments:

# Use strong database passwords
# Implement API authentication
# Use HTTPS/TLS for all connections
# Enable audit logging
# Set up monitoring and alerts
# Implement rate limiting

API Reference

Main Functions

`fetch_data()`

Validates and compares data from Excel and database sources.

Parameters:

statementid (int): Database statement identifier
agency_type (str): Agency classification
swap (bool): Enable agent NPN mapping
file (UploadedFile): Excel file object
query (str): Agency name query
threshold (int): Fuzzy match threshold (0-100)

Returns:

Validation results and comparison data

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support & Contact

For issues, questions, or suggestions:

Open an issue on the project repository
Contact the development team
Check documentation for solutions
Visit Aviasole.com for more information

About Aviasole

Aviasole is an AI development company specializing in cutting-edge artificial intelligence solutions. We create innovative POCs and production-ready applications that demonstrate the power and potential of modern AI technologies.

ValidationAI is one of our showcase projects demonstrating enterprise-grade AI-powered data validation capabilities.

Learn more: https://aviasole.com

Changelog

Version 1.0.0 (Current)

Initial release
AI-powered data validation
Excel to PostgreSQL comparison
Real-time validation interface

Acknowledgments

Built with:

Streamlit - Web application framework
LangChain - LLM framework
Google Generative AI - Intelligent analysis
Pandas - Data processing
FuzzyWuzzy - Fuzzy matching

Frequently Asked Questions

Q: What Excel formats are supported? A: .xlsx and .xls formats are supported.

Q: Can I use this without PostgreSQL? A: Currently, PostgreSQL is required. Contact support for alternative database support.

Q: How do I increase matching accuracy? A: Adjust the fuzzy match threshold and ensure database data is properly cleaned and normalized.

Q: Is there a limit on file size? A: Files up to 100MB are supported. For larger files, contact support.

Q: How do I deploy to production? A: See the Production Deployment section in Security.

Last Updated: March 2026 Documentation Version: 1.0

Company

Aviasole - AI Development & Innovation Website: https://aviasole.com

This project is proudly developed and maintained by Aviasole, a leading AI development company focused on creating innovative AI solutions for enterprise challenges.

For more AI projects and solutions, visit aviasole.com

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.bat		setup.bat

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ValidationAI

Overview

Key Features

Table of Contents

Prerequisites

System Requirements

Required Services

Installation

Quick Start

1. Clone or Download the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Configure Environment Variables

5. Run the Application

Docker Setup

Prerequisites

Build and Run

Stop Services

Manual Setup

Step 1: Install Python Dependencies

Step 2: PostgreSQL Configuration

Step 3: Google Cloud Setup

Step 4: Environment Configuration

Step 5: Run Application

Configuration

Environment Variables

Configuration Best Practices

Usage

Basic Workflow

1. Launch the Application

2. Load Excel File

3. Configure Validation

4. Execute Validation

5. Analyze Results

Advanced Usage

Custom Threshold Configuration

Handling Large Datasets

Architecture

System Components

Key Components

main.py

utils/data_pg.py

utils/agent_tools.py

Troubleshooting

Common Issues

1. Database Connection Error

2. API Key Error

3. Module Import Error

4. Excel File Format Error

5. Timeout Issues

Enable Debug Mode

Check Environment Variables

Performance Optimization

Best Practices

Contributing

Development Setup

Security

Important Security Notes

Production Deployment

API Reference

Main Functions

fetch_data()

License

Support & Contact

About Aviasole

Changelog

Version 1.0.0 (Current)

Acknowledgments

Frequently Asked Questions

Company

About

Resources

Uh oh!

Stars

Watchers

`main.py`

`utils/data_pg.py`

`utils/agent_tools.py`

`fetch_data()`

Packages