Skip to content

aviasoletechnologies/ValidationAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ValidationAI

License: MIT Python 3.10+ Streamlit App Made by Aviasole

Overview

ValidationAI is an intelligent data validation and reconciliation platform that compares data from multiple sources (Excel files and PostgreSQL databases) using AI-powered fuzzy matching and comparison. Built with LangChain, Google Generative AI, and Streamlit, it provides enterprise-grade data validation capabilities with minimal configuration.

Key Features

  • 🤖 AI-Powered Comparison - Uses Google Generative AI for intelligent data analysis
  • 🎯 Fuzzy Matching - Intelligent matching with configurable thresholds
  • 📊 Multi-Source Validation - Compare Excel files against PostgreSQL databases
  • 🔍 Detailed Reporting - Real-time validation results with comprehensive analytics
  • 🚀 Easy Integration - Simple configuration with environment variables
  • 📈 Scalable Architecture - Handles large datasets efficiently

Table of Contents


Prerequisites

System Requirements

  • Python: 3.10 or higher
  • PostgreSQL: 12 or higher (for database connection)
  • Operating System: Windows, macOS, or Linux
  • RAM: Minimum 4GB recommended
  • Disk Space: 500MB for dependencies

Required Services

  • Google Cloud Account - For Generative AI API access
  • PostgreSQL Database - For data source

Installation

Quick Start

1. Clone or Download the Repository

cd ValidationAI

2. Create Virtual Environment

# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

# Copy the example file
cp .env.example .env

# Edit .env with your actual values
# See Configuration section below

5. Run the Application

streamlit run main.py

The application will open at http://localhost:8501


Docker Setup

Prerequisites

  • Docker 20.10+
  • Docker Compose 2.0+

Build and Run

# Build the Docker image
docker build -t validation-ai .

# Create .env file with your configuration
cp .env.example .env
# Edit .env with your values

# Run with Docker Compose
docker-compose up

# Application will be available at http://localhost:8501

Stop Services

docker-compose down

Manual Setup

Step 1: Install Python Dependencies

python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install requirements
pip install --upgrade pip
pip install -r requirements.txt

Step 2: PostgreSQL Configuration

Ensure your PostgreSQL instance is running and accessible.

# Test connection (optional)
psql -h YOUR_HOST -U YOUR_USER -d YOUR_DATABASE

Step 3: Google Cloud Setup

  1. Go to Google Cloud Console
  2. Create a new project or select existing one
  3. Enable "Generative Language API"
  4. Create an API key:
    • Navigate to "Credentials"
    • Click "Create Credentials" → "API Key"
    • Copy the API key

Step 4: Environment Configuration

cp .env.example .env

Edit .env file with your configuration (see Configuration section)

Step 5: Run Application

streamlit run main.py

Configuration

Environment Variables

Create a .env file in the project root directory with the following variables:

# Google Generative AI Configuration
# Get your API key from: https://makersuite.google.com/app/apikey
GOOGLE_API_KEY=your_actual_google_api_key_here

# PostgreSQL Database Configuration
# Example: db.example.com
DB_HOST=your_database_host

# Database port (default: 5432)
DB_PORT=5432

# Database name
DB_NAME=your_database_name

# Database username
DB_USER=your_database_user

# Database password (use strong password in production)
DB_PASSWORD=your_database_password

# Application Settings
LOG_LEVEL=INFO

Configuration Best Practices

  1. Never commit .env file - It contains sensitive information
  2. Use strong database passwords - Minimum 12 characters with special characters
  3. Restrict API key scope - Use Google Cloud Console to limit API key usage
  4. Environment-specific configs - Use different credentials for dev/staging/production

Usage

Basic Workflow

1. Launch the Application

streamlit run main.py

2. Load Excel File

  • Click "Browse files" in the sidebar
  • Select your Excel file containing data to validate
  • The file will be processed automatically

3. Configure Validation

  • Statement ID: Enter the database record identifier
  • Agency Type: Select or enter agency classification
  • Fuzzy Match Threshold: Adjust matching sensitivity (0-100, default: 80)
  • Enable Swap: Toggle for agent NPN mapping

4. Execute Validation

  • Click "Search" button
  • View real-time validation results
  • Review detailed comparison report

5. Analyze Results

  • Check validation scores
  • Review matching confidence levels
  • Export results if needed

Advanced Usage

Custom Threshold Configuration

The fuzzy matching threshold determines matching sensitivity:

- 90-100: Very strict matching (high precision, may miss valid matches)
- 70-89:  Standard matching (recommended for most cases)
- 50-69:  Lenient matching (high recall, may include false positives)

Handling Large Datasets

For files with 10,000+ rows:

  1. Increase match threshold to reduce processing time
  2. Consider splitting into multiple validation batches
  3. Monitor database connection stability

Architecture

System Components

┌─────────────────────────────────────────────────────────┐
│          Streamlit Web Interface (main.py)              │
│        - File upload & configuration UI                 │
│        - Real-time result display                       │
└─────────────────┬───────────────────────────────────────┘
                  │
        ┌─────────┴──────────────┐
        │                        │
    ┌───▼──────────────┐  ┌──────▼────────────────┐
    │ Excel Processing │  │ Database Connection  │
    │ (Pandas)         │  │ (psycopg2)           │
    └───┬──────────────┘  └──────┬────────────────┘
        │                        │
        └─────────────┬──────────┘
                      │
            ┌─────────▼──────────────┐
            │ Data Comparison Engine │
            │ - FuzzyWuzzy matching  │
            │ - Pandas operations    │
            └─────────┬──────────────┘
                      │
            ┌─────────▼──────────────────┐
            │ AI Analysis & Validation   │
            │ (LangChain + Google GenAI) │
            │ - Tool-calling agent       │
            │ - Intelligent comparison   │
            └─────────┬──────────────────┘
                      │
            ┌─────────▼──────────────┐
            │ Results & Reporting    │
            │ - Console output       │
            │ - Session state        │
            └────────────────────────┘

Key Components

main.py

  • Streamlit application entry point
  • UI configuration and layout
  • Session state management
  • Data validation orchestration

utils/data_pg.py

  • PostgreSQL connection management
  • Database query execution
  • Data retrieval and transformation

utils/agent_tools.py

  • LangChain tool definitions
  • Google Generative AI integration
  • Comparison logic and analysis

Troubleshooting

Common Issues

1. Database Connection Error

Error: could not connect to server: Connection refused

Solutions:

  • Verify PostgreSQL is running: systemctl status postgresql (Linux/macOS)
  • Check host, port, and credentials in .env
  • Ensure database exists: psql -l
  • Test connection: psql -h HOST -U USER -d DATABASE

2. API Key Error

Error: GOOGLE_API_KEY environment variable is not set

Solutions:

  • Verify .env file exists in project root
  • Check API key value in .env is correct
  • Reload environment: source venv/bin/activate (Linux/macOS)
  • Restart application: streamlit run main.py

3. Module Import Error

ModuleNotFoundError: No module named 'streamlit'

Solutions:

  • Activate virtual environment
  • Reinstall dependencies: pip install -r requirements.txt
  • Check Python version: python --version (must be 3.10+)

4. Excel File Format Error

Error: File format not recognized

Solutions:

  • Ensure file is .xlsx or .xls format
  • Check file is not corrupted
  • Try opening file in Excel first
  • Convert to .xlsx if using .xls

5. Timeout Issues

Error: Connection timeout

Solutions:

  • Check database server status
  • Verify network connectivity
  • Increase timeout settings in code (if needed)
  • Try with smaller dataset first

Enable Debug Mode

# Set debug logging
export LOG_LEVEL=DEBUG
streamlit run main.py

Check Environment Variables

# Display all configured variables
cat .env

# Verify specific variables
echo $GOOGLE_API_KEY
echo $DB_HOST

Performance Optimization

Best Practices

  1. Batch Processing

    • Process data in chunks for large files
    • Adjust chunk size based on available memory
  2. Database Indexing

    • Ensure database tables have proper indexes
    • Check query execution plans
  3. Connection Pooling

    • Reuse database connections
    • Monitor active connections
  4. Caching

    • Use Streamlit's @st.cache_data decorator
    • Cache expensive computations

Contributing

We welcome contributions! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make your changes and test thoroughly
  4. Commit with clear messages: git commit -m "Add feature description"
  5. Push to branch: git push origin feature/your-feature
  6. Submit a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements.txt

# Run tests
pytest tests/

# Format code
black .

# Check code quality
flake8 .

Security

Important Security Notes

  • 🔐 Never commit .env file containing sensitive credentials
  • 🔐 Use environment variables for all secrets in production
  • 🔐 Restrict API key scope using Google Cloud Console
  • 🔐 Use HTTPS when deploying to production
  • 🔐 Implement authentication for production deployments
  • 🔐 Rotate credentials regularly for maximum security

Production Deployment

For production environments:

# Use strong database passwords
# Implement API authentication
# Use HTTPS/TLS for all connections
# Enable audit logging
# Set up monitoring and alerts
# Implement rate limiting

API Reference

Main Functions

fetch_data()

Validates and compares data from Excel and database sources.

Parameters:

  • statementid (int): Database statement identifier
  • agency_type (str): Agency classification
  • swap (bool): Enable agent NPN mapping
  • file (UploadedFile): Excel file object
  • query (str): Agency name query
  • threshold (int): Fuzzy match threshold (0-100)

Returns:

  • Validation results and comparison data

License

This project is licensed under the MIT License - see the LICENSE file for details.


Support & Contact

For issues, questions, or suggestions:

  • Open an issue on the project repository
  • Contact the development team
  • Check documentation for solutions
  • Visit Aviasole.com for more information

About Aviasole

Aviasole is an AI development company specializing in cutting-edge artificial intelligence solutions. We create innovative POCs and production-ready applications that demonstrate the power and potential of modern AI technologies.

ValidationAI is one of our showcase projects demonstrating enterprise-grade AI-powered data validation capabilities.

Learn more: https://aviasole.com


Changelog

Version 1.0.0 (Current)

  • Initial release
  • AI-powered data validation
  • Excel to PostgreSQL comparison
  • Real-time validation interface

Acknowledgments

Built with:


Frequently Asked Questions

Q: What Excel formats are supported? A: .xlsx and .xls formats are supported.

Q: Can I use this without PostgreSQL? A: Currently, PostgreSQL is required. Contact support for alternative database support.

Q: How do I increase matching accuracy? A: Adjust the fuzzy match threshold and ensure database data is properly cleaned and normalized.

Q: Is there a limit on file size? A: Files up to 100MB are supported. For larger files, contact support.

Q: How do I deploy to production? A: See the Production Deployment section in Security.


Last Updated: March 2026 Documentation Version: 1.0


Company

Aviasole - AI Development & Innovation Website: https://aviasole.com

This project is proudly developed and maintained by Aviasole, a leading AI development company focused on creating innovative AI solutions for enterprise challenges.

For more AI projects and solutions, visit aviasole.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors