Skip to content

SwapCodesDev/ScrapeBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapeBot: AI-Powered Structured Web Scraper

Auto-Parse is a modern, developer-centric structured web scraper. It allows users to visually define a JSON schema, enter a target URL, and retrieve schema-validated JSON data in real time, all without managing brittle CSS selectors. It reduces DOM token noise by >80% using custom heuristic token compression, saving context window space and API cost.


⚡ Tech Stack & Architecture

  • Backend: Python 3.10+, FastAPI (Asynchronous High-Performance API), SQLAlchemy + SQLite (Job History & State Management)
  • Scraper Engine: Playwright (Headless Chromium Browser for dynamic rendering) + HTTPX (Fast static page retriever fallback)
  • Parser Engine: BeautifulSoup4 (HTML parsing) + Markdownify (Noise cleaning & markdown compression)
  • AI Core: Gemini 2.5 Flash (google-genai SDK) utilizing strict JSON response schemas (response_schema)
  • Frontend: Single-Page App (SPA) built using Next.js (App Router), TypeScript, Tailwind CSS, and Lucide React. The design system is themed under "Technical Brutalism & Data Precision" (obsidian #0e0e0e background, vibrant cyan/blue neon boundaries, sharp 0px corners, Space Grotesk header fonts, and monospaced live terminal feeds).

🛠️ Project Structure

ScrapeBot/
├── backend/
│   ├── .env                 # Environment secrets (GEMINI_API_KEY)
│   ├── config.py            # Configuration loader
│   ├── database.py          # SQLAlchemy SQLite connection and session makers
│   ├── main.py              # FastAPI server routes & background tasks orchestrator
│   ├── models.py            # SQLAlchemy database tables (Job, ExtractedData)
│   ├── parser.py            # bs4 DOM Cleaner and Gemini structured AI Extractor
│   ├── schemas.py           # Pydantic V2 Request & Response models
│   ├── scraper.py           # Playwright Chromium and HTTPX fetch mechanisms
│   └── data/                # Local database folder
│       └── autoparse.db     # Autogenerated SQLite database
├── frontend/
│   ├── package.json         # Node.js project dependencies
│   ├── next.config.ts       # Next.js workspace configurations
│   ├── src/                 # Application source code
│   │   ├── app/             # Page route and global layout definition
│   │   └── components/      # UI components (Sidebar, Overview, NewExtraction, Explorer, Settings)
│   └── public/              # Static assets
└── README.md                # System Setup and Deployment Guide

🚀 Installation & Local Setup

Follow these steps to set up and run Auto-Parse on your local machine:

1. Prerequisites

  • Python 3.10+ (Conda Virtual Environment recommended)
  • Node.js 18.0+ and npm package manager
  • Gemini API Key from Google AI Studio

2. Set Up the Environment

Create and activate a Python virtual environment:

# Using Conda:
conda create -n auto-parse-env python=3.10 -y
conda activate auto-parse-env

# Or using venv:
python -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

3. Install Dependencies

Install all required python dependencies using the root requirements.txt inside your virtual environment:

pip install -r requirements.txt

4. Install Playwright Browser Binaries

Playwright requires headless browser binaries to perform dynamic javascript rendering. Initialize the Chromium binary:

playwright install chromium

5. Configure Environment Variables

Copy the template .env.example file to .env inside the backend/ directory and add your Gemini API Key:

# On Windows PowerShell:
cp backend/.env.example backend/.env

# On Linux/macOS or Git Bash:
cp backend/.env.example backend/.env

Open backend/.env and update the values:

GEMINI_API_KEY=your_gemini_api_key_here
PORT=8000
HOST=127.0.0.1

🏁 Running the Application

Auto-Parse runs in a decoupled architecture where the Backend API and Frontend are served on separate local ports.

1. Start the Backend API Server

Navigate to the root directory and start the FastAPI server:

conda activate auto-parse-env
python -m backend.main

The server will start at http://127.0.0.1:8000 and automatically initialize/migrate the local SQLite database at backend/data/autoparse.db.

2. Start the Frontend Server

Navigate to the frontend/ directory, install dependencies, and launch the Next.js dev server:

# Navigate to the frontend
cd frontend

# Install Node modules
npm install

# Run the next.js server
npm run dev

The user interface will be available in your browser at http://localhost:3000.


📈 System Features Walkthrough

1. Technical Brutalism Telemetry HUD

The dashboard exposes real-time pipeline status cards:

  • Success Rate: Ratio of completed schema-validated extractions.
  • Average Latency: Pipeline latency tracking including browser load, DOM cleanup, and LLM processing times.
  • Scrape Pipelines: Stored execution counts in your local SQLite base.
  • Tokens Saved: Metric showing raw DOM tokens shaved using our semantic HTML converter.

2. Visual Schema Builder

Build schema constraints inside the configuration workspace with dynamic field rows (fields are custom-validated to avoid naming collisions and special characters):

  • Define Field Name (camelCase/snake_case)
  • Choose Data Type (String, Number, Boolean, Array of Strings)
  • Detail Description Context to guide the Gemini model on exactly what selector/content to target on the web page.

3. Code & Log Terminal Console

  • Watch live-streaming crawler actions through a simulated monospaced console interface.
  • View DOM compression ratios (e.g. 9,000 character DOM reduced by 81% into 1,600 characters of Markdown context).
  • Display syntax-highlighted structural JSON results when complete, with immediate Copy/Download options.

4. Developer API Snippet Generator

Exposes standard endpoints for third-party scripts. Every job configuration generates:

  • Ready-to-copy cURL request blocks.
  • Ready-to-copy Python requests scripts with visual payloads so developers can integrate structured scraping into their own data pipelines.

🛠️ Verification & End-to-End Scrape Test

You can test the scraper with a mock target like books.toscrape.com:

  1. Enter seed URL: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
  2. Add schema fields:
    • book_title (String, "The full title of the book")
    • price (Number, "The price of the book as a floating number")
    • stock_availability (String, "Stock status, e.g. In stock or Out of stock")
  3. Toggle Dynamic Render (Playwright) to ON.
  4. Click LAUNCH_EXTRACTION.
  5. Switch to DATA & LOGS to view live streaming logs, DOM cleaning compression, and the final validated structured JSON structure from Gemini!

About

AI-powered structured web scraper that visually builds JSON schemas and uses Gemini 2.5 Flash & Playwright to extract clean, validated JSON with >80% DOM noise reduction.

Topics

Resources

Stars

Watchers

Forks

Contributors