ScrapeBot: AI-Powered Structured Web Scraper

Auto-Parse is a modern, developer-centric structured web scraper. It allows users to visually define a JSON schema, enter a target URL, and retrieve schema-validated JSON data in real time, all without managing brittle CSS selectors. It reduces DOM token noise by >80% using custom heuristic token compression, saving context window space and API cost.

⚡ Tech Stack & Architecture

Backend: Python 3.10+, FastAPI (Asynchronous High-Performance API), SQLAlchemy + SQLite (Job History & State Management)
Scraper Engine: Playwright (Headless Chromium Browser for dynamic rendering) + HTTPX (Fast static page retriever fallback)
Parser Engine: BeautifulSoup4 (HTML parsing) + Markdownify (Noise cleaning & markdown compression)
AI Core: Gemini 2.5 Flash (google-genai SDK) utilizing strict JSON response schemas (response_schema)
Frontend: Single-Page App (SPA) built using Next.js (App Router), TypeScript, Tailwind CSS, and Lucide React. The design system is themed under "Technical Brutalism & Data Precision" (obsidian #0e0e0e background, vibrant cyan/blue neon boundaries, sharp 0px corners, Space Grotesk header fonts, and monospaced live terminal feeds).

🛠️ Project Structure

ScrapeBot/
├── backend/
│   ├── .env                 # Environment secrets (GEMINI_API_KEY)
│   ├── config.py            # Configuration loader
│   ├── database.py          # SQLAlchemy SQLite connection and session makers
│   ├── main.py              # FastAPI server routes & background tasks orchestrator
│   ├── models.py            # SQLAlchemy database tables (Job, ExtractedData)
│   ├── parser.py            # bs4 DOM Cleaner and Gemini structured AI Extractor
│   ├── schemas.py           # Pydantic V2 Request & Response models
│   ├── scraper.py           # Playwright Chromium and HTTPX fetch mechanisms
│   └── data/                # Local database folder
│       └── autoparse.db     # Autogenerated SQLite database
├── frontend/
│   ├── package.json         # Node.js project dependencies
│   ├── next.config.ts       # Next.js workspace configurations
│   ├── src/                 # Application source code
│   │   ├── app/             # Page route and global layout definition
│   │   └── components/      # UI components (Sidebar, Overview, NewExtraction, Explorer, Settings)
│   └── public/              # Static assets
└── README.md                # System Setup and Deployment Guide

🚀 Installation & Local Setup

Follow these steps to set up and run Auto-Parse on your local machine:

1. Prerequisites

Python 3.10+ (Conda Virtual Environment recommended)
Node.js 18.0+ and npm package manager
Gemini API Key from Google AI Studio

2. Set Up the Environment

Create and activate a Python virtual environment:

# Using Conda:
conda create -n auto-parse-env python=3.10 -y
conda activate auto-parse-env

# Or using venv:
python -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

3. Install Dependencies

Install all required python dependencies using the root requirements.txt inside your virtual environment:

pip install -r requirements.txt

4. Install Playwright Browser Binaries

Playwright requires headless browser binaries to perform dynamic javascript rendering. Initialize the Chromium binary:

playwright install chromium

5. Configure Environment Variables

Copy the template .env.example file to .env inside the backend/ directory and add your Gemini API Key:

# On Windows PowerShell:
cp backend/.env.example backend/.env

# On Linux/macOS or Git Bash:
cp backend/.env.example backend/.env

Open backend/.env and update the values:

GEMINI_API_KEY=your_gemini_api_key_here
PORT=8000
HOST=127.0.0.1

🏁 Running the Application

Auto-Parse runs in a decoupled architecture where the Backend API and Frontend are served on separate local ports.

1. Start the Backend API Server

Navigate to the root directory and start the FastAPI server:

conda activate auto-parse-env
python -m backend.main

The server will start at http://127.0.0.1:8000 and automatically initialize/migrate the local SQLite database at backend/data/autoparse.db.

2. Start the Frontend Server

Navigate to the frontend/ directory, install dependencies, and launch the Next.js dev server:

# Navigate to the frontend
cd frontend

# Install Node modules
npm install

# Run the next.js server
npm run dev

The user interface will be available in your browser at http://localhost:3000.

📈 System Features Walkthrough

1. Technical Brutalism Telemetry HUD

The dashboard exposes real-time pipeline status cards:

Success Rate: Ratio of completed schema-validated extractions.
Average Latency: Pipeline latency tracking including browser load, DOM cleanup, and LLM processing times.
Scrape Pipelines: Stored execution counts in your local SQLite base.
Tokens Saved: Metric showing raw DOM tokens shaved using our semantic HTML converter.

2. Visual Schema Builder

Build schema constraints inside the configuration workspace with dynamic field rows (fields are custom-validated to avoid naming collisions and special characters):

Define Field Name (camelCase/snake_case)
Choose Data Type (String, Number, Boolean, Array of Strings)
Detail Description Context to guide the Gemini model on exactly what selector/content to target on the web page.

3. Code & Log Terminal Console

Watch live-streaming crawler actions through a simulated monospaced console interface.
View DOM compression ratios (e.g. 9,000 character DOM reduced by 81% into 1,600 characters of Markdown context).
Display syntax-highlighted structural JSON results when complete, with immediate Copy/Download options.

4. Developer API Snippet Generator

Exposes standard endpoints for third-party scripts. Every job configuration generates:

Ready-to-copy cURL request blocks.
Ready-to-copy Python requests scripts with visual payloads so developers can integrate structured scraping into their own data pipelines.

🛠️ Verification & End-to-End Scrape Test

You can test the scraper with a mock target like books.toscrape.com:

Enter seed URL: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Add schema fields:
- book_title (String, "The full title of the book")
- price (Number, "The price of the book as a floating number")
- stock_availability (String, "Stock status, e.g. In stock or Out of stock")
Toggle Dynamic Render (Playwright) to ON.
Click LAUNCH_EXTRACTION.
Switch to DATA & LOGS to view live streaming logs, DOM cleaning compression, and the final validated structured JSON structure from Gemini!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeBot: AI-Powered Structured Web Scraper

⚡ Tech Stack & Architecture

🛠️ Project Structure

🚀 Installation & Local Setup

1. Prerequisites

2. Set Up the Environment

3. Install Dependencies

4. Install Playwright Browser Binaries

5. Configure Environment Variables

🏁 Running the Application

1. Start the Backend API Server

2. Start the Frontend Server

📈 System Features Walkthrough

1. Technical Brutalism Telemetry HUD

2. Visual Schema Builder

3. Code & Log Terminal Console

4. Developer API Snippet Generator

🛠️ Verification & End-to-End Scrape Test

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ScrapeBot: AI-Powered Structured Web Scraper

⚡ Tech Stack & Architecture

🛠️ Project Structure

🚀 Installation & Local Setup

1. Prerequisites

2. Set Up the Environment

3. Install Dependencies

4. Install Playwright Browser Binaries

5. Configure Environment Variables

🏁 Running the Application

1. Start the Backend API Server

2. Start the Frontend Server

📈 System Features Walkthrough

1. Technical Brutalism Telemetry HUD

2. Visual Schema Builder

3. Code & Log Terminal Console

4. Developer API Snippet Generator

🛠️ Verification & End-to-End Scrape Test

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages