Auto-Parse is a modern, developer-centric structured web scraper. It allows users to visually define a JSON schema, enter a target URL, and retrieve schema-validated JSON data in real time, all without managing brittle CSS selectors. It reduces DOM token noise by >80% using custom heuristic token compression, saving context window space and API cost.
- Backend: Python 3.10+, FastAPI (Asynchronous High-Performance API), SQLAlchemy + SQLite (Job History & State Management)
- Scraper Engine: Playwright (Headless Chromium Browser for dynamic rendering) + HTTPX (Fast static page retriever fallback)
- Parser Engine: BeautifulSoup4 (HTML parsing) + Markdownify (Noise cleaning & markdown compression)
- AI Core: Gemini 2.5 Flash (
google-genaiSDK) utilizing strict JSON response schemas (response_schema) - Frontend: Single-Page App (SPA) built using Next.js (App Router), TypeScript, Tailwind CSS, and Lucide React. The design system is themed under "Technical Brutalism & Data Precision" (obsidian
#0e0e0ebackground, vibrant cyan/blue neon boundaries, sharp 0px corners, Space Grotesk header fonts, and monospaced live terminal feeds).
ScrapeBot/
├── backend/
│ ├── .env # Environment secrets (GEMINI_API_KEY)
│ ├── config.py # Configuration loader
│ ├── database.py # SQLAlchemy SQLite connection and session makers
│ ├── main.py # FastAPI server routes & background tasks orchestrator
│ ├── models.py # SQLAlchemy database tables (Job, ExtractedData)
│ ├── parser.py # bs4 DOM Cleaner and Gemini structured AI Extractor
│ ├── schemas.py # Pydantic V2 Request & Response models
│ ├── scraper.py # Playwright Chromium and HTTPX fetch mechanisms
│ └── data/ # Local database folder
│ └── autoparse.db # Autogenerated SQLite database
├── frontend/
│ ├── package.json # Node.js project dependencies
│ ├── next.config.ts # Next.js workspace configurations
│ ├── src/ # Application source code
│ │ ├── app/ # Page route and global layout definition
│ │ └── components/ # UI components (Sidebar, Overview, NewExtraction, Explorer, Settings)
│ └── public/ # Static assets
└── README.md # System Setup and Deployment Guide
Follow these steps to set up and run Auto-Parse on your local machine:
- Python 3.10+ (Conda Virtual Environment recommended)
- Node.js 18.0+ and npm package manager
- Gemini API Key from Google AI Studio
Create and activate a Python virtual environment:
# Using Conda:
conda create -n auto-parse-env python=3.10 -y
conda activate auto-parse-env
# Or using venv:
python -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activateInstall all required python dependencies using the root requirements.txt inside your virtual environment:
pip install -r requirements.txtPlaywright requires headless browser binaries to perform dynamic javascript rendering. Initialize the Chromium binary:
playwright install chromiumCopy the template .env.example file to .env inside the backend/ directory and add your Gemini API Key:
# On Windows PowerShell:
cp backend/.env.example backend/.env
# On Linux/macOS or Git Bash:
cp backend/.env.example backend/.envOpen backend/.env and update the values:
GEMINI_API_KEY=your_gemini_api_key_here
PORT=8000
HOST=127.0.0.1Auto-Parse runs in a decoupled architecture where the Backend API and Frontend are served on separate local ports.
Navigate to the root directory and start the FastAPI server:
conda activate auto-parse-env
python -m backend.mainThe server will start at http://127.0.0.1:8000 and automatically initialize/migrate the local SQLite database at backend/data/autoparse.db.
Navigate to the frontend/ directory, install dependencies, and launch the Next.js dev server:
# Navigate to the frontend
cd frontend
# Install Node modules
npm install
# Run the next.js server
npm run devThe user interface will be available in your browser at http://localhost:3000.
The dashboard exposes real-time pipeline status cards:
- Success Rate: Ratio of completed schema-validated extractions.
- Average Latency: Pipeline latency tracking including browser load, DOM cleanup, and LLM processing times.
- Scrape Pipelines: Stored execution counts in your local SQLite base.
- Tokens Saved: Metric showing raw DOM tokens shaved using our semantic HTML converter.
Build schema constraints inside the configuration workspace with dynamic field rows (fields are custom-validated to avoid naming collisions and special characters):
- Define Field Name (camelCase/snake_case)
- Choose Data Type (
String,Number,Boolean,Arrayof Strings) - Detail Description Context to guide the Gemini model on exactly what selector/content to target on the web page.
- Watch live-streaming crawler actions through a simulated monospaced console interface.
- View DOM compression ratios (e.g. 9,000 character DOM reduced by 81% into 1,600 characters of Markdown context).
- Display syntax-highlighted structural JSON results when complete, with immediate Copy/Download options.
Exposes standard endpoints for third-party scripts. Every job configuration generates:
- Ready-to-copy cURL request blocks.
- Ready-to-copy Python
requestsscripts with visual payloads so developers can integrate structured scraping into their own data pipelines.
You can test the scraper with a mock target like books.toscrape.com:
- Enter seed URL:
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html - Add schema fields:
book_title(String, "The full title of the book")price(Number, "The price of the book as a floating number")stock_availability(String, "Stock status, e.g. In stock or Out of stock")
- Toggle Dynamic Render (Playwright) to
ON. - Click LAUNCH_EXTRACTION.
- Switch to DATA & LOGS to view live streaming logs, DOM cleaning compression, and the final validated structured JSON structure from Gemini!