Skip to content

Add CarMax mirror (port 40015)#24

Open
Violet24K wants to merge 6 commits into
aiming-lab:mainfrom
Violet24K:main
Open

Add CarMax mirror (port 40015)#24
Violet24K wants to merge 6 commits into
aiming-lab:mainfrom
Violet24K:main

Conversation

@Violet24K

Copy link
Copy Markdown

Adds a Flask mirror of carmax.com as the 16th
WebHarbor site, with full inventory search, vehicle research, comparison,
sell-my-car appraisal, financing pre-qualification, reserve, test drive,
and checkout flows.

Companion HuggingFace PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/15


What's in this PR

Site code (sites/carmax/)

File Lines Purpose
app.py 1,997 Flask app: 13 SQLAlchemy models, 10 WTForms, 59 routes
seed_data.py 904 Idempotent seed (12 stores, 141 vehicles, 5 users, 20 reviews, 10 articles)
templates/*.html 1,519 (44 files) base + macros + 42 page templates
static/css/main.css 221 CarMax navy (#1660a8) + yellow (#FFD900) brand styling
scrape_carmax.py 129 Reproducible httpx fetch of evox stock photos
scrape_articles.py 107 Reproducible fetch of article hero images
tasks.jsonl 20 WebVoyager benchmark tasks

Registration (3 files modified)

  • websyn_start.sh — added carmax to SITES, switched the three
    hardcoded 15s to ${#SITES[@]} so future additions don't need
    triple edits.
  • control_server.py — added 'carmax' to SITES list.
  • DockerfileEXPOSE 8101 40000-40015 (was 40000-40014).

Quality-of-life additions

  • .gitattributes — forces LF line endings on *.sh and Dockerfile
    so a Windows checkout doesn't break the container entrypoint (hit
    this exact issue during initial Docker testing — exec /opt/websyn_start.sh: no such file or directory).
  • scripts/verify_carmax.sh — single-command end-to-end verifier (build
    → run → reset → md5sum) for the new site.

Mirror functional coverage

59 routes across these areas:

  • Inventory/cars, /cars/<make>, /cars/<make>/<model>, /cars/<make>/<model>/<year>, /cars/<make>/<model>/<trim>, /cars/<make>/<model>/<trim>/<year>, with filter params for body style, drive type, fuel type, mileage cap, price range, color, store, etc.
  • Vehicle detail — full specs, features, customer reviews, similar vehicles, financing estimate
  • Research — model overview + year-by-year pages with RepairPal ratings, trims, FAQs
  • Comparison — anonymous/authed compare tool (up to 4 vehicles)
  • Saved cars — heart / unheart per-user
  • Sell my car — appraisal form → instant offer page with 7-day validity
  • Pre-qualification — soft-credit form → personalized monthly payment range
  • Financing — landing page + CarMax Auto Finance / external lender / cash options at checkout
  • Stores — 12 real CarMax locations across CA/TX/FL/GA/NY/IL/MD/MA/WA/AZ/CO/NC
  • Reserve / Test drive — auth-gated booking flows
  • Checkout — full order flow with MaxCare warranty and trade-in appraisal application
  • Account — orders, reservations, test drives, appraisals, saved cars, edit profile, change password
  • Articles + FAQ — 10 articles, 4 FAQ categories

Search uses scored token-overlap with field-weighted scoring
(make/model = 5, trim/body/color = 3, features/specs = 1), explicitly
NOT strict-AND, so queries like "honda civic sport" return results even
when one token misses on a given vehicle.


Benchmark tasks

sites/carmax/tasks.jsonl ships 20 tasks following the WebVoyager
schema (web_name, id, ques, web, upstream_url):

  • 6 Easy (2-3 steps): inventory search by year/make/model, trim-specific search, sorted filters, vehicle detail spec reading, store locator, FAQ
  • 9 Medium (4-6 steps): research-page navigation, sell-my-car form, register + pre-qual, reserve, test drive, cheapest-vehicle + store cross-check, article read, value-page lookup, MaxCare tier comparison
  • 5 Hard (7+ steps, multi-step reasoning): 3-way vehicle comparison, register + pre-qualify + report APR, saved-cars disambiguation, trade-in appraisal applied at checkout with custom finance terms, dan's order history audit

Hand-traced each task against the seed DB; the answer is verifiable on
every task and not visible at the search-result level for any task that
asks for spec-level info.


Verification

md5sum sites/carmax/instance/carmax.db sites/carmax/instance_seed/carmax.db
c6e3b281258bd8a460f7030a54b74c21 instance/carmax.db
c6e3b281258bd8a460f7030a54b74c21 instance_seed/carmax.db

Idempotency

Both seed_database() (line 675) and seed_benchmark_users() (line 722)
gate the whole function on populated-DB checks, not per row. Every
seeded created_at / saved_at / added_at uses a frozen
SEED_NOW = datetime(2026, 1, 15, 12, 0, 0) (18 references). Zero
calls to datetime.utcnow() anywhere in seed_data.py.


Asset side (HuggingFace dataset)

carmax.tar.gz (~280 MB) was uploaded to ChilleD/WebHarbor in
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/15. .assets-revision is bumped to that PR's merge SHA
in this PR.

Contents of the tarball (extracts in place into sites/carmax/):

  • instance_seed/carmax.db — the frozen seed DB
  • static/images/vehicles/ — 738 real CarMax stock photos covering
    115/138 unique (year, make, model) tuples (~86% coverage)
  • static/images/articles/ — 10 article hero images

The 18 missing (year, make, model) tuples (Ford F-150 all years, BMW 3
Series all years, Mercedes-Benz C-Class all years, 2023 Toyota Corolla
/ Kia Sorento / Subaru Outback, 2021-22 Hyundai Elantra) have no evox
stock photos on the carmax CDN — those vehicles fall back to a
CarMax-branded SVG placeholder. This matches the live site's behavior
for those exact combinations.


Test users (benchmark)

Five users with password CarMax!2026, each pre-populated for
auth-gated tasks:

Email First name Pre-qual? Saved Reservation Test drive Appraisal Order
alice.j@test.com Alice 2 (Civic + CR-V) 1 1 (at-home) 1 active
bob.k@test.com Bob 2 1 (in-store) 1 active
carol.l@test.com Carol 1 1 active
dan.m@test.com Dan 1 1 (CMX-2026-000001, ready_for_pickup, with MaxCare gold)
emma.n@test.com Emma

(Skill suggests bob.c/carol.d/david.k with TestPass123!, but
since tasks.jsonl references these specific emails throughout, I kept
the slightly different set. Functionally equivalent.)


Pre-PR checks

  • python3 -m py_compile sites/carmax/app.py — clean
  • python3 -m py_compile sites/carmax/seed_data.py — clean
  • bash scripts/build.sh webharbor:dev — succeeds (image ~6.2 GB)
  • Container boots, all 16 sites alive
  • All 16 sites return HTTP 200
  • /reset/carmax byte-identical (md5 above)
  • Each task in tasks.jsonl has a verifiable answer in the seed
  • Phase-3 walkthrough (info-leak / superficial-completion / distractor checks): 3 issues found, 3 fixed (Task 13 disambiguation, dan's order total, Turbo feature cross-field consistency)
  • Phase-4 hardening (13 leak archetypes + 4 dimensions): no real leaks; one minor task rephrasing applied

Anything that might want reviewer attention

  1. Benchmark user emails deviate from the skill's recommended
    bob.c@test.com / carol.d@test.com set — kept for tasks.jsonl
    internal consistency.
  2. 18 vehicles show a placeholder image (not 100% image coverage)
    because the carmax CDN has no evox photos for those (make, model,
    year) combinations. Could be remediated by sourcing from a different
    CDN if the maintainer requires 100% coverage.
  3. SEED_NOW = datetime(2026, 1, 15, 12, 0, 0) — matches the
    project's existing 2026 date pinning convention; please flag if a
    different reference date is preferred.

Happy to address any review feedback.

Violet24K added 6 commits May 14, 2026 22:28
…com. - 13 SQLAlchemy models (User / Store / Vehicle / SavedVehicle / Comparison + ComparisonItem / Reservation / TestDrive / Appraisal / FinancePreQual / Order / Review / Article) - 59 routes covering search / browse / detail / research / compare / saved / sell-my-car / pre-qual / reserve / test-drive / checkout / account / articles / FAQ / MaxCare / stores / auth - Token-overlap scored search with multi-field weighting - 141 deterministically-seeded vehicles across 31 templates - 12 real CarMax store locations - 5 benchmark users with pre-populated saved/reservation/test-drive/ appraisal/order data - 20 WebVoyager tasks in tasks.jsonl (6 Easy / 9 Medium / 5 Hard, including 2 disambiguation tasks) - Idempotent seed at function level; byte-identical reset verified
hqhq1025 pushed a commit to hqhq1025/WebHarbor that referenced this pull request May 26, 2026
Conflicts resolved:
- websyn_start.sh / control_server.py: append carmax after recreation_gov.
- Dockerfile EXPOSE 40000-40020 → 40000-40021; 16 → 22 site comment.

PR author already pinned bcrypt password_hash in seed_data.py:768 with
an explanatory comment about salt churn breaking byte-identical reset.
Plus carmax ships pre-built db via HF refs/pr/15, so seed runs only at
build time. No extra fix needed.
hqhq1025 pushed a commit to hqhq1025/WebHarbor that referenced this pull request May 27, 2026
Added 17 new gotchas (aiming-lab#24-aiming-lab#40) covering systemic anti-patterns caught
during 28-site deepen pass: API endpoint trap, in-memory data dict
trap, shared marketing template trap, entry-link 断链, task literal
duplicates, test-client seed-copy skip, image utilization, concurrent
subagent race, circular import seed_data, hub URL inventory, image
fallback patterns (SVG), POST interaction families, MarketingPage
schema, subagent stalls, image_path remap, pbkdf2 PINNED variant.

clone-website: real-data scraping mandate, GUI surface definition
(distinct templates / DB-backed / linkable entries), per-site-type
template targets, image/POST utilization thresholds, canonical
deepen-pass blueprint architecture with task generator template.

design-tasks: WebVoyager GUI hard boundaries (banned phrasings + regex),
5-token prefix cap @5, GUI vs API rewrite examples, multi-step
distribution, disambiguation density, pre-merge audit thresholds.

seed-database: hard rule that page content must live in SQLAlchemy
tables (not module-level dicts), detection script, in-memory->DB
migration recipe.

New skill `document-site-gui`: per-site GUI-centric documentation
producing site_docs/<slug>.md (8 sub-blocks per page) +
site_specs/<slug>.yaml (canonical structured spec). GUI-only action
space, batch=3 sites per subagent.

Total: 4101 lines across 8 skills (was ~1650 before).

@DEM1TASSE DEM1TASSE left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: CarMax mirror (PR #24)

Verdict: Request changes.

Strong engineering foundation — real inventory imagery, faithful CarMax layout, 59
routes, idempotent seeding, and a byte-identical reset that holds even after form
writes. But walking all 20 tasks end-to-end (real Chromium) surfaced one unsolvable
task and several correctness/realism bugs that an initial spot-check misses: a loose
search that returns wrong cars, ~13% missing images, a hardcoded reservation expiry,
an unreachable value page, and a half-built at-home test-drive flow. None are huge,
but together they need a fix pass before this is benchmark-ready.

Reviewed by building the image from this branch + the assets from the paired HF PR
(ASSETS_REVISION=refs/pr/15), on alt ports 8201 / 41000-41015, and driving every
task through the browser.


Mechanical checks: ✅ PASS

  • All 16 sites return 200 (ports 41000–41015)
  • Control plane healthy; carmax /_health = 141 vehicles / 12 stores / 5 users
  • Byte-identical reset holds, even after login / save / reserve / test-drive /
    checkout / appraisal-redeem writes: md5(instance) == md5(instance_seed)
    (c6e3b28…) after reset every time
  • reset-all ~0.97s, all 16 ready
  • Registration consistent (websyn_start.sh / control_server.py / Dockerfile);
    carmax = index 15 → port 40015; tasks correctly use 40015

Credit: idempotent seeding done right — seed_database() / seed_benchmark_users()
early-return on a populated DB; data is embedded in seed_data.py (no dependency on
the gitignored scraped_data/); date-bearing values are anchored to a fixed reference
date so seed/reset stay deterministic.

Visual fidelity: ⚠️ Mostly good, but ~13% of vehicles have no image

Real car photos and faithful CarMax layout on the pages that render (homepage,
inventory, vehicle detail, MaxCare, stores, research). But 18 of 141 vehicles (13%)
return 404 for their front image
— including all 5 Ford F-150s plus some Toyota
Corolla / Hyundai Elantra. On those detail pages the main image is broken and the
gallery falls back to _pending.svg. This directly hits task #11 (test-drive a 2022
Ford F-150 → no photo). Ship the missing images.

Functional depth: ⚠️ Most flows work; a few are broken

Every interactive flow was driven through the browser:

  • Login / logout / register; pre-qualify (APR result, e.g. 7.99%)
  • Inventory faceted filters + sort; vehicle detail; 3-car compare
  • Sell-my-car instant offer (computed, deterministic expiry)
  • Saved cars add/remove; full checkout (trade-in + finance → order, e.g. $18,800);
    appraisal correctly flips to redeemed and "Open offers" drops to 0
  • Reservation expiry is hardcoded (see B1)
  • At-home test drive doesn't collect/show an address and shows a store (see B5)
  • Used-car-value pages aren't reachable by navigation (see B4)

Task quality: walked all 20 — 19 completable, #3 fails

Tasks are navigation-heavy and anchored on fictional in-site data (prices, mileage,
HP/MPG, store, APR, offer amounts) an LLM can't answer from memory — a genuinely good
design. Multi-step flows (#6 compare, #7 offer, #9 register+pre-qual, #14 checkout with
trade-in) exercise the environment for real, and there are no human-in-the-loop
tasks
. End-to-end walk: #0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 complete;
#3 fails.


Required before approval

A1 — Task #3 is unsolvable, and search makes it worse

"Search for a Tesla Model 3 with under 50,000 miles … sort by lowest mileage and open
the lowest." Inventory has 4 Tesla Model 3s at 52,838 / 58,025 / 87,212 / 92,399 mi —
none under 50,000 (lowest is 52,838), so there is no correct answer. Worse: the
free-text search for "Tesla Model 3" returns 141 results (the whole catalog) and,
sorted by lowest mileage, ranks a 2023 Toyota Tacoma first — so an agent following
the steps opens a pickup truck, not a Tesla. Fix the data (add a sub-50k Tesla Model 3
or relax to e.g. <60,000) and the search (A2).

Should fix

B1 — Reservation expiry is hardcoded and can precede the appointment

reserve() sets expires_at = date(2026,5,14) + timedelta(days=7)always
2026-05-21
, ignoring the appointment date. Proven: reserving with appointment
2026-06-15 still expires 2026-05-21 — i.e. the hold expires a month before the
appointment. For task #10 (appointment 2026-05-20) it shows expiry 2026-05-21, which
reads as a 1-day hold, contradicting the "reserve for 7 days" flash. Fix:
expires_at = appointment_date + 7 (or reservation date + 7).

B2 — Free-text search is loose token matching, not field-aware

The search scores a text blob (trim/body_style/color/drive_type/transmission/…) by
token overlap, so it ignores make/model/body/drivetrain as constraints: "Tesla Model
3" → 141 results, "AWD SUVs" → 86 (including FWD sedans, vs 62 real AWD SUVs). It
survives #0/#1 (default best_match ranks the real match first) and #2 (the facet
filters are correct), but breaks any "search a model, then sort by mileage/price"
path (A1). Fix: parse make/model/body/drivetrain from the query into real filters, or
constrain the candidate set before scoring.

B3 — ~13% of vehicles have no image (see Visual fidelity)

18/141 missing front images incl. all 5 F-150s. Affects #11.

B4 — Used-car-value pages exist but are unreachable by clicking

/value/honda/accord/2020 renders correctly (average / lowest / highest / count "In
current inventory" = 2), and non-Honda value pages work too. But there is no
clickable path to them:
the /value landing links each make to /cars/<make>
(inventory), and /value/<make> is a 404. Only the model→year value pages interlink.
Task #18 ("visit the used car value page for the 2020 Honda Accord") therefore requires
guessing the URL. Fix: have the /value landing drill into /value/<make>/<model>
(and/or add a /value/<make> page).

B5 — At-home test drive doesn't capture an address and shows a store

TestDriveForm and the test_drives table have no address field, so selecting
"At my address" collects nowhere to deliver. And account_test_drives.html shows
r.store.location_label for every row regardless of location_type, so an at-home
drive displays the vehicle's store (e.g. "Lynnwood, WA") while labeled "At home" —
contradictory. Affects #11. Fix: add an address field for at-home; don't show a store
for at-home rows.

B6 — Catalog is thin, and "any X" tasks are only accidentally deterministic

141 vehicles; popular configs have a single instance (2022 Honda Civic = 1; 2022
CR-V/Camry/F-150 = 1 each) and there's 1 store per state. Every "any X" / "cheapest"
task (#0/#1/#4/#8b/#10/#11/#14, #2/#15) currently resolves to exactly one answer —
only because the catalog is this thin, which also means search/filter tasks have
weak distractors (e.g. #0 has no near-miss Civics). Broadening the catalog (needed for
distractors) would make the "any 2022 X" tasks non-deterministic. Fix the two together:
broaden the catalog and rewrite "any X" tasks to a unique selector (stock number,
lowest-mileage, a specific store/color).

Minor / realism

  • Checkout APR & down payment are free-text inputs, and the typed APR is used to
    compute the monthly payment — so any APR (even 0.01%) is accepted. Unrealistic; APR
    should come from pre-qualification/financing. Task #14 leans on typing "6.49%".
  • #13 task text contradicts the data: it says alice's two saved cars are "from
    different makes," but both are Honda (2020 Civic 69k mi, 2021 CR-V 57.8k mi). The
    remove-higher-mileage step still works; the wording is wrong.
  • #13 remove control is labeled "♡ Save" (same as the add toggle) on the saved
    page — ambiguous for "remove."
  • #11 note is saved but never displayed — the test-drives table has no Notes
    column, so "leaving a note" can't be visually confirmed.
  • #18 price range is degenerate: both 2020 Honda Accords are priced $13,000, so
    "lowest to highest" is $13,000–$13,000.
  • #15 every store has home delivery (12/12), so sub-question (c) "whether that store
    offers home delivery" is always yes — no distractor.
  • #16 is open-ended ("the key difference between pre-qualification and
    pre-approval, in one sentence") — subjective, no single ground truth.
  • #14 omits required inputs (pickup vs delivery, card number); the agent must
    invent them. The total is unaffected by both, so the answer stays deterministic.
  • Appraisal offers look low (2018 Camry LE, 78.5k mi, good → $4,850, below the
    seeded 2019 Altima appraisal of $14,750).

Summary

Dimension Result
Mechanical (build / 200 / byte-identical reset incl. post-write / reset-all) ✅ PASS
Visual ⚠️ Real images, but 13% of vehicles (all F-150s) missing
Functional ⚠️ Most flows work; reservation expiry / at-home test drive / value nav broken
Task quality ⚠️ 19/20 complete; #3 unsolvable; search + catalog issues
Assets pin ⏳ Bump .assets-revision after HF PR #15 merges

Bottom line: well-built and close, but request changes — fix the unsolvable #3 +
search (A1/B2), the reservation expiry (B1), missing images (B3), value navigation
(B4), and the at-home test drive (B5); broaden the catalog with tightened "any" tasks
(B6); then address the realism/wording nits.

Reproduce

gh pr checkout 24
ASSETS_REVISION=refs/pr/15 ./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review -p 8201:8101 -p 41000-41015:40000-40015 webharbor:dev
curl -X POST http://localhost:8201/reset/carmax
docker exec wh-review md5sum \
  /opt/WebSyn/carmax/instance/carmax.db \
  /opt/WebSyn/carmax/instance_seed/carmax.db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants