Add CarMax mirror (port 40015) by Violet24K · Pull Request #24 · aiming-lab/WebHarbor

Violet24K · 2026-05-15T17:12:56Z

Adds a Flask mirror of carmax.com as the 16th
WebHarbor site, with full inventory search, vehicle research, comparison,
sell-my-car appraisal, financing pre-qualification, reserve, test drive,
and checkout flows.

Companion HuggingFace PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/15

What's in this PR

Site code (`sites/carmax/`)

File	Lines	Purpose
`app.py`	1,997	Flask app: 13 SQLAlchemy models, 10 WTForms, 59 routes
`seed_data.py`	904	Idempotent seed (12 stores, 141 vehicles, 5 users, 20 reviews, 10 articles)
`templates/*.html`	1,519 (44 files)	base + macros + 42 page templates
`static/css/main.css`	221	CarMax navy (`#1660a8`) + yellow (`#FFD900`) brand styling
`scrape_carmax.py`	129	Reproducible httpx fetch of evox stock photos
`scrape_articles.py`	107	Reproducible fetch of article hero images
`tasks.jsonl`	20	WebVoyager benchmark tasks

Registration (3 files modified)

websyn_start.sh — added carmax to SITES, switched the three
hardcoded 15s to ${#SITES[@]} so future additions don't need
triple edits.
control_server.py — added 'carmax' to SITES list.
Dockerfile — EXPOSE 8101 40000-40015 (was 40000-40014).

Quality-of-life additions

.gitattributes — forces LF line endings on *.sh and Dockerfile
so a Windows checkout doesn't break the container entrypoint (hit
this exact issue during initial Docker testing — exec /opt/websyn_start.sh: no such file or directory).
scripts/verify_carmax.sh — single-command end-to-end verifier (build
→ run → reset → md5sum) for the new site.

Mirror functional coverage

59 routes across these areas:

Inventory — /cars, /cars/<make>, /cars/<make>/<model>, /cars/<make>/<model>/<year>, /cars/<make>/<model>/<trim>, /cars/<make>/<model>/<trim>/<year>, with filter params for body style, drive type, fuel type, mileage cap, price range, color, store, etc.
Vehicle detail — full specs, features, customer reviews, similar vehicles, financing estimate
Research — model overview + year-by-year pages with RepairPal ratings, trims, FAQs
Comparison — anonymous/authed compare tool (up to 4 vehicles)
Saved cars — heart / unheart per-user
Sell my car — appraisal form → instant offer page with 7-day validity
Pre-qualification — soft-credit form → personalized monthly payment range
Financing — landing page + CarMax Auto Finance / external lender / cash options at checkout
Stores — 12 real CarMax locations across CA/TX/FL/GA/NY/IL/MD/MA/WA/AZ/CO/NC
Reserve / Test drive — auth-gated booking flows
Checkout — full order flow with MaxCare warranty and trade-in appraisal application
Account — orders, reservations, test drives, appraisals, saved cars, edit profile, change password
Articles + FAQ — 10 articles, 4 FAQ categories

Search uses scored token-overlap with field-weighted scoring
(make/model = 5, trim/body/color = 3, features/specs = 1), explicitly
NOT strict-AND, so queries like "honda civic sport" return results even
when one token misses on a given vehicle.

Benchmark tasks

sites/carmax/tasks.jsonl ships 20 tasks following the WebVoyager
schema (web_name, id, ques, web, upstream_url):

6 Easy (2-3 steps): inventory search by year/make/model, trim-specific search, sorted filters, vehicle detail spec reading, store locator, FAQ
9 Medium (4-6 steps): research-page navigation, sell-my-car form, register + pre-qual, reserve, test drive, cheapest-vehicle + store cross-check, article read, value-page lookup, MaxCare tier comparison
5 Hard (7+ steps, multi-step reasoning): 3-way vehicle comparison, register + pre-qualify + report APR, saved-cars disambiguation, trade-in appraisal applied at checkout with custom finance terms, dan's order history audit

Hand-traced each task against the seed DB; the answer is verifiable on
every task and not visible at the search-result level for any task that
asks for spec-level info.

Verification

md5sum sites/carmax/instance/carmax.db sites/carmax/instance_seed/carmax.db
c6e3b281258bd8a460f7030a54b74c21 instance/carmax.db
c6e3b281258bd8a460f7030a54b74c21 instance_seed/carmax.db

Idempotency

Both seed_database() (line 675) and seed_benchmark_users() (line 722)
gate the whole function on populated-DB checks, not per row. Every
seeded created_at / saved_at / added_at uses a frozen
SEED_NOW = datetime(2026, 1, 15, 12, 0, 0) (18 references). Zero
calls to datetime.utcnow() anywhere in seed_data.py.

Asset side (HuggingFace dataset)

carmax.tar.gz (~280 MB) was uploaded to ChilleD/WebHarbor in
https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/15. .assets-revision is bumped to that PR's merge SHA
in this PR.

Contents of the tarball (extracts in place into sites/carmax/):

instance_seed/carmax.db — the frozen seed DB
static/images/vehicles/ — 738 real CarMax stock photos covering
115/138 unique (year, make, model) tuples (~86% coverage)
static/images/articles/ — 10 article hero images

The 18 missing (year, make, model) tuples (Ford F-150 all years, BMW 3
Series all years, Mercedes-Benz C-Class all years, 2023 Toyota Corolla
/ Kia Sorento / Subaru Outback, 2021-22 Hyundai Elantra) have no evox
stock photos on the carmax CDN — those vehicles fall back to a
CarMax-branded SVG placeholder. This matches the live site's behavior
for those exact combinations.

Test users (benchmark)

Five users with password CarMax!2026, each pre-populated for
auth-gated tasks:

Email	First name	Pre-qual?	Saved	Reservation	Test drive	Appraisal	Order
`alice.j@test.com`	Alice	✓	2 (Civic + CR-V)	1	1 (at-home)	1 active	—
`bob.k@test.com`	Bob	✓	2	—	1 (in-store)	1 active	—
`carol.l@test.com`	Carol	✓	1	—	—	1 active	—
`dan.m@test.com`	Dan	—	1	—	—	—	1 (CMX-2026-000001, ready_for_pickup, with MaxCare gold)
`emma.n@test.com`	Emma	✓	—	—	—	—	—

(Skill suggests bob.c/carol.d/david.k with TestPass123!, but
since tasks.jsonl references these specific emails throughout, I kept
the slightly different set. Functionally equivalent.)

Pre-PR checks

python3 -m py_compile sites/carmax/app.py — clean
python3 -m py_compile sites/carmax/seed_data.py — clean
bash scripts/build.sh webharbor:dev — succeeds (image ~6.2 GB)
Container boots, all 16 sites alive
All 16 sites return HTTP 200
/reset/carmax byte-identical (md5 above)
Each task in tasks.jsonl has a verifiable answer in the seed
Phase-3 walkthrough (info-leak / superficial-completion / distractor checks): 3 issues found, 3 fixed (Task 13 disambiguation, dan's order total, Turbo feature cross-field consistency)
Phase-4 hardening (13 leak archetypes + 4 dimensions): no real leaks; one minor task rephrasing applied

Anything that might want reviewer attention

Benchmark user emails deviate from the skill's recommended
bob.c@test.com / carol.d@test.com set — kept for tasks.jsonl
internal consistency.
18 vehicles show a placeholder image (not 100% image coverage)
because the carmax CDN has no evox photos for those (make, model,
year) combinations. Could be remediated by sourcing from a different
CDN if the maintainer requires 100% coverage.
SEED_NOW = datetime(2026, 1, 15, 12, 0, 0) — matches the
project's existing 2026 date pinning convention; please flag if a
different reference date is preferred.

Happy to address any review feedback.

…inux issue

…com. - 13 SQLAlchemy models (User / Store / Vehicle / SavedVehicle / Comparison + ComparisonItem / Reservation / TestDrive / Appraisal / FinancePreQual / Order / Review / Article) - 59 routes covering search / browse / detail / research / compare / saved / sell-my-car / pre-qual / reserve / test-drive / checkout / account / articles / FAQ / MaxCare / stores / auth - Token-overlap scored search with multi-field weighting - 141 deterministically-seeded vehicles across 31 templates - 12 real CarMax store locations - 5 benchmark users with pre-populated saved/reservation/test-drive/ appraisal/order data - 20 WebVoyager tasks in tasks.jsonl (6 Easy / 9 Medium / 5 Hard, including 2 disambiguation tasks) - Idempotent seed at function level; byte-identical reset verified

Conflicts resolved: - websyn_start.sh / control_server.py: append carmax after recreation_gov. - Dockerfile EXPOSE 40000-40020 → 40000-40021; 16 → 22 site comment. PR author already pinned bcrypt password_hash in seed_data.py:768 with an explanatory comment about salt churn breaking byte-identical reset. Plus carmax ships pre-built db via HF refs/pr/15, so seed runs only at build time. No extra fix needed.

@5

Added 17 new gotchas (aiming-lab#24-aiming-lab#40) covering systemic anti-patterns caught during 28-site deepen pass: API endpoint trap, in-memory data dict trap, shared marketing template trap, entry-link 断链, task literal duplicates, test-client seed-copy skip, image utilization, concurrent subagent race, circular import seed_data, hub URL inventory, image fallback patterns (SVG), POST interaction families, MarketingPage schema, subagent stalls, image_path remap, pbkdf2 PINNED variant. clone-website: real-data scraping mandate, GUI surface definition (distinct templates / DB-backed / linkable entries), per-site-type template targets, image/POST utilization thresholds, canonical deepen-pass blueprint architecture with task generator template. design-tasks: WebVoyager GUI hard boundaries (banned phrasings + regex), 5-token prefix cap @5, GUI vs API rewrite examples, multi-step distribution, disambiguation density, pre-merge audit thresholds. seed-database: hard rule that page content must live in SQLAlchemy tables (not module-level dicts), detection script, in-memory->DB migration recipe. New skill `document-site-gui`: per-site GUI-centric documentation producing site_docs/<slug>.md (8 sub-blocks per page) + site_specs/<slug>.yaml (canonical structured spec). GUI-only action space, batch=3 sites per subagent. Total: 4101 lines across 8 skills (was ~1650 before).

DEM1TASSE

Review: CarMax mirror (PR #24)

Verdict: Request changes.

Strong engineering foundation — real inventory imagery, faithful CarMax layout, 59
routes, idempotent seeding, and a byte-identical reset that holds even after form
writes. But walking all 20 tasks end-to-end (real Chromium) surfaced one unsolvable
task and several correctness/realism bugs that an initial spot-check misses: a loose
search that returns wrong cars, ~13% missing images, a hardcoded reservation expiry,
an unreachable value page, and a half-built at-home test-drive flow. None are huge,
but together they need a fix pass before this is benchmark-ready.

Reviewed by building the image from this branch + the assets from the paired HF PR
(ASSETS_REVISION=refs/pr/15), on alt ports 8201 / 41000-41015, and driving every
task through the browser.

Mechanical checks: ✅ PASS

All 16 sites return 200 (ports 41000–41015)
Control plane healthy; carmax /_health = 141 vehicles / 12 stores / 5 users
Byte-identical reset holds, even after login / save / reserve / test-drive /
checkout / appraisal-redeem writes: md5(instance) == md5(instance_seed)
(c6e3b28…) after reset every time
reset-all ~0.97s, all 16 ready
Registration consistent (websyn_start.sh / control_server.py / Dockerfile);
carmax = index 15 → port 40015; tasks correctly use 40015

Credit: idempotent seeding done right — seed_database() / seed_benchmark_users()
early-return on a populated DB; data is embedded in seed_data.py (no dependency on
the gitignored scraped_data/); date-bearing values are anchored to a fixed reference
date so seed/reset stay deterministic.

Visual fidelity: ⚠️ Mostly good, but ~13% of vehicles have no image

Real car photos and faithful CarMax layout on the pages that render (homepage,
inventory, vehicle detail, MaxCare, stores, research). But 18 of 141 vehicles (13%)
return 404 for their front image — including all 5 Ford F-150s plus some Toyota
Corolla / Hyundai Elantra. On those detail pages the main image is broken and the
gallery falls back to _pending.svg. This directly hits task #11 (test-drive a 2022
Ford F-150 → no photo). Ship the missing images.

Functional depth: ⚠️ Most flows work; a few are broken

Every interactive flow was driven through the browser:

Login / logout / register; pre-qualify (APR result, e.g. 7.99%)
Inventory faceted filters + sort; vehicle detail; 3-car compare
Sell-my-car instant offer (computed, deterministic expiry)
Saved cars add/remove; full checkout (trade-in + finance → order, e.g. $18,800);
appraisal correctly flips to redeemed and "Open offers" drops to 0
Reservation expiry is hardcoded (see B1)
At-home test drive doesn't collect/show an address and shows a store (see B5)
Used-car-value pages aren't reachable by navigation (see B4)

Task quality: walked all 20 — 19 completable, #3 fails

Tasks are navigation-heavy and anchored on fictional in-site data (prices, mileage,
HP/MPG, store, APR, offer amounts) an LLM can't answer from memory — a genuinely good
design. Multi-step flows (#6 compare, #7 offer, #9 register+pre-qual, #14 checkout with
trade-in) exercise the environment for real, and there are no human-in-the-loop
tasks. End-to-end walk: #0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 complete;
#3 fails.

Required before approval

A1 — Task #3 is unsolvable, and search makes it worse

"Search for a Tesla Model 3 with under 50,000 miles … sort by lowest mileage and open
the lowest." Inventory has 4 Tesla Model 3s at 52,838 / 58,025 / 87,212 / 92,399 mi —
none under 50,000 (lowest is 52,838), so there is no correct answer. Worse: the
free-text search for "Tesla Model 3" returns 141 results (the whole catalog) and,
sorted by lowest mileage, ranks a 2023 Toyota Tacoma first — so an agent following
the steps opens a pickup truck, not a Tesla. Fix the data (add a sub-50k Tesla Model 3
or relax to e.g. <60,000) and the search (A2).

Should fix

B1 — Reservation expiry is hardcoded and can precede the appointment

reserve() sets expires_at = date(2026,5,14) + timedelta(days=7) → always
2026-05-21, ignoring the appointment date. Proven: reserving with appointment
2026-06-15 still expires 2026-05-21 — i.e. the hold expires a month before the
appointment. For task #10 (appointment 2026-05-20) it shows expiry 2026-05-21, which
reads as a 1-day hold, contradicting the "reserve for 7 days" flash. Fix:
expires_at = appointment_date + 7 (or reservation date + 7).

B2 — Free-text search is loose token matching, not field-aware

The search scores a text blob (trim/body_style/color/drive_type/transmission/…) by
token overlap, so it ignores make/model/body/drivetrain as constraints: "Tesla Model
3" → 141 results, "AWD SUVs" → 86 (including FWD sedans, vs 62 real AWD SUVs). It
survives #0/#1 (default best_match ranks the real match first) and #2 (the facet
filters are correct), but breaks any "search a model, then sort by mileage/price"
path (A1). Fix: parse make/model/body/drivetrain from the query into real filters, or
constrain the candidate set before scoring.

B3 — ~13% of vehicles have no image (see Visual fidelity)

18/141 missing front images incl. all 5 F-150s. Affects #11.

B4 — Used-car-value pages exist but are unreachable by clicking

/value/honda/accord/2020 renders correctly (average / lowest / highest / count "In
current inventory" = 2), and non-Honda value pages work too. But there is no
clickable path to them: the /value landing links each make to /cars/<make>
(inventory), and /value/<make> is a 404. Only the model→year value pages interlink.
Task #18 ("visit the used car value page for the 2020 Honda Accord") therefore requires
guessing the URL. Fix: have the /value landing drill into /value/<make>/<model>
(and/or add a /value/<make> page).

B5 — At-home test drive doesn't capture an address and shows a store

TestDriveForm and the test_drives table have no address field, so selecting
"At my address" collects nowhere to deliver. And account_test_drives.html shows
r.store.location_label for every row regardless of location_type, so an at-home
drive displays the vehicle's store (e.g. "Lynnwood, WA") while labeled "At home" —
contradictory. Affects #11. Fix: add an address field for at-home; don't show a store
for at-home rows.

B6 — Catalog is thin, and "any X" tasks are only accidentally deterministic

141 vehicles; popular configs have a single instance (2022 Honda Civic = 1; 2022
CR-V/Camry/F-150 = 1 each) and there's 1 store per state. Every "any X" / "cheapest"
task (#0/#1/#4/#8b/#10/#11/#14, #2/#15) currently resolves to exactly one answer —
only because the catalog is this thin, which also means search/filter tasks have
weak distractors (e.g. #0 has no near-miss Civics). Broadening the catalog (needed for
distractors) would make the "any 2022 X" tasks non-deterministic. Fix the two together:
broaden the catalog and rewrite "any X" tasks to a unique selector (stock number,
lowest-mileage, a specific store/color).

Minor / realism

Checkout APR & down payment are free-text inputs, and the typed APR is used to
compute the monthly payment — so any APR (even 0.01%) is accepted. Unrealistic; APR
should come from pre-qualification/financing. Task #14 leans on typing "6.49%".
#13 task text contradicts the data: it says alice's two saved cars are "from
different makes," but both are Honda (2020 Civic 69k mi, 2021 CR-V 57.8k mi). The
remove-higher-mileage step still works; the wording is wrong.
#13 remove control is labeled "♡ Save" (same as the add toggle) on the saved
page — ambiguous for "remove."
#11 note is saved but never displayed — the test-drives table has no Notes
column, so "leaving a note" can't be visually confirmed.
#18 price range is degenerate: both 2020 Honda Accords are priced $13,000, so
"lowest to highest" is $13,000–$13,000.
#15 every store has home delivery (12/12), so sub-question (c) "whether that store
offers home delivery" is always yes — no distractor.
#16 is open-ended ("the key difference between pre-qualification and
pre-approval, in one sentence") — subjective, no single ground truth.
#14 omits required inputs (pickup vs delivery, card number); the agent must
invent them. The total is unaffected by both, so the answer stays deterministic.
Appraisal offers look low (2018 Camry LE, 78.5k mi, good → $4,850, below the
seeded 2019 Altima appraisal of $14,750).

Summary

Dimension	Result
Mechanical (build / 200 / byte-identical reset incl. post-write / reset-all)	✅ PASS
Visual	⚠️ Real images, but 13% of vehicles (all F-150s) missing
Functional	⚠️ Most flows work; reservation expiry / at-home test drive / value nav broken
Task quality	⚠️ 19/20 complete; #3 unsolvable; search + catalog issues
Assets pin	⏳ Bump `.assets-revision` after HF PR #15 merges

Bottom line: well-built and close, but request changes — fix the unsolvable #3 +
search (A1/B2), the reservation expiry (B1), missing images (B3), value navigation
(B4), and the at-home test drive (B5); broaden the catalog with tightened "any" tasks
(B6); then address the realism/wording nits.

Reproduce

gh pr checkout 24
ASSETS_REVISION=refs/pr/15 ./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review -p 8201:8101 -p 41000-41015:40000-40015 webharbor:dev
curl -X POST http://localhost:8201/reset/carmax
docker exec wh-review md5sum \
  /opt/WebSyn/carmax/instance/carmax.db \
  /opt/WebSyn/carmax/instance_seed/carmax.db

Violet24K added 6 commits May 14, 2026 22:28

adding carmax phase 1

427ad6a

phase 1 almost done

d3c4380

Create tasks.jsonl

177bdf1

phase1 docker check passed, phase 2 & 3 finished; update LR windows/l…

1b577f6

…inux issue

phase 3 done. scraped images

a680728

sarendis56 mentioned this pull request May 16, 2026

Add Compass real-estate mirror (port 40015) #25

Open

hqhq1025 mentioned this pull request May 26, 2026

Add Discogs mirror site (port 40015) #34

Open

DEM1TASSE suggested changes Jun 23, 2026

View reviewed changes

boyugou mentioned this pull request Jun 25, 2026

feat(drugs_com): add drugs.com mirror site (port 40016) #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CarMax mirror (port 40015)#24

Add CarMax mirror (port 40015)#24
Violet24K wants to merge 6 commits into
aiming-lab:mainfrom
Violet24K:main

Violet24K commented May 15, 2026

Uh oh!

DEM1TASSE left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Violet24K commented May 15, 2026

What's in this PR

Site code (sites/carmax/)

Registration (3 files modified)

Quality-of-life additions

Mirror functional coverage

Benchmark tasks

Verification

Idempotency

Asset side (HuggingFace dataset)

Test users (benchmark)

Pre-PR checks

Anything that might want reviewer attention

Uh oh!

DEM1TASSE left a comment

Choose a reason for hiding this comment

Review: CarMax mirror (PR #24)

Mechanical checks: ✅ PASS

Visual fidelity: ⚠️ Mostly good, but ~13% of vehicles have no image

Functional depth: ⚠️ Most flows work; a few are broken

Task quality: walked all 20 — 19 completable, #3 fails

Required before approval

A1 — Task #3 is unsolvable, and search makes it worse

Should fix

B1 — Reservation expiry is hardcoded and can precede the appointment

B2 — Free-text search is loose token matching, not field-aware

B3 — ~13% of vehicles have no image (see Visual fidelity)

B4 — Used-car-value pages exist but are unreachable by clicking

B5 — At-home test drive doesn't capture an address and shows a store

B6 — Catalog is thin, and "any X" tasks are only accidentally deterministic

Minor / realism

Summary

Reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Site code (`sites/carmax/`)