Add CarMax mirror (port 40015)#24
Conversation
…com. - 13 SQLAlchemy models (User / Store / Vehicle / SavedVehicle / Comparison + ComparisonItem / Reservation / TestDrive / Appraisal / FinancePreQual / Order / Review / Article) - 59 routes covering search / browse / detail / research / compare / saved / sell-my-car / pre-qual / reserve / test-drive / checkout / account / articles / FAQ / MaxCare / stores / auth - Token-overlap scored search with multi-field weighting - 141 deterministically-seeded vehicles across 31 templates - 12 real CarMax store locations - 5 benchmark users with pre-populated saved/reservation/test-drive/ appraisal/order data - 20 WebVoyager tasks in tasks.jsonl (6 Easy / 9 Medium / 5 Hard, including 2 disambiguation tasks) - Idempotent seed at function level; byte-identical reset verified
Conflicts resolved: - websyn_start.sh / control_server.py: append carmax after recreation_gov. - Dockerfile EXPOSE 40000-40020 → 40000-40021; 16 → 22 site comment. PR author already pinned bcrypt password_hash in seed_data.py:768 with an explanatory comment about salt churn breaking byte-identical reset. Plus carmax ships pre-built db via HF refs/pr/15, so seed runs only at build time. No extra fix needed.
Added 17 new gotchas (aiming-lab#24-aiming-lab#40) covering systemic anti-patterns caught during 28-site deepen pass: API endpoint trap, in-memory data dict trap, shared marketing template trap, entry-link 断链, task literal duplicates, test-client seed-copy skip, image utilization, concurrent subagent race, circular import seed_data, hub URL inventory, image fallback patterns (SVG), POST interaction families, MarketingPage schema, subagent stalls, image_path remap, pbkdf2 PINNED variant. clone-website: real-data scraping mandate, GUI surface definition (distinct templates / DB-backed / linkable entries), per-site-type template targets, image/POST utilization thresholds, canonical deepen-pass blueprint architecture with task generator template. design-tasks: WebVoyager GUI hard boundaries (banned phrasings + regex), 5-token prefix cap @5, GUI vs API rewrite examples, multi-step distribution, disambiguation density, pre-merge audit thresholds. seed-database: hard rule that page content must live in SQLAlchemy tables (not module-level dicts), detection script, in-memory->DB migration recipe. New skill `document-site-gui`: per-site GUI-centric documentation producing site_docs/<slug>.md (8 sub-blocks per page) + site_specs/<slug>.yaml (canonical structured spec). GUI-only action space, batch=3 sites per subagent. Total: 4101 lines across 8 skills (was ~1650 before).
DEM1TASSE
left a comment
There was a problem hiding this comment.
Review: CarMax mirror (PR #24)
Verdict: Request changes.
Strong engineering foundation — real inventory imagery, faithful CarMax layout, 59
routes, idempotent seeding, and a byte-identical reset that holds even after form
writes. But walking all 20 tasks end-to-end (real Chromium) surfaced one unsolvable
task and several correctness/realism bugs that an initial spot-check misses: a loose
search that returns wrong cars, ~13% missing images, a hardcoded reservation expiry,
an unreachable value page, and a half-built at-home test-drive flow. None are huge,
but together they need a fix pass before this is benchmark-ready.
Reviewed by building the image from this branch + the assets from the paired HF PR
(ASSETS_REVISION=refs/pr/15), on alt ports 8201 / 41000-41015, and driving every
task through the browser.
Mechanical checks: ✅ PASS
- All 16 sites return
200(ports 41000–41015) - Control plane healthy; carmax
/_health= 141 vehicles / 12 stores / 5 users - Byte-identical reset holds, even after login / save / reserve / test-drive /
checkout / appraisal-redeem writes:md5(instance) == md5(instance_seed)
(c6e3b28…) after reset every time -
reset-all~0.97s, all 16 ready - Registration consistent (
websyn_start.sh/control_server.py/Dockerfile);
carmax = index 15 → port 40015; tasks correctly use 40015
Credit: idempotent seeding done right — seed_database() / seed_benchmark_users()
early-return on a populated DB; data is embedded in seed_data.py (no dependency on
the gitignored scraped_data/); date-bearing values are anchored to a fixed reference
date so seed/reset stay deterministic.
Visual fidelity: ⚠️ Mostly good, but ~13% of vehicles have no image
Real car photos and faithful CarMax layout on the pages that render (homepage,
inventory, vehicle detail, MaxCare, stores, research). But 18 of 141 vehicles (13%)
return 404 for their front image — including all 5 Ford F-150s plus some Toyota
Corolla / Hyundai Elantra. On those detail pages the main image is broken and the
gallery falls back to _pending.svg. This directly hits task #11 (test-drive a 2022
Ford F-150 → no photo). Ship the missing images.
Functional depth: ⚠️ Most flows work; a few are broken
Every interactive flow was driven through the browser:
- Login / logout / register; pre-qualify (APR result, e.g. 7.99%)
- Inventory faceted filters + sort; vehicle detail; 3-car compare
- Sell-my-car instant offer (computed, deterministic expiry)
- Saved cars add/remove; full checkout (trade-in + finance → order, e.g. $18,800);
appraisal correctly flips toredeemedand "Open offers" drops to 0 - Reservation expiry is hardcoded (see B1)
- At-home test drive doesn't collect/show an address and shows a store (see B5)
- Used-car-value pages aren't reachable by navigation (see B4)
Task quality: walked all 20 — 19 completable, #3 fails
Tasks are navigation-heavy and anchored on fictional in-site data (prices, mileage,
HP/MPG, store, APR, offer amounts) an LLM can't answer from memory — a genuinely good
design. Multi-step flows (#6 compare, #7 offer, #9 register+pre-qual, #14 checkout with
trade-in) exercise the environment for real, and there are no human-in-the-loop
tasks. End-to-end walk: #0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 complete;
#3 fails.
Required before approval
A1 — Task #3 is unsolvable, and search makes it worse
"Search for a Tesla Model 3 with under 50,000 miles … sort by lowest mileage and open
the lowest." Inventory has 4 Tesla Model 3s at 52,838 / 58,025 / 87,212 / 92,399 mi —
none under 50,000 (lowest is 52,838), so there is no correct answer. Worse: the
free-text search for "Tesla Model 3" returns 141 results (the whole catalog) and,
sorted by lowest mileage, ranks a 2023 Toyota Tacoma first — so an agent following
the steps opens a pickup truck, not a Tesla. Fix the data (add a sub-50k Tesla Model 3
or relax to e.g. <60,000) and the search (A2).
Should fix
B1 — Reservation expiry is hardcoded and can precede the appointment
reserve() sets expires_at = date(2026,5,14) + timedelta(days=7) → always
2026-05-21, ignoring the appointment date. Proven: reserving with appointment
2026-06-15 still expires 2026-05-21 — i.e. the hold expires a month before the
appointment. For task #10 (appointment 2026-05-20) it shows expiry 2026-05-21, which
reads as a 1-day hold, contradicting the "reserve for 7 days" flash. Fix:
expires_at = appointment_date + 7 (or reservation date + 7).
B2 — Free-text search is loose token matching, not field-aware
The search scores a text blob (trim/body_style/color/drive_type/transmission/…) by
token overlap, so it ignores make/model/body/drivetrain as constraints: "Tesla Model
3" → 141 results, "AWD SUVs" → 86 (including FWD sedans, vs 62 real AWD SUVs). It
survives #0/#1 (default best_match ranks the real match first) and #2 (the facet
filters are correct), but breaks any "search a model, then sort by mileage/price"
path (A1). Fix: parse make/model/body/drivetrain from the query into real filters, or
constrain the candidate set before scoring.
B3 — ~13% of vehicles have no image (see Visual fidelity)
18/141 missing front images incl. all 5 F-150s. Affects #11.
B4 — Used-car-value pages exist but are unreachable by clicking
/value/honda/accord/2020 renders correctly (average / lowest / highest / count "In
current inventory" = 2), and non-Honda value pages work too. But there is no
clickable path to them: the /value landing links each make to /cars/<make>
(inventory), and /value/<make> is a 404. Only the model→year value pages interlink.
Task #18 ("visit the used car value page for the 2020 Honda Accord") therefore requires
guessing the URL. Fix: have the /value landing drill into /value/<make>/<model>
(and/or add a /value/<make> page).
B5 — At-home test drive doesn't capture an address and shows a store
TestDriveForm and the test_drives table have no address field, so selecting
"At my address" collects nowhere to deliver. And account_test_drives.html shows
r.store.location_label for every row regardless of location_type, so an at-home
drive displays the vehicle's store (e.g. "Lynnwood, WA") while labeled "At home" —
contradictory. Affects #11. Fix: add an address field for at-home; don't show a store
for at-home rows.
B6 — Catalog is thin, and "any X" tasks are only accidentally deterministic
141 vehicles; popular configs have a single instance (2022 Honda Civic = 1; 2022
CR-V/Camry/F-150 = 1 each) and there's 1 store per state. Every "any X" / "cheapest"
task (#0/#1/#4/#8b/#10/#11/#14, #2/#15) currently resolves to exactly one answer —
only because the catalog is this thin, which also means search/filter tasks have
weak distractors (e.g. #0 has no near-miss Civics). Broadening the catalog (needed for
distractors) would make the "any 2022 X" tasks non-deterministic. Fix the two together:
broaden the catalog and rewrite "any X" tasks to a unique selector (stock number,
lowest-mileage, a specific store/color).
Minor / realism
- Checkout APR & down payment are free-text inputs, and the typed APR is used to
compute the monthly payment — so any APR (even 0.01%) is accepted. Unrealistic; APR
should come from pre-qualification/financing. Task #14 leans on typing "6.49%". - #13 task text contradicts the data: it says alice's two saved cars are "from
different makes," but both are Honda (2020 Civic 69k mi, 2021 CR-V 57.8k mi). The
remove-higher-mileage step still works; the wording is wrong. - #13 remove control is labeled "♡ Save" (same as the add toggle) on the saved
page — ambiguous for "remove." - #11 note is saved but never displayed — the test-drives table has no Notes
column, so "leaving a note" can't be visually confirmed. - #18 price range is degenerate: both 2020 Honda Accords are priced $13,000, so
"lowest to highest" is $13,000–$13,000. - #15 every store has home delivery (12/12), so sub-question (c) "whether that store
offers home delivery" is always yes — no distractor. - #16 is open-ended ("the key difference between pre-qualification and
pre-approval, in one sentence") — subjective, no single ground truth. - #14 omits required inputs (pickup vs delivery, card number); the agent must
invent them. The total is unaffected by both, so the answer stays deterministic. - Appraisal offers look low (2018 Camry LE, 78.5k mi, good → $4,850, below the
seeded 2019 Altima appraisal of $14,750).
Summary
| Dimension | Result |
|---|---|
| Mechanical (build / 200 / byte-identical reset incl. post-write / reset-all) | ✅ PASS |
| Visual | |
| Functional | |
| Task quality | |
| Assets pin | ⏳ Bump .assets-revision after HF PR #15 merges |
Bottom line: well-built and close, but request changes — fix the unsolvable #3 +
search (A1/B2), the reservation expiry (B1), missing images (B3), value navigation
(B4), and the at-home test drive (B5); broaden the catalog with tightened "any" tasks
(B6); then address the realism/wording nits.
Reproduce
gh pr checkout 24
ASSETS_REVISION=refs/pr/15 ./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review -p 8201:8101 -p 41000-41015:40000-40015 webharbor:dev
curl -X POST http://localhost:8201/reset/carmax
docker exec wh-review md5sum \
/opt/WebSyn/carmax/instance/carmax.db \
/opt/WebSyn/carmax/instance_seed/carmax.db
Adds a Flask mirror of carmax.com as the 16th
WebHarbor site, with full inventory search, vehicle research, comparison,
sell-my-car appraisal, financing pre-qualification, reserve, test drive,
and checkout flows.
Companion HuggingFace PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/15
What's in this PR
Site code (
sites/carmax/)app.pyseed_data.pytemplates/*.htmlstatic/css/main.css#1660a8) + yellow (#FFD900) brand stylingscrape_carmax.pyscrape_articles.pytasks.jsonlRegistration (3 files modified)
websyn_start.sh— addedcarmaxtoSITES, switched the threehardcoded
15s to${#SITES[@]}so future additions don't needtriple edits.
control_server.py— added'carmax'toSITESlist.Dockerfile—EXPOSE 8101 40000-40015(was40000-40014).Quality-of-life additions
.gitattributes— forces LF line endings on*.shandDockerfileso a Windows checkout doesn't break the container entrypoint (hit
this exact issue during initial Docker testing —
exec /opt/websyn_start.sh: no such file or directory).scripts/verify_carmax.sh— single-command end-to-end verifier (build→ run → reset → md5sum) for the new site.
Mirror functional coverage
59 routes across these areas:
/cars,/cars/<make>,/cars/<make>/<model>,/cars/<make>/<model>/<year>,/cars/<make>/<model>/<trim>,/cars/<make>/<model>/<trim>/<year>, with filter params for body style, drive type, fuel type, mileage cap, price range, color, store, etc.Search uses scored token-overlap with field-weighted scoring
(make/model = 5, trim/body/color = 3, features/specs = 1), explicitly
NOT strict-AND, so queries like "honda civic sport" return results even
when one token misses on a given vehicle.
Benchmark tasks
sites/carmax/tasks.jsonlships 20 tasks following the WebVoyagerschema (
web_name,id,ques,web,upstream_url):Hand-traced each task against the seed DB; the answer is verifiable on
every task and not visible at the search-result level for any task that
asks for spec-level info.
Verification
md5sum sites/carmax/instance/carmax.db sites/carmax/instance_seed/carmax.db
c6e3b281258bd8a460f7030a54b74c21 instance/carmax.db
c6e3b281258bd8a460f7030a54b74c21 instance_seed/carmax.db
Idempotency
Both
seed_database()(line 675) andseed_benchmark_users()(line 722)gate the whole function on populated-DB checks, not per row. Every
seeded
created_at/saved_at/added_atuses a frozenSEED_NOW = datetime(2026, 1, 15, 12, 0, 0)(18 references). Zerocalls to
datetime.utcnow()anywhere inseed_data.py.Asset side (HuggingFace dataset)
carmax.tar.gz(~280 MB) was uploaded toChilleD/WebHarborinhttps://huggingface.co/datasets/ChilleD/WebHarbor/discussions/15.
.assets-revisionis bumped to that PR's merge SHAin this PR.
Contents of the tarball (extracts in place into
sites/carmax/):instance_seed/carmax.db— the frozen seed DBstatic/images/vehicles/— 738 real CarMax stock photos covering115/138 unique (year, make, model) tuples (~86% coverage)
static/images/articles/— 10 article hero imagesThe 18 missing (year, make, model) tuples (Ford F-150 all years, BMW 3
Series all years, Mercedes-Benz C-Class all years, 2023 Toyota Corolla
/ Kia Sorento / Subaru Outback, 2021-22 Hyundai Elantra) have no evox
stock photos on the carmax CDN — those vehicles fall back to a
CarMax-branded SVG placeholder. This matches the live site's behavior
for those exact combinations.
Test users (benchmark)
Five users with password
CarMax!2026, each pre-populated forauth-gated tasks:
alice.j@test.combob.k@test.comcarol.l@test.comdan.m@test.comemma.n@test.com(Skill suggests
bob.c/carol.d/david.kwithTestPass123!, butsince
tasks.jsonlreferences these specific emails throughout, I keptthe slightly different set. Functionally equivalent.)
Pre-PR checks
python3 -m py_compile sites/carmax/app.py— cleanpython3 -m py_compile sites/carmax/seed_data.py— cleanbash scripts/build.sh webharbor:dev— succeeds (image ~6.2 GB)/reset/carmaxbyte-identical (md5 above)tasks.jsonlhas a verifiable answer in the seedAnything that might want reviewer attention
bob.c@test.com/carol.d@test.comset — kept fortasks.jsonlinternal consistency.
because the carmax CDN has no evox photos for those (make, model,
year) combinations. Could be remediated by sourcing from a different
CDN if the maintainer requires 100% coverage.
SEED_NOW = datetime(2026, 1, 15, 12, 0, 0)— matches theproject's existing 2026 date pinning convention; please flag if a
different reference date is preferred.
Happy to address any review feedback.