Add Merriam-Webster mirror site#44
Conversation
Adds a Flask mirror of merriam-webster.com as the 16th WebHarbor site: dictionary, thesaurus, word of the day, vocabulary quizzes, and full account/login flow. 142 real word entries, 30 thesaurus entries, 3 quizzes (10 questions each), all scraped from the live site. 20 WebVoyager-format benchmark tasks in tasks.jsonl. Registered as site index 15 (port 40015) in websyn_start.sh, control_server.py, and Dockerfile (EXPOSE 40000-40015). Pre-PR checks (passed locally): - docker build webharbor:dev (5.89GB) - 16/16 sites return HTTP 200 - /reset/merriam_webster byte-identical (md5 a4248bef..) - /reset-all 16 sites parallel ~1.1s - 20/20 benchmark tasks walkable in container - All 15 existing sites still byte-identical (no regression) Assets: heavy assets (instance_seed/merriam_webster.db, 12 real images from MW games/quizzes) uploaded to HF dataset YuanDaozeiii/WebHarbor at revision 8866e560. .assets-revision pins to the fork until the HF PR adding merriam_webster.tar.gz to ChilleD/WebHarbor is merged. Also fixes a pre-existing .gitignore bug where the inline comment on sites/*/scraped_data/ silently disabled the rule (gitignore does not support inline comments). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6268698 to
ce6d6e3
Compare
|
hello, please check the repo~ @Raibows |
There was a problem hiding this comment.
Review: Merriam-Webster mirror (PR #44)
Verdict: Changes requested (mergeable after task-design fixes).
The infrastructure is solid — byte-identical reset, clean build, faithful and
correct page content. The real concern is task quality: a large share of the 20
tasks can be answered from an LLM's prior knowledge without ever navigating the site,
which undercuts the point of a web-agent benchmark. A couple of tasks are also
ill-posed for autonomous evaluation. None of this is hard to fix, but it should be
fixed before this is used as a benchmark.
Reviewed by driving a real Chromium (Playwright) against the image built from this
branch + the pinned HF assets, on alt ports 8201 / 41000-41015, plus side-by-side
screenshot comparison against the live merriam-webster.com.
Mechanical checks: ✅ PASS
- All 16 sites return
200(ports 41000–41015) - Control plane healthy — all 16 sites
alive - Byte-identical reset holds: after
POST /reset/merriam_webster,
md5(instance/…db) == md5(instance_seed/…db)(a4248be…) - Reset wipes runtime writes: re-checked md5 after login (writes
SearchHistory), save/remove (mutatesSavedWord), and register (adds a
User) — still matches seed. Hard invariant is solid. -
reset-allcompletes in ~0.97s, all 16 sitesready - HF revision exists and contains all 16 site tarballs
Credit: idempotent seeding is done right — both seed functions early-return on a
populated DB, and a fixed WOTD_ANCHOR = date(2025,1,15) is used in the seed instead
of date.today(). Three-place registration (websyn_start.sh, control_server.py,
Dockerfile EXPOSE) is consistent; merriam_webster is index 15 → port 40015.
Visual fidelity: ✅ Acceptable (judged as "clear + content-correct on the solve path")
Bar applied: not pixel-identical to the real site, but task-solve-path pages must be
clear and show semantically correct content. For a text-heavy dictionary with
essentially no imagery, "no images" is fine — what matters is that the definitions,
etymologies, synonyms, and antonyms shown are the real, correct data.
- Inner pages (word detail, thesaurus, WOTD, games, quiz) are clear and readable
- Content is correct: spot-checked serendipity etymology / first-known-use,
happysynonyms+antonyms, etc. — all match real Merriam-Webster data - Search-miss is graceful: "No entries found … Check your spelling"
Non-blocking "could improve" (pixel/layout divergence from the real site):
- Homepage is much simpler than the real MW homepage — the real one is a
magazine-style page of large photographic feature cards (quizzes, articles, "Top
Lookups Right Now"); the mirror has only a text WOTD card + a Trending list. - Logo is a left-aligned text wordmark across the whole site, not MW's centered
circular badge. Thesaurus uses green/red chips vs MW's orange theme + sense-grouped
sidebar. These don't impede the agent's observations, so they're cosmetic.
Functional depth: ✅ Mostly PASS
- Login, logout, register (with WTForms validation: bad email + mismatched
password → field errors, stays on/register) - Search (exact headword redirects to the detail page) + autocomplete endpoint
- Save word → persists; remove saved word → list 3→2 with correct flash
- Quiz submit + scoring → score banner is correct (e.g. 6/10)
- Quiz accepts partial/empty submission and misrepresents unanswered
questions as correct. You can submit having answered only one question (or
none) — there is no "answer all questions" validation. The score banner is
right (e.g.1/10), but the per-question review marks every question's correct
answer green ✓ and only marks a red ✗ on a choice you picked; an unanswered
question (picked is None) therefore renders identically to a correctly-answered
one — green ✓, no "not answered" indicator. Verified: answering 1 of 10 →
1/10score, but 10 green ✓ / 0 red ✗ on the review page.
- Why it matters: an agent (or the screenshot-reading judge) can conclude all
answers were correct; and tasks #11/#12 ("answer all questions") can be
"completed" by answering one, with the result page hiding the skips.
- Root cause:quiz_result.htmlonly branches on correct-answer vs picked-answer;
it never handlespicked is None.quiz.htmlradios aren'trequiredand
quiz_submitdoesn't validate completeness.
- Fix: require all questions answered before scoring (or flag unanswered ones),
and render unanswered questions distinctly (e.g. "Not answered").
Task quality: ⚠️ Changes requested — the main issue
Catalog is only 142 dictionary words (plus 30 thesaurus entries, 8 WOTD, 3
quizzes), skewed toward advanced/literary vocabulary. Common words (cry, baby, run,
dog, water…) are absent, so any off-script lookup dead-ends. The 20 tasks are all
self-consistent against this catalog (every referenced word exists), but the
catalog is thin for a "dictionary."
1. Most tasks are answerable from LLM prior knowledge without navigating (biggest issue)
This is a web-agent benchmark, but many tasks ask for dictionary facts a frontier
LLM already knows and can answer with zero site interaction:
| Task | Asks for | LLM can answer without the site? |
|---|---|---|
| #0 serendipity part of speech | noun | yes |
| #2 nostalgia source language | Greek | yes |
| #5 / #6 brave synonyms / calm antonyms | courageous / angry… | yes |
| #8 happy synonym + antonym | — | yes |
| #17 / #19 compare words' first-use / POS | — | likely |
Only the MW-specific respelling pronunciation (#0) and the exact first-known-use
year (#3 1909, #4 1827) are hard to produce from memory. The rest are
knowledge-recall, not web navigation.
2. No answer key + LLM-only judge amplifies #1
tasks.jsonl has no answer field, and there is no answer-key file anywhere. Grading
is agent_demo/eval_judge.py (LLM-as-judge) reading the trajectory — there is no
stored ground truth. Combined with #1: the judge LLM also knows the dictionary facts,
so an agent that hallucinates a plausible answer without opening the page can still
be marked success. The environment may never actually be exercised.
3. Several tasks have no stable, verifiable answer (not just #18)
A cluster of tasks can't be graded against a fixed ground truth in autonomous
evaluation:
| Task | Problem | Severity |
|---|---|---|
| #18 "remove a word … ask me which one to remove if it's unclear" | Presupposes a human to ask; in autonomous eval there is none, so the agent guesses and "success" is undefined. | must fix |
| #9 "tell me today's featured word and its POS" | todays_wotd() rotates by date.today(), so the answer changes by run date; the page also shows a 2025 feature_date under a 2026 footer. |
must fix |
| #11 "answer all questions and tell me your final score" | Score depends on the agent's choices — no fixed answer (and broken by the quiz bug above). | should fix |
| #12 "report how many you got correct out of the total" | Same as #11 — non-deterministic. | should fix |
| #16 "username of your choice, any email" | Free input, no fixed answer — but the goal ("verify you are logged in") is verifiable, so this one is acceptable. | borderline / OK |
Fixes: name the word to remove in #18 (e.g. "remove curiosity"); anchor #9 to a
fixed date or a specific dated word; for #11/#12 grade on "completed the quiz and
correctly reported the on-screen score" (and fix the quiz bug so the score page is
trustworthy).
Note: first-person phrasing elsewhere ("tell me…", "your word list", "your account"
in #0/#6/#8/#10/#14/#15) is just normal instruction phrasing — the agent reports to
the user and "your list" = the logged-in account's list. Those are fine. #13 ("how
many quizzes + difficulty of each") is deterministic (3 quizzes: easy/medium/medium).
4. Quiz tasks (#11/#12) have no deterministic answer + are mechanically trivial
"answer all questions and tell me your final score" — any score is valid, and
selecting all-first-choice without reading still "completes" it. These do resist the
knowledge shortcut (you must navigate the quiz and read the score off the page), which
is good, but they test only the mechanical flow. The quiz itself is a plain
radio-button form; consider richer, more interactive game formats (the real MW has
Wordle-style / Drop-a-Letter games) to make these meaningfully harder.
Suggested fixes
- Re-anchor tasks on MW-specific, on-page facts that resist prior knowledge:
exact first-known-use years, MW respelling pronunciations, the exact wording of a
numbered sense, the WOTD "Did You Know?" text, a specific example sentence, quiz
question wording — things the agent must read off the page. - Broaden the catalog so off-script lookups don't dead-end, and so tasks have real
distractors. - Fix #18 (name the word). Consider an answer key or stricter judge rubric so
knowledge-shortcut answers fail. - WOTD "today" is run-date dependent (
todays_wotd()rotates by
date.today().toordinal()), and the page renders a 2025feature_datewhile the
footer says 2026 — task #9 ("today's featured word") has no stable answer. Anchor it
to a fixed date or target a specific dated word (#10 already does — good).
Assets PR: confirm before merge
.assets-revision pins repo: YuanDaozeiii/WebHarbor @ a591b293… — a personal HF
fork, not the canonical ChilleD/WebHarbor. Per the two-repo workflow the assets
should be merged into ChilleD and the pin updated to that merge SHA, otherwise
fetch_assets.sh breaks if the fork is deleted/rewritten. Please land on ChilleD
and re-pin (or confirm a ChilleD HF PR is open).
Summary
| Dimension | Result |
|---|---|
| Mechanical (build / 200 / byte-identical reset / reset-all) | ✅ PASS — solid |
| Visual (clear + content-correct on solve path) | ✅ Acceptable; homepage/logo cosmetics could improve |
| Functional (auth / search / save / remove / quiz) | |
| Task quality | |
| Assets pin |
Bottom line: strong engineering, but the task set needs a redesign pass so it
actually tests web navigation rather than LLM recall. Recommend changes before merge.
Evidence
Screenshots were captured during review (mirror pages vs. the live site,
side by side) — available on request.
Reproduce
gh pr checkout 44
./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review -p 8201:8101 -p 41000-41015:40000-40015 webharbor:dev
for p in $(seq 41000 41015); do curl -so /dev/null -w "$p:%{http_code}\n" http://localhost:$p/; done
curl -X POST http://localhost:8201/reset/merriam_webster
docker exec wh-review md5sum \
/opt/WebSyn/merriam_webster/instance/merriam_webster.db \
/opt/WebSyn/merriam_webster/instance_seed/merriam_webster.db|
Hi @YuanDaoze Thank you for your amazing contribution! Hi @DEM1TASSE Thank you for your through review, every point now has been addressed in my updates. I also added sections about the deterministic verification which should be contributed by the reviewer in the future. In this collaboration workflow, Contributor role will propose the tasks while Reviewer will later propose the verifiers, independently. |
…word seed) HF PR ChilleD/WebHarbor#29 merged; .assets-revision now points at the canonical dataset's main commit carrying the regenerated 156-word seed. Verified: fetch from ChilleD main -> build -> byte-identical reset (md5 79ae0eab…) holds; runtime writes wiped on reset.
Summary
Adds the 16th WebHarbor site: a Merriam-Webster mirror with dictionary,
thesaurus, Word of the Day, vocabulary quizzes, and login/account.
142 real word entries, 30 thesaurus entries, 3 quizzes (10 questions
each), 20 benchmark tasks in tasks.jsonl.
Paired HF PR
Heavy assets (instance_seed/merriam_webster.db, 12 real images) live in:
.assets-revisionto YuanDaozeiii/WebHarbor@8866e560.assets-revisionto ChilleD merge SHAPre-PR checks (passed locally)