Add Merriam-Webster mirror site by YuanDaoze · Pull Request #44 · aiming-lab/WebHarbor

YuanDaoze · 2026-06-02T11:03:50Z

Summary

Adds the 16th WebHarbor site: a Merriam-Webster mirror with dictionary,
thesaurus, Word of the Day, vocabulary quizzes, and login/account.
142 real word entries, 30 thesaurus entries, 3 quizzes (10 questions
each), 20 benchmark tasks in tasks.jsonl.

Paired HF PR

Heavy assets (instance_seed/merriam_webster.db, 12 real images) live in:

HF PR: https://huggingface.co/datasets/ChilleD/WebHarbor/discussions/29
Currently pinned via .assets-revision to YuanDaozeiii/WebHarbor@8866e560
After HF PR merges, will bump .assets-revision to ChilleD merge SHA

Pre-PR checks (passed locally)

docker build webharbor:dev (5.89GB)
16/16 sites return HTTP 200
/reset/merriam_webster byte-identical (md5 a4248bef..)
/reset-all 16 sites parallel ~1.1s
20/20 benchmark tasks walkable in container
All 15 existing sites still byte-identical (no regression)

Adds a Flask mirror of merriam-webster.com as the 16th WebHarbor site: dictionary, thesaurus, word of the day, vocabulary quizzes, and full account/login flow. 142 real word entries, 30 thesaurus entries, 3 quizzes (10 questions each), all scraped from the live site. 20 WebVoyager-format benchmark tasks in tasks.jsonl. Registered as site index 15 (port 40015) in websyn_start.sh, control_server.py, and Dockerfile (EXPOSE 40000-40015). Pre-PR checks (passed locally): - docker build webharbor:dev (5.89GB) - 16/16 sites return HTTP 200 - /reset/merriam_webster byte-identical (md5 a4248bef..) - /reset-all 16 sites parallel ~1.1s - 20/20 benchmark tasks walkable in container - All 15 existing sites still byte-identical (no regression) Assets: heavy assets (instance_seed/merriam_webster.db, 12 real images from MW games/quizzes) uploaded to HF dataset YuanDaozeiii/WebHarbor at revision 8866e560. .assets-revision pins to the fork until the HF PR adding merriam_webster.tar.gz to ChilleD/WebHarbor is merged. Also fixes a pre-existing .gitignore bug where the inline comment on sites/*/scraped_data/ silently disabled the rule (gitignore does not support inline comments). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

YuanDaoze · 2026-06-10T09:22:37Z

hello， please check the repo~ @Raibows

DEM1TASSE

Review: Merriam-Webster mirror (PR #44)

Verdict: Changes requested (mergeable after task-design fixes).

The infrastructure is solid — byte-identical reset, clean build, faithful and
correct page content. The real concern is task quality: a large share of the 20
tasks can be answered from an LLM's prior knowledge without ever navigating the site,
which undercuts the point of a web-agent benchmark. A couple of tasks are also
ill-posed for autonomous evaluation. None of this is hard to fix, but it should be
fixed before this is used as a benchmark.

Reviewed by driving a real Chromium (Playwright) against the image built from this
branch + the pinned HF assets, on alt ports 8201 / 41000-41015, plus side-by-side
screenshot comparison against the live merriam-webster.com.

Mechanical checks: ✅ PASS

All 16 sites return 200 (ports 41000–41015)
Control plane healthy — all 16 sites alive
Byte-identical reset holds: after POST /reset/merriam_webster,
md5(instance/…db) == md5(instance_seed/…db) (a4248be…)
Reset wipes runtime writes: re-checked md5 after login (writes
SearchHistory), save/remove (mutates SavedWord), and register (adds a
User) — still matches seed. Hard invariant is solid.
reset-all completes in ~0.97s, all 16 sites ready
HF revision exists and contains all 16 site tarballs

Credit: idempotent seeding is done right — both seed functions early-return on a
populated DB, and a fixed WOTD_ANCHOR = date(2025,1,15) is used in the seed instead
of date.today(). Three-place registration (websyn_start.sh, control_server.py,
Dockerfile EXPOSE) is consistent; merriam_webster is index 15 → port 40015.

Visual fidelity: ✅ Acceptable (judged as "clear + content-correct on the solve path")

Bar applied: not pixel-identical to the real site, but task-solve-path pages must be
clear and show semantically correct content. For a text-heavy dictionary with
essentially no imagery, "no images" is fine — what matters is that the definitions,
etymologies, synonyms, and antonyms shown are the real, correct data.

Inner pages (word detail, thesaurus, WOTD, games, quiz) are clear and readable
Content is correct: spot-checked serendipity etymology / first-known-use,
happy synonyms+antonyms, etc. — all match real Merriam-Webster data
Search-miss is graceful: "No entries found … Check your spelling"

Non-blocking "could improve" (pixel/layout divergence from the real site):

Homepage is much simpler than the real MW homepage — the real one is a
magazine-style page of large photographic feature cards (quizzes, articles, "Top
Lookups Right Now"); the mirror has only a text WOTD card + a Trending list.
Logo is a left-aligned text wordmark across the whole site, not MW's centered
circular badge. Thesaurus uses green/red chips vs MW's orange theme + sense-grouped
sidebar. These don't impede the agent's observations, so they're cosmetic.

Functional depth: ✅ Mostly PASS

Login, logout, register (with WTForms validation: bad email + mismatched
password → field errors, stays on /register)
Search (exact headword redirects to the detail page) + autocomplete endpoint
Save word → persists; remove saved word → list 3→2 with correct flash
Quiz submit + scoring → score banner is correct (e.g. 6/10)
Quiz accepts partial/empty submission and misrepresents unanswered
questions as correct. You can submit having answered only one question (or
none) — there is no "answer all questions" validation. The score banner is
right (e.g. 1/10), but the per-question review marks every question's correct
answer green ✓ and only marks a red ✗ on a choice you picked; an unanswered
question (picked is None) therefore renders identically to a correctly-answered
one — green ✓, no "not answered" indicator. Verified: answering 1 of 10 →
1/10 score, but 10 green ✓ / 0 red ✗ on the review page.
- Why it matters: an agent (or the screenshot-reading judge) can conclude all
answers were correct; and tasks #11/#12 ("answer all questions") can be
"completed" by answering one, with the result page hiding the skips.
- Root cause: quiz_result.html only branches on correct-answer vs picked-answer;
it never handles picked is None. quiz.html radios aren't required and
quiz_submit doesn't validate completeness.
- Fix: require all questions answered before scoring (or flag unanswered ones),
and render unanswered questions distinctly (e.g. "Not answered").

Task quality: ⚠️ Changes requested — the main issue

Catalog is only 142 dictionary words (plus 30 thesaurus entries, 8 WOTD, 3
quizzes), skewed toward advanced/literary vocabulary. Common words (cry, baby, run,
dog, water…) are absent, so any off-script lookup dead-ends. The 20 tasks are all
self-consistent against this catalog (every referenced word exists), but the
catalog is thin for a "dictionary."

1. Most tasks are answerable from LLM prior knowledge without navigating (biggest issue)

This is a web-agent benchmark, but many tasks ask for dictionary facts a frontier
LLM already knows and can answer with zero site interaction:

Task	Asks for	LLM can answer without the site?
#0 serendipity part of speech	noun	yes
#2 nostalgia source language	Greek	yes
#5 / #6 brave synonyms / calm antonyms	courageous / angry…	yes
#8 happy synonym + antonym	—	yes
#17 / #19 compare words' first-use / POS	—	likely

Only the MW-specific respelling pronunciation (#0) and the exact first-known-use
year (#3 1909, #4 1827) are hard to produce from memory. The rest are
knowledge-recall, not web navigation.

2. No answer key + LLM-only judge amplifies #1

tasks.jsonl has no answer field, and there is no answer-key file anywhere. Grading
is agent_demo/eval_judge.py (LLM-as-judge) reading the trajectory — there is no
stored ground truth. Combined with #1: the judge LLM also knows the dictionary facts,
so an agent that hallucinates a plausible answer without opening the page can still
be marked success. The environment may never actually be exercised.

3. Several tasks have no stable, verifiable answer (not just #18)

A cluster of tasks can't be graded against a fixed ground truth in autonomous
evaluation:

Task	Problem	Severity
#18 "remove a word … ask me which one to remove if it's unclear"	Presupposes a human to ask; in autonomous eval there is none, so the agent guesses and "success" is undefined.	must fix
#9 "tell me today's featured word and its POS"	`todays_wotd()` rotates by `date.today()`, so the answer changes by run date; the page also shows a 2025 `feature_date` under a 2026 footer.	must fix
#11 "answer all questions and tell me your final score"	Score depends on the agent's choices — no fixed answer (and broken by the quiz bug above).	should fix
#12 "report how many you got correct out of the total"	Same as #11 — non-deterministic.	should fix
#16 "username of your choice, any email"	Free input, no fixed answer — but the goal ("verify you are logged in") is verifiable, so this one is acceptable.	borderline / OK

Fixes: name the word to remove in #18 (e.g. "remove curiosity"); anchor #9 to a
fixed date or a specific dated word; for #11/#12 grade on "completed the quiz and
correctly reported the on-screen score" (and fix the quiz bug so the score page is
trustworthy).

Note: first-person phrasing elsewhere ("tell me…", "your word list", "your account"
in #0/#6/#8/#10/#14/#15) is just normal instruction phrasing — the agent reports to
the user and "your list" = the logged-in account's list. Those are fine. #13 ("how
many quizzes + difficulty of each") is deterministic (3 quizzes: easy/medium/medium).

4. Quiz tasks (#11/#12) have no deterministic answer + are mechanically trivial

"answer all questions and tell me your final score" — any score is valid, and
selecting all-first-choice without reading still "completes" it. These do resist the
knowledge shortcut (you must navigate the quiz and read the score off the page), which
is good, but they test only the mechanical flow. The quiz itself is a plain
radio-button form; consider richer, more interactive game formats (the real MW has
Wordle-style / Drop-a-Letter games) to make these meaningfully harder.

Suggested fixes

Re-anchor tasks on MW-specific, on-page facts that resist prior knowledge:
exact first-known-use years, MW respelling pronunciations, the exact wording of a
numbered sense, the WOTD "Did You Know?" text, a specific example sentence, quiz
question wording — things the agent must read off the page.
Broaden the catalog so off-script lookups don't dead-end, and so tasks have real
distractors.
Fix #18 (name the word). Consider an answer key or stricter judge rubric so
knowledge-shortcut answers fail.
WOTD "today" is run-date dependent (todays_wotd() rotates by
date.today().toordinal()), and the page renders a 2025 feature_date while the
footer says 2026 — task #9 ("today's featured word") has no stable answer. Anchor it
to a fixed date or target a specific dated word (#10 already does — good).

Assets PR: confirm before merge

.assets-revision pins repo: YuanDaozeiii/WebHarbor @ a591b293… — a personal HF
fork, not the canonical ChilleD/WebHarbor. Per the two-repo workflow the assets
should be merged into ChilleD and the pin updated to that merge SHA, otherwise
fetch_assets.sh breaks if the fork is deleted/rewritten. Please land on ChilleD
and re-pin (or confirm a ChilleD HF PR is open).

Summary

Dimension	Result
Mechanical (build / 200 / byte-identical reset / reset-all)	✅ PASS — solid
Visual (clear + content-correct on solve path)	✅ Acceptable; homepage/logo cosmetics could improve
Functional (auth / search / save / remove / quiz)	⚠️ Mostly PASS — quiz accepts partial/empty submit and shows unanswered questions as correct
Task quality	⚠️ Changes requested — knowledge-shortcut tasks, no answer key, #18 human-in-loop, thin catalog
Assets pin	⚠️ Confirm — points at a personal fork, not ChilleD

Bottom line: strong engineering, but the task set needs a redesign pass so it
actually tests web navigation rather than LLM recall. Recommend changes before merge.

Evidence

Screenshots were captured during review (mirror pages vs. the live site,
side by side) — available on request.

Reproduce

gh pr checkout 44
./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review -p 8201:8101 -p 41000-41015:40000-40015 webharbor:dev
for p in $(seq 41000 41015); do curl -so /dev/null -w "$p:%{http_code}\n" http://localhost:$p/; done
curl -X POST http://localhost:8201/reset/merriam_webster
docker exec wh-review md5sum \
  /opt/WebSyn/merriam_webster/instance/merriam_webster.db \
  /opt/WebSyn/merriam_webster/instance_seed/merriam_webster.db

Raibows · 2026-06-24T09:00:35Z

Hi @YuanDaoze Thank you for your amazing contribution!

Hi @DEM1TASSE Thank you for your through review, every point now has been addressed in my updates.

I also added sections about the deterministic verification which should be contributed by the reviewer in the future. In this collaboration workflow, Contributor role will propose the tasks while Reviewer will later propose the verifiers, independently.

…word seed) HF PR ChilleD/WebHarbor#29 merged; .assets-revision now points at the canonical dataset's main commit carrying the regenerated 156-word seed. Verified: fetch from ChilleD main -> build -> byte-identical reset (md5 79ae0eab…) holds; runtime writes wiped on reset.

YuanDaoze changed the title ~~feat(merriam_webster): add Merriam-Webster mirror site~~ Add Merriam-Webster mirror site Jun 2, 2026

YuanDaoze force-pushed the feat/merriam-webster branch from 6268698 to ce6d6e3 Compare June 2, 2026 11:31

DEM1TASSE suggested changes Jun 20, 2026

View reviewed changes

fix issues by comments

2fc4acd

Raibows merged commit 438a029 into aiming-lab:main Jun 24, 2026

boyugou mentioned this pull request Jun 25, 2026

feat(drugs_com): add drugs.com mirror site (port 40016) #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Merriam-Webster mirror site#44

Add Merriam-Webster mirror site#44
Raibows merged 3 commits into
aiming-lab:mainfrom
YuanDaoze:feat/merriam-webster

YuanDaoze commented Jun 2, 2026 •

edited

Loading

Uh oh!

YuanDaoze commented Jun 10, 2026

Uh oh!

DEM1TASSE left a comment •

edited

Loading

Uh oh!

Raibows commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

YuanDaoze commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Paired HF PR

Pre-PR checks (passed locally)

Uh oh!

YuanDaoze commented Jun 10, 2026

Uh oh!

DEM1TASSE left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review: Merriam-Webster mirror (PR #44)

Mechanical checks: ✅ PASS

Visual fidelity: ✅ Acceptable (judged as "clear + content-correct on the solve path")

Functional depth: ✅ Mostly PASS

Task quality: ⚠️ Changes requested — the main issue

1. Most tasks are answerable from LLM prior knowledge without navigating (biggest issue)

2. No answer key + LLM-only judge amplifies #1

3. Several tasks have no stable, verifiable answer (not just #18)

4. Quiz tasks (#11/#12) have no deterministic answer + are mechanically trivial

Suggested fixes

Assets PR: confirm before merge

Summary

Evidence

Reproduce

Uh oh!

Raibows commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuanDaoze commented Jun 2, 2026 •

edited

Loading

DEM1TASSE left a comment •

edited

Loading