Skip to content

Add Merriam-Webster mirror site#44

Merged
Raibows merged 3 commits into
aiming-lab:mainfrom
YuanDaoze:feat/merriam-webster
Jun 24, 2026
Merged

Add Merriam-Webster mirror site#44
Raibows merged 3 commits into
aiming-lab:mainfrom
YuanDaoze:feat/merriam-webster

Conversation

@YuanDaoze

@YuanDaoze YuanDaoze commented Jun 2, 2026

Copy link
Copy Markdown

Summary

Adds the 16th WebHarbor site: a Merriam-Webster mirror with dictionary,
thesaurus, Word of the Day, vocabulary quizzes, and login/account.
142 real word entries, 30 thesaurus entries, 3 quizzes (10 questions
each), 20 benchmark tasks in tasks.jsonl.

Paired HF PR

Heavy assets (instance_seed/merriam_webster.db, 12 real images) live in:

Pre-PR checks (passed locally)

  • docker build webharbor:dev (5.89GB)
  • 16/16 sites return HTTP 200
  • /reset/merriam_webster byte-identical (md5 a4248bef..)
  • /reset-all 16 sites parallel ~1.1s
  • 20/20 benchmark tasks walkable in container
  • All 15 existing sites still byte-identical (no regression)

@YuanDaoze YuanDaoze changed the title feat(merriam_webster): add Merriam-Webster mirror site Add Merriam-Webster mirror site Jun 2, 2026
Adds a Flask mirror of merriam-webster.com as the 16th WebHarbor site:
dictionary, thesaurus, word of the day, vocabulary quizzes, and full
account/login flow. 142 real word entries, 30 thesaurus entries, 3
quizzes (10 questions each), all scraped from the live site. 20
WebVoyager-format benchmark tasks in tasks.jsonl.

Registered as site index 15 (port 40015) in websyn_start.sh,
control_server.py, and Dockerfile (EXPOSE 40000-40015).

Pre-PR checks (passed locally):
- docker build webharbor:dev (5.89GB)
- 16/16 sites return HTTP 200
- /reset/merriam_webster byte-identical (md5 a4248bef..)
- /reset-all 16 sites parallel ~1.1s
- 20/20 benchmark tasks walkable in container
- All 15 existing sites still byte-identical (no regression)

Assets: heavy assets (instance_seed/merriam_webster.db, 12 real images
from MW games/quizzes) uploaded to HF dataset YuanDaozeiii/WebHarbor at
revision 8866e560. .assets-revision pins to the fork until the HF PR
adding merriam_webster.tar.gz to ChilleD/WebHarbor is merged.

Also fixes a pre-existing .gitignore bug where the inline comment on
sites/*/scraped_data/ silently disabled the rule (gitignore does not
support inline comments).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@YuanDaoze YuanDaoze force-pushed the feat/merriam-webster branch from 6268698 to ce6d6e3 Compare June 2, 2026 11:31
@YuanDaoze

Copy link
Copy Markdown
Author

hello, please check the repo~ @Raibows

@DEM1TASSE DEM1TASSE left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Merriam-Webster mirror (PR #44)

Verdict: Changes requested (mergeable after task-design fixes).

The infrastructure is solid — byte-identical reset, clean build, faithful and
correct page content. The real concern is task quality: a large share of the 20
tasks can be answered from an LLM's prior knowledge without ever navigating the site,
which undercuts the point of a web-agent benchmark. A couple of tasks are also
ill-posed for autonomous evaluation. None of this is hard to fix, but it should be
fixed before this is used as a benchmark.

Reviewed by driving a real Chromium (Playwright) against the image built from this
branch + the pinned HF assets, on alt ports 8201 / 41000-41015, plus side-by-side
screenshot comparison against the live merriam-webster.com.


Mechanical checks: ✅ PASS

  • All 16 sites return 200 (ports 41000–41015)
  • Control plane healthy — all 16 sites alive
  • Byte-identical reset holds: after POST /reset/merriam_webster,
    md5(instance/…db) == md5(instance_seed/…db) (a4248be…)
  • Reset wipes runtime writes: re-checked md5 after login (writes
    SearchHistory), save/remove (mutates SavedWord), and register (adds a
    User) — still matches seed. Hard invariant is solid.
  • reset-all completes in ~0.97s, all 16 sites ready
  • HF revision exists and contains all 16 site tarballs

Credit: idempotent seeding is done right — both seed functions early-return on a
populated DB, and a fixed WOTD_ANCHOR = date(2025,1,15) is used in the seed instead
of date.today(). Three-place registration (websyn_start.sh, control_server.py,
Dockerfile EXPOSE) is consistent; merriam_webster is index 15 → port 40015.

Visual fidelity: ✅ Acceptable (judged as "clear + content-correct on the solve path")

Bar applied: not pixel-identical to the real site, but task-solve-path pages must be
clear and show semantically correct content
. For a text-heavy dictionary with
essentially no imagery, "no images" is fine — what matters is that the definitions,
etymologies, synonyms, and antonyms shown are the real, correct data.

  • Inner pages (word detail, thesaurus, WOTD, games, quiz) are clear and readable
  • Content is correct: spot-checked serendipity etymology / first-known-use,
    happy synonyms+antonyms, etc. — all match real Merriam-Webster data
  • Search-miss is graceful: "No entries found … Check your spelling"

Non-blocking "could improve" (pixel/layout divergence from the real site):

  • Homepage is much simpler than the real MW homepage — the real one is a
    magazine-style page of large photographic feature cards (quizzes, articles, "Top
    Lookups Right Now"); the mirror has only a text WOTD card + a Trending list.
  • Logo is a left-aligned text wordmark across the whole site, not MW's centered
    circular badge. Thesaurus uses green/red chips vs MW's orange theme + sense-grouped
    sidebar. These don't impede the agent's observations, so they're cosmetic.

Functional depth: ✅ Mostly PASS

  • Login, logout, register (with WTForms validation: bad email + mismatched
    password → field errors, stays on /register)
  • Search (exact headword redirects to the detail page) + autocomplete endpoint
  • Save word → persists; remove saved word → list 3→2 with correct flash
  • Quiz submit + scoring → score banner is correct (e.g. 6/10)
  • Quiz accepts partial/empty submission and misrepresents unanswered
    questions as correct.
    You can submit having answered only one question (or
    none) — there is no "answer all questions" validation. The score banner is
    right (e.g. 1/10), but the per-question review marks every question's correct
    answer green ✓ and only marks a red ✗ on a choice you picked; an unanswered
    question (picked is None) therefore renders identically to a correctly-answered
    one — green ✓, no "not answered" indicator. Verified: answering 1 of 10 →
    1/10 score, but 10 green ✓ / 0 red ✗ on the review page.
    - Why it matters: an agent (or the screenshot-reading judge) can conclude all
    answers were correct; and tasks #11/#12 ("answer all questions") can be
    "completed" by answering one, with the result page hiding the skips.
    - Root cause: quiz_result.html only branches on correct-answer vs picked-answer;
    it never handles picked is None. quiz.html radios aren't required and
    quiz_submit doesn't validate completeness.
    - Fix: require all questions answered before scoring (or flag unanswered ones),
    and render unanswered questions distinctly (e.g. "Not answered").

Task quality: ⚠️ Changes requested — the main issue

Catalog is only 142 dictionary words (plus 30 thesaurus entries, 8 WOTD, 3
quizzes), skewed toward advanced/literary vocabulary. Common words (cry, baby, run,
dog, water…) are absent, so any off-script lookup dead-ends. The 20 tasks are all
self-consistent against this catalog (every referenced word exists), but the
catalog is thin for a "dictionary."

1. Most tasks are answerable from LLM prior knowledge without navigating (biggest issue)

This is a web-agent benchmark, but many tasks ask for dictionary facts a frontier
LLM already knows and can answer with zero site interaction:

Task Asks for LLM can answer without the site?
#0 serendipity part of speech noun yes
#2 nostalgia source language Greek yes
#5 / #6 brave synonyms / calm antonyms courageous / angry… yes
#8 happy synonym + antonym yes
#17 / #19 compare words' first-use / POS likely

Only the MW-specific respelling pronunciation (#0) and the exact first-known-use
year (#3 1909, #4 1827) are hard to produce from memory. The rest are
knowledge-recall, not web navigation.

2. No answer key + LLM-only judge amplifies #1

tasks.jsonl has no answer field, and there is no answer-key file anywhere. Grading
is agent_demo/eval_judge.py (LLM-as-judge) reading the trajectory — there is no
stored ground truth. Combined with #1: the judge LLM also knows the dictionary facts,
so an agent that hallucinates a plausible answer without opening the page can still
be marked success. The environment may never actually be exercised.

3. Several tasks have no stable, verifiable answer (not just #18)

A cluster of tasks can't be graded against a fixed ground truth in autonomous
evaluation:

Task Problem Severity
#18 "remove a word … ask me which one to remove if it's unclear" Presupposes a human to ask; in autonomous eval there is none, so the agent guesses and "success" is undefined. must fix
#9 "tell me today's featured word and its POS" todays_wotd() rotates by date.today(), so the answer changes by run date; the page also shows a 2025 feature_date under a 2026 footer. must fix
#11 "answer all questions and tell me your final score" Score depends on the agent's choices — no fixed answer (and broken by the quiz bug above). should fix
#12 "report how many you got correct out of the total" Same as #11 — non-deterministic. should fix
#16 "username of your choice, any email" Free input, no fixed answer — but the goal ("verify you are logged in") is verifiable, so this one is acceptable. borderline / OK

Fixes: name the word to remove in #18 (e.g. "remove curiosity"); anchor #9 to a
fixed date or a specific dated word; for #11/#12 grade on "completed the quiz and
correctly reported the on-screen score" (and fix the quiz bug so the score page is
trustworthy).

Note: first-person phrasing elsewhere ("tell me…", "your word list", "your account"
in #0/#6/#8/#10/#14/#15) is just normal instruction phrasing — the agent reports to
the user and "your list" = the logged-in account's list. Those are fine. #13 ("how
many quizzes + difficulty of each") is deterministic (3 quizzes: easy/medium/medium).

4. Quiz tasks (#11/#12) have no deterministic answer + are mechanically trivial

"answer all questions and tell me your final score" — any score is valid, and
selecting all-first-choice without reading still "completes" it. These do resist the
knowledge shortcut (you must navigate the quiz and read the score off the page), which
is good, but they test only the mechanical flow. The quiz itself is a plain
radio-button form; consider richer, more interactive game formats (the real MW has
Wordle-style / Drop-a-Letter games) to make these meaningfully harder.

Suggested fixes

  • Re-anchor tasks on MW-specific, on-page facts that resist prior knowledge:
    exact first-known-use years, MW respelling pronunciations, the exact wording of a
    numbered sense, the WOTD "Did You Know?" text, a specific example sentence, quiz
    question wording — things the agent must read off the page.
  • Broaden the catalog so off-script lookups don't dead-end, and so tasks have real
    distractors.
  • Fix #18 (name the word). Consider an answer key or stricter judge rubric so
    knowledge-shortcut answers fail.
  • WOTD "today" is run-date dependent (todays_wotd() rotates by
    date.today().toordinal()), and the page renders a 2025 feature_date while the
    footer says 2026 — task #9 ("today's featured word") has no stable answer. Anchor it
    to a fixed date or target a specific dated word (#10 already does — good).

Assets PR: confirm before merge

.assets-revision pins repo: YuanDaozeiii/WebHarbor @ a591b293… — a personal HF
fork
, not the canonical ChilleD/WebHarbor. Per the two-repo workflow the assets
should be merged into ChilleD and the pin updated to that merge SHA, otherwise
fetch_assets.sh breaks if the fork is deleted/rewritten. Please land on ChilleD
and re-pin (or confirm a ChilleD HF PR is open).


Summary

Dimension Result
Mechanical (build / 200 / byte-identical reset / reset-all) ✅ PASS — solid
Visual (clear + content-correct on solve path) ✅ Acceptable; homepage/logo cosmetics could improve
Functional (auth / search / save / remove / quiz) ⚠️ Mostly PASS — quiz accepts partial/empty submit and shows unanswered questions as correct
Task quality ⚠️ Changes requested — knowledge-shortcut tasks, no answer key, #18 human-in-loop, thin catalog
Assets pin ⚠️ Confirm — points at a personal fork, not ChilleD

Bottom line: strong engineering, but the task set needs a redesign pass so it
actually tests web navigation rather than LLM recall. Recommend changes before merge.

Evidence

Screenshots were captured during review (mirror pages vs. the live site,
side by side) — available on request.

Reproduce

gh pr checkout 44
./scripts/fetch_assets.sh
./scripts/build.sh webharbor:dev
docker run -d --rm --name wh-review -p 8201:8101 -p 41000-41015:40000-40015 webharbor:dev
for p in $(seq 41000 41015); do curl -so /dev/null -w "$p:%{http_code}\n" http://localhost:$p/; done
curl -X POST http://localhost:8201/reset/merriam_webster
docker exec wh-review md5sum \
  /opt/WebSyn/merriam_webster/instance/merriam_webster.db \
  /opt/WebSyn/merriam_webster/instance_seed/merriam_webster.db

@Raibows

Raibows commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Hi @YuanDaoze Thank you for your amazing contribution!

Hi @DEM1TASSE Thank you for your through review, every point now has been addressed in my updates.

I also added sections about the deterministic verification which should be contributed by the reviewer in the future. In this collaboration workflow, Contributor role will propose the tasks while Reviewer will later propose the verifiers, independently.

…word seed)

HF PR ChilleD/WebHarbor#29 merged; .assets-revision now points at the
canonical dataset's main commit carrying the regenerated 156-word seed.
Verified: fetch from ChilleD main -> build -> byte-identical reset (md5
79ae0eab…) holds; runtime writes wiped on reset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants