Skip to content

tool: sync WRB soil descriptions from the mobile i18n into the DB#366

Merged
johannesparty merged 2 commits into
mainfrom
feat/wrb-descriptions-sync
Jun 11, 2026
Merged

tool: sync WRB soil descriptions from the mobile i18n into the DB#366
johannesparty merged 2 commits into
mainfrom
feat/wrb-descriptions-sync

Conversation

@johannesparty

Copy link
Copy Markdown
Contributor

The mobile client keeps hand-cleaned, multi-language copies of the WRB soil Description/Management narratives in its i18n files (soil.match_info.<key>.{description,management}). The backend reads the same narratives from wrb_fao90_desc, where — it turns out — the non-English columns (fr, ks) are largely English placeholders, es is partial, and the text carries literal <br> tags.

scripts/wrb_descriptions_sync.py:

  • default: read-only HTML word-diff (mobile JSON vs DB), with <br>/whitespace normalized out so only genuine wording/translation differences show; introspects which language columns exist.
  • --write: replaces the DB values from the JSON source of truth in one transaction — widens the varchar(2000) columns to text, adds columns for new languages (ka/uk), rewrites every in-both soil, and emits the applied SQL for review.

Stdlib (difflib/html) + psycopg only. Run where the soil-id DB is reachable (e.g. inside the backend container).

After --write, regenerate and redistribute the soil-id-db dump (make dump_soil_id_dbmake build_docker_image → push). gypsisols is in the app i18n but has no DB row (skipped); the swks mapping is the existing Kiswahili column.

🤖 Generated with Claude Code

@johannesparty johannesparty changed the title Tool: sync WRB soil descriptions from the mobile i18n into the DB tool: sync WRB soil descriptions from the mobile i18n into the DB Jun 10, 2026
@johannesparty johannesparty force-pushed the feat/wrb-descriptions-sync branch 2 times, most recently from d9e5886 to c5b1497 Compare June 11, 2026 08:29
…90_desc

The mobile client keeps hand-cleaned, multi-language copies of the WRB soil
Description/Management narratives in its i18n files; the backend reads the same
narratives from the wrb_fao90_desc table, where the non-English columns are
largely English placeholders and the text carries literal <br> tags. The table
also stores a translated name per language (wrb_tax_<lang>) and is keyed by
wrb_tax (= the English name, which the soil-ID algorithm matches against).

scripts/wrb_descriptions_sync.py compares the two (default: a read-only HTML
word-diff with <br>/whitespace normalized out) and, with --write, rebuilds the
table from the JSON source of truth in one transaction: widens the varchar(2000)
description/management columns to text, adds columns for new languages, deletes
all rows, and reinserts one row per JSON soil (name + description + management
per language). A guard refuses to rebuild if the JSON does not cover every soil
the DB or algorithm (HWSD2 fao90_name) still needs.

Languages are auto-discovered from the translation files; the only column-suffix
mismatch (sw -> ks) lives in DB_SUFFIX_EXCEPTIONS. scripts/README.md documents
the tool and a runbook for the planned ks -> sw rename that would remove even
that exception.

Stdlib (difflib/html) + psycopg only; run where the soil-id DB is reachable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@johannesparty johannesparty force-pushed the feat/wrb-descriptions-sync branch from c5b1497 to 7b5b02a Compare June 11, 2026 17:39
@johannesparty johannesparty requested a review from knipec June 11, 2026 17:46

@knipec knipec left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly going off our discussion, not the code -- but sounds good!

Comment thread scripts/README.md
Comment thread scripts/README.md Outdated
Comment thread scripts/README.md Outdated
The backend only ever surfaces the English description/management, so the
soil-id library now fetches just that: get_WRB_descriptions / getSG_descriptions
select WRB_tax, Description_en, Management_en, and global_soil builds an
English-only siteDescription. The other languages stay in the table (kept current
by wrb_descriptions_sync, staged for a possible future multilingual API) but are
no longer read.

With the read path decoupled from the non-English columns, the sync drops the
sw->ks exception entirely (the DB_SUFFIX_EXCEPTIONS map is gone) and writes Swahili
to the correctly ISO-coded description_sw; the obsolete description_ks columns are
left in place (now nullable, all NULL) for a window-free DROP later -- see
scripts/README.md.

Global snapshots regenerated to the English-only siteDescription. The English
text still carries <br> here because it matches the current soil-id-db image;
it becomes clean once that image is rebuilt from the synced data.

(Includes a ruff-pre-commit version bump applied by the repo's update hook.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@johannesparty johannesparty force-pushed the feat/wrb-descriptions-sync branch from 06cff74 to 6e7ee55 Compare June 11, 2026 22:19
@johannesparty johannesparty merged commit 0b82d00 into main Jun 11, 2026
4 checks passed
@johannesparty johannesparty deleted the feat/wrb-descriptions-sync branch June 11, 2026 22:38
johannesparty added a commit to techmatters/terraso-backend that referenced this pull request Jun 11, 2026
…ath)

Pins soil-id to the tagged release where the WRB description read path is
English-only (techmatters/soil-id-algorithm#366): the library fetches only
Description_en/Management_en. soil-id's install_requires is unchanged from the
prior pin, so only the soil-id URL + archive hash move; the rest of the lock is
untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
johannesparty added a commit to techmatters/terraso-backend that referenced this pull request Jun 12, 2026
…2053)

* feat: expose global soil description & management via API and exports

Global (WRB/HWSD) soil matches carry description and management guidance as a
multilingual dict in the soil-id output, but resolve_soil_info only forwarded a
plain string (the US brief_narrative) and nulled the dict — so global matches
returned a null description and no management text. Now:

- resolve_soil_info returns the English description for both US (string) and
  global (Description_en) matches, plus a new management field from the global
  Management_en (None for US).
- normalize_soil_description() strips the <br> tags the global DB stores and
  collapses whitespace, matching the clients' hand-cleaned copies.
- SoilSeries gains a management field; fullDescriptionUrl is annotated US only
  (it is the SoilWeb series URL, absent for global matches).
- The export soil-id query requests management; CSV export gains Selected soil
  management / Top soil match management columns beside the description ones
  (JSON carries it automatically via passthrough). CSV snapshot fixtures
  regenerated with the new columns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: drop <br> normalization, return raw English description/management

normalize_soil_description stripped <br> tags and collapsed whitespace from the
WRB descriptions, because the soil-id DB stored them with literal <br>. We're now
replacing those DB strings with the clean mobile-i18n copies (no <br>) via the
wrb_descriptions_sync tool, so the normalization is redundant — resolve_soil_info
returns Description_en / Management_en (and the US brief_narrative) verbatim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* build: repin soil-id to 2026-06-11.1 (English-only description read path)

Pins soil-id to the tagged release where the WRB description read path is
English-only (techmatters/soil-id-algorithm#366): the library fetches only
Description_en/Management_en. soil-id's install_requires is unchanged from the
prior pin, so only the soil-id URL + archive hash move; the rest of the lock is
untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
johannesparty added a commit to techmatters/terraso-backend that referenced this pull request Jun 12, 2026
…2053)

* feat: expose global soil description & management via API and exports

Global (WRB/HWSD) soil matches carry description and management guidance as a
multilingual dict in the soil-id output, but resolve_soil_info only forwarded a
plain string (the US brief_narrative) and nulled the dict — so global matches
returned a null description and no management text. Now:

- resolve_soil_info returns the English description for both US (string) and
  global (Description_en) matches, plus a new management field from the global
  Management_en (None for US).
- normalize_soil_description() strips the <br> tags the global DB stores and
  collapses whitespace, matching the clients' hand-cleaned copies.
- SoilSeries gains a management field; fullDescriptionUrl is annotated US only
  (it is the SoilWeb series URL, absent for global matches).
- The export soil-id query requests management; CSV export gains Selected soil
  management / Top soil match management columns beside the description ones
  (JSON carries it automatically via passthrough). CSV snapshot fixtures
  regenerated with the new columns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: drop <br> normalization, return raw English description/management

normalize_soil_description stripped <br> tags and collapsed whitespace from the
WRB descriptions, because the soil-id DB stored them with literal <br>. We're now
replacing those DB strings with the clean mobile-i18n copies (no <br>) via the
wrb_descriptions_sync tool, so the normalization is redundant — resolve_soil_info
returns Description_en / Management_en (and the US brief_narrative) verbatim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* build: repin soil-id to 2026-06-11.1 (English-only description read path)

Pins soil-id to the tagged release where the WRB description read path is
English-only (techmatters/soil-id-algorithm#366): the library fetches only
Description_en/Management_en. soil-id's install_requires is unchanged from the
prior pin, so only the soil-id URL + archive hash move; the rest of the lock is
untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants