Skip to content

Block bots from crawling Mintlify static assets#45

Merged
divyasinghds merged 5 commits into
developfrom
fix/robots-txt-block-static-assets-rebased
May 29, 2026
Merged

Block bots from crawling Mintlify static assets#45
divyasinghds merged 5 commits into
developfrom
fix/robots-txt-block-static-assets-rebased

Conversation

@divyasinghds
Copy link
Copy Markdown
Contributor

@divyasinghds divyasinghds commented May 29, 2026

Re-opens #3 from a fresh branch off current main. The original branch was cut from an old "Initial commit" state and rebasing produced unrelated conflicts (favicon.svg, logo/light.svg, .mintignore, docs.json). The link-rot check on #3 was failing on stale broken refs (getting-started/quick-setup, images/tracebloc-workflow-overview.png) that no longer exist on main.

Summary

Adds a custom robots.txt to block bots from crawling Mintlify static asset bundles.

Context

Clarity bot traffic data (Apr 10-16) shows ~200 bot requests/week hitting /mintlify-assets/_next/static/ JS/CSS chunks. These have zero SEO value and waste crawl budget.

Changes

New robots.txt file:

  • Blocks /mintlify-assets/ (JS/CSS/font bundles)
  • Keeps existing /cdn-cgi/ block
  • Preserves sitemap reference

Test plan

  • Verify docs.tracebloc.io/robots.txt returns updated rules after deploy
  • Confirm all docs pages are still crawlable

Supersedes #3 (approved by @saadqbal x2).

🤖 Generated with Claude Code


Note

Low Risk
Docs-only crawl policy; no application code, auth, or data handling changes.

Overview
Adds a custom robots.txt for the docs site so crawlers skip low-value static paths while documentation pages remain open to User-agent: *.

New disallow rules: /cdn-cgi/ (unchanged intent from prior setup) and /mintlify-assets/ (Next/Mintlify JS, CSS, and font bundles that were drawing bot traffic without SEO benefit).

Sitemap still points at https://docs.tracebloc.io/sitemap.xml.

Reviewed by Cursor Bugbot for commit 23dd9aa. Bugbot is set up for automated code reviews on this repo. Configure here.

saadqbal and others added 5 commits May 11, 2026 14:41
Prod: enhance documentation and CI for training workflows and templates
* docs: consolidate Docker build into single multi-arch command (#35)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: remove open-source client claim from how-training-works (#36)

* docs: remove open-source client claim from how-training-works

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: say "contact us" instead of "open a support ticket"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Asad Iqbal (Saadi) <asad.dsoft@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add automated upstream sync workflow (#27)

* docs: add automated upstream sync workflow

Adds a Claude-powered workflow that syncs docs pages with upstream README
changes from five source repos (tracebloc-py-package, client, start-training,
data-ingestors, model-zoo). Source repos fire repository_dispatch on push;
this repo's workflow fetches the upstream file, has Claude rewrite the target
.mdx in docs voice, and opens a PR.

- .github/sync-sources.yml: mapping of upstream files to docs pages
- .github/workflows/sync-docs.yml: dispatch + manual + cron-driven sync job
- .github/notify-docs.workflow-template.yml: template for source repos

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address bugbot issues in sync workflow

- Pass ANTHROPIC_API_KEY as anthropic_api_key input to claude-code-action
  instead of env var (action reads via core.getInput, not env).
- Move sync cache from .sync-cache/ to /tmp/sync-cache/ so untracked
  cache files are not picked up by create-pull-request.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: address remaining bugbot issues in sync workflow

- Add concurrency group so overlapping cron/dispatch/manual runs
  serialize instead of racing on the docs/sync-upstream branch
  (would otherwise fail with "failed to push some refs" and drop
  changes from the losing run).
- Pin yq to v4.44.3 instead of latest for deterministic builds.
- Restrict create-pull-request add-paths to **/*.mdx so stray edits
  outside docs pages cannot be staged into the sync PR.
- Note in the notify template that branches may need adjusting for
  repos using master (e.g. data-ingestors).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: accumulate sync runs onto existing PR branch

Previously each run checked out the default branch fresh and force-pushed
only the dispatched source's diff to docs/sync-upstream, silently
overwriting any earlier dispatched sources' pending changes.

Now the workflow:
- Checks if docs/sync-upstream exists on the remote; if so, checks it
  out so prior accumulated changes are part of the working tree.
- Resolves the default branch dynamically and passes it to peter-evans
  as the explicit base so the PR continues targeting the right branch
  even after we switched off it.

Result: sequential dispatches for different sources combine into one
PR instead of clobbering each other.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: read sync-sources.yml from base branch, not stale sync branch

After the previous fix switched the working tree to docs/sync-upstream
to accumulate changes, all subsequent reads of .github/sync-sources.yml
were coming from the (potentially stale) sync branch instead of the
base branch. If a new source were added or an instruction edited on
main while a sync PR was pending, the workflow would silently use the
outdated config.

Snapshot the mapping to /tmp/sync-sources.yml before any branch switch,
and point both the yq filter step and the Claude prompt at the snapshot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>

* docs: migrate join-use-case API examples to snake_case (tracebloc_package 0.7.0) (#14)

Sync hyperparameters and start-training pages with the legacy
tracebloc/documentation repo:

- Rename camelCase API methods to snake_case: upload_model,
  link_model_dataset, experiment_name, get_training_plan,
  learning_rate, loss_function, layers_freeze, early_stop_callback,
  reduce_lr_callback, model_checkpoint_callback,
  terminate_on_nan_callback, training_classes, data_type
- Rename trainingObject → training
- Update terminate-on-NaN description (any NaN loss)
- Use pip optional-extras syntax: tracebloc_package[pytorch|tensorflow|sklearn|all]

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: migrate SDK page to tracebloc 0.8.x (#39)

* docs: migrate SDK page to tracebloc 0.8.x (closes #38)

The SDK was renamed in tracebloc/tracebloc-py-package#135. `tracebloc==0.8.1`
is live on PyPI. Migrating Mintlify docs to the canonical name.

- Rename `tools-help/tracebloc-package.mdx` -> `tools-help/tracebloc.mdx`.
- Rewrite the page: `tracebloc` install + import, snake_case API
  (post-SDK.2), historical Note about the rename, link to redirect
  package on PyPI.
- Bump install pin to `>=0.8.0` (was `>=0.6.32`); add per-extra
  install options.
- `docs.json`:
  - Nav: `tools-help/tracebloc-package` -> `tools-help/tracebloc`.
  - Add `/tools-help/tracebloc-package` -> `/tools-help/tracebloc`
    redirect to preserve old inbound links.
  - Existing redirects pointing at `/tools-help/tracebloc-package`
    now point at `/tools-help/tracebloc`.
- Internal cross-links in faqs.mdx + key-terms.mdx -> new URL.
- `join-use-case/start-training.mdx` install snippet -> new name + pin.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: bump install pin to 0.8.1 (latest)

* docs: migrate SDK examples to snake_case API (post-SDK.2)

The 0.7.0 SDK.2 release renamed the public Python API to PEP 8 /
snake_case. The old camelCase forms still work via deprecation
aliases (with DeprecationWarning) but new examples should use the
canonical names.

Updates the three customer-facing pages that still showed the
camelCase API:

- `join-use-case/start-training.mdx` — the main walk-through.
- `join-use-case/hyperparameters.mdx` — the full reference table.
- `join-use-case/model-optimization.mdx` — pretrained-weights upload.

Method renames applied (per tracebloc-py-package/MIGRATION.md):
- `uploadModel` -> `upload_model` (+ `model_name=` kwarg)
- `linkModelDataset` -> `link_model_dataset` (+ `dataset_id=` kwarg)
- `getTrainingPlan` -> `get_training_plan`
- `experimentName` -> `experiment_name`
- `learningRate` -> `learning_rate`
- `lossFunction` -> `loss_function`
- `layersFreeze` -> `layers_freeze`
- `earlystopCallback` -> `early_stop_callback`
- `reducelrCallback` -> `reduce_lr_callback`
- `modelCheckpointCallback` -> `model_checkpoint_callback`
- `terminateOnNaNCallback` -> `terminate_on_nan_callback`
- `trainingClasses` -> `training_classes`
- `dataType` -> `data_type`

The `model_name` and `dataset_id` keyword names are no longer aliased
in 0.8.x — passing positional args still works, but the kwargs
`modelname=` / `datasetId=` raise TypeError, so the docs use the
explicit kwarg form everyone should adopt.

Also renames the local variable `trainingObject` -> `training_plan`
throughout, matching the canonical sample workflow in tracebloc's
project CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ci): update sync-sources.yml dest after file rename (bugbot)

`tools-help/tracebloc-package.mdx` was renamed to `tools-help/tracebloc.mdx`
earlier in this PR, but the daily `sync-docs.yml` cron reads
`.github/sync-sources.yml` and would have either recreated the old
orphan path or failed outright — silently preventing upstream README
edits from reaching the new page.

Repointing the dest at `tools-help/tracebloc.mdx` keeps the upstream
README -> docs page sync working. The mapping `id` stays
`tracebloc-package` (it's a slug used for dispatch; changing it would
need a coordinated edit in the upstream notify workflow, which doesn't
exist yet — scope creep here).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ci): correct upstream refs in sync-sources.yml

Three entries pointed at `main` branches that do not exist in the
upstream repos, which would cause the sync fetch step to 404:
- tracebloc-py-package → develop (default; `main` does not exist; per
  the SDK repo's CLAUDE.md, develop is the canonical source of truth)
- data-ingestors → master (default branch)
- model-zoo → master (default branch)

Verified against the GitHub API for each repo. The `Readme.md` casing
flagged by bugbot is correct as-is: data-ingestors actually ships
`Readme.md` (mixed case), so the bugbot suggestion would have broken
the fetch — left unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ci): pass source_id to yq via strenv() to avoid query injection

Bugbot flagged the previous `yq ".sources[] | select(.id == \"$target\")"`
pattern as shell-injectable. The specific RCE described doesn't
actually trigger — `DISPATCH_ID` / `INPUT_ID` are routed through
`env:` (Actions best practice) and bash does not re-tokenize
variable values inside double quotes, so `$()`, backticks, and `;`
in the value remain literal.

However, a `"` in the value would still terminate the yq string
literal at the yq parser level and could yield a malformed query or
unintended filter. Routing the value through `strenv(TARGET)` keeps
it entirely out of the yq expression syntax — defense in depth at
zero cost.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal (Saadi) <asad.dsoft@gmail.com>
docs: clarify setup guide deploys single-node workspace
…43) (#44)

Two changes to the prepare-data and setup-guide pages driven by user
feedback after a fresh end-to-end setup:

- prepare-dataset.mdx: lead with the declarative YAML method (helm
  install tracebloc/ingestor --set-file ingestConfig=./ingest.yaml).
  The existing Python-template + Docker + kubectl flow stays as the
  advanced path for users who need custom processors. Calls out that
  ingest.yaml fields vary per category and points at the per-category
  examples in the data-ingestors repo.
- setup-guide.mdx: add a Note after the curl one-liner pointing at
  the helm upgrade command (--reset-then-reuse-values, --version) so
  users know how to upgrade an installer-deployed client without
  losing applied values.

Co-authored-by: Asad Iqbal (Saadi) <asad.dsoft@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-applied on top of current main (original branch fix/robots-txt-block-static-assets
was cut from an old initial-commit state and rebasing produced unrelated conflicts
in favicon/logo/.mintignore/docs.json).

Clarity data shows bots (Apple, OpenAI, Google) spending ~200 requests/week on
/mintlify-assets/_next/static/ JS/CSS chunks. These have zero SEO value.

Adds custom robots.txt that blocks /mintlify-assets/ while keeping the existing
/cdn-cgi/ block and sitemap reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented May 29, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
tracebloc 🟢 Ready View Preview May 29, 2026, 11:07 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 9 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@divyasinghds divyasinghds changed the base branch from main to develop May 29, 2026 11:11
@divyasinghds divyasinghds requested a review from aptracebloc May 29, 2026 11:11
@divyasinghds divyasinghds merged commit 7a432f3 into develop May 29, 2026
7 of 9 checks passed
@divyasinghds divyasinghds deleted the fix/robots-txt-block-static-assets-rebased branch May 29, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants