Skip to content

dcarnicer/sf-documentation-watcher

Repository files navigation

SF Documentation Watcher

Monitors the Salesforce Agentforce Life Sciences admin guide for content changes and sends Telegram notifications when documentation is updated.

How it works

  1. TOC extraction — fetches the main guide page and extracts all article URLs from the table of contents using Puppeteer + browserless/chrome (required because the site is fully JavaScript-rendered via Salesforce Experience Cloud / LWC).
  2. New section detection — compares the current TOC against the previous run and notifies if new articles appear.
  3. Content change detection — fetches each article, extracts visible text, and computes a SHA-256 hash. If the hash differs from the stored one, the content has changed.
  4. Git-based diffing — full article text is saved as .txt files in a content/ git repository. On the first run a baseline commit is made. On subsequent runs, any changes produce a new commit so you can review exactly what changed with git diff or git log -p.
  5. Telegram notifications — you receive a message when:
    • A new section appears in the guide
    • One or more articles change (with a link to the content repo)
    • A critical error occurs (browser unreachable, TOC unavailable)
    • More than 20% of pages fail in a single run
  6. Optional push — if CONTENT_REMOTE is set, the content repo is pushed to GitHub automatically after every commit.

How scheduling works

The watcher runs inside a Docker container that stays on permanently. A cron job inside the container triggers the script at the configured time (default: 2:00 AM daily). You do not need to configure anything on the host machine — just keep the container running with docker compose up -d.

To change the schedule, set CRON_SCHEDULE in your .env using standard cron syntax:

CRON_SCHEDULE=0 2 * * *   # 2:00 AM every day (default)
CRON_SCHEDULE=0 8 * * 1   # 8:00 AM every Monday

Quick start (Docker)

Requirements

  • Docker and Docker Compose
  • A Telegram bot and chat ID (see below)

1. Clone the repo

git clone git@github.com:dcarnicer/sf-documentation-watcher.git
cd sf-documentation-watcher

2. Create a Telegram bot

  1. Open Telegram and search for @BotFather
  2. Send /newbot and follow the prompts — copy the token you receive
  3. Send any message to your new bot (so it has a chat to respond to)
  4. Run the helper script to get your chat_id:
    TELEGRAM_TOKEN=123456:ABC... node get-chat-id.mjs
    It prints your chat_id. Copy it for the next step.

3. Create the .env file

cp .env.example .env

Edit .env with your values:

TELEGRAM_TOKEN=123456:ABC-your-token-here
TELEGRAM_CHAT_ID=123456789

4. (Optional) Track content changes in your own GitHub repo

If you want the downloaded article snapshots pushed to GitHub so you can browse diffs online:

  1. Create a new empty repository on GitHub (no README, no .gitignore)
  2. Add its SSH URL to .env:
    CONTENT_REMOTE=git@github.com:your-username/your-content-repo.git
  3. Make sure the machine running the watcher has an SSH key added to your GitHub account. Inside Docker, mount your SSH key by adding this to the watcher service in docker-compose.yml:
    volumes:
      - watcher-data:/data
      - ~/.ssh:/root/.ssh:ro

The first run pushes a baseline commit with all 192 articles. Subsequent runs push only when content changes.

5. Start

docker compose up -d

Docker pulls the images, builds the watcher container, and starts the cron schedule. The first run creates the baseline (no Telegram notification sent). From the second run onwards, any changes trigger a notification.

6. Check logs

All commands below run on the host machine — you do not need to enter the container.

# Live logs from the watcher (updated on each cron run)
docker compose logs -f watcher

# Run manually without waiting for the cron schedule
docker exec sfdc-watcher node /app/sfdc-watcher.mjs

7. Inspect content and diffs

# See which articles changed in the last commit
docker exec sfdc-watcher git -C /data/content log -1 --name-only

# See exactly what changed (full diff)
docker exec sfdc-watcher git -C /data/content log -p -1

# Open a shell inside the container if you need to explore further
docker exec -it sfdc-watcher bash

8. Stop

docker compose down

Data in the watcher-data volume (content repo, state, logs) is preserved across restarts. To delete it as well:

docker compose down -v

Moving to another machine

  1. Install Docker
  2. Clone the repo:
    git clone git@github.com:dcarnicer/sf-documentation-watcher.git
    cd sf-documentation-watcher
  3. Create .env with your credentials (same as step 3 above)
  4. Start:
    docker compose up -d

The first run rebuilds the baseline (no notification). From the second run onwards, changes trigger Telegram notifications.


Terminal UI

When run interactively, the watcher shows a live interface that updates in place:

────────────────────────────────────────────────────────
  SF Documentation Watcher  ·  2026-04-01 02:00:00
────────────────────────────────────────────────────────

  ◆  TOC: 192 pages found

  ████████░░░░░░░░░░░░░░░░░░░░░░  52/192  27%  eta 213s
  ⟳  ind.lsc_customer_engagement_personas.htm
  ~  3 changed
  ✗  1 error(s)  ind.lsc_something.htm

────────────────────────────────────────────────────────
  ✓  192 pages  ·  3 changed  ·  1 error  ·  47m 12s
────────────────────────────────────────────────────────

  Failed pages:
    • ind.lsc_something.htm

When running via cron (no TTY), it automatically falls back to plain line-by-line output with no ANSI codes, suitable for log files.


Dependencies

Package Version Purpose
puppeteer-core latest Headless browser automation — connects to the browserless/chrome container
chalk latest Terminal colors and styling

Docker images:

Image Purpose
browserless/chrome Headless Chrome over WebSocket — required because Salesforce Help pages are fully JS-rendered

Performance

  • 15 second cooldown between pages
  • Reconnects to the browser every 20 pages to prevent connection drops
  • Each page times out after 60 seconds; failed pages are retried once
  • Full run takes ~50 minutes for 192 pages

Roadmap

  • Content validation — before saving a page, check that the downloaded text is plausible: minimum length threshold, absence of error strings ("Sorry to interrupt", "Page not found", "Access denied", "CSS Error"), and basic sanity checks. Invalid pages would be skipped and retried on the next run rather than overwriting a good snapshot with a bad one. A standalone audit script could also scan the entire content/ repo and report suspicious files.

  • Multi-guide support — the watcher is currently hardcoded to the Agentforce Life Sciences admin guide. All Salesforce Help guides share the same URL pattern (help.salesforce.com/s/articleView?id=...) and page structure (LWC TOC, same DOM layout), so adding support for multiple guides via configuration would be straightforward. The main change would be accepting a list of source URLs in .env and namespacing the content files by guide.

  • Salesforce Developer Guides — developer documentation lives at developer.salesforce.com and uses a different site structure (static HTML, different navigation). Would need a separate fetcher and TOC extractor, but could share the same git-based diffing and Telegram notification logic.

  • AI-powered change summaries — instead of sending a raw list of changed files, use the Claude API to summarise the diff in plain language before the Telegram notification. The git diff is already available after each commit, so the flow would be: diff → Claude API → human-readable summary (e.g. "The installation prerequisites for package X have changed and a new step has been added to section Y") → Telegram. Optional feature, requires an Anthropic API key.

  • RAG-ready export — generate chunked, structured files from the article text suitable for ingestion into a vector database or knowledge base. Each chunk would include metadata (article ID, URL, section title, last updated) alongside the content, making it straightforward to build a retrieval-augmented generation pipeline on top of the documentation for AI agents to consume.

  • Google Drive integration — automatically upload the latest article snapshots (or the RAG-ready export) to a Google Drive folder after each run, so the documentation is accessible to other tools and team members without needing access to the git repo. Would use the Google Drive API with a service account for authentication.

  • Cascade failure detection — if a configurable number of consecutive errors is reached mid-run (e.g. 10 in a row), assume the browser or the site is temporarily unavailable, send a Telegram alert, sleep for a few hours, and then resume from where it left off rather than aborting the entire run. This would avoid losing a full night's run due to a transient issue.

  • Confluence integration (low priority) — publish the downloaded documentation to a Confluence space, keeping pages in sync with the Salesforce source. When changes are detected, the corresponding Confluence page would be updated automatically via the Confluence REST API.


Files

File Description
sfdc-watcher.mjs Main watcher script
ui.mjs Terminal UI — live progress bar, status and error tracking
get-chat-id.mjs Helper to find your Telegram chat ID
Dockerfile Watcher container image
docker-compose.yml Orchestrates watcher + browserless
entrypoint.sh Container startup: initialises git repo, sets up cron
.env.example Credentials template
.env Your credentials — never commit this file

About

Scraper for salesforce documentation and see changes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors