Marketplace | Documentation | Issues
A GitHub Action that evaluates Claude Code skills against YAML test cases with automated grading and PR reporting.
- uses: skill-bench/skill-eval-action@v1
with:
skill-name: tf-guide
skill-path: ./skills/tf-guide
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}Run skills in parallel - each skill gets its own job:
name: Skill Eval
on:
pull_request:
paths:
- 'skills/**'
permissions:
contents: read
pull-requests: write
jobs:
eval:
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
skill:
- tf-guide
- k8s-operator-sdk
- secure-gh-workflow
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: skill-bench/skill-eval-action@v1
with:
skill-name: ${{ matrix.skill }}
skill-path: skills/${{ matrix.skill }}
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
pass-threshold: '80'Automatically find and evaluate all skills that have evals/ directories - no need to hardcode skill names:
name: Skill Eval
on:
pull_request:
paths:
- 'skills/**'
permissions:
contents: read
pull-requests: write
jobs:
discover:
runs-on: ubuntu-latest
outputs:
skills: ${{ steps.discover.outputs.skills }}
count: ${{ steps.discover.outputs.count }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
with:
persist-credentials: false
sparse-checkout: skills
- name: Discover skills with evals
id: discover
run: |
skills=$(find skills -name "*.yaml" -path "*/evals/*" -exec dirname {} \; | xargs -I{} dirname {} | xargs -I{} basename {} | sort -u | jq -R -s -c 'split("\n") | map(select(. != ""))')
echo "skills=$skills" >> "$GITHUB_OUTPUT"
echo "count=$(echo $skills | jq length)" >> "$GITHUB_OUTPUT"
- name: Summary
run: echo "Found ${{ steps.discover.outputs.count }} skills with evals"
eval:
needs: discover
if: needs.discover.outputs.count > 0
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
skill: ${{ fromJSON(needs.discover.outputs.skills) }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: skill-bench/skill-eval-action@v1
with:
skill-name: ${{ matrix.skill }}
skill-path: skills/${{ matrix.skill }}
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
pass-threshold: '80'Combine with dorny/paths-filter or git diff to only eval skills that were modified in the PR:
jobs:
changed:
runs-on: ubuntu-latest
outputs:
skills: ${{ steps.filter.outputs.skills }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
with:
persist-credentials: false
- name: Find changed skills with evals
id: filter
run: |
skills=$(git diff --name-only origin/main...HEAD -- 'skills/' | cut -d/ -f2 | sort -u | while read s; do
[ -d "skills/$s/evals" ] && echo "$s"
done | jq -R -s -c 'split("\n") | map(select(. != ""))')
echo "skills=$skills" >> "$GITHUB_OUTPUT"
eval:
needs: changed
if: needs.changed.outputs.skills != '[]'
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
skill: ${{ fromJSON(needs.changed.outputs.skills) }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
- uses: skill-bench/skill-eval-action@v1
with:
skill-name: ${{ matrix.skill }}
skill-path: skills/${{ matrix.skill }}
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}The action evaluates one skill per invocation. Parallelism comes from GitHub Actions matrix strategy:
| Approach | Skills in parallel | How |
|---|---|---|
| Static matrix | Up to 256 | List skills in matrix.skill |
| Dynamic matrix | Up to 256 | Use discover step + fromJSON() |
| Changed only | Varies | Filter by git diff |
| Sequential | 1 | No matrix (not recommended for >3 skills) |
Within a single skill, eval cases run sequentially to avoid Anthropic API rate limits.
| Input | Required | Default | Description |
|---|---|---|---|
skill-name |
Yes | - | Name of the skill to evaluate |
skill-path |
Yes | - | Path to the skill directory (must contain SKILL.md and evals/) |
anthropic-api-key |
Yes | - | Anthropic API key for the claude CLI |
pass-threshold |
No | 80 |
Minimum pass rate (0-100) to succeed |
timeout |
No | 120 |
Timeout per eval case in seconds |
allowed-tools |
No | '' |
Tool allow-list granted to the skill under test, forwarded to claude --allowedTools (e.g. Bash(kubectl get:*),Bash(gh api:*),Read). Per-case allowed_tools overrides it |
permission-mode |
No | '' |
Permission mode forwarded to claude --permission-mode (default, acceptEdits, plan, bypassPermissions). Per-case permission_mode overrides it |
post-comment |
No | true |
Post results as a PR comment |
github-token |
No | ${{ github.token }} |
Token for PR comments |
upload-viewer |
No | true |
Upload eval-viewer HTML as an artifact |
node-version |
No | 22 |
Node.js version for claude CLI installation |
max-retries |
No | 3 |
Max retry attempts per API call on timeout/error |
retry-delay |
No | 10 |
Base delay between retries in seconds (multiplied by attempt number) |
| Output | Description |
|---|---|
pass-rate |
Overall pass rate as percentage (0-100) |
passed |
Total criteria passed |
total |
Total criteria evaluated |
cases-run |
Number of eval cases executed |
eval YAML -> claude -p (execute) -> claude -p (grade) -> summary.json -> PR comment + artifact
- Discovers eval YAML files in
<skill-path>/evals/ - Executes each case via
claude -pwith skill content injected - Grades each response against criteria via a separate
claude -pcall - Aggregates results and writes a GitHub Actions step summary
- Posts a PR comment with pass/fail table and failed criteria details
- Uploads an interactive eval viewer as an artifact
- Fails the step if pass rate is below threshold
Place YAML files in <skill-path>/evals/:
# evals/001-basic-usage.yaml
name: Basic usage
prompt: "The user prompt that should trigger and test this skill"
files: # optional - temp files created before the test
- path: "main.tf"
content: |
resource "aws_instance" "web" {}
criteria: # success criteria - ALL must pass
- "Output contains a valid resource block"
- "Uses for_each, not count, for multiple resources"
expect_skill: true # optional - default true
timeout: 120 # optional - default from action input
allowed_tools: # optional - overrides the action `allowed-tools` input
- "Bash(kubectl get:*)"
- "Read"
permission_mode: default # optional - overrides the action `permission-mode` inputInclude at least one negative trigger case (expect_skill: false).
Skills that diagnose by running read-only commands (kubectl get, gh api, gcloud ... list) need those commands to actually execute. By default the model runs with no tool permissions, so every Bash call is denied and such skills fail spuriously. Grant a scoped allow-list via the allowed-tools input (or per-case allowed_tools):
# action invocation
with:
allowed-tools: "Bash(kubectl get:*),Bash(gh api:*),Read"Prefer scoped allow-lists over permission-mode: bypassPermissions so read-only commands run while a skill's refusal-of-mutation behavior can still be tested. Use a per-case allowed_tools to widen scope for a single case (e.g. allow Bash(kubectl delete:*) only in a case that asserts the skill refuses to run it).
The action posts (or updates) a PR comment with:
- Pass/fail table with per-case results
- Collapsible failed criteria with evidence
- Eval metadata (time, tokens, cost, threshold)
Comments are upserted using an HTML marker - re-runs update the existing comment instead of creating duplicates.
LLM-based evals are non-deterministic. Each run, Claude generates a slightly different response, and the grader evaluates it slightly differently. The same skill without changes may produce different pass rates across runs.
This is why:
- The default
pass-thresholdis80not100 - The agentskills.io best practices say "occasional flakiness is expected"
- Multiple runs + aggregation gives a more reliable picture
Options to reduce flakiness:
- Relax criteria - make them less brittle (e.g., "uses SHA pinning or explains how to resolve SHAs" instead of "all actions pinned to 40-char SHA")
- Run multiple times and average - aggregate results across runs for a stable signal
- Lower threshold - accept that 70-80% is a realistic pass rate for LLM evals
Each eval case makes 2 API calls (execute + grade). A skill with 5 cases = 10 API calls. Set appropriate timeout values to limit runaway token usage. Use the "changed only" pattern to avoid evaluating unchanged skills on every PR.
ANTHROPIC_API_KEYas a repository secret- Eval YAML files in the skill's
evals/directory - Skills must follow the Agent Skills format
MIT