Skill Eval Action

A GitHub Action that evaluates Claude Code skills against YAML test cases with automated grading and PR reporting.

Usage

Single skill

- uses: skill-bench/skill-eval-action@v1
  with:
    skill-name: tf-guide
    skill-path: ./skills/tf-guide
    anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}

Multiple skills (static matrix)

Run skills in parallel - each skill gets its own job:

name: Skill Eval
on:
  pull_request:
    paths:
      - 'skills/**'

permissions:
  contents: read
  pull-requests: write

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    strategy:
      fail-fast: false
      matrix:
        skill:
          - tf-guide
          - k8s-operator-sdk
          - secure-gh-workflow
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

      - uses: skill-bench/skill-eval-action@v1
        with:
          skill-name: ${{ matrix.skill }}
          skill-path: skills/${{ matrix.skill }}
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          pass-threshold: '80'

Auto-discover all skills (dynamic matrix)

Automatically find and evaluate all skills that have evals/ directories - no need to hardcode skill names:

name: Skill Eval
on:
  pull_request:
    paths:
      - 'skills/**'

permissions:
  contents: read
  pull-requests: write

jobs:
  discover:
    runs-on: ubuntu-latest
    outputs:
      skills: ${{ steps.discover.outputs.skills }}
      count: ${{ steps.discover.outputs.count }}
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
        with:
          persist-credentials: false
          sparse-checkout: skills

      - name: Discover skills with evals
        id: discover
        run: |
          skills=$(find skills -name "*.yaml" -path "*/evals/*" -exec dirname {} \; | xargs -I{} dirname {} | xargs -I{} basename {} | sort -u | jq -R -s -c 'split("\n") | map(select(. != ""))')
          echo "skills=$skills" >> "$GITHUB_OUTPUT"
          echo "count=$(echo $skills | jq length)" >> "$GITHUB_OUTPUT"

      - name: Summary
        run: echo "Found ${{ steps.discover.outputs.count }} skills with evals"

  eval:
    needs: discover
    if: needs.discover.outputs.count > 0
    runs-on: ubuntu-latest
    timeout-minutes: 30
    strategy:
      fail-fast: false
      matrix:
        skill: ${{ fromJSON(needs.discover.outputs.skills) }}
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

      - uses: skill-bench/skill-eval-action@v1
        with:
          skill-name: ${{ matrix.skill }}
          skill-path: skills/${{ matrix.skill }}
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
          pass-threshold: '80'

Only evaluate changed skills

Combine with dorny/paths-filter or git diff to only eval skills that were modified in the PR:

jobs:
  changed:
    runs-on: ubuntu-latest
    outputs:
      skills: ${{ steps.filter.outputs.skills }}
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
        with:
          persist-credentials: false

      - name: Find changed skills with evals
        id: filter
        run: |
          skills=$(git diff --name-only origin/main...HEAD -- 'skills/' | cut -d/ -f2 | sort -u | while read s; do
            [ -d "skills/$s/evals" ] && echo "$s"
          done | jq -R -s -c 'split("\n") | map(select(. != ""))')
          echo "skills=$skills" >> "$GITHUB_OUTPUT"

  eval:
    needs: changed
    if: needs.changed.outputs.skills != '[]'
    runs-on: ubuntu-latest
    timeout-minutes: 30
    strategy:
      fail-fast: false
      matrix:
        skill: ${{ fromJSON(needs.changed.outputs.skills) }}
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

      - uses: skill-bench/skill-eval-action@v1
        with:
          skill-name: ${{ matrix.skill }}
          skill-path: skills/${{ matrix.skill }}
          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}

Parallelism

The action evaluates one skill per invocation. Parallelism comes from GitHub Actions matrix strategy:

Approach	Skills in parallel	How
Static matrix	Up to 256	List skills in `matrix.skill`
Dynamic matrix	Up to 256	Use discover step + `fromJSON()`
Changed only	Varies	Filter by git diff
Sequential	1	No matrix (not recommended for >3 skills)

Within a single skill, eval cases run sequentially to avoid Anthropic API rate limits.

Inputs

Input	Required	Default	Description
`skill-name`	Yes	-	Name of the skill to evaluate
`skill-path`	Yes	-	Path to the skill directory (must contain `SKILL.md` and `evals/`)
`anthropic-api-key`	Yes	-	Anthropic API key for the `claude` CLI
`pass-threshold`	No	`80`	Minimum pass rate (0-100) to succeed
`timeout`	No	`120`	Timeout per eval case in seconds
`allowed-tools`	No	`''`	Tool allow-list granted to the skill under test, forwarded to `claude --allowedTools` (e.g. `Bash(kubectl get:),Bash(gh api:),Read`). Per-case `allowed_tools` overrides it
`permission-mode`	No	`''`	Permission mode forwarded to `claude --permission-mode` (`default`, `acceptEdits`, `plan`, `bypassPermissions`). Per-case `permission_mode` overrides it
`post-comment`	No	`true`	Post results as a PR comment
`github-token`	No	`${{ github.token }}`	Token for PR comments
`upload-viewer`	No	`true`	Upload eval-viewer HTML as an artifact
`node-version`	No	`22`	Node.js version for claude CLI installation
`max-retries`	No	`3`	Max retry attempts per API call on timeout/error
`retry-delay`	No	`10`	Base delay between retries in seconds (multiplied by attempt number)

Outputs

Output	Description
`pass-rate`	Overall pass rate as percentage (0-100)
`passed`	Total criteria passed
`total`	Total criteria evaluated
`cases-run`	Number of eval cases executed

How it works

eval YAML -> claude -p (execute) -> claude -p (grade) -> summary.json -> PR comment + artifact

Discovers eval YAML files in <skill-path>/evals/
Executes each case via claude -p with skill content injected
Grades each response against criteria via a separate claude -p call
Aggregates results and writes a GitHub Actions step summary
Posts a PR comment with pass/fail table and failed criteria details
Uploads an interactive eval viewer as an artifact
Fails the step if pass rate is below threshold

Eval case format

Place YAML files in <skill-path>/evals/:

# evals/001-basic-usage.yaml
name: Basic usage
prompt: "The user prompt that should trigger and test this skill"
files:                          # optional - temp files created before the test
  - path: "main.tf"
    content: |
      resource "aws_instance" "web" {}
criteria:                       # success criteria - ALL must pass
  - "Output contains a valid resource block"
  - "Uses for_each, not count, for multiple resources"
expect_skill: true              # optional - default true
timeout: 120                    # optional - default from action input
allowed_tools:                  # optional - overrides the action `allowed-tools` input
  - "Bash(kubectl get:*)"
  - "Read"
permission_mode: default        # optional - overrides the action `permission-mode` input

Include at least one negative trigger case (expect_skill: false).

Granting tool permissions

Skills that diagnose by running read-only commands (kubectl get, gh api, gcloud ... list) need those commands to actually execute. By default the model runs with no tool permissions, so every Bash call is denied and such skills fail spuriously. Grant a scoped allow-list via the allowed-tools input (or per-case allowed_tools):

# action invocation
with:
  allowed-tools: "Bash(kubectl get:*),Bash(gh api:*),Read"

Prefer scoped allow-lists over permission-mode: bypassPermissions so read-only commands run while a skill's refusal-of-mutation behavior can still be tested. Use a per-case allowed_tools to widen scope for a single case (e.g. allow Bash(kubectl delete:*) only in a case that asserts the skill refuses to run it).

PR comment

The action posts (or updates) a PR comment with:

Pass/fail table with per-case results
Collapsible failed criteria with evidence
Eval metadata (time, tokens, cost, threshold)

Comments are upserted using an HTML marker - re-runs update the existing comment instead of creating duplicates.

Non-determinism and flakiness

LLM-based evals are non-deterministic. Each run, Claude generates a slightly different response, and the grader evaluates it slightly differently. The same skill without changes may produce different pass rates across runs.

This is why:

The default pass-threshold is 80 not 100
The agentskills.io best practices say "occasional flakiness is expected"
Multiple runs + aggregation gives a more reliable picture

Options to reduce flakiness:

Relax criteria - make them less brittle (e.g., "uses SHA pinning or explains how to resolve SHAs" instead of "all actions pinned to 40-char SHA")
Run multiple times and average - aggregate results across runs for a stable signal
Lower threshold - accept that 70-80% is a realistic pass rate for LLM evals

Cost considerations

Each eval case makes 2 API calls (execute + grade). A skill with 5 cases = 10 API calls. Set appropriate timeout values to limit runaway token usage. Use the "changed only" pattern to avoid evaluating unchanged skills on every PR.

Requirements

ANTHROPIC_API_KEY as a repository secret
Eval YAML files in the skill's evals/ directory
Skills must follow the Agent Skills format

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
action.yml		action.yml
package-lock.json		package-lock.json
package.json		package.json
skill-bench-logo-dark.png		skill-bench-logo-dark.png
skill-bench-logo.png		skill-bench-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Eval Action

Usage

Single skill

Multiple skills (static matrix)

Auto-discover all skills (dynamic matrix)

Only evaluate changed skills

Parallelism

Inputs

Outputs

How it works

Eval case format

Granting tool permissions

PR comment

Non-determinism and flakiness

Cost considerations

Requirements

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Skill Eval Action

Usage

Single skill

Multiple skills (static matrix)

Auto-discover all skills (dynamic matrix)

Only evaluate changed skills

Parallelism

Inputs

Outputs

How it works

Eval case format

Granting tool permissions

PR comment

Non-determinism and flakiness

Cost considerations

Requirements

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages