feat(examples): CV screening demo with feedback-to-deploy walkthrough by mmabrouk · Pull Request #4607 · Agenta-AI/agenta

mmabrouk · 2026-06-09T21:42:56Z

What this is

A complete CV screening example that demonstrates the core Agenta loop: a prompt screens a candidate, the result lands in observability as a trace, a human gives feedback on it, and that feedback drives a prompt change that you evaluate and deploy. The prompt scores a CV against an IT Manager job spec on three booleans with reasons (tech_match, experience_match, overall_match) plus the list of missing requirements.

A recruiter uses a small Streamlit app (PDF upload to Markdown to the prompt fetched from the registry). An AI engineer works the prompt, the test set, and the evaluations in Agenta. The two sides stay split: screening.py owns the AI logic and tracing, app.py is only UI.

The walkthrough

The example is built around one story: the company's working language is German, but the prompt's job spec never says so.

The recruiter screens a strong IT manager who does not speak German. The app says "Advance to interview" (the miss), so they submit a thumbs-down with a comment.
The AI engineer filters traces by the user-feedback annotation, opens the bad trace, and opens its span in the playground. It lands on the exact prompt revision with the CV pre-filled.
They add Fluent German (the company's working language) to the must-have requirements and rerun. overall_match flips to false and German shows up in missing_requirements, while tech_match and experience_match stay true.
They add the CV to the test set as a new case with only expected_overall_match = false, leaving the other two expected columns empty.
They run an evaluation comparing the deployed prompt against the new one. The old prompt fails the new case; the new prompt passes it and keeps every original case passing, because every other candidate in the test set speaks German.
They deploy the new revision. The Streamlit app picks it up on the next screening, no code change.

Walkthrough video

What's in the example

config.py: the job spec, the prompt, and the structured-output JSON schema.
prepare_testset.py: builds data/testset.csv (27 curated CVs from the public opensporks/resumes dataset, all German speakers) and can upload it to Agenta. The demo candidate is deliberately excluded so the walkthrough adds it.
setup_app.py: idempotent setup. Creates the app and a default variant, sets the completion URI on every revision, deploys with the environment reference map, and archives any stray auto-named variant.
generate_traces.py: seeds screening traces (no feedback) before a demo.
screening.py, app.py: the AI logic and the Streamlit UI.
make_sample_pdfs.py, data/sample_cvs/: four sample CV PDFs, including the no-German demo candidate and a German-speaking strong match.
AGENTS.md: setup steps and the SDK/endpoint gotchas found while building this (completion URI, variant naming, deploy reference map).

Notes

The demo beats are verified, not assumed. Borderline CVs are model-unstable, so the labels follow the model's stable consensus, and the German requirement is added at the top of the must-haves where the flip is reliable (5/5 runs).
Pre-ran the before/after evaluation over the full test set twice: before scores 0.96 (only the new German case fails), after scores 1.00 with no regressions.
The code evaluator that scores each expected_* column is in Readme.md. Test it with a real evaluation run, not the evaluator test panel, which does not pass test set columns (AGE-3825).

vercel · 2026-06-09T21:43:03Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jun 15, 2026 8:38am

coderabbitai · 2026-06-09T21:46:21Z

📝 Walkthrough

Walkthrough

This PR introduces a complete, production-ready CV screening example for the Python SDK. It includes shared configuration with a structured JSON schema for classification results, scripts to prepare a curated test set from external resume data, an Agenta deployment script, and an interactive Streamlit demo that fetches prompts from Agenta, runs LLM screening, and collects user feedback as trace annotations.

Changes

CV Screening Example

Layer / File(s)	Summary
Overview, Documentation, and Infrastructure `examples/python/Readme.md`, `examples/python/cv-screening/Readme.md`, `examples/python/cv-screening/.env.example`, `examples/python/cv-screening/requirements.txt`, `examples/python/cv-screening/data/.gitignore`	Root README adds CV screening to use cases table. New cv-screening README documents the full end-to-end workflow. Environment template defines `AGENTA_API_KEY`, `AGENTA_HOST`, and `OPENAI_API_KEY`. Dependencies include Agenta SDK, OpenAI client, Streamlit, test-data tools, and PDF generation.
Shared Configuration and Agenta Deployment `examples/python/cv-screening/config.py`, `examples/python/cv-screening/create_app.py`	`config.py` defines app/variant slugs, system/user prompts with `{cv}` template, a strict JSON schema enforcing scores (1–5), requirement lists, classification enum, and reasoning. `create_app.py` initializes the Agenta client, creates the service completion app, publishes the prompt variant with the schema-based LLM config, and deploys to production.
Test Set Preparation and Sample Generation `examples/python/cv-screening/Readme.md` (test set docs), `examples/python/cv-screening/prepare_testset.py`, `examples/python/cv-screening/make_sample_pdfs.py`	README documents test set construction from external resume dataset with curated IT-manager classifications. `prepare_testset.py` downloads a public Hugging Face parquet, applies hand-curated resume ID mappings, converts resume HTML to Markdown, writes `data/testset.csv`, and optionally uploads to Agenta. `make_sample_pdfs.py` renders selected test CVs as PDF files under `data/sample_cvs/` with text normalization and FPDF styling.
Interactive Demo with PDF Upload and Feedback `examples/python/cv-screening/Readme.md` (demo walkthrough), `examples/python/cv-screening/app.py`	README walks through setup, Agenta deployment, test-set upload, playground iteration, Streamlit demo run, and feedback collection. `app.py` provides a Streamlit dashboard: PDF upload and Markdown conversion, fetches the production prompt from Agenta (with local fallback), invokes OpenAI chat completion with the schema, captures Agenta trace invocation IDs, renders classification banner and per-area score metrics with progress bars, requirement lists, and a feedback form (thumbs up/down + optional comment) that posts to Agenta trace data and triggers a success rerun.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant StreamlitApp as Streamlit App
  participant Agenta
  participant OpenAI
  
  User->>StreamlitApp: Upload CV PDF
  StreamlitApp->>StreamlitApp: Convert PDF to Markdown
  StreamlitApp->>Agenta: Fetch production prompt config
  Agenta-->>StreamlitApp: Return prompt + LLM config
  User->>StreamlitApp: Click "Screen CV" button
  StreamlitApp->>OpenAI: Call chat completion<br/>with prompt + schema
  OpenAI-->>StreamlitApp: Return structured JSON<br/>(scores, requirements, classification)
  StreamlitApp->>StreamlitApp: Render classification banner<br/>+ score metrics + requirements
  StreamlitApp->>Agenta: Capture trace invocation ID
  User->>StreamlitApp: Submit feedback<br/>(thumbs up/down + comment)
  StreamlitApp->>Agenta: POST feedback as trace annotation
  Agenta-->>StreamlitApp: Success response (200/202)
  StreamlitApp-->>User: Show feedback confirmation

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 60.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately captures the main feature being added: a CV screening demo with feedback integration and evaluation walkthrough workflow.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing the walkthrough story, components, and implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/cv-classifier-demo-oug3jb

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/python/cv-screening/requirements.txt (1)
1-19: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Pin dependency versions to avoid pulling vulnerable packages.

The requirements file specifies no version constraints, which means pip install will fetch the latest versions of all packages and their transitive dependencies. OSV Scanner has flagged numerous critical and high-severity vulnerabilities in transitive dependencies that could be pulled in, including:

aiohttp: 23 CRITICAL issues (SSRF, header injection, DoS, credential leaks)

gitpython: 9 CRITICAL issues (RCE, path traversal, arbitrary code execution)

litellm: 13 CRITICAL issues (SSTI, SQL injection, SSRF, eval-based RCE)

pillow: 6 CRITICAL issues (arbitrary code execution, buffer overflow, DoS)

pyarrow: 3 CRITICAL issues (arbitrary code execution)

While this is example code, users may run it in environments connected to real data or networks. Unpinned dependencies create a supply-chain risk.
🔒 Recommendation

Generate a pinned requirements.txt by running:
pip install -r requirements.txt
pip freeze > requirements.txt
Then review the frozen versions and update any packages flagged by pip-audit or OSV Scanner. Alternatively, specify minimum safe versions inline:
 # Agenta SDK + LLM client
-agenta
-openai
-python-dotenv
+agenta>=0.28.0
+openai>=1.0.0
+python-dotenv>=1.0.0
For the remaining packages, apply the same pattern after verifying secure minimum versions.

🧹 Nitpick comments (1)

examples/python/cv-screening/Readme.md (1)

20-22: 💤 Low value

Add language identifier to code fence.

The code fence starting at line 20 lacks a language identifier, triggering a markdownlint warning (MD040). While this is ASCII art rather than code, specifying text or leaving it as triple-backticks with no syntax highlighting improves consistency.

📝 Proposed fix

-```
+```text
 PDF upload ──> Markdown (markitdown) ──> prompt fetched from Agenta ──> LLM ──> structured scores

</details>

<!-- cr-comment:v1:63b9a63971a5e8574a05aef6 -->

</blockquote></details>

</blockquote></details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `7c1b7401-bc27-47de-b94d-d0d734c5558f`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between aed2d47357cc8d88347011835c7cc1f3f7f08ea7 and c28d1a2dca9c1982a3b8885929de57447d80e256.

</details>

<details>
<summary>⛔ Files ignored due to path filters (4)</summary>

* `examples/python/cv-screening/data/sample_cvs/candidate_chef.pdf` is excluded by `!**/*.pdf`
* `examples/python/cv-screening/data/sample_cvs/candidate_it_manager.pdf` is excluded by `!**/*.pdf`
* `examples/python/cv-screening/data/sample_cvs/candidate_it_supervisor.pdf` is excluded by `!**/*.pdf`
* `examples/python/cv-screening/data/testset.csv` is excluded by `!**/*.csv`

</details>

<details>
<summary>📒 Files selected for processing (10)</summary>

* `examples/python/Readme.md`
* `examples/python/cv-screening/.env.example`
* `examples/python/cv-screening/Readme.md`
* `examples/python/cv-screening/app.py`
* `examples/python/cv-screening/config.py`
* `examples/python/cv-screening/create_app.py`
* `examples/python/cv-screening/data/.gitignore`
* `examples/python/cv-screening/make_sample_pdfs.py`
* `examples/python/cv-screening/prepare_testset.py`
* `examples/python/cv-screening/requirements.txt`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

A walkthrough demo for classifying CVs against a job spec with Agenta: - Curated test set of 30 real Markdown CVs (from the public opensporks/resumes dataset on Hugging Face, a mirror of the Kaggle Resume Dataset), hand-labeled against an IT Manager job spec - prepare_testset.py rebuilds the CSV reproducibly and can upload it to Agenta via the SDK - create_app.py creates the completion app with the screening prompt and structured-output JSON schema, and deploys it to production - Streamlit demo UI: PDF upload -> Markdown (markitdown) -> prompt fetched from the Agenta registry -> structured score dashboard - Sample CV PDFs (one per classification) generated from the test set https://claude.ai/code/session_01YMbf4sUb2VBFQHGNKv6yh3

The Streamlit app now shows a thumbs up/down form with an optional comment after each screening. Submitting it attaches the feedback to the screening's trace in Agenta as an annotation (evaluator slug 'user-feedback'), following the capture-user-feedback cookbook: the invocation link is captured inside the instrumented classify_cv call and the annotation is POSTed to /api/simple/traces/. Screening results now persist in session state so the result and feedback form survive Streamlit reruns. Entry scripts load .env via python-dotenv, matching the documented setup flow. https://claude.ai/code/session_01YMbf4sUb2VBFQHGNKv6yh3

…pt revision Move all the AI logic out of the Streamlit app into a new screening.py module (prompt fetch, the LLM call, tracing, feedback), leaving app.py as a UI-only shell. Any other frontend can import screening.py unchanged. Tracing improvements so screenings are easy to act on from the UI: - Auto-instrument the OpenAI client with OpenInference, so every trace has a child LLM span with the exact messages, token counts, and cost. - classify_cv takes its inputs as a dict whose keys match the prompt input variables ({"cv": ...}), and the prompt config is kept out of the trace (ignore_inputs). The span data then mirrors the completion app's inputs. - Link each span to the deployed prompt revision via ag.tracing.store_refs, so traces filter by app/environment and open in the playground on the right revision with inputs pre-filled. Also fix create_app.py to read variant.variant_version as an attribute (VariantManager now returns a ConfigurationResponse, not a dict).

The walkthrough needed a leaner story: the output schema is now tech_match / experience_match / overall_match, each with a short reason, plus the missing-requirements list. overall_match is a holistic hire-or-not judgment, so a requirement like a language can flip it while the other two stay true. The test set drops the bookkeeping columns and carries one expected_* column per dimension; empty cells are skipped by the code evaluator documented in the Readme.

Every test set CV now speaks German (the company's working language); the demo candidate (a strong IT manager with no German) is excluded from the set so the walkthrough adds it as a new test case. Relabeled the borderline rows to the model's stable consensus and dropped two bistable resumes, so the before/after evaluation is deterministic (before 0.96, after 1.00, no regressions). Verified the German flip is reliable when the requirement is added at the top of the must-haves. Adds setup_app.py: an idempotent setup that creates the app and a default variant explicitly, always sets the completion URI, deploys with the reference map, and archives stray auto-named variants. Adds generate_traces.py to seed traces, and AGENTS.md documenting setup and the SDK/endpoint gotchas. Feedback annotations now send a boolean score.

dosubot Bot added example python Pull requests that update Python code size:XL This PR changes 500-999 lines, ignoring generated files. labels Jun 9, 2026

mmabrouk marked this pull request as draft June 9, 2026 21:43

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

vercel Bot deployed to Preview June 10, 2026 20:26 View deployment

vercel Bot deployed to Preview June 11, 2026 12:01 View deployment

vercel Bot deployed to Preview June 11, 2026 12:03 View deployment

vercel Bot deployed to Preview June 11, 2026 12:38 View deployment

claude and others added 6 commits June 12, 2026 12:15

Fix make_sample_pdfs for the new test set columns

68ffcbd

mmabrouk force-pushed the claude/cv-classifier-demo-oug3jb branch from 9cb02f6 to f3bf581 Compare June 15, 2026 08:29

vercel Bot deployed to Preview June 15, 2026 08:31 View deployment

mmabrouk changed the title ~~Add CV screening example with curated resume test set~~ feat(examples): CV screening demo with feedback-to-deploy walkthrough Jun 15, 2026

mmabrouk marked this pull request as ready for review June 15, 2026 08:36

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. feature and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jun 15, 2026

Add the walkthrough video to the readme

0cdeb65

mmabrouk changed the base branch from main to release/v0.103.5 June 15, 2026 08:38

vercel Bot deployed to Preview June 15, 2026 08:38 View deployment

mmabrouk requested a review from bekossy June 15, 2026 08:39

bekossy approved these changes Jun 15, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 15, 2026

bekossy merged commit 1d42627 into release/v0.103.5 Jun 15, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): CV screening demo with feedback-to-deploy walkthrough#4607

feat(examples): CV screening demo with feedback-to-deploy walkthrough#4607
bekossy merged 7 commits into
release/v0.103.5from
claude/cv-classifier-demo-oug3jb

mmabrouk commented Jun 9, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mmabrouk commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

The walkthrough

Walkthrough video

What's in the example

Notes

Uh oh!

vercel Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mmabrouk commented Jun 9, 2026 •

edited

Loading

vercel Bot commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading