Skip to content

feat(examples): CV screening demo with feedback-to-deploy walkthrough#4607

Merged
bekossy merged 7 commits into
release/v0.103.5from
claude/cv-classifier-demo-oug3jb
Jun 15, 2026
Merged

feat(examples): CV screening demo with feedback-to-deploy walkthrough#4607
bekossy merged 7 commits into
release/v0.103.5from
claude/cv-classifier-demo-oug3jb

Conversation

@mmabrouk

@mmabrouk mmabrouk commented Jun 9, 2026

Copy link
Copy Markdown
Member

What this is

A complete CV screening example that demonstrates the core Agenta loop: a prompt screens a candidate, the result lands in observability as a trace, a human gives feedback on it, and that feedback drives a prompt change that you evaluate and deploy. The prompt scores a CV against an IT Manager job spec on three booleans with reasons (tech_match, experience_match, overall_match) plus the list of missing requirements.

A recruiter uses a small Streamlit app (PDF upload to Markdown to the prompt fetched from the registry). An AI engineer works the prompt, the test set, and the evaluations in Agenta. The two sides stay split: screening.py owns the AI logic and tracing, app.py is only UI.

The walkthrough

The example is built around one story: the company's working language is German, but the prompt's job spec never says so.

  1. The recruiter screens a strong IT manager who does not speak German. The app says "Advance to interview" (the miss), so they submit a thumbs-down with a comment.
  2. The AI engineer filters traces by the user-feedback annotation, opens the bad trace, and opens its span in the playground. It lands on the exact prompt revision with the CV pre-filled.
  3. They add Fluent German (the company's working language) to the must-have requirements and rerun. overall_match flips to false and German shows up in missing_requirements, while tech_match and experience_match stay true.
  4. They add the CV to the test set as a new case with only expected_overall_match = false, leaving the other two expected columns empty.
  5. They run an evaluation comparing the deployed prompt against the new one. The old prompt fails the new case; the new prompt passes it and keeps every original case passing, because every other candidate in the test set speaks German.
  6. They deploy the new revision. The Streamlit app picks it up on the next screening, no code change.

Walkthrough video

CV screening walkthrough

What's in the example

  • config.py: the job spec, the prompt, and the structured-output JSON schema.
  • prepare_testset.py: builds data/testset.csv (27 curated CVs from the public opensporks/resumes dataset, all German speakers) and can upload it to Agenta. The demo candidate is deliberately excluded so the walkthrough adds it.
  • setup_app.py: idempotent setup. Creates the app and a default variant, sets the completion URI on every revision, deploys with the environment reference map, and archives any stray auto-named variant.
  • generate_traces.py: seeds screening traces (no feedback) before a demo.
  • screening.py, app.py: the AI logic and the Streamlit UI.
  • make_sample_pdfs.py, data/sample_cvs/: four sample CV PDFs, including the no-German demo candidate and a German-speaking strong match.
  • AGENTS.md: setup steps and the SDK/endpoint gotchas found while building this (completion URI, variant naming, deploy reference map).

Notes

  • The demo beats are verified, not assumed. Borderline CVs are model-unstable, so the labels follow the model's stable consensus, and the German requirement is added at the top of the must-haves where the flip is reliable (5/5 runs).
  • Pre-ran the before/after evaluation over the full test set twice: before scores 0.96 (only the new German case fails), after scores 1.00 with no regressions.
  • The code evaluator that scores each expected_* column is in Readme.md. Test it with a real evaluation run, not the evaluator test panel, which does not pass test set columns (AGE-3825).

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jun 15, 2026 8:38am

Request Review

@dosubot dosubot Bot added example python Pull requests that update Python code size:XL This PR changes 500-999 lines, ignoring generated files. labels Jun 9, 2026
@mmabrouk mmabrouk marked this pull request as draft June 9, 2026 21:43
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a complete, production-ready CV screening example for the Python SDK. It includes shared configuration with a structured JSON schema for classification results, scripts to prepare a curated test set from external resume data, an Agenta deployment script, and an interactive Streamlit demo that fetches prompts from Agenta, runs LLM screening, and collects user feedback as trace annotations.

Changes

CV Screening Example

Layer / File(s) Summary
Overview, Documentation, and Infrastructure
examples/python/Readme.md, examples/python/cv-screening/Readme.md, examples/python/cv-screening/.env.example, examples/python/cv-screening/requirements.txt, examples/python/cv-screening/data/.gitignore
Root README adds CV screening to use cases table. New cv-screening README documents the full end-to-end workflow. Environment template defines AGENTA_API_KEY, AGENTA_HOST, and OPENAI_API_KEY. Dependencies include Agenta SDK, OpenAI client, Streamlit, test-data tools, and PDF generation.
Shared Configuration and Agenta Deployment
examples/python/cv-screening/config.py, examples/python/cv-screening/create_app.py
config.py defines app/variant slugs, system/user prompts with {cv} template, a strict JSON schema enforcing scores (1–5), requirement lists, classification enum, and reasoning. create_app.py initializes the Agenta client, creates the service completion app, publishes the prompt variant with the schema-based LLM config, and deploys to production.
Test Set Preparation and Sample Generation
examples/python/cv-screening/Readme.md (test set docs), examples/python/cv-screening/prepare_testset.py, examples/python/cv-screening/make_sample_pdfs.py
README documents test set construction from external resume dataset with curated IT-manager classifications. prepare_testset.py downloads a public Hugging Face parquet, applies hand-curated resume ID mappings, converts resume HTML to Markdown, writes data/testset.csv, and optionally uploads to Agenta. make_sample_pdfs.py renders selected test CVs as PDF files under data/sample_cvs/ with text normalization and FPDF styling.
Interactive Demo with PDF Upload and Feedback
examples/python/cv-screening/Readme.md (demo walkthrough), examples/python/cv-screening/app.py
README walks through setup, Agenta deployment, test-set upload, playground iteration, Streamlit demo run, and feedback collection. app.py provides a Streamlit dashboard: PDF upload and Markdown conversion, fetches the production prompt from Agenta (with local fallback), invokes OpenAI chat completion with the schema, captures Agenta trace invocation IDs, renders classification banner and per-area score metrics with progress bars, requirement lists, and a feedback form (thumbs up/down + optional comment) that posts to Agenta trace data and triggers a success rerun.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant StreamlitApp as Streamlit App
  participant Agenta
  participant OpenAI
  
  User->>StreamlitApp: Upload CV PDF
  StreamlitApp->>StreamlitApp: Convert PDF to Markdown
  StreamlitApp->>Agenta: Fetch production prompt config
  Agenta-->>StreamlitApp: Return prompt + LLM config
  User->>StreamlitApp: Click "Screen CV" button
  StreamlitApp->>OpenAI: Call chat completion<br/>with prompt + schema
  OpenAI-->>StreamlitApp: Return structured JSON<br/>(scores, requirements, classification)
  StreamlitApp->>StreamlitApp: Render classification banner<br/>+ score metrics + requirements
  StreamlitApp->>Agenta: Capture trace invocation ID
  User->>StreamlitApp: Submit feedback<br/>(thumbs up/down + comment)
  StreamlitApp->>Agenta: POST feedback as trace annotation
  Agenta-->>StreamlitApp: Success response (200/202)
  StreamlitApp-->>User: Show feedback confirmation
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 60.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately captures the main feature being added: a CV screening demo with feedback integration and evaluation walkthrough workflow.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the walkthrough story, components, and implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/cv-classifier-demo-oug3jb

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/python/cv-screening/requirements.txt (1)

1-19: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Pin dependency versions to avoid pulling vulnerable packages.

The requirements file specifies no version constraints, which means pip install will fetch the latest versions of all packages and their transitive dependencies. OSV Scanner has flagged numerous critical and high-severity vulnerabilities in transitive dependencies that could be pulled in, including:

  • aiohttp: 23 CRITICAL issues (SSRF, header injection, DoS, credential leaks)
  • gitpython: 9 CRITICAL issues (RCE, path traversal, arbitrary code execution)
  • litellm: 13 CRITICAL issues (SSTI, SQL injection, SSRF, eval-based RCE)
  • pillow: 6 CRITICAL issues (arbitrary code execution, buffer overflow, DoS)
  • pyarrow: 3 CRITICAL issues (arbitrary code execution)

While this is example code, users may run it in environments connected to real data or networks. Unpinned dependencies create a supply-chain risk.

🔒 Recommendation

Generate a pinned requirements.txt by running:

pip install -r requirements.txt
pip freeze > requirements.txt

Then review the frozen versions and update any packages flagged by pip-audit or OSV Scanner. Alternatively, specify minimum safe versions inline:

 # Agenta SDK + LLM client
-agenta
-openai
-python-dotenv
+agenta>=0.28.0
+openai>=1.0.0
+python-dotenv>=1.0.0

For the remaining packages, apply the same pattern after verifying secure minimum versions.

🧹 Nitpick comments (1)
examples/python/cv-screening/Readme.md (1)

20-22: 💤 Low value

Add language identifier to code fence.

The code fence starting at line 20 lacks a language identifier, triggering a markdownlint warning (MD040). While this is ASCII art rather than code, specifying text or leaving it as triple-backticks with no syntax highlighting improves consistency.

📝 Proposed fix
-```
+```text
 PDF upload ──> Markdown (markitdown) ──> prompt fetched from Agenta ──> LLM ──> structured scores
</details>

<!-- cr-comment:v1:63b9a63971a5e8574a05aef6 -->

</blockquote></details>

</blockquote></details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `7c1b7401-bc27-47de-b94d-d0d734c5558f`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between aed2d47357cc8d88347011835c7cc1f3f7f08ea7 and c28d1a2dca9c1982a3b8885929de57447d80e256.

</details>

<details>
<summary>⛔ Files ignored due to path filters (4)</summary>

* `examples/python/cv-screening/data/sample_cvs/candidate_chef.pdf` is excluded by `!**/*.pdf`
* `examples/python/cv-screening/data/sample_cvs/candidate_it_manager.pdf` is excluded by `!**/*.pdf`
* `examples/python/cv-screening/data/sample_cvs/candidate_it_supervisor.pdf` is excluded by `!**/*.pdf`
* `examples/python/cv-screening/data/testset.csv` is excluded by `!**/*.csv`

</details>

<details>
<summary>📒 Files selected for processing (10)</summary>

* `examples/python/Readme.md`
* `examples/python/cv-screening/.env.example`
* `examples/python/cv-screening/Readme.md`
* `examples/python/cv-screening/app.py`
* `examples/python/cv-screening/config.py`
* `examples/python/cv-screening/create_app.py`
* `examples/python/cv-screening/data/.gitignore`
* `examples/python/cv-screening/make_sample_pdfs.py`
* `examples/python/cv-screening/prepare_testset.py`
* `examples/python/cv-screening/requirements.txt`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread examples/python/cv-screening/app.py Outdated
Comment thread examples/python/cv-screening/app.py
Comment thread examples/python/cv-screening/create_app.py
Comment thread examples/python/cv-screening/prepare_testset.py
Comment thread examples/python/cv-screening/prepare_testset.py Outdated
claude and others added 6 commits June 12, 2026 12:15
A walkthrough demo for classifying CVs against a job spec with Agenta:

- Curated test set of 30 real Markdown CVs (from the public
  opensporks/resumes dataset on Hugging Face, a mirror of the Kaggle
  Resume Dataset), hand-labeled against an IT Manager job spec
- prepare_testset.py rebuilds the CSV reproducibly and can upload it
  to Agenta via the SDK
- create_app.py creates the completion app with the screening prompt
  and structured-output JSON schema, and deploys it to production
- Streamlit demo UI: PDF upload -> Markdown (markitdown) -> prompt
  fetched from the Agenta registry -> structured score dashboard
- Sample CV PDFs (one per classification) generated from the test set

https://claude.ai/code/session_01YMbf4sUb2VBFQHGNKv6yh3
The Streamlit app now shows a thumbs up/down form with an optional
comment after each screening. Submitting it attaches the feedback to
the screening's trace in Agenta as an annotation (evaluator slug
'user-feedback'), following the capture-user-feedback cookbook:
the invocation link is captured inside the instrumented classify_cv
call and the annotation is POSTed to /api/simple/traces/.

Screening results now persist in session state so the result and
feedback form survive Streamlit reruns. Entry scripts load .env via
python-dotenv, matching the documented setup flow.

https://claude.ai/code/session_01YMbf4sUb2VBFQHGNKv6yh3
…pt revision

Move all the AI logic out of the Streamlit app into a new screening.py
module (prompt fetch, the LLM call, tracing, feedback), leaving app.py as
a UI-only shell. Any other frontend can import screening.py unchanged.

Tracing improvements so screenings are easy to act on from the UI:

- Auto-instrument the OpenAI client with OpenInference, so every trace has
  a child LLM span with the exact messages, token counts, and cost.
- classify_cv takes its inputs as a dict whose keys match the prompt input
  variables ({"cv": ...}), and the prompt config is kept out of the trace
  (ignore_inputs). The span data then mirrors the completion app's inputs.
- Link each span to the deployed prompt revision via ag.tracing.store_refs,
  so traces filter by app/environment and open in the playground on the
  right revision with inputs pre-filled.

Also fix create_app.py to read variant.variant_version as an attribute
(VariantManager now returns a ConfigurationResponse, not a dict).
The walkthrough needed a leaner story: the output schema is now
tech_match / experience_match / overall_match, each with a short reason,
plus the missing-requirements list. overall_match is a holistic
hire-or-not judgment, so a requirement like a language can flip it while
the other two stay true. The test set drops the bookkeeping columns and
carries one expected_* column per dimension; empty cells are skipped by
the code evaluator documented in the Readme.
Every test set CV now speaks German (the company's working language); the
demo candidate (a strong IT manager with no German) is excluded from the
set so the walkthrough adds it as a new test case. Relabeled the borderline
rows to the model's stable consensus and dropped two bistable resumes, so
the before/after evaluation is deterministic (before 0.96, after 1.00, no
regressions). Verified the German flip is reliable when the requirement is
added at the top of the must-haves.

Adds setup_app.py: an idempotent setup that creates the app and a default
variant explicitly, always sets the completion URI, deploys with the
reference map, and archives stray auto-named variants. Adds
generate_traces.py to seed traces, and AGENTS.md documenting setup and the
SDK/endpoint gotchas. Feedback annotations now send a boolean score.
@mmabrouk mmabrouk force-pushed the claude/cv-classifier-demo-oug3jb branch from 9cb02f6 to f3bf581 Compare June 15, 2026 08:29
@mmabrouk mmabrouk changed the title Add CV screening example with curated resume test set feat(examples): CV screening demo with feedback-to-deploy walkthrough Jun 15, 2026
@mmabrouk mmabrouk marked this pull request as ready for review June 15, 2026 08:36
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. feature and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jun 15, 2026
@mmabrouk mmabrouk changed the base branch from main to release/v0.103.5 June 15, 2026 08:38
@mmabrouk mmabrouk requested a review from bekossy June 15, 2026 08:39
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 15, 2026
@bekossy bekossy merged commit 1d42627 into release/v0.103.5 Jun 15, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

example feature lgtm This PR has been approved by a maintainer python Pull requests that update Python code size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants