feat(examples): CV screening demo with feedback-to-deploy walkthrough#4607
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR introduces a complete, production-ready CV screening example for the Python SDK. It includes shared configuration with a structured JSON schema for classification results, scripts to prepare a curated test set from external resume data, an Agenta deployment script, and an interactive Streamlit demo that fetches prompts from Agenta, runs LLM screening, and collects user feedback as trace annotations. ChangesCV Screening Example
Sequence Diagram(s)sequenceDiagram
participant User
participant StreamlitApp as Streamlit App
participant Agenta
participant OpenAI
User->>StreamlitApp: Upload CV PDF
StreamlitApp->>StreamlitApp: Convert PDF to Markdown
StreamlitApp->>Agenta: Fetch production prompt config
Agenta-->>StreamlitApp: Return prompt + LLM config
User->>StreamlitApp: Click "Screen CV" button
StreamlitApp->>OpenAI: Call chat completion<br/>with prompt + schema
OpenAI-->>StreamlitApp: Return structured JSON<br/>(scores, requirements, classification)
StreamlitApp->>StreamlitApp: Render classification banner<br/>+ score metrics + requirements
StreamlitApp->>Agenta: Capture trace invocation ID
User->>StreamlitApp: Submit feedback<br/>(thumbs up/down + comment)
StreamlitApp->>Agenta: POST feedback as trace annotation
Agenta-->>StreamlitApp: Success response (200/202)
StreamlitApp-->>User: Show feedback confirmation
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/python/cv-screening/requirements.txt (1)
1-19:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPin dependency versions to avoid pulling vulnerable packages.
The requirements file specifies no version constraints, which means
pip installwill fetch the latest versions of all packages and their transitive dependencies. OSV Scanner has flagged numerous critical and high-severity vulnerabilities in transitive dependencies that could be pulled in, including:
- aiohttp: 23 CRITICAL issues (SSRF, header injection, DoS, credential leaks)
- gitpython: 9 CRITICAL issues (RCE, path traversal, arbitrary code execution)
- litellm: 13 CRITICAL issues (SSTI, SQL injection, SSRF, eval-based RCE)
- pillow: 6 CRITICAL issues (arbitrary code execution, buffer overflow, DoS)
- pyarrow: 3 CRITICAL issues (arbitrary code execution)
While this is example code, users may run it in environments connected to real data or networks. Unpinned dependencies create a supply-chain risk.
🔒 Recommendation
Generate a pinned
requirements.txtby running:pip install -r requirements.txt pip freeze > requirements.txtThen review the frozen versions and update any packages flagged by
pip-auditor OSV Scanner. Alternatively, specify minimum safe versions inline:# Agenta SDK + LLM client -agenta -openai -python-dotenv +agenta>=0.28.0 +openai>=1.0.0 +python-dotenv>=1.0.0For the remaining packages, apply the same pattern after verifying secure minimum versions.
🧹 Nitpick comments (1)
examples/python/cv-screening/Readme.md (1)
20-22: 💤 Low valueAdd language identifier to code fence.
The code fence starting at line 20 lacks a language identifier, triggering a markdownlint warning (MD040). While this is ASCII art rather than code, specifying
textor leaving it as triple-backticks with no syntax highlighting improves consistency.📝 Proposed fix
-``` +```text PDF upload ──> Markdown (markitdown) ──> prompt fetched from Agenta ──> LLM ──> structured scores</details> <!-- cr-comment:v1:63b9a63971a5e8574a05aef6 --> </blockquote></details> </blockquote></details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro Plus **Run ID**: `7c1b7401-bc27-47de-b94d-d0d734c5558f` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between aed2d47357cc8d88347011835c7cc1f3f7f08ea7 and c28d1a2dca9c1982a3b8885929de57447d80e256. </details> <details> <summary>⛔ Files ignored due to path filters (4)</summary> * `examples/python/cv-screening/data/sample_cvs/candidate_chef.pdf` is excluded by `!**/*.pdf` * `examples/python/cv-screening/data/sample_cvs/candidate_it_manager.pdf` is excluded by `!**/*.pdf` * `examples/python/cv-screening/data/sample_cvs/candidate_it_supervisor.pdf` is excluded by `!**/*.pdf` * `examples/python/cv-screening/data/testset.csv` is excluded by `!**/*.csv` </details> <details> <summary>📒 Files selected for processing (10)</summary> * `examples/python/Readme.md` * `examples/python/cv-screening/.env.example` * `examples/python/cv-screening/Readme.md` * `examples/python/cv-screening/app.py` * `examples/python/cv-screening/config.py` * `examples/python/cv-screening/create_app.py` * `examples/python/cv-screening/data/.gitignore` * `examples/python/cv-screening/make_sample_pdfs.py` * `examples/python/cv-screening/prepare_testset.py` * `examples/python/cv-screening/requirements.txt` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
A walkthrough demo for classifying CVs against a job spec with Agenta: - Curated test set of 30 real Markdown CVs (from the public opensporks/resumes dataset on Hugging Face, a mirror of the Kaggle Resume Dataset), hand-labeled against an IT Manager job spec - prepare_testset.py rebuilds the CSV reproducibly and can upload it to Agenta via the SDK - create_app.py creates the completion app with the screening prompt and structured-output JSON schema, and deploys it to production - Streamlit demo UI: PDF upload -> Markdown (markitdown) -> prompt fetched from the Agenta registry -> structured score dashboard - Sample CV PDFs (one per classification) generated from the test set https://claude.ai/code/session_01YMbf4sUb2VBFQHGNKv6yh3
The Streamlit app now shows a thumbs up/down form with an optional comment after each screening. Submitting it attaches the feedback to the screening's trace in Agenta as an annotation (evaluator slug 'user-feedback'), following the capture-user-feedback cookbook: the invocation link is captured inside the instrumented classify_cv call and the annotation is POSTed to /api/simple/traces/. Screening results now persist in session state so the result and feedback form survive Streamlit reruns. Entry scripts load .env via python-dotenv, matching the documented setup flow. https://claude.ai/code/session_01YMbf4sUb2VBFQHGNKv6yh3
…pt revision
Move all the AI logic out of the Streamlit app into a new screening.py
module (prompt fetch, the LLM call, tracing, feedback), leaving app.py as
a UI-only shell. Any other frontend can import screening.py unchanged.
Tracing improvements so screenings are easy to act on from the UI:
- Auto-instrument the OpenAI client with OpenInference, so every trace has
a child LLM span with the exact messages, token counts, and cost.
- classify_cv takes its inputs as a dict whose keys match the prompt input
variables ({"cv": ...}), and the prompt config is kept out of the trace
(ignore_inputs). The span data then mirrors the completion app's inputs.
- Link each span to the deployed prompt revision via ag.tracing.store_refs,
so traces filter by app/environment and open in the playground on the
right revision with inputs pre-filled.
Also fix create_app.py to read variant.variant_version as an attribute
(VariantManager now returns a ConfigurationResponse, not a dict).
The walkthrough needed a leaner story: the output schema is now tech_match / experience_match / overall_match, each with a short reason, plus the missing-requirements list. overall_match is a holistic hire-or-not judgment, so a requirement like a language can flip it while the other two stay true. The test set drops the bookkeeping columns and carries one expected_* column per dimension; empty cells are skipped by the code evaluator documented in the Readme.
Every test set CV now speaks German (the company's working language); the demo candidate (a strong IT manager with no German) is excluded from the set so the walkthrough adds it as a new test case. Relabeled the borderline rows to the model's stable consensus and dropped two bistable resumes, so the before/after evaluation is deterministic (before 0.96, after 1.00, no regressions). Verified the German flip is reliable when the requirement is added at the top of the must-haves. Adds setup_app.py: an idempotent setup that creates the app and a default variant explicitly, always sets the completion URI, deploys with the reference map, and archives stray auto-named variants. Adds generate_traces.py to seed traces, and AGENTS.md documenting setup and the SDK/endpoint gotchas. Feedback annotations now send a boolean score.
9cb02f6 to
f3bf581
Compare
What this is
A complete CV screening example that demonstrates the core Agenta loop: a prompt screens a candidate, the result lands in observability as a trace, a human gives feedback on it, and that feedback drives a prompt change that you evaluate and deploy. The prompt scores a CV against an IT Manager job spec on three booleans with reasons (
tech_match,experience_match,overall_match) plus the list of missing requirements.A recruiter uses a small Streamlit app (PDF upload to Markdown to the prompt fetched from the registry). An AI engineer works the prompt, the test set, and the evaluations in Agenta. The two sides stay split:
screening.pyowns the AI logic and tracing,app.pyis only UI.The walkthrough
The example is built around one story: the company's working language is German, but the prompt's job spec never says so.
user-feedbackannotation, opens the bad trace, and opens its span in the playground. It lands on the exact prompt revision with the CV pre-filled.Fluent German (the company's working language)to the must-have requirements and rerun.overall_matchflips tofalseand German shows up inmissing_requirements, whiletech_matchandexperience_matchstaytrue.expected_overall_match = false, leaving the other two expected columns empty.Walkthrough video
What's in the example
config.py: the job spec, the prompt, and the structured-output JSON schema.prepare_testset.py: buildsdata/testset.csv(27 curated CVs from the publicopensporks/resumesdataset, all German speakers) and can upload it to Agenta. The demo candidate is deliberately excluded so the walkthrough adds it.setup_app.py: idempotent setup. Creates the app and adefaultvariant, sets the completion URI on every revision, deploys with the environment reference map, and archives any stray auto-named variant.generate_traces.py: seeds screening traces (no feedback) before a demo.screening.py,app.py: the AI logic and the Streamlit UI.make_sample_pdfs.py,data/sample_cvs/: four sample CV PDFs, including the no-German demo candidate and a German-speaking strong match.AGENTS.md: setup steps and the SDK/endpoint gotchas found while building this (completion URI, variant naming, deploy reference map).Notes
expected_*column is inReadme.md. Test it with a real evaluation run, not the evaluator test panel, which does not pass test set columns (AGE-3825).