fix(deps): pin markitdown >=0.1.5 with narrow extras (closes #64)#65
Merged
Conversation
The bare `markitdown[all]` pulled ~67MB of unused deps (azure-*, pdfminer, pdfplumber, speechrecognition, youtube-transcript-api, pydub, xlrd, olefile) and let users land on pre-0.1.0 markitdown where pptx chart parse errors (`#N/A`, `#DIV/0!`, blanks → `ValueError` from python-pptx `CT_StrVal_NumVal_Composite.value`) propagate out and abort the whole conversion. Switch to `markitdown[docx,pptx,xlsx,xls]>=0.1.5` so we only install Office-format extras we actually use, and we always run against a version whose `_convert_chart_to_markdown` wraps the python-pptx call in `except Exception` — the offending chart degrades to `[unsupported chart]` instead of killing the file. Also add `.xls` to `SUPPORTED_EXTENSIONS` / `_SHORT_DOC_TYPES` so the `[xls]` extra has a code path that exercises it. Chart numeric data is still lost on bad cells (upstream limitation in python-pptx — no fix or open PR there). A higher-fidelity pptx path can be added behind a config flag if users need it.
Collaborator
Author
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
markitdown[all]withmarkitdown[docx,pptx,xlsx,xls]>=0.1.5— drops ~67 MB of unused deps (azure-*, pdfminer, pdfplumber, speechrecognition, youtube-transcript-api, pydub, xlrd-then-readded, olefile) and forces users off pre-0.1.0 markitdown..xlstoSUPPORTED_EXTENSIONS/_SHORT_DOC_TYPESso the[xls]extra is actually reachable.Why this closes #64
The reporter's traceback (
_markitdown.py:1715) is from the pre-0.1.0 monolithic markitdown, where aValueErrorfrom python-pptx'sCT_StrVal_NumVal_Composite.value(barefloat(self.v.text)on#N/A/#DIV/0!/ blank cells) propagates straight out and aborts the wholeopenkb addrun.From markitdown 0.1.2 onward,
_convert_chart_to_markdownis wrapped inexcept Exceptionand the offending chart degrades to[unsupported chart]. Pinning>=0.1.5guarantees users land on a version with that handler.Caveat
This is an "unblock the pipeline" fix, not a chart-data-preservation fix — the chart's other (numeric) cells are still lost on bad cells. The root cause sits in python-pptx and there is no upstream fix or open PR there; we considered a monkey-patch but dropped it as too invasive. A higher-fidelity pptx path (e.g. LibreOffice → PDF → existing PDF pipeline) behind a config flag is a reasonable follow-up if users hit this in practice.
Test plan
uv run --extra dev pytest→ 520 passeduv syncafter the dep change cleanly removes 17 packages, then re-addsxlrdfor[xls]from openkb.converter import MarkItDownstill imports