Skip to content

fix(deps): pin markitdown >=0.1.5 with narrow extras (closes #64)#65

Merged
KylinMountain merged 1 commit into
mainfrom
fix/pptx-chart-na-issue-64
May 24, 2026
Merged

fix(deps): pin markitdown >=0.1.5 with narrow extras (closes #64)#65
KylinMountain merged 1 commit into
mainfrom
fix/pptx-chart-na-issue-64

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

  • Replace markitdown[all] with markitdown[docx,pptx,xlsx,xls]>=0.1.5 — drops ~67 MB of unused deps (azure-*, pdfminer, pdfplumber, speechrecognition, youtube-transcript-api, pydub, xlrd-then-readded, olefile) and forces users off pre-0.1.0 markitdown.
  • Add .xls to SUPPORTED_EXTENSIONS / _SHORT_DOC_TYPES so the [xls] extra is actually reachable.

Why this closes #64

The reporter's traceback (_markitdown.py:1715) is from the pre-0.1.0 monolithic markitdown, where a ValueError from python-pptx's CT_StrVal_NumVal_Composite.value (bare float(self.v.text) on #N/A / #DIV/0! / blank cells) propagates straight out and aborts the whole openkb add run.

From markitdown 0.1.2 onward, _convert_chart_to_markdown is wrapped in except Exception and the offending chart degrades to [unsupported chart]. Pinning >=0.1.5 guarantees users land on a version with that handler.

Caveat

This is an "unblock the pipeline" fix, not a chart-data-preservation fix — the chart's other (numeric) cells are still lost on bad cells. The root cause sits in python-pptx and there is no upstream fix or open PR there; we considered a monkey-patch but dropped it as too invasive. A higher-fidelity pptx path (e.g. LibreOffice → PDF → existing PDF pipeline) behind a config flag is a reasonable follow-up if users hit this in practice.

Test plan

  • uv run --extra dev pytest → 520 passed
  • uv sync after the dep change cleanly removes 17 packages, then re-adds xlrd for [xls]
  • Manual: from openkb.converter import MarkItDown still imports

The bare `markitdown[all]` pulled ~67MB of unused deps (azure-*, pdfminer, pdfplumber, speechrecognition, youtube-transcript-api, pydub, xlrd, olefile) and let users land on pre-0.1.0 markitdown where pptx chart parse errors (`#N/A`, `#DIV/0!`, blanks → `ValueError` from python-pptx `CT_StrVal_NumVal_Composite.value`) propagate out and abort the whole conversion.

Switch to `markitdown[docx,pptx,xlsx,xls]>=0.1.5` so we only install Office-format extras we actually use, and we always run against a version whose `_convert_chart_to_markdown` wraps the python-pptx call in `except Exception` — the offending chart degrades to `[unsupported chart]` instead of killing the file. Also add `.xls` to `SUPPORTED_EXTENSIONS` / `_SHORT_DOC_TYPES` so the `[xls]` extra has a code path that exercises it.

Chart numeric data is still lost on bad cells (upstream limitation in python-pptx — no fix or open PR there). A higher-fidelity pptx path can be added behind a config flag if users need it.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@KylinMountain KylinMountain merged commit 51b6d4f into main May 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error converting pptx file with charts

1 participant