Skip to content

feat: add pypdfium2 as optional PDF parser#298

Open
okwn wants to merge 12 commits into
VectifyAI:mainfrom
okwn:add-pypdfium2-parser
Open

feat: add pypdfium2 as optional PDF parser#298
okwn wants to merge 12 commits into
VectifyAI:mainfrom
okwn:add-pypdfium2-parser

Conversation

@okwn
Copy link
Copy Markdown

@okwn okwn commented May 25, 2026

Summary

Add pypdfium2 as an optional PDF parser, providing 3-5x faster parsing with cleaner text extraction (no broken words, correct Unicode).

Changes

  • Add pypdfium2 as optional PDF parser (lazy-imported, not required)
  • Make PageIndexClient parser-agnostic, pdf_parser configurable per index() call
  • Move pdf_parser off doc dict, pass via call args
  • Centralize default parser as DEFAULT_PDF_PARSER constant
  • Keep pdf_parser default in code, not config.yaml

Testing

Default behavior unchanged. Users can opt in via pdf_parser="pypdfium2".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants