Skip to content

Can I convert from PDF to text preserving the original layout? #290

Closed Answered by mara004
gildofabregat asked this question in Q&A
Discussion options

You must be logged in to vote

pdfium provides range and bounds based text extraction APIs. It can also tell you the position of individual chars on the page.
However, it does not expose APIs for layout analysis such as detecting words, lines and paragraphs/columns.

It should not mix up the two columns when extracting the text.
But it doesn't format text output to visually reflect the original (in the sense of keeping the two columns side by side), i.e. it can't do what pymupdf's python -m fitz gettext -mode layout can. So the answer to your question is probably "no".

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
2 replies
@mara004
Comment options

@gildofabregat
Comment options

Answer selected by mara004
Comment options

You must be logged in to vote
3 replies
@gildofabregat
Comment options

@gildofabregat
Comment options

@mara004
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants