Pdf Text Extractor

Ramón San Félix Ramón

CODE AUTHOR

Posts: 28

1 month 3 weeks ago #539 by Ramón San Félix Ramón

Ramón San Félix Ramón created the code: Pdf Text Extractor

Today I would like to share with you a tool that I use before OCR: a simple function to extract text from a PDF. I'm referring to those PDFs where the text can be selected and copied, not the ones that are just images.

If you ever need to extract text from these types of PDFs, you can use iTextSharp and its PdfExtractor to achieve this.

In the example I am going to show, I have created two functions. The first, called of_pdftotxt, converts a PDF into a text file. The second, of_pdftoblob, returns the contents of the PDF in a blob. With this second function, you can use filewrite from PowerBuilder to get the same result as the first function.

Out of curiosity, I want to comment that I have tried to create the library in iText7, but, as with other examples, it gives an error when executing the functions from the library. However, if I create a console program and test the functions, they run correctly. The same thing happened to me with the example of Digital Signature and with the example of Filling in fields in a PDF form. I don't know if I'm doing something wrong in iText7 or if I'm missing some reference. The fact is that with iTextSharp they work correctly in both ways.

In the Visual Studio example, I have commented out the code to use with iText7 instead of iTextSharp, in case anyone dares to solve the mystery...

I leave you here a copy of the project today but I always recommend that you download it from git hub in case there is an update:

github.com/rasanfe/pbPdfExtractor

github.com/rasanfe/PdfExtractor

To be aware of what I publish you can follow my blog in Spanish:

rsrsystem.blogspot.com

This message has an attachment file.
Please log in or register to see it.

Please Log in or Create an account to join the conversation.