Tess4j separate pages from pdf with multiple pages in Java Application

Tess4j separate pages from pdf with multiple pages in Java Application - java

I have added Tess4j to my Java Application everything works great my PDF document with more than 50 pages is properly recognized and read and written to a text document as a string.
The problem is how can I mark end of each single page from my PDF file in my text document? By for example special string like ("++##- END--##++") which does not occur on the pages of PDF document???
Is this even possible with Tess4j?
Thank you

If you use createDocuments method or the low-level TessBoxTextRenderer API, the output text file will contain page-separator characters, which are FF by default.
https://tesseract-ocr.github.io/tessdoc/FAQ.html#what-page-separators-are-used-in-txt-output-by-tesseract-400
http://tess4j.sourceforge.net/docs

Related

Fastest Way to Read Number of Pages of Docx Files in Java (After Word Rendering)?

I create docx files using docx4j. After document creation I need to know the number of pages.
I know that I can read the persisted number of pages using docx4j as follows:
final WordprocessingMLPackage doc = ... // read doc
org.docx4j.openpackaging.parts.DocPropsExtendedPart docPropsExtendedPart = doc .getDocPropsExtendedPart();
org.docx4j.docProps.extended.Properties extendedProps = docPropsExtendedPart.getJaxbElement();
final Integer pages = extendedProps.getPages();
but it always returns 1 because that's the number of pages that is persisted to the docx file. Apache POI obviously returns the same result (XWPFDocument returning 1 number of pages for docx file).
When you open the document with Word you can notice that the number of pages is steadily updated for the first few seconds (which confirms that the initial number of pages is 1 and Word updates it dynamically after applying the styles etc. that influence the number of pages).
I read that you can convert it to PDF first and then read the number of pages. The sample PDF conversion provided by docx4j on github uses a commercial PDF converter so I cannot reuse this code. Furthermore, converting it to PDF first seems cumbersome and unnecessarily time-consuming to me.
Question: What is the fastest way to read the number of pages of docx files in Java without using commercial software?
//Edit:
The question Number of pages in a word doc in java unfortunately doesn't help me. As I wrote above, apache POI (and other APIs) only read the persisted number of pages (which is 1). What I need is the actual number of pages as Word would display it when you actually open the file.

You could use documents4j (though this needs Word); see https://www.docx4java.org/blog/2020/03/documents4j-for-toc-update/
You may need to customise word_convert.vbs a little more.

PDFBox: convert PDF to text including chapter headlineinformation

I am currently working at a project to extract the content of pdf files and search for certain keywords in them.
For extracting the content I am using PDFBox and this works fine.
The problem I now have encountered is that I want to be able to search for certain keywords only within chapter headlines.
At the moment my code for extracting looks like this:
PDDocument doc = PDDocument.load(pdfFile);
String text = new PDFTextStripper().getText(doc);
doc.close();
This only extracts the raw text of the file, with no information about headlines. I was not able to figure out how to use PDFBox to include such information. So I am not sure if this is even possible.
Has anybody experience with this tool and can tell me, if its even possible to do this by using PDFBox and if yes, how I will be able to achieve this?
Kind regards

How to test the PDF generated using iText?

I have used itext in Java to convert a HTML to PDF.
Now I want to test if the PDF generated by me is correct i.e the positions and contents all are correct and at correct positions.
Is there away to do the testing of my code?

Basically, Your question is about validating itext output.
If You do not trust library for converting HTML to PDF, You probably do not trust reading raw PDF data as well. You can therefore use other libraries (PDF clown) for parsing PDF as a validation.
You have 2 approaches.
First one requires rasterization of PDF (GhostScript) and comparing to HTML. Indeed, the performance overhead is significant.
Second one parses the document format. I have gone into depth in my previous answer about searching for text inside PDF file.
I have mentioned there searching for text as well as finding it's position on page.
I would suggest just simply avoid validating of output, unless You know something is wrong.
These libraries are widely-used and well-tested.

Two fields (out of 18) in a PDF form don't appear after filling and flattening with iText

I'm using the latest version of iText (5.5.0) to fill a pdf form and flatten it.
For some reason when I fill the form using an fdf everything works except for special chars like 'ç'. But when I use an xfdf the special chars appear but two fields do not ('comment' and 'datelicFormatted')
-> see https://upl.cases.lu/?action=d&id=82085749396826225834 for the template, the xpdf and a result
The form was created by converting a word document and adding text fields with acrobat 10.
The really strange part is when I ask not to flatten it: the fields contain the right values, they just seem to vanish when flattened.
Thank you for any help you could provide.

Your template indeed uses the font Arial Unicode MS that could be interpreted by iText correctly only with some extra resource(CMAP UniKS-UTF16-H). The ordinary iText JARS does not contain such resources by design. But there is one more iText JAR - itext-asia.jar that contains CMAPs resources. So just added itext-asia.jar to your classpath and the problematic fields will be flattened correctly. Your could download itext-asia.jar at http://sourceforge.net/projects/itext/files/extrajars/extrajars-2.3.zip/download

Convert Word Document to PDF using Java

I want to convert ms word document to PDF file using POI.jar(read the MS word Content) and Itext.jar(Creat the PDF File).
For Plain text in MS word, I am able to conver into PDF. But I have few images on ms word. I want to put those images on PDF.
Could some please help me out?

You lucky man i just stumbled upon JODConverter it uses openoffice to covert through java and its very easy to use.

There isn't such a solution for free, you will have to buy something like Aspose components, but you can also save the Word document as HTML and use any of the available HTML-to-PDF tools to convert it to PDF using Java. One of them is wkhtmltopdf.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.