Extract textual data from a PDF

Extract textual data from a PDF - java

I'm using a Java program to extract textual data from a PDF.
When I use this type of PDF I have no problem :
But when I use this type the extraction is not performed :
Have you any idea to resolve this problem?

Try using iText7 and following code:
File inputFile = new File("path_to_your_pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String text = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
pdfDocument.close();
And let us know what the output is. And whether the output corresponds to what you'd expect.
As #mkl points out, this may simply be the the difference between extracting form-fields or not. In any case, the links to your pdfs would be much appreciated. As well as some code.
But you can of course extract both using iText.
Reading material:
http://developers.itextpdf.com/content/itext-7-examples/itext-7-form-examples
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?

Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.

RTF Line count in Java

Kindly let me know any API to calculate the line count for RTF document.
Apache POI or Aspose works for document, but its not able to find line count for RTF.
Thanks.

Java already has a built-in RTF-Parser: RTFEditorKit.
Take a look at its read method.
For example:
test.rtf file contents
hello
stackoverflow
users
So, it has 3 lines separated by \n.
Code:
FileInputStream stream = new FileInputStream("test.rtf");
RTFEditorKit kit = new RTFEditorKit();
Document doc = kit.createDefaultDocument();
kit.read(stream, doc, 0);
String plainText = doc.getText(0, doc.getLength());
System.out.println(plainText.split("\\n").length);
Output = 3

You can use Aspose.Words for Java to get the number of lines of an RTF document. Please do the following:
Read RTF file using document class
Get BuiltInDocumentProperties object using getBuiltInDocumentProperties method
Now, get number of lines using getLines property of BuiltInDocumentProperties object
I hope this helps. Please note that I work as developer evangelist at Aspose. If you need any help with Aspose, do let me know.

Wrong parsing with iText's PdfTextExtractor

I'm facing a problem when I try to read the content of a PDF document. I'm using iText 2.1.7 with Java, and I need to analyze the content of a PDF document: at first I was using the PdfTextExtractor's getTextFromPage method and it was working right, but only when the page is just text, if it contains an image, then the String that I get with the getTextFromPage is a set of meaningless symbols (maybe a different character encoding?), and I lose the content of the whole page. I tried with the last version of iText and works fine, but if I'm not wrong the license wouldn't be totally free (I'm working in a web application for a commercial customer, which serves PDFs on the fly) so I can't use it. I would really appreciate if you have any suggestion.
In case you need it, here is the code:
PdfReader pdf = new PdfReader(doc); //doc is just a byte[]
int pageCount = pdf.getNumberOfPages();
for (int i = 1; i <= pageCount; i++) {
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(pdf);
String pageText = pdfTextExtractor.getTextFromPage(i);
Thanks in advance, regards.

I think that you PDF has an inline image. I do not think that iText 2.1.7 will deal with that.
You can find information regarding the license here

Convert Word (.docx !) to html with Apache POI + fr.opensagres.xdocreport

I could implement converting the old .doc files to html only with Apache POI.
For .docx, however, I had to use the fr.opensagres.xdocreport package.
Code is pretty straightforward:
XWPFDocument document = new XWPFDocument(inputStream);
OutputStream outputStream = new ByteArrayOutputStream();
XHTMLOptions options = XHTMLOptions.create().indent(4).setImageManager(new Base64EmbedImgManager());
XHTMLConverter.getInstance().convert(document, outputStream, options);
return outputStream.toString();
However, there two issues:
embedded excel charts are ignored (the .doc conversion with Apache POI converts them to images just like it does with any other normal image)
text that has custom various format combinations is not rendered correctly, several new-lines are inserted unnecessarily. ( see input and output examples)
Does anybody know any way to fix this ?
Thank you.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract textual data from a PDF - java

I'm using a Java program to extract textual data from a PDF. When I use this type of PDF I have no problem : But when I use this type the extraction is not performed : Have you any idea to resolve this problem?

Related

Replacing text in XWPFParagraph without changing format of the docx file

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

RTF Line count in Java

Wrong parsing with iText's PdfTextExtractor

Convert Word (.docx !) to html with Apache POI + fr.opensagres.xdocreport

Categories

Resources