Optimize PDF Word Search

Optimize PDF Word Search - java

I have an application that iterates over a directory of pdf files and searches for a string. I am using PDFBox to extract the text from the PDF and the code is pretty straightforward. At first to search through 13 files it was taking a minute in a half to load the results but I noticed that PDFBox was putting a lot of stuff in the log file file. I changed the logging level and that helped alot but it is still taking over 30 seconds to load a page. Does anybody have any suggestions on how I can optimize the code or another way to determine how many hits are in a document? I played around with Lucene but it seems to only give you the number of hits in a directory not number of hits in a particular file.
Here is my code to get the text out of a PDF.
public static String parsePDF (String filename) throws IOException
{
FileInputStream fi = new FileInputStream(new File(filename));
PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String pdfText = stripper.getText(new PDDocument(cd));
cd.close();
return pdfText;
}

Lucene would allow you to index each of the document seperately.
Instead of using PDFBox directly. you can use Apache Tika for extracting text and feeding it to lucene. Tika uses PDFBox internally. However, it provides easy to use api as well as ability to extract content from any types of document seamlessly.
Once you have each lucene document for each of the file in your directory, you can perform search against the complete index.
Lucene matches the search term and would return back number of results (files) which match the content in the document.
It is also possible to get the hits in each of the lucene document/file using the lucene api.
This is called the term frequency, and can be calculated for the document and field being searched upon.
Example from In a Lucene / Lucene.net search, how do I count the number of hits per document?
List docIds = // doc ids for documents that matched the query,
// sorted in ascending order
int totalFreq = 0;
TermDocs termDocs = reader.termDocs();
termDocs.seek(new Term("my_field", "congress"));
for (int id : docIds) {
termDocs.skipTo(id);
totalFreq += termDocs.freq();
}

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Add text in MS Word doc using apache POI

I have been trying to edit different types of documents using Apache POI. The script should handle both extensions .doc and .docx. I could successfully edit the .docx file using XWPF api and the required text was added at the end of the docx file.
For editing .doc files(which include header, footer and a few paragraphs), following script is used, which use HWPFDocument.
FileInputStream fis = new FileInputStream(args[0]);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
CharacterRun run = range.insertAfter("FROM SEHWAGGG A FOUUURRRRRR");
run.setBold(true);
run.setItalic(true);
The script works fine with normal documents which does not have header and footer. But seems that the issue appears with complex documents. It insert text, but in between the paragraphs (and at the beginning using insertBefore()). There are no text replacements required, just have to put the text at the end of the document. I searched similar scripts but most of them handle text replacement.
How can I add the text at the end, after all paragraphs?

I've tested It with the following document:
At first (with your original code) it completely destroyed the document:
By changing the following line, the insert works fine for me:
// Old
Range range = doc.getEndnoteRange();
// New
Range range = doc.getEndnoteRange();

I'm afraid you are out of luck with HWPF with the current state of the project.
I created a custom HWPF library for one of our clients, but the changes are not public. The changes were huge, so you can't spend - say - a week and assume that things will be fixed. You might get away with the current public HWPF when only some text needs to be replaced without changing the string length ("abc" -> "123" or "a " -> "1234").

How to extract content from a pdf having both text and images?

I have a pdf file which has two types of pages, normal text pages and pages coming from scanned documents. Text content can be easily extracted usin either PDFBox or Tika. However these libraries can't do OCR and I need something like Tess4J. But how can I combine Tess4J and PDFBox (or Tika) together in order to extract all the content from both text and scanned pages?
EDITED:------
I found a solution as follows but it doesn't work well
for(PDPage page:pages){
PDPropertyList pageFonts = page.getResources().getProperties();
Map<String,PDXObjectImage> img = page.getResources().getImages();
Set<String> keys = img.keySet();
Iterator iter = keys.iterator();
while(iter.hasNext()){
String k = (String) iter.next();
System.out.println(k);
PDXObjectImage ci = img.get(k);
ci.write2file(k);
File imageFile = new File(k+".jpg");
Tesseract instance = Tesseract.getInstance();
try{
String result = instance.doOCR(imageFile);
System.out.println(result);
}
catch(TesseractException e){
System.err.println(e.getMessage());
}
}
}
The problem is although image fies which are saved have good quality, Tess4J does not operate well on them and mostly extract nonsense. However Tess4J is able to OCR them very good if the pdf is simply passed to it in the first place.
In summary, extracting the images by PdfBox affect the quality of OCR process, Does anyone know why?

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?

Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.

HWPFDocument / XWPFDocument New Lines

I am trying to pull data from microsoft-word and translate it to sql statement and inserting it an Oracle database.
When the data in ms-word contains a new line that is created by [Shift-Enter] and not just enter,
The text contains an icon that looks like a box with a question mark.
Where the ET is just standard new line using the enter key and the ST is new lines using the
Shift-Enter combination. So when generating the SQL and inserting it to oracle, oracle counts that not as a text, but as hex.
My question is, how to remove lines that is created by [shift-enter] to just a standard '\n'?
Thanks
Update
This is how i get the text information
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream(file));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
text = we.getText();
Update Answer:
This was a bug in poi-3.6. In poi-3.8 it shows as \r.

What you're almost certainly seeing are "fields" in the word document, which are special blocks of text such as links, macros etc
Option number one is to continue using WordExtractor, but call stripFields(String) on the resulting text before using it. That'll remove any of these fields from the text for you.
The other option is to use a different way of getting the text out. WordToTextConverter is part of Apache POI, and is more complex code that handles more of the format and should skip these for you (WordExtractor is pretty simple and low level). The other is to use Apache Tika, which provides a common way of extracting text from a number of file formats. That does have the proper code to deal with fields, and as an added bonus it'll be trivial for you to support .docx or .pdf when your requirements change!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimize PDF Word Search - java

Related

Replacing text in XWPFParagraph without changing format of the docx file

Add text in MS Word doc using apache POI

How to extract content from a pdf having both text and images?

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

HWPFDocument / XWPFDocument New Lines

Categories

Resources