How to extract content from a pdf having both text and images? - java

I have a pdf file which has two types of pages, normal text pages and pages coming from scanned documents. Text content can be easily extracted usin either PDFBox or Tika. However these libraries can't do OCR and I need something like Tess4J. But how can I combine Tess4J and PDFBox (or Tika) together in order to extract all the content from both text and scanned pages?
EDITED:------
I found a solution as follows but it doesn't work well
for(PDPage page:pages){
PDPropertyList pageFonts = page.getResources().getProperties();
Map<String,PDXObjectImage> img = page.getResources().getImages();
Set<String> keys = img.keySet();
Iterator iter = keys.iterator();
while(iter.hasNext()){
String k = (String) iter.next();
System.out.println(k);
PDXObjectImage ci = img.get(k);
ci.write2file(k);
File imageFile = new File(k+".jpg");
Tesseract instance = Tesseract.getInstance();
try{
String result = instance.doOCR(imageFile);
System.out.println(result);
}
catch(TesseractException e){
System.err.println(e.getMessage());
}
}
}
The problem is although image fies which are saved have good quality, Tess4J does not operate well on them and mostly extract nonsense. However Tess4J is able to OCR them very good if the pdf is simply passed to it in the first place.
In summary, extracting the images by PdfBox affect the quality of OCR process, Does anyone know why?

Related

GSON / iText: Extract Text From PDF 1.7 byte[]

I'm automating tests using Rest-Assured and GSON - and need to validate the contents of a PDF file that is returned in the response of a POST request. The content of the files vary and can contain anything from just text, to text and tables, or text and tables and graphics. Every page can, and most likely will be different as far a glyph content. I am only concerned with ALL text on the pdf page - be it just plain text, or text inside of a table, or text associated with (or is inside of) an image. Since all pdf's returned by the request are different, I cannot define search areas (as far as I know). I just need to extract all text on the page.
I extract the pdf data into a byte array like so:
Gson pdfGson = new Gson();
byte[] pdfBytes =
pdfGson.fromJson(this.response.as(JsonObject.class)
.get("pdfData").getAsJsonObject().get("data").getAsJsonArray(), byte[].class);
(I've tried other extraction methods for the byte[], but this is the only way I've found that returns valid data.) This returns a very large byte[] like so:
[37, 91, 22, 45, 23, ...]
When I parse the array I run into the same issue as This Question (except my pdf is 1.7) and I attempt to implement the accepted answer, adjusted for my purposes and as explained in the documentation for iText:
byte[] decodedPdfBytes = PdfReader.decodeBytes(pdfBytes, new PdfDictionary(), FilterHandlers.getDefaultFilterHandlers());
IRandomAccessSource source = new RandomAccessSourceFactory().createSource(decodedPdfBytes);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ReaderProperties readerProperties = new ReaderProperties();
// Ineffective:
readerProperties.setPassword(user.password.getBytes());
PdfReader pdfReader = new PdfReader(source, readerProperties);
// Ineffective:
pdfReader.setUnethicalReading(true);
PdfDocument pdfDoc = new PdfDocument(pdfReader, new PdfWriter(baos));
for(int i = 1; i < pdfDoc.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(pdfDoc.getPage(i));
System.out.println(text);
}
This DOES decode the pdf page, and return text, but it is only the header text. No other text is returned.
For what it's worth, on the front end, when the user clicks the button to generate the pdf, it returns a blob containing the download data, so I'm relatively sure that the metadata is GSA encoded, but I'm not sure if that matters at all. I'm not able to share an example of the pdf docs due to sensitive material.
Any point in the right direction would be greatly appreciated! I've spent 3 days trying to find a solution.
For those looking for a solution - ultimately we wound up going a different route. We never found a solution to this specific issue.

Extracting font size and boldness from PDF with PDF2Dom

PDF2Dom (based on the PDFBox library) is capable of converting PDFs to HTML format preserving such characteristics like font size, boldness etc. Example of this conversation is shown below:
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();}
I'm trying to parse an existing PDF and extract these characteristics on a line for line basis and I wonder if there are any existing methods within the PDF2Dom/PDFBox parse these right from the PDF?
Another method would be to just use the HTML output and proceed from there but it seems like an unnecessary detour.

Parse PDF files with key - value pair using PDFBox

I'm currently parsing PDF with PDFBox library like this:
File f = new File("myFile.pdf");
PDDocument doc = PDDocument.load(f);
PDFTextStripper pdfStripper = new PDFLayoutTextStripper(); //For better printing in console I used a PDFLayoutTextStripper
String text = pdfStripper.getText(doc);
System.out.println(text);
doc.close();
I'm getting a really nice looking pdf. Mine pdf files will have structure like this:
My super pdf file which is first one of all
someKey1 someValue1
someKey2 someValue2
someKey3 someValue3
....
someKey1 someValue4
someKey2 someValue5
someKey3 someValue6
....
some header over here
and here would be my next pair
someKey4 someValue7
....
Is there any library with could get for me all values with e.g. key someKey1? Or maybe is there any better solution for parsing PDF in Java?

How to detect OCR in a scanned Document with pdfbox 2.0.0?

The Problem: I have a large folder with many subfolders with many pdfs in them. Some of them already have OCR on them. Some of them don't. So i wanted to write a Java Program to filter the non OCR PDFs out and copy them to a hot folder.
I tested like 20 Documents and what they all have in common is, that if you open them with editor, you can find the word 'font' and the OCR ones and you cant find it in the non OCR ones. My Question now is: How do i implement this check using PDFbox 2.0.0 ? All the solutions i found dont seem to work only with older versions. And I'm not capable of finding a solution in the documentation. (which is clearly my fault)
Thanks in advance.
Here's how to find out if fonts are on the top level of a page:
PDDocument doc = PDDocument.load(new File(...));
PDPage page = doc.getPage(0); // 0 based
PDResources resources = page.getResources();
for (COSName fontName : resources.getFontNames())
{
System.out.println(fontName.getName());
}
doc.close();
Re: mkl suggestion, here's how to extract text:
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); // 1 based
stripper.setEndPage(1);
String extractedText = stripper.getText(doc);
System.out.println(extractedText);

Optimize PDF Word Search

I have an application that iterates over a directory of pdf files and searches for a string. I am using PDFBox to extract the text from the PDF and the code is pretty straightforward. At first to search through 13 files it was taking a minute in a half to load the results but I noticed that PDFBox was putting a lot of stuff in the log file file. I changed the logging level and that helped alot but it is still taking over 30 seconds to load a page. Does anybody have any suggestions on how I can optimize the code or another way to determine how many hits are in a document? I played around with Lucene but it seems to only give you the number of hits in a directory not number of hits in a particular file.
Here is my code to get the text out of a PDF.
public static String parsePDF (String filename) throws IOException
{
FileInputStream fi = new FileInputStream(new File(filename));
PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String pdfText = stripper.getText(new PDDocument(cd));
cd.close();
return pdfText;
}
Lucene would allow you to index each of the document seperately.
Instead of using PDFBox directly. you can use Apache Tika for extracting text and feeding it to lucene. Tika uses PDFBox internally. However, it provides easy to use api as well as ability to extract content from any types of document seamlessly.
Once you have each lucene document for each of the file in your directory, you can perform search against the complete index.
Lucene matches the search term and would return back number of results (files) which match the content in the document.
It is also possible to get the hits in each of the lucene document/file using the lucene api.
This is called the term frequency, and can be calculated for the document and field being searched upon.
Example from In a Lucene / Lucene.net search, how do I count the number of hits per document?
List docIds = // doc ids for documents that matched the query,
// sorted in ascending order
int totalFreq = 0;
TermDocs termDocs = reader.termDocs();
termDocs.seek(new Term("my_field", "congress"));
for (int id : docIds) {
termDocs.skipTo(id);
totalFreq += termDocs.freq();
}

Categories