I'm currently parsing PDF with PDFBox library like this:
File f = new File("myFile.pdf");
PDDocument doc = PDDocument.load(f);
PDFTextStripper pdfStripper = new PDFLayoutTextStripper(); //For better printing in console I used a PDFLayoutTextStripper
String text = pdfStripper.getText(doc);
System.out.println(text);
doc.close();
I'm getting a really nice looking pdf. Mine pdf files will have structure like this:
My super pdf file which is first one of all
someKey1 someValue1
someKey2 someValue2
someKey3 someValue3
....
someKey1 someValue4
someKey2 someValue5
someKey3 someValue6
....
some header over here
and here would be my next pair
someKey4 someValue7
....
Is there any library with could get for me all values with e.g. key someKey1? Or maybe is there any better solution for parsing PDF in Java?
Related
I'm bulding a pdf-parser using Apache PDFBox, after parsing the plain text i run some algorithms and in the end output a json-file. For some pdf files the output file contains utf-8 encoding, for other pdfs it contains some form of what seems to be latin-1 encoding (spaces show up as "\xa0" when the json-file is opened in python). I assume this must be a consequence of the fonts or some other characteristic of the pdf?
My code to read the plain text is as follows
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
//Closing the document
document.close();
I've tried just saving the plain text:
PrintWriter out = new PrintWriter(outPath + ".txt");
out.print(text);
Even opening this plain text file in python yields "\xa0" characters instead of space if the file is read into a dictionary , yielding the following results:
dict_keys(['1.\xa0\lorem\xa0ipsum', '2.\xa0\lorem\xa0ipsum\xa0\lorem\xa0ipsum', '3.\xa0\lorem', '4.\xa0\lorem\xa0ipsum', '5.\xa0\lorem\xa0ipsum'])
I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?
I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?
If you want to make sure your PrintWriter uses UTF-8 encoding, say so in the constructor:
PrintWriter out = new PrintWriter(outPath + ".txt", "UTF-8");
PDF2Dom (based on the PDFBox library) is capable of converting PDFs to HTML format preserving such characteristics like font size, boldness etc. Example of this conversation is shown below:
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();}
I'm trying to parse an existing PDF and extract these characteristics on a line for line basis and I wonder if there are any existing methods within the PDF2Dom/PDFBox parse these right from the PDF?
Another method would be to just use the HTML output and proceed from there but it seems like an unnecessary detour.
I am trying to extract tabular data from PDF, and to start it, my first step of algorithm says to convert PDF to html doc.
How can I convert PDF to html using pdf2Dom library?
you can convert it using this
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();
}
reference - link
I want to check if a pdf file contains a long string, which is a string of a full XML document.
I can open both files and extract the text already. i've done that with the following code:
File temp = File.createTempFile("temp-pdf", ".tmp");
OutputStream out = new FileOutputStream(temp);
out.write(Base64.decodeBase64(testObject.getPdfAsDoc().getContent()));
out.close();
PDDocument document = PDDocument.load(temp);
PDFTextStripper pdfStripper = new PDFTextStripper();
String pdfText = pdfStripper.getText(document);
Integer posS =pdfText.indexOf("<?xml version");
Integer posE = pdfText.lastIndexOf("</ServiceSpecificationSchema:serviceSpecification>")+"</ServiceSpecificationSchema:serviceSpecification>".length();
pdfText =pdfText.substring( posS,posE );
String xmlText = testObject.getXmlAsDoc().getContent();
Now i have the problem, that the lines of both documents don't match, a cause of formats like linebreaks from the pdf file.
Example lines of TXT output from XML file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification xmlns:xs=" ..... >
Example lines of TXT output from PDF file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification
xmlns:xs=" ..... >
Second, i have page numbers between the XML tags from the PDF. Do you know a good way to remove this lines?
</operations>
Page 51 of 52
</consumerInterface>
What is the best approach to check if the pdf contains an XML?
I've already tried to remove all linebreaks and whitespaces from the file and compare them. But if i do that, i cannot find a line with the difference.
It does not have to be a valid XML file at the end.
Just want to post my solution if others need it.
My code is a little to large, to post it here.
Basicly i extract the text from the pdf and remove strings like page x and headlines from it. After that i removed all whitespaces as pointed out above. Finally i compare character by character of the extracted string to inform my users where they have done things wrong in the text. This method works pretty well, even if the auther does not care about formatting and just copy and paste the whole xml document.
I have a pdf file which has two types of pages, normal text pages and pages coming from scanned documents. Text content can be easily extracted usin either PDFBox or Tika. However these libraries can't do OCR and I need something like Tess4J. But how can I combine Tess4J and PDFBox (or Tika) together in order to extract all the content from both text and scanned pages?
EDITED:------
I found a solution as follows but it doesn't work well
for(PDPage page:pages){
PDPropertyList pageFonts = page.getResources().getProperties();
Map<String,PDXObjectImage> img = page.getResources().getImages();
Set<String> keys = img.keySet();
Iterator iter = keys.iterator();
while(iter.hasNext()){
String k = (String) iter.next();
System.out.println(k);
PDXObjectImage ci = img.get(k);
ci.write2file(k);
File imageFile = new File(k+".jpg");
Tesseract instance = Tesseract.getInstance();
try{
String result = instance.doOCR(imageFile);
System.out.println(result);
}
catch(TesseractException e){
System.err.println(e.getMessage());
}
}
}
The problem is although image fies which are saved have good quality, Tess4J does not operate well on them and mostly extract nonsense. However Tess4J is able to OCR them very good if the pdf is simply passed to it in the first place.
In summary, extracting the images by PdfBox affect the quality of OCR process, Does anyone know why?