I am trying to read a PDF file of Gujarat electoral roll (sample file). I need to extract all the information in a structured format. I am using pdfbox from Apache to extract text from the PDF file.
The problem that I am facing is that certain characters are getting lost in the conversion and there is a lot of noise in the converted text. Please find the converted file here.
The code
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
public class Main {
public static void main(String[] args){
PDDocument pd;
BufferedWriter wr;
try {
File input = new File("myPDF_manual.pdf");
File output = new File("newPaperTestFile.txt"); // The text file where you are going to store the extracted data
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
stripper.writeText(pd, wr);
if (pd != null) {
pd.close();
wr.close();
System.out.println(" file processed.");
}
} catch (Exception e){
e.printStackTrace();
}
}
}
I also tried the code using getText() method of PDFTextStripper class but the result is same.
I also tried to convert the pdf to xml using pdftohtml command line utility for linux. But there also some of the information is still lost. The xml file can be found here
Please suggest me any solution to solve this problem. Solution doesn't need to be Java specific.
Related
I am able to create a pdf file but when I try to open the output pdf file I am getting error : "the file is damaged"
Here is my code please help me.
String encodedBytes= "QmFzZTY0IGVuY29kaW5nIHNjaGVtZXMgYXJlIHVzZWQgd2hlbiBiaW5hcnkgZGF0YSBuZWVkcyB0byBiZSBzdG9yZWQgb3IgdHJhbnNmZXJyZWQgYXMgdGV4dHVhbCBkYXRh";
BASE64Decoder decoder = new BASE64Decoder();
byte[] decodedBytes = decoder.decodeBuffer(encodedBytes);
File file = new File("C:/Users/istest/Documents/test.pdf");
FileOutputStream fos = new FileOutputStream(file);
fos.write(decodedBytes);
Your string is not a valid PDF file.
A pdf file should start its proper Magic number (please refer to the Format indicators section of this link)
PDF files start with "%PDF" (hex 25 50 44 46).
or in Base64 : JVBERi
if you try your code with a valid PDF encoded string like this one, it might work.
But because you did not provided the code of your BASE64Decoder class, it is hard to be sure that it will work.
For that reason, here is a simple implementation of the java.util.Base64 package (Warning do not copy/past this example and do not try it before changing the given base64 string here with the correct one as supplied in the previous link...as noted in the bellow comment, in order to be short the correct string was replaced by a corrupted one)
import java.io.File;
import java.io.FileOutputStream;
import java.util.Base64;
class Base64DecodePdf {
public static void main(String[] args) {
File file = new File("./test.pdf");
try ( FileOutputStream fos = new FileOutputStream(file); ) {
// To be short I use a corrupted PDF string, so make sure to use a valid one if you want to preview the PDF file
String b64 = "JVBERi0xLjUKJYCBgoMKMSAwIG9iago8PC9GaWx0ZXIvRmxhdGVEZWNvZGUvRmlyc3QgMTQxL04gMjAvTGVuZ3==";
byte[] decoder = Base64.getDecoder().decode(b64);
fos.write(decoder);
System.out.println("PDF File Saved");
} catch (Exception e) {
e.printStackTrace();
}
}
}
Credit : source.
I have a project steganography to hide docx document into jpeg image. Using apache POI, I can run it and read docx document but only letters can be read.
Even though there are pictures in it.
Here is the code
FileInputStream in = null;
try
{
in = new FileInputStream(directory);
XWPFDocument datax = new XWPFDocument(in);
XWPFWordExtractor extract = new XWPFWordExtractor(datax);
String DataFinal = extract.getText();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line = null;
this.isi_file = extract.getText();
}
catch (IOException x) {}
System.out.println("isi :" + this.isi_file);
How can I read all component in the docx document using java? Please help me and thank you for your helping.
Please check documentation for XWPFDocument class. It contains some useful methods, for example:
getAllPictures() returns list of all pictures in document;
getTables() returns list of all tables in document.
In your code snippet exists line XWPFDocument datax = new XWPFDocument(in);. So after that line your can write some code like:
// process all pictures in document
for (XWPFPictureData picture : datax.getAllPictures()) {
// get each picture as byte array
byte[] pictureData = picture.getData();
// process picture somehow
...
}
I've a task of extracting all the images from a docx file. I am ussing the snippet below for the same. I am using the Apache POI api for the same.
`File file = new File(InputFileString);
FileInputStream fs = new FileInputStream(file.getAbsolutePath());
//FileInputStream fs=new FileInputStream(src);
//create office word 2007+ document object to wrap the word file
XWPFDocument doc1x=new XWPFDocument(fs);
//get all images from the document and store them in the list piclist
List<XWPFPictureData> piclist=doc1x.getAllPictures();
//traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator=piclist.iterator();
int i=0;
while(iterator.hasNext()){
XWPFPictureData pic=iterator.next();
byte[] bytepic=pic.getData();
BufferedImage imag=ImageIO.read(new ByteArrayInputStream(bytepic));
ImageIO.write(imag, "jpg", new File("C:/imagefromword"+i+".jpg"));
i++;
}`
However, this code cannot detect any images which are in the footer or header section of the document.
I've extensively used my google skills and couldn't come up with anything useful.
Is there anyway to capture the image file in the footer section of the
docx file?
I am no expert on Apache POI issues, but a simple search came up with this code:
package com.concretepage;
import java.io.FileInputStream;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFFooter;
import org.apache.poi.xwpf.usermodel.XWPFHeader;
public class ReadDOCXHeaderFooter {
public static void main(String[] args) {
try {
FileInputStream fis = new FileInputStream("D:/docx/read-test.docx");
XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc);
//read header
XWPFHeader header = policy.getDefaultHeader();
System.out.println(header.getText());
//read footer
XWPFFooter footer = policy.getDefaultFooter();
System.out.println(footer.getText());
} catch(Exception ex) {
ex.printStackTrace();
}
}
}
And the documentation page of XWPFHeaderFooter (which is the direct father class of the XWPFFooter class in the above example...) shows the same getAllPictures method you used to iterate over all the pictures in the documents body.
I on mobile, so I haven't really tested anything - but it seems straight-forward enough to work.
Good luck!
I am using some arabic text in my app. on simulator Arabic Text is diplaying fine.
BUT on device it is not displaying Properly.
On Simulator it is like مَرْحَبًا that.
But on device it is like مرحبا.
My need is this one مَرْحَبًا.
Create text resources for a MIDP application, and how to load them at run-time. This technique is unicode safe, and so is suitable for all languages. The run-time code is small, fast, and uses relatively little memory.
Creating the Text Source
اَللّٰهُمَّ اِنِّىْ اَسْئَلُكَ رِزْقًاوَّاسِعًاطَيِّبًامِنْ رِزْقِكَ
مَرْحَبًا
The process starts with creating a text file. When the file is loaded, each line becomes a separate String object, so you can create a file like:
This needs to be in UTF-8 format. On Windows, you can create UTF-8 files in Notepad. Make sure you use Save As..., and select UTF-8 encoding.
Make the name arb.utf8
This needs to be converted to a format that can be read easily by the MIDP application. MIDP does not provide convenient ways to read text files, like J2SE's BufferedReader. Unicode support can also be a problem when converting between bytes and characters. The easiest way to read text is to use DataInput.readUTF(). But to use this, we need to have written the text using DataOutput.writeUTF().
Below is a simple J2SE, command-line program that will read the .uft8 file you saved from notepad, and create a .res file to go in the JAR.
import java.io.*;
import java.util.*;
public class TextConverter {
public static void main(String[] args) {
if (args.length == 1) {
String language = args[0];
List<String> text = new Vector<String>();
try {
// read text from Notepad UTF-8 file
InputStream in = new FileInputStream(language + ".utf8");
try {
BufferedReader bufin = new BufferedReader(new InputStreamReader(in, "UTF-8"));
String s;
while ( (s = bufin.readLine()) != null ) {
// remove formatting character added by Notepad
s = s.replaceAll("\ufffe", "");
text.add(s);
}
} finally {
in.close();
}
// write it for easy reading in J2ME
OutputStream out = new FileOutputStream(language + ".res");
DataOutputStream dout = new DataOutputStream(out);
try {
// first item is the number of strings
dout.writeShort(text.size());
// then the string themselves
for (String s: text) {
dout.writeUTF(s);
}
} finally {
dout.close();
}
} catch (Exception e) {
System.err.println("TextConverter: " + e);
}
} else {
System.err.println("syntax: TextConverter <language-code>");
}
}
}
To convert arb.utf8 to arb.res, run the converter as:
java TextConverter arb
Using the Text at Runtime
Place the .res file in the JAR.
In the MIDP application, the text can be read with this method:
public String[] loadText(String resName) throws IOException {
String[] text;
InputStream in = getClass().getResourceAsStream(resName);
try {
DataInputStream din = new DataInputStream(in);
int size = din.readShort();
text = new String[size];
for (int i = 0; i < size; i++) {
text[i] = din.readUTF();
}
} finally {
in.close();
}
return text;
}
Load and use text like this:
String[] text = loadText("arb.res");
System.out.println("my arabic word from arb.res file ::"+text[0]+" second from arb.res file ::"+text[1]);
Hope this will help you. Thanks
I tried to display the data from a doc file on console then i got this error
run:
The document is really a RTF file
Exception in thread "main" java.lang.NullPointerException
at DocReader.readDocFile(DocReader.java:36)
at DocReader.main(DocReader.java:47)
Java Result: 1
BUILD SUCCESSFUL (total time: 4 seconds)
can any one explain where i went wrong
the code is
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class DocReader {
public void readDocFile() {
File docFile = null;
WordExtractor docExtractor = null ;
WordExtractor exprExtractor = null ;
try {
docFile = new File("C:\\web.doc");
FileInputStream fis=new FileInputStream(docFile.getAbsolutePath());
HWPFDocument doc=new HWPFDocument(fis);
docExtractor = new WordExtractor(doc);
}
catch(Exception exep)
{
System.out.println(exep.getMessage());
}
String [] docArray = docExtractor.getParagraphText();
for(int i=0;i<docArray.length;i++)
{
if(docArray[i] != null)
System.out.println("Line "+ i +" : " + docArray[i]);
}
}
public static void main(String[] args) {
DocReader reader = new DocReader();
reader.readDocFile();
}
}
The document is really a RTF file
That's a typical message of an IllegalArgumentException from the HWPFDocument constructor. To the point it means that the supplied file is actually a (Wordpad) RTF file whose .rtf extension has incorrectly been renamed to .doc.
Supply a real MS Word .doc file instead and fix your code to not continue the flow when an exception has occurred. You need to throw it.
Just open the file in some Document program like Microsoft Office. Now save the same file with "Save As" option and choose .doc format.
That means, at line 36 of DocReader.java file, you are trying to invoke an API from an object but the object is not being created yet. So, the solution is to create an instance of the class first before making that API invocation.
UPDATE
My hunch tells me the NullPointerException happens at docExtractor.getParagraphText() because the docExtractor doesn't get initialized properly. Instead of swallowing the exception, print the stacktrace to figure out the actual problem, like this:-
try {
...
}
catch(Exception exep) {
exep.printStackTrace(); // do this
}