Converting a PDF to text using Tesseract OCR - java

AIM: convert a PDF to base64 where PDF can be a general PDF or a scanned one.
I am using Tesseract OCR for converting scanned PDFs to text files. Since I am working in Java, I am using terr4j library for this.
The flow of program as I have thought would be as follows:
Get PDF file ---> Convert each page to image using Ghost4j ---> Pass each image to tess4f for OCR ---> convert whole text to base64.
I have been able to convert a PDF file to Images using following code:
package helpers;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.awt.Image;
import java.awt.image.RenderedImage;
import java.util.List;
import javax.imageio.ImageIO;
import org.ghost4j.document.DocumentException;
import org.ghost4j.document.PDFDocument;
import org.ghost4j.analyzer.FontAnalyzer;
import org.ghost4j.renderer.RendererException;
import org.ghost4j.renderer.SimpleRenderer;
import net.sourceforge.tess4j.*;
class encoder {
public static byte[] createByteArray(File pCurrentFolder, String pNameOfBinaryFile) {
String pathToBinaryData = pCurrentFolder.getAbsolutePath()+"/"+pNameOfBinaryFile;
File file = new File(pathToBinaryData);
if (!file.exists()) {
System.out.println(pNameOfBinaryFile+" could not be found in folder "+pCurrentFolder.getName());
return null;
}
FileInputStream fin = null;
try {
fin = new FileInputStream(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
byte fileContent[] = new byte[(int) file.length()];
try {
if (fin != null)
fin.read(fileContent);
} catch (IOException e) {
e.printStackTrace();
}
return fileContent;
}
public void covertToImage(File pdfDoc) {
PDFDocument document = new PDFDocument();
try {
document.load(pdfDoc);
} catch (IOException e) {
e.printStackTrace();
}
SimpleRenderer renderer = new SimpleRenderer();
renderer.setResolution(300);
List<Image> images = null;
try {
images = renderer.render(document);
} catch (IOException e) {
e.printStackTrace();
} catch (RendererException e) {
e.printStackTrace();
} catch (DocumentException e) {
e.printStackTrace();
}
try {
if (images != null) {
// for testing only 1 page
ImageIO.write((RenderedImage) images.get(10), "png", new File("/home/cloudera/Downloads/1.png"));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
public class encodeFile {
public static void main(String[] args) {
/* This part is for pure PDF files i.e. not scanned */
//byte[] arr = encoder.createByteArray(new File("/home/cloudera/Downloads/"), "test.pdf");
//String result = javax.xml.bind.DatatypeConverter.printBase64Binary(arr);
//System.out.println(result);
/* This part create the image for a page of scanned PDF file */
new encoder().covertToImage(new File("/home/cloudera/Downloads/isl99201.pdf")); // results in 1.png
/* This part is for OCR */
Tesseract instance = new Tesseract();
String res = instance.doOCR(new File("/home/cloudera/Downloads/1.png"));
System.out.println(res);
}
}
Running this produces these errors:
This occurs when I try to create an image from the PDF. I have seen that if I remove tess4j from build.sbt, image is created with out any errors but I have to use it with that.
Connected to the target VM, address: '127.0.0.1:46698', transport: 'socket'
Exception in thread "main" java.lang.AbstractMethodError: com.sun.jna.Structure.getFieldOrder()Ljava/util/List;
at com.sun.jna.Structure.fieldOrder(Structure.java:884)
at com.sun.jna.Structure.getFields(Structure.java:910)
at com.sun.jna.Structure.deriveLayout(Structure.java:1058)
at com.sun.jna.Structure.calculateSize(Structure.java:982)
at com.sun.jna.Structure.calculateSize(Structure.java:949)
at com.sun.jna.Structure.allocateMemory(Structure.java:375)
at com.sun.jna.Structure.<init>(Structure.java:184)
at com.sun.jna.Structure.<init>(Structure.java:172)
at com.sun.jna.Structure.<init>(Structure.java:159)
at com.sun.jna.Structure.<init>(Structure.java:151)
at org.ghost4j.GhostscriptLibrary$display_callback_s.<init>(GhostscriptLibrary.java:63)
at org.ghost4j.Ghostscript.buildNativeDisplayCallback(Ghostscript.java:381)
at org.ghost4j.Ghostscript.initialize(Ghostscript.java:336)
at org.ghost4j.renderer.SimpleRenderer.run(SimpleRenderer.java:105)
at org.ghost4j.renderer.AbstractRemoteRenderer.render(AbstractRemoteRenderer.java:86)
at org.ghost4j.renderer.AbstractRemoteRenderer.render(AbstractRemoteRenderer.java:70)
at helpers.encoder.covertToImage(encodeFile.java:62)
at helpers.encodeFile.main(encodeFile.java:86)
Disconnected from the target VM, address: '127.0.0.1:46698', transport: 'socket'
Process finished with exit code 1
This error occurs while passing any image to tess4j:
Connected to the target VM, address: '127.0.0.1:46133', transport: 'socket'
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract': Native library (linux-x86-64/libtesseract.so) not found in resource path (....)
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:271)
at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:398)
at com.sun.jna.Library$Handler.<init>(Library.java:147)
at com.sun.jna.Native.loadLibrary(Native.java:412)
at com.sun.jna.Native.loadLibrary(Native.java:391)
at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:78)
at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:40)
at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:360)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:273)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:205)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:189)
at helpers.encodeFile.main(encodeFile.java:89)
Disconnected from the target VM, address: '127.0.0.1:46133', transport: 'socket'
Process finished with exit code 1
I am working on Intellij using SBT on 64 bit CentOS 6.6. By some internet search I have able to understand the issues above but I am facing two constraints:
The JNA library that is being used is by default of the latest version i.e. 4.1.0. I read on the internet about the incompatibility between JNA and other libraries this can occur. So I tried to specify the older version of 3.4.0. But build.sbt keeps rejecting that.
I am on a 64 Bit system and tessearct would work with a 32 Bit system. How should I integrate it in the project?
Following is the part from build.sbt which handles all the required libraries:
"org.ghost4j" % "ghost4j" % "0.5.1",
"org.bouncycastle" % "bctsp-jdk14" % "1.46",
"net.sourceforge.tess4j" % "tess4j" % "2.0.0",
"com.github.jai-imageio" % "jai-imageio-core" % "1.3.0"
"net.java.dev.jna" % "jna" % "3.4.0", // does not make any difference as only 4.1.0 is installed.
Please help me out in this problem.
UPDATE: I added "net.java.dev.jna" % "jna" % "3.4.0" force() to build.sbt and it solved my first problem.

The solution to this issue lies in the Tesseract-API that I found on github. I forked it into my Github account and added a test for a scanned image and did some code refactoring. This way to library started to function properly. The scanned doc I used for testing is here.
I built it successfully on Travis and now it working fine on 32 as well as 64 bit systems.

Related

Unable to locate and load file in Eclipse IDE, File not found

I'm new to Java, and I am facing this issue in Eclipse. Even after pointing it to the correct file, it shows a file not Found Error.
I am trying to compile code from a Java file using the Java Compiler API.
The code words fine in Visual Studio with setting everything in root, But gives this error in Eclipse with all these directories.
Also, why are there three different src folders in the image?
My project structure
package com.example.app;
import javax.tools.JavaCompiler;
import javax.tools.ToolProvider;
import java.io.File;
import java.io.IOException;
public class compilier {
public static void main(String[] args) throws IOException {
JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
int result = compiler.run(null, null, null, new File("com/example/app/Code.java").getAbsolutePath());
if (result == 0)
{
System.out.println("File Compiled");
}
try {
String package_dir = "/demo/src/main/java/com/example/app";
try{
ProcessBuilder builder = new ProcessBuilder("java", package_dir.concat("/Code"));
builder.redirectErrorStream(true);
File outfile = new File((package_dir.concat("/output.txt")));
builder.redirectOutput();
builder.start();
if (outfile.length() > 3000)
{
System.out.println("Exceeded buffer limit");
System.exit(1);
}
} catch(IOException e) {
e.printStackTrace();
}
} catch (Exception err) {
System.out.println("Error!");
err.printStackTrace();
}
}
}
Error Message
Your path looks wrong. The /demo directory would need to be in the root of your current drive.
Also, the output of a Maven build is found in the target directory. The Java class files are generated there, and the resource files are copied over from src/main/res hierarchy. The .Java files are lost. You could add a Maven task to copy the .Java files but this would be very nonstandard.
Finally you need to load resource files using the classpath. There are lots of examples on the Internet. Otherwise you may end up with a project that finds the file in Eclipse but not when deployed in a .jar or .war file.
Happy hunting.

Project Stops working after compiling to jar

I currently want to try to create a .jar file that pinpoints to a .bat file to start a gaming server,
the Forge Modloader for the current version switched from a server startup via jar file to .bat file, and my server provider currently has no solution for it. -Small disclaimer, I haven't touched java for 6 years, which is why I may not see the obvious
For this, I found some code from Pavan.
Though, there are two problems, where I hope you may have a solution or some other workaround.
First of all, while in Intellij, "everything" works fine. main() is running, and the "Hallo World" Test .bat is opening. After compiling it to a jar, nothing happens, even with a set File Path.
Second Problem. I've tried several spots, but System.exit(0) does not work, after
int returnCode = CommandLineUtils.executeCommandLine(commandLine, systemOut, systemErr);
The code basically stops, and the process stays inactive, which could end up bad for a gaming server where I have 0 access to the needed tools to clean this up by myself... and I don't want to explain to Customer Support why there are 1000 instances of java running in the background ;)
But regardless, Thanks for your time and hopefully help as well
import java.io.File;
import java.io.OutputStreamWriter;
import org.codehaus.plexus.util.cli.CommandLineException;
import org.codehaus.plexus.util.cli.CommandLineUtils;
import org.codehaus.plexus.util.cli.Commandline;
import org.codehaus.plexus.util.cli.WriterStreamConsumer;
public class BatRunner {
public BatRunner() {
String batfile = "run.bat";
String directory = "C:\\Users\\User\\IdeaProjects";
try {
runProcess(batfile, directory);
} catch (CommandLineException e) {
e.printStackTrace();
}
}
public void runProcess(String batfile, String directory) throws CommandLineException {
Commandline commandLine = new Commandline();
File executable = new File(directory + "/" +batfile);
commandLine.setExecutable(executable.getAbsolutePath());
WriterStreamConsumer systemOut = new WriterStreamConsumer(
new OutputStreamWriter(System.out));
WriterStreamConsumer systemErr = new WriterStreamConsumer(
new OutputStreamWriter(System.out));
int returnCode = CommandLineUtils.executeCommandLine(commandLine, systemOut, systemErr);
System.exit(0);
if (returnCode != 0) {
System.out.println("Something Bad Happened!");
} else {
System.out.println("Taaa!! ddaaaaa!!");
}
}
public static void main(String[] args) {
new BatRunner();
}
}
Source: https://www.opencodez.com/java/how-to-execute-bat-file-from-java.htm/

Reading NFC Tag using JAVA Smart Card API not working on MAC OS

I am developing an application to read a NFC Tag UID from NFC Reader (ACR122U-A9) device.
I used JAVA and javax.smartcardio API to detect the NFC Reader and Reading NFC Tag.
The functionality of the application is to display notification when the NFC Reader device is connect or disconnect from PC. Then if the device is connected and NFC Tag is presented then display the notification that NFC Tag is presented.
I tried to find the Event based api to implement above functionality but I cannot find so I used Java Timer and Polling for the NFC Reader Device and NFC Tag.
Following is my sample JAVA code that used for Polling for NFC device and Tag.
import java.lang.reflect.Field;
import java.lang.reflect.Method;
import java.util.List;
import java.util.Timer;
import java.util.TimerTask;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.smartcardio.CardTerminal;
import javax.smartcardio.TerminalFactory;
/**
*
* #author sa
*/
public class NFC_Test {
/**
* #param args the command line arguments
*/
static Timer timer;
public static void main(String[] args) {
try {
timer = new Timer(); //At this line a new Thread will be created
timer.scheduleAtFixedRate(new NFC_Test.MyTask(), 0, 1000);
} catch (Exception ex) {
Logger.getLogger(NFC_Test.class.getName()).log(Level.SEVERE, null, ex);
}
}
static class MyTask extends TimerTask {
public void run() {
///////////////////This fix applied after reading thread at http://stackoverflow.com/a/16987873/1411888
try {
Class pcscterminal =
Class.forName("sun.security.smartcardio.PCSCTerminals");
Field contextId = pcscterminal.getDeclaredField("contextId");
contextId.setAccessible(true);
if (contextId.getLong(pcscterminal) != 0L) {
Class pcsc =
Class.forName("sun.security.smartcardio.PCSC");
Method SCardEstablishContext = pcsc.getDeclaredMethod(
"SCardEstablishContext", new Class[]{Integer.TYPE});
SCardEstablishContext.setAccessible(true);
Field SCARD_SCOPE_USER =
pcsc.getDeclaredField("SCARD_SCOPE_USER");
SCARD_SCOPE_USER.setAccessible(true);
long newId = ((Long) SCardEstablishContext.invoke(pcsc, new Object[]{Integer.valueOf(SCARD_SCOPE_USER.getInt(pcsc))})).longValue();
contextId.setLong(pcscterminal, newId);
}
} catch (Exception ex) {
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////
TerminalFactory factory = null;
List<CardTerminal> terminals = null;
try {
factory = TerminalFactory.getDefault();
terminals = factory.terminals().list();
} catch (Exception ex) { //
Logger.getLogger(NFC_Test.class.getName()).log(Level.SEVERE,null, ex);
}
if (factory != null && factory.terminals() != null && terminals
!= null && terminals.size() > 0) {
try {
CardTerminal terminal = terminals.get(0);
if (terminal != null) {
System.out.println(terminal);
if (terminal.isCardPresent()) {
System.out.println("Card");
} else {
System.out.println("No Card");
}
} else {
System.out.println("No terminal");
}
terminal = null;
} catch (Exception e) {
Logger.getLogger(NFC_Test.class.getName()).log(Level.SEVERE,null, e);
}
factory = null;
terminals = null;
Runtime.getRuntime().gc();
} else {
System.out.println("No terminal");
}
}
}
}
Above code is working fine in Windows OS but when I run it on MAC OS then application runs for 5-10 seconds perfectly but then it suddenly crash with the following memory error.
java(921,0x10b0c3000) malloc: *** mmap(size=140350941302784) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Java Result: 139
I searched on internet and cannot find anything regarding above memory error. Also I included the code for memory management to release the object when it is used in timer by assigning NULL value to it.
I have used http://ludovicrousseau.blogspot.com/2010/06/pcsc-sample-in-java.html for reference
I believe this was one of the errors I was getting when trying to track down the bugs with libj2pcsc.dylib on 64-bit Java on OS X. See also smartcardio thread on discussions.apple.com and my email to security-dev. Basically, the problem is that DWORD* should be a pointer to a 32-bit number on OS X, but Sun’s library assumed that it was a pointer to a 64-bit number. Then it dereferences that value and tries to malloc a buffer of that size, which can contain junk in the upper 32 bits. See Java_sun_security_smartcardio_PCSC_SCardListReaders in the source of pcsc.c
Potential workarounds:
Be very conservative with calls to Terminals.list() (which crashes intermittently), and don’t trust the results of Terminal.isCardPresent(), Terminals.waitForChange(long), or CardTerminal.waitForCard(boolean, long). My co-worker realized that he could call TerminalImpl.SCardGetStatusChange(long, long, int[], String[]) using reflection to get the right results. This is what we used to do. Very painful!
Fix the header files for libj2pcsc.dylib and recompile OpenJDK. This is what we do at my company right now.
Switch to a different implementation of javax.smartcardio. I know of two: my own jnasmartcardio and intarsys/smartcard-io. I have not tried my own library on NFC cards, however, but I welcome any bug reports and patches.

Recompile OpenCV Java for Eclipse

I'm using OpenCV for a object detection project. I'm trying to read frames from a stored video file using VideoCapture, but in OpenCV Java there is no current implementation. I followed instructions in this post: open video file with opencv java, to edit the source files of OpenCV Java to allow this functionality. The problem is I don't know how to recompile the files? - since I just added the downloaded opencv jar file into my eclipse project originally.
You should probably try JavaCV, an OpenCV wrapper for Java.
This post shows what you need to download/install to get things working on your system, but I'm sure you can find more updated posts around the Web.
One of the demos I present during my OpenCV mini-courses contains a source code that uses JavaCV to load a video file and display it on a window:
import static com.googlecode.javacv.cpp.opencv_core.*;
import static com.googlecode.javacv.cpp.opencv_imgproc.*;
import static com.googlecode.javacv.cpp.opencv_highgui.*;
import com.googlecode.javacv.OpenCVFrameGrabber;
import com.googlecode.javacv.FrameGrabber;
public class OpenCV_tut4
{
public static void main(String[] args)
{
FrameGrabber grabber = new OpenCVFrameGrabber("demo.avi");
if (grabber == null)
{
System.out.println("!!! Failed OpenCVFrameGrabber");
return;
}
cvNamedWindow("video_demo");
try
{
grabber.start(); // initialize video capture
IplImage frame = null;
while (true)
{
frame = grabber.grab(); // capture a single frame
if (frame == null)
{
System.out.println("!!! Failed grab");
break;
}
cvShowImage("video_demo", frame);
int key = cvWaitKey(33);
if (key == 27) // ESC was pressed, abort!
break;
}
}
catch (Exception e)
{
System.out.println("!!! An exception occurred");
}
}
}

Setting pptx Theme in Java

I am trying to merge some pptx documents programmatically using java. I figured out how to do this in essence using Apache POI but the documents I am trying to merge do not work.
After significant searching and trial and error I figured out that the reason for this is that the pptx documents do not have theme information (i.e., if I click into powerpoint and check the slide master view it's blank). If I goto the themes in the Design Ribbon and select 'office theme' or another theme then save. the files will merge charmingly. Otherwise, I run into the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Failed to fetch default style for otherStyle and level=0
at org.apache.poi.xslf.usermodel.XSLFTextParagraph.getDefaultMasterStyle(XSLFTextParagraph.java:1005)
at org.apache.poi.xslf.usermodel.XSLFTextParagraph.fetchParagraphProperty(XSLFTextParagraph.java:1029)
at org.apache.poi.xslf.usermodel.XSLFTextParagraph.isBullet(XSLFTextParagraph.java:654)
at org.apache.poi.xslf.usermodel.XSLFTextParagraph.copy(XSLFTextParagraph.java:1044)
at org.apache.poi.xslf.usermodel.XSLFTextShape.copy(XSLFTextShape.java:631)
at org.apache.poi.xslf.usermodel.XSLFSheet.appendContent(XSLFSheet.java:358)
at com.apsiva.main.Snippet.main(Snippet.java:28)
The following is the code I ran:
package com.apsiva.main;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.xslf.usermodel.SlideLayout;
import org.apache.poi.xslf.usermodel.XMLSlideShow;
import org.apache.poi.xslf.usermodel.XSLFSlide;
import org.apache.poi.xslf.usermodel.XSLFSlideLayout;
public class Snippet {
/** Merge the pptx files in the array <decks> to the desired destination
* chosen in <outputPath> */
public static void main(String[] args) {
try {
FileInputStream empty = new FileInputStream("C:/Users/Alex/workspace/OutputWorker/tmp/base2.pptx");
XMLSlideShow pptx;
pptx = new XMLSlideShow(empty);
XSLFSlideLayout defaultLayout = pptx.getSlideMasters()[0].getLayout(SlideLayout.TITLE_AND_CONTENT);
FileInputStream is = new FileInputStream("C:/Users/Alex/workspace/OutputWorker/tmp/noWork.pptx");
// FileInputStream is = new FileInputStream("C:/Users/Alex/workspace/OutputWorker/tmp/works2.pptx");
XMLSlideShow src = new XMLSlideShow(is);
is.close();
for (XSLFSlide srcSlide: src.getSlides()){
pptx.createSlide(defaultLayout).appendContent(srcSlide);
}
FileOutputStream out = new FileOutputStream("C:/POI-TEST-OUTPUT.pptx");
pptx.write(out);
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
I want to get these files to merge and I believe the solution is to programmatically assign the theme to the files. How can it be done?
Thank you for your consideration!
In some cases when you have generated pptx files (ex. JasperReport exports) then some invalid values might be added for different fields. For example line spacing, which can be percent, and special characters, and the apache poi xslf doesn't know how to handle these values. When opening the file, PowerPoint automatically adjusts these values to valid ones. When using apache poi, you have to individually identify these fields and adjust them manually.
I had a similar issue, but with line spacing, and did a workaround, by setting the values for each paragraph like this:
List<XSLFShape> shapes = srcSlide.getShapes();
for (XSLFShape xslfShape: shapes) {
if (xslfShape instanceof XSLFTextShape){
List<XSLFTextParagraph> textParagraphs = ((XSLFTextShape) xslfShape).getTextParagraphs();
for (XSLFTextParagraph textParagraph: textParagraphs) {
textParagraph.setLineSpacing(10d);
}
}
}
This worked like a charm.
A more effective way to do this is to do it directly on the XML object:
List<CTShape> ctShapes = srcSlide.getXmlObject().getCSld().getSpTree().getSpList();
for (CTShape ctShape : ctShapes) {
List<CTTextParagraph> ctTextParagraphs = ctShape.getTxBody().getPList();
for (CTTextParagraph paragraph : ctTextParagraphs) {
if (paragraph.getPPr().getLnSpc() != null) {
paragraph.getPPr().unsetLnSpc();
}
}
}
/ApachePOI/src/ooxml/java/org/apache/poi/xslf/usermodel/XSLFTextParagraph.java
CTTextParagraphProperties getDefaultMasterStyle()
add
if( o.length == 0 ) {
return null;
}

Categories