OCR implementation using using java - java

I have written java code to convert images into text using java.But my code is taking only single image as input . I want that the program should fetch images from a folder and then run the OCR on it.
My code is:
import java.io.FileOutputStream;
import org.bytedeco.javacpp.*;
import org.junit.Test;
import static org.bytedeco.javacpp.lept.*;
import static org.bytedeco.javacpp.tesseract.*;
import static org.junit.Assert.assertTrue;
import java.io.File;
public class BasicTesseractExampleTest {
#Test
public void givenTessBaseApi_whenImageOcrd_thenTextDisplayed() throws Exception {
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
// Initialize tesseract-ocr with English, without specifying tessdata path
if (api.Init(".", "ENG") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
PIX image = pixRead("IMG_0012 (1).jpg");
api.SetImage(image);
// Get OCR result
outText = api.GetUTF8Text();
String string = outText.getString();
assertTrue(!string.isEmpty())
System.out.println(str);
// Destroy used object and release memory
api.End();
outText.deallocate();
pixDestroy(image);
}
}

To read a list of files out of a given Path use for example:
File f = new File("C:/programs");
File[] fileArray = f.listFiles();
Now you can check every File out of the fileArray if it is a directory and skip that with:
if(fileArray[0].isDirectory()) continue;
To find the images you can check for example the ending of the filename with:
fileArray[0].getName().endsWith(".jpg")
Do this check for all files out ouf the fileArray and call your method with the right files. To check the right file you have to change this line of your code:
PIX image = pixRead("IMG_0012 (1).jpg");
and add the fileArray[?] where the ? must be replaced with the right number.

Related

Replace Pattern match with Preferred Text Java

Hey I am trying to replace the a regex pattern in a directory of files and replace with this character 'X'. I started out trying alter one file but that is not working. I cam eup with the following code any help would be appreciated.
My goal is to read all the file content find the regex pattern and replace it.
Also this code is not working it runs but dose nothing to the text file.
import java.io.File;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
public class DataChange {
public static void main(String[] args) throws IOException {
String absolutePathOne = "C:\\Users\\hoflerj\\Desktop\\After\\test.txt";
String[] files = { "test.txt" };
for (String file : files) {
File f = new File(file);
String content = FileUtils.readFileToString(new File(absolutePathOne));
FileUtils.writeStringToFile(f, content.replaceAll("2018(.+)", "X"));
}
}
}
File Content inside the file is:
3-MAAAA2017/2/00346
I am trying to have it read through and replace 2017/2/00346 with XXX's
my goal is to do this for like 3 files at one time also.

How to read many images in java without memory leak

I need to read many images to process them one after another. At first I used the IO library to read each image:
File outputfile = new File(uri);
BufferedImage imgBuff = ImageIO.read(outputfile);
imgBuff.flush();
imgBuff = null;
outputfile = null;
However it takes up a lot of memory and my process crashes. After doing some research I found that there are many issues with reading many images using the java IO library. I used this simple progam to verify the memory leak was caused by reading the images using this image http://tinyurl.com/ku3ff7w:
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import javax.imageio.ImageIO;
import org.junit.Test;
public class MemoryLeakTest {
static File outputfile = null;
static BufferedImage imgBuff = null;
public static void main(String args[]) {
String uri = "/home/user/Pictures/image.jpg";
outputfile = new File(uri);
for (int i = 0; i < 15000; i++) {
outputfile = new File(uri);
try {
imgBuff = ImageIO.read(outputfile);
} catch (IOException e) {
e.printStackTrace();
} finally {
if (imgBuff != null) {
imgBuff.flush();
imgBuff = null;
}
outputfile = null;
}
}
}
}
I have also tried using the ImageJ library, but the same problem occured converting the image to BufferedImage:
ImagePlus bb = op.openImage(uri);
imgBuff = bb.getBufferedImage();
bb.killStack();
bb.flush();
bb.close();
I guess I could read the images as byte arrays and that would solve the problem, but the solution is not ideal. Does anyone know if there is any library or method to read many images in Java without runing out of memory?
My solutions was to use img4Java ( a java interface to ImageMagick commandline).
In this way I delegate the image manipulation to an external process freeing JVM to go out fo memory.
see http://im4java.sourceforge.net/ and http://www.imagemagick.org/

How to get pictures with names from an xls file using Apache POI

Using workbook.getAllPictures() I can get an array of picture data but unfortunately it is only the data and those objects have no methods for accessing the name of the picture or any other related information.
There is a HSSFPicture class which would contain all the details of the picture but how to get for example an array of those objects from the xls?
Update:
Found SO question How can I find a cell, which contain a picture in apache poi which has a method for looping through all the pictures in the worksheet. That works.
Now that I was able to try the HSSFPicture class I found out that the getFileName() method is returning the file name without the extension. I can use the getPictureData().suggestFileExtension() to get a suggested file extension but I really would need to get the extension the picture had when it was added into the xls file. Would there be a way to get it?
Update 2:
The pictures are added into the xls with a macro. This is the part of macro that is adding the images into the sheet. fname is the full path and imageName is the file name, both are including the extension.
Set img = Sheets("Receipt images").Pictures.Insert(fname)
img.Left = 10
img.top = top + 10
img.Name = imageName
Set img = Nothing
The routine to check if the picture already exists in the Excel file.
For Each img In Sheets("Receipt images").Shapes
If img.Name = imageName Then
Set foundImage = img
Exit For
End If
Next
This recognizes that "image.jpg" is different from "image.gif", so the img.Name includes the extension.
The shape names are not in the default POI objects. So if we need them we have to deal with the underlying objects. That is for the shapes in HSSF mainly the EscherAggregate (http://poi.apache.org/apidocs/org/apache/poi/hssf/record/EscherAggregate.html) which we can get from the sheet. From its parent class AbstractEscherHolderRecord we can get all EscherOptRecords which contains the options of the shapes. In those options are also to find the groupshape.shapenames.
My example is not the complete solution. It is only provided to show which objects could be used to achieve this.
Example:
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.ss.usermodel.*;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.hssf.record.*;
import org.apache.poi.ddf.*;
import java.util.List;
import java.util.ArrayList;
class ShapeNameTestHSSF {
public static void main(String[] args) {
try {
InputStream inp = new FileInputStream("workbook1.xls");
Workbook wb = WorkbookFactory.create(inp);
Sheet sheet = wb.getSheetAt(0);
EscherAggregate escherAggregate = ((HSSFSheet)sheet).getDrawingEscherAggregate();
EscherContainerRecord escherContainer = escherAggregate.getEscherContainer().getChildContainers().get(0);
//throws java.lang.NullPointerException if no Container present
List<EscherRecord> escherOptRecords = new ArrayList<EscherRecord>();
escherContainer.getRecordsById(EscherOptRecord.RECORD_ID, escherOptRecords);
for (EscherRecord escherOptRecord : escherOptRecords) {
for (EscherProperty escherProperty : ((EscherOptRecord)escherOptRecord).getEscherProperties()) {
System.out.println(escherProperty.getName());
if (escherProperty.isComplex()) {
System.out.println(new String(((EscherComplexProperty)escherProperty).getComplexData(), "UTF-16LE"));
} else {
if (escherProperty.isBlipId()) System.out.print("BlipId = ImageId = ");
System.out.println(((EscherSimpleProperty)escherProperty).getPropertyValue());
}
System.out.println("=============================");
}
System.out.println(":::::::::::::::::::::::::::::");
}
FileOutputStream fileOut = new FileOutputStream("workbook1.xls");
wb.write(fileOut);
fileOut.flush();
fileOut.close();
} catch (InvalidFormatException ifex) {
} catch (FileNotFoundException fnfex) {
} catch (IOException ioex) {
}
}
}
Again: This is not a ready to use solution. A ready to use solution cannot be provided here, because of the complexity of the EscherRecords. Maybe to get the correct EscherRecords for the image shapes and their related EscherOptRecords, you have recursive to loop through all EscherRecords in the EscherAggregate checking whether they are ContainerRecords and if so loop through its children and so on.
Start here:
http://poi.apache.org/spreadsheet/quick-guide.html#Images
this tutorial can help you to extract an image's information from an xls spreadsheet using Apache POI

Get MIME type from dicom files in java

I have tried all the following:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URLConnection;
import java.nio.file.Files;
public class mimeDicom {
public static void main(String[] argvs) throws IOException{
String path = "Image003.dcm";
String[] mime = new String[3];
File file = new File(path);
mime[0] = Files.probeContentType(file.toPath());
mime[1] = URLConnection.guessContentTypeFromName(file.getName());
InputStream is = new BufferedInputStream(new FileInputStream(file));
mime[2] = URLConnection.guessContentTypeFromStream(is);
for(String m: mime)
System.out.println("mime: " + m);
}
}
But the results are still: mime: null for each of the tried methods above and I really want to know if the file is a DICOM as sometimes they don't have the extension or have a different one.
How can I know if the file is a DICOM from the path?
Note: this is not a duplicate of How to accurately determine mime data from a file? because the excellent list of magic numbers doesn't cover DICOM files and the apache tika gives application/octet-stream as return which doesn't really identify it as an image and it's not useful as the NIfTI files (among others) get the exactly same MIME from Tika.
To determine if a file is Dicom, you best bet is to parse the file yourself and see if it contains the magic bytes "DICM" at the file offset 128.
The first 128 bytes are usually 0 but may contain anything.

generate executable jar at runtime

I'd like to write a Java app which can create executable jars at runtime. The "hello world" of what I want to do is write a Java app X that when run, generates an executable jar Y that when run, prints hello world (or perhaps another string not known until after Y is run).
How can I accomplish this?
The other answers require starting a new process, this is a method that doesn't. Here are 3 class definitions which produce the hello world scenario described in the question.
When you run XMain.main, it generates /tmp/y.jar. Then, when you run this at the command line:
java -jar /tmp/y.jar cool
It prints:
Hello darling Y!
cool
example/YMain.java
package example;
import java.io.IOException;
import java.io.InputStream;
public class YMain {
public static void main(String[] args) throws IOException {
// Fetch and print message from X
InputStream fromx = YMain.class.getClassLoader().getResourceAsStream("fromx.txt");
System.out.println(new String(Util.toByteArray(fromx)));
// Print first command line argument
System.out.println(args[0]);
}
}
example/XMain.java
package example;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.jar.Attributes;
import java.util.jar.JarEntry;
import java.util.jar.JarOutputStream;
import java.util.jar.Manifest;
public class XMain {
public static void main(String[] args) throws IOException {
Manifest manifest = new Manifest();
manifest.getMainAttributes().put(Attributes.Name.MANIFEST_VERSION, "1.0");
manifest.getMainAttributes().put(Attributes.Name.MAIN_CLASS, YMain.class.getName());
JarOutputStream jarOutputStream = new JarOutputStream(new FileOutputStream("/tmp/y.jar"), manifest);
// Add the main class
addClass(YMain.class, jarOutputStream);
// Add the Util class; Y uses it to read our secret message
addClass(Util.class, jarOutputStream);
// Add a secret message
jarOutputStream.putNextEntry(new JarEntry("fromx.txt"));
jarOutputStream.write("Hello darling Y!".getBytes());
jarOutputStream.closeEntry();
jarOutputStream.close();
}
private static void addClass(Class c, JarOutputStream jarOutputStream) throws IOException
{
String path = c.getName().replace('.', '/') + ".class";
jarOutputStream.putNextEntry(new JarEntry(path));
jarOutputStream.write(Util.toByteArray(c.getClassLoader().getResourceAsStream(path)));
jarOutputStream.closeEntry();
}
}
example/Util.java
package example;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
public class Util {
public static byte[] toByteArray(InputStream in) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buf = new byte[0x1000];
while (true) {
int r = in.read(buf);
if (r == -1) {
break;
}
out.write(buf, 0, r);
}
return out.toByteArray();
}
}
Do you have to write it in plain old Java? I'd use Gradle (a Groovy based build tool). You can have a custom task to write out the source files for Y (Groovy makes it really easy to write out templated files). Gradle makes it easy to generate an executable jar.
If you really want to roll your own from scratch, you'd need to use ZipOutStream to zip up the compiled files after calling javac via the Process API to compile the source.
Maybe a bit more info about why you want to do this would help get better answers
cheers
Lee
To elaborate on Lee's reply, you need to compile the source first. You can use Process or you can use the code from tools.jar directly as explained here. Then write out a MANIFEST.MF file and put it all together using ZipOutputStream as mentioned.
Step 1: figure out how to do it manually using the command line.
Step 2: automate this by calling the program from within Java.
http://devdaily.com/java/edu/pj/pj010016/
For step 1 I would suggest using ant - IDEs are not always automatable. So, either write out all the files from Java, or have some of the ant configurations included as resources n the project.

Categories