Is it possible to generate with Tess4j the byte[] of a PDF with OCR instead of a physical file?
I need to make PDF files searchable via OCR, it works but I would like to avoid this step.
Tesseract tessInst = new Tesseract();
tessInst.setDatapath("C:\\Tess4J");
List<RenderedFormat> list = new ArrayList<RenderedFormat>();
list.add(RenderedFormat.PDF);
tessInst.createDocuments(inputFile.getPath(), "C:\\a\\b\\b\\Tess4J\\filename", list); // i dont want to create this, i just need a byte[]!
Thx!
No, Tesseract does not support it. TessPDFRendererCreate expects a string for file path as input.
https://tesseract-ocr.github.io/tessapi/5.x/a00008.html
Related
I am using the Tesseract Java API (tess4J) to convert Tiff images to PDFs.
This works nicely, but I am forced to write both the source Tiff image and the output PDF to local filestore as actual physical files in order to use the TessAPI1.TessPDFRendererCreate API.
Please note the following in the code snippet below: -
The input Tiff is originally a java.awt.image.BufferedImage, but I have to write it to a physical file (sourceTiffFile is a File object).
I must specify a file path for the output (pdfFullFilepath is a String representing an absolute path for the new PDF file).
try {
ImageIO.write(bufferedImage, "tiff", sourceTiffFile);
} catch (Exception ioe) {
//handling code...
}
TessResultRenderer renderer = TessAPI1.TessPDFRendererCreate(pdfFullFilepath, dataPath, 0);
TessAPI1.TessResultRendererInsert(renderer, TessAPI1.TessPDFRendererCreate(pdfFullFilepath, dataPath, 0));
int result = TessAPI1.TessBaseAPIProcessPages(handle, sourceTiffFile.getAbsolutePath(), null, 0, renderer);
I would really like to avoid creating physical files, but am not sure if it is possible with this API. Ideally, I would like to pass the Tiff as a java.awt.image.BufferedImage or a byte array and receive the output PDF as a byte array.
Any suggestions would be most welcome as always. Thank you :)
You can pass in ProcessPage API method a Pix, which can be converted from a BufferedImage, but the output will still be a physical file. Tesseract API dictates that.
https://tesseract-ocr.github.io/tessapi/4.0.0/a01625.html
http://tess4j.sourceforge.net/docs/docs-4.4/net/sourceforge/tess4j/TessAPI1.html
For ex:
int result = TessAPI1.TessBaseAPIProcessPage(handle, LeptUtils.convertImageToPix(bufferedImage), page_index, "input file name", null, 0, renderer);
I got a strange issue with a GIF image in Java. The image is provided by an XML API as Base64 encoded string. To decode the Base64, I use the commons-codec library in version 1.13.
When I just decode the Base64 string and write the bytes out to a file, the image shows properly in browsers and MS Paint (nothing else to test here).
final String base64Gif = "[Base64 as provided by API]";
final byte[] sigImg = Base64.decodeBase64(base64Gif);
File sigGif = new File("C:/Temp/pod_1Z12345E5991872040.org.gif");
try (FileOutputStream fos = new FileOutputStream()) {
fos.write(sigImg);
fos.flush();
}
The resulting file opened in MS Paint:
But when I now start consuming this file using Java (for example creating a PDF document from HTML using the openhtmltopdf library), it is corrupted and does not show properly.
final String htmlLetterStr = "[HTML as provided by API]";
final Document doc = Jsoup.parse(htmlLetterStr);
try (FileOutputStream fos = new FileOutputStream(new File("C:/Temp/letter_1Z12345E5991872040.pdf"))) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.useFastMode();
builder.withW3cDocument(new W3CDom().fromJsoup(doc), "file:///C:/Temp/");
builder.toStream(fos);
builder.useDefaultPageSize(210, 297, BaseRendererBuilder.PageSizeUnits.MM);
builder.run();
fos.flush();
}
When I now open the resulting PDF, the image created above looks like this. It seems that only the first pixel lines are printed, some layer is missing, or something like that.
The same happens, if I read the image again with ImageIO and try to convert it into PNG. The resulting PNG looks exactly the same as the image printed in the PDF document.
How can I get the image to display properly in the PDF document?
Edit:
Link to original GIF Base64 as provided by API: https://pastebin.com/sYJv6j0h
As #haraldK pointed out in the comments, the GIF file provided via the XML API does not conform to the GIF standard and thus cannot be parsed by Java's ImageIO API.
Since there does not seem to exist a pure Java tool to repair the file, the workaround I came up with now is to use ImageMagick via Java's Process API. Calling the convert command with the -coalesce option will parse the broken GIF and create a new one that does conform to the GIF standard.
// Decode broken GIF image and write to disk
final String base64Gif = "[Base64 as provided by API]";
final byte[] sigImg = Base64.decodeBase64(base64Gif);
Path gifPath = Paths.get("C:/Temp/pod_1Z12345E5991872040.tmp.gif");
if (!Files.exists(gifPath)) {
Files.createFile(gifPath);
}
Files.write(gifPath, sigImg, StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING);
// Use the Java Process API to call ImageMagick (on Linux you would use the 'convert' binary)
ProcessBuilder procBuild = new ProcessBuilder();
procBuild.command("C:\\Program Files\\ImageMagick-7.0.9-Q16\\magick.exe", "C:\\Temp\\pod_1Z12345E5991872040.tmp.gif", "-coalesce", "C:\\Temp\\pod_1Z12345E5991872040.gif");
Process proc = procBuild.start();
// Wait for ImageMagick to complete its work
proc.waitFor();
The newly created file can be read by Java's ImageIO API and be used as expected.
I want to be specific as to what I am asking. I am not asking to modified any MP4 files, just extract metadata such as Width and Height and bitrate and encoding and this is not in the MP3 tags.
I have tested Xuggle which works, but I need to have a library that does not use JNI or any native code.
I already looked into MP4Parser, and Apache Tika, and they both does not extract metadata, just tag info or alter the file.
Is there such java lib?
I actually found what I was looking for using Mp4Parser
here is a simple lines of code to get what I wanted using Mp4Parser
FileChannel fc = new FileInputStream("content/Video_720p_Madagascar-3.mp4").getChannel();
IsoFile isoFile = new IsoFile(fc);
MovieBox moov = isoFile.getMovieBox();
for(Box b : moov.getBoxes()) {
System.out.println(b);
}
b contains all the info I needed, now I just have to parse b to get exactly what I want.
At present (2015), Android SDK provides MediaMetadataRetriever class for extracting metadata:
MediaMetadataRetriever m = new MediaMetadataRetriever();
m.setDataSource(mInputFile);
String extractedHeight = m.extractMetadata(MediaMetadataRetriever.METADATA_KEY_VIDEO_HEIGHT);
String extractedWidth = m.extractMetadata(MediaMetadataRetriever.METADATA_KEY_VIDEO_WIDTH);
I have a byte[] of an image and I need to upload it as an image to picasa.
According to the documentation, an image is uploaded as follows.
MediaFileSource myMedia = new MediaFileSource(new File("lights.jpg"), "image/jpeg");
which means I need to create a File, out of the byte[].
The catch is, I have to do this without using FileOutputStream as it is not supported by Google App Engine (which is the environment I am using)
Is there any way to do this?
You don't have to use MediaFileSource to upload a photo, you can use MediaByteArraySource and pass it to photo.setMediaSource(...).
I have been stuck in converting WMF/EMF images into standard image format such as JPG or PNG using Java.
What are the best options available?
The Batik library is a toolkit to handle SVG in Java. There are converters included like WMFTranscoder to convert from WMF to SVG and JPEGTranscoder and PNGTranscoder to convert SVG to JPEG/PNG. See Transcoder API Docs for more details.
Another alternative is ImageMagick. It's not Java but has Java bindings: im4java and JMagick.
wmf is a vector file format. For best results, convert them to .svg or .pdf format.
I did it in two stages
1) wmf2fig --auto XXXX.wmf
2) fig2pdf --nogv XXXX.fig
I created a python script for bulk conversion
import subprocess as sbp
a = sbp.Popen("ls *.wmf",shell=True, stderr=sbp.PIPE, stdout=sbp.PIPE)
filelist = a.communicate()[0].splitlines()
for ele in filelist:
cmdarg = 'wmf2fig --auto '+ ele.rsplit('.',1)[0]+'.wmf'
a = sbp.Popen(cmdarg, shell=True, stderr=sbp.PIPE, stdout=sbp.PIPE)
out = a.communicate()
for ele in filelist:
cmdarg = 'fig2pdf --nogv '+ ele.rsplit('.',1)[0]+'.fig'
a = sbp.Popen(cmdarg, shell=True, stderr=sbp.PIPE, stdout=sbp.PIPE)
out = a.communicate()
cmdarg = 'rm *.fig'
a = sbp.Popen(cmdarg, shell=True, stderr=sbp.PIPE, stdout=sbp.PIPE)
out = a.communicate()
If you are deploying your application in a Windows environment, then SWT can handle the conversion for you.
Image image = new Image(Display.getCurrent(), "test.wmf");
ImageLoader loader = new ImageLoader();
loader.data = new ImageData[] { image.getImageData() };
try(FileOutputStream stream = new FileOutputStream("test.png"))
{
loader.save(stream, SWT.IMAGE_PNG);
}
image.dispose();
The purpose of SWT is to provide a Java wrapper around native functionality, and in this case it is calling the windows GDI directly to get it to render the WMF.
I've created some wrappers around the Batik package (as mentioned by vanje's answer) some time ago, that provides ImageIO support for SVG and WMF/EMF.
With these plugins you should be able to write:
ImageIO.write(ImageIO.read(wmfFile), pngFile, "png");
Source code on GitHub.
While the ImageIO plugins are convenient, im4java and JMagick might still have better format support.
Here is one way.
Get (or make) a Java component that can render the files in question.
Create a BufferedImage the same size as the component needs to display the image.
Get the Graphics object from the BufferedImage.
Call renderComponent.paintComponent(Graphics)
Save the image using one of the ImageIO.write() variants.
See my answer to Swing: Obtain Image of JFrame for steps 2-5. Step 1. is something I'd ask Google about.