Ocr could not convert some characters properly in java - java

I am trying to get text from element which are inside canvas, I am trying to capture canvas portion as image and then extracting text from image using OCR library. But I am facing issue as some of characters are not getting converted in exact text.
I am using Selenium webdriver, Java and Maven.
Code to extract text from image:
Ocr.setUp();
Ocr ocr = new Ocr(); // create a new OCR engine
ocr.startEngine("eng", Ocr.SPEED_FASTEST); // English
textFromImage = ocr.recognize(new File[]{new File("E:\\Device.png")},
Ocr.RECOGNIZE_TYPE_TEXT, Ocr.OUTPUT_FORMAT_PLAINTEXT);
System.out.println(textFromImage);
ocr.stopEngine();
Currently I am trying to get text which is in google search text box from below image :
I am expecting output as get text : What is swift in iOS
But OCR returns text : Whatls swlflln IOS as per following screenshot:
Some how OCR could not covert some characters same as what in image. Is there any other solution for this?

Related

Extract image data based on coordinates or tessaract and writing the content in docs/docx word file

I have image.want to extract image data with same layout into docx file and in readable form using python.i have tried
Applied tessaract on image and converting to pdf using pyteesaract
Then converting pdf to word file
But i am not able to maintain the layout and format.
This question has been answered before in here. You can use the pdf2image library for this issue:
from pdf2image import convert_from_path
pages = convert_from_path('sample.pdf', 400) //400 is the Image quality in DPI (default 200)
pages[0].save("sample.png")

How to insert barcode in PDF using PDFBox 2.0.13?

I am trying to insert barcode in the PDF using PDFBox2.0.13. I tried using the BufferedImage for this as given in
How to add Code128 Barcode image to existing pdf using pdfbox(1.8.12) with barcode4j library?
but this uses "new PDPixelMap(doc, bim)" this PDPixelMap is deprecated in 2.0.x.
My question is how do we insert barcode in PDF with APIs available in PDFBox2.0.13(probably replacement of PDPixelMap)and without using PDPixelMap.? Would be great if code snippet provided.
Use LosslessFactory like this:
PDImageXObject img = LosslessFactory.createFromImage(doc, bim);
contentStream.drawImage(img, x, y);

how can I write java code with syntax highlighting using pdfbox

I am trying to create a java project that simply prints a project source code, be it java, php, c++ and others.
I can create the PDF just fine with iText, but now I need some kind of highlighting the java code I read the same way a code editor like sublime highlights. I discovered pdfbox: a library for creating/manipulating PDF files, but I can't find how to highlight code text(like sublime does) by using this library. Any help?
Copying from another SO question : highlight text using pdfbox when it's location in the pdf is known
PDDocument doc = PDDocument.load(/*path to the file*/);
PDPage page = (PDPage)doc.getDocumentCatalog.getAllPages.get(i);
List annots = page.getAnnotations;
PDAnnotationTextMarkup markup = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.Su....);
markup.setRectangle(/*your PDRectangle*/);
markup.setQuads(/*float array of size eight with all the vertices of the PDRectangle in anticlockwise order*/);
annots.add(markup);
doc.save(/*path to the output file*/);

java - OCR with Asprise library

I make an Android app that captures a photo and saves the text from it using OCR. This is my code with Asprise library, but something is wrong with the "recognize" method:
Ocr.setUp();
Ocr ocr = new Ocr();
ocr.startEngine("eng", Ocr.SPEED_FASTEST);
String s = ocr.recognize(theImage, Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT);
ocr.stopEngine();
"theImage" is Bitmap, but they want "RenderedImage" type there (thought Bitmap is rendered too), and the fourth parameter of the "recognize" method is "Object... propSpec", but there in the sample of asprise official site there are only 3 parameters. And now parameters in the "recognize" line are underlined with red. So, what should I do with my code that it work properly?
P.S. Of course, I've heard about tess-two library, but it's a bit complicated for me to add it in Android Studio (I don't know why they couldn't just make it the way that it be added with only one line in build.gradle)
I've implemented same , what you want to do , by following code , and this is working as i wanted it to , other issues may be like file reader in your PC i.e, if you want PDF file to be OCR , .pdf reader should be installed .
Ocr.setUp();
Ocr ocr = new Ocr();
ocr.startEngine("eng", Ocr.SPEED_FASTEST);
String s = ocr.recognize(new File[] {new File(path)},
Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT);
System.out.println("Result: \n" + s);
ocr.stopEngine();
System.out.println("---END---");

Parsing multiple images from a web page and display to an android device

I am trying to parse the images that are display in this link http://lawncare.ncsu.edu/RSSFeed.aspx and display it in an android device. Right now, I am only able to parse the text associated with the images. Can anyone suggest any ideas on how to go about parsing these images? Preferably not using JSoup because I am already half way down the code.
Try looking at this question over here at : How to parse XML using the SAX parser
specifically the answer given by damiean and his solution.
This is a part of my code where i am parsing the rss feed :
Bundle b = new Bundle();
b.putString("title", feed.getItem(position).getTitle());
b.putString("description", feed.getItem(position).getDescription());
b.putString("content", feed.getItem(position).getContent());
b.putString("link", feed.getItem(position).getLink());
b.putString("pubdate", feed.getItem(position).getPubDate().toString());
What i have done is that the "field" contains both the description(text) and the image . I passed the whole "content" field contents into a WebView and it displays the image as well as the text correctly.

Categories