How to read PDF templates using java OCR

How to read PDF templates using java OCR - java

Can some one suggest a solution for the below scenario ?
We have menus from restaurants. Each restaurant has its own menu. The goal is to identify the elements in the menu such as menu item, toppings, prices etc and update the database.
Fox example : A restaurant menu can contain menu items such as "Chicken", "Vegetarian" etc under a group called "Sandwiches.
For that I am planning to use a java implementation of OCR. Will this work out ?

If u want to use OCR inside your code you can go with Tessrect-OCR with some native developement.Its a very powerfull library with having quick output.this link is for wrapper class for Tessrect or you can also use Tess4j alternative to Tesjeract(first one).This is the same library used by google and u can also add multiple languages support.

Convert the PDF to an image (using javacv etc) and OCR it using tesseract or tess4j. It is not a permanent or the best solution, but it works great!

If you are typing up the PDF, then using it, there's no need to do this; simply read the PDF (see below). However, if you are scanning in the PDF (an image, not text), you will need to resort to OCR.
To read the PDF from a file, you could use something like iText or PDFBox

Interesting project! Java or any other language, I would think that OCR is not accurate enough for what you need. Menus are often printed with non-standard fonts and sometimes with background images making it difficult for OCR to accurately read every word. Then you have the challenge of formatting. Some menus may organize the content by Chicken, Vegetarian, Beef. Others may have categories like Light Fare, Entree, Appetizer, small plates.
This strikes me as a real data engineering challenge. While menus seem like they are hierarchical, they actual structure is very flexible and varies a great deal from one to another. Adding OCR to this mess adds typos to this whole mess, and now you need to be looking for words like "chicken" because you may actually have Chicen or Cichen or (h1ckn.
Maybe I've never used really great OCR software and I'm imagining a problem that isn't there. I would think that most restaurants type their menus on computers and you are better off trying to get them to share those files with you.

Related

Alternative to Markdown with Color support

I am writing on a Note App (Android and REST API built with PHP/Slim 3). I am wondering if there is something else than Markdown to save notes to a readable and interchangeable format. The problem with Markdown for me is that there is no solution to style texts (e.g. colored text). It is also hard to extend Markdown with custom attributes.
I am already thinking of creating an own data format (or using XML). But this means a lot of work for parsing it. I like the idea of using a standard format to interchange it between client/server and between other applications. But the featureset of Markdown is very limited (by design for sure).
Do you have any tips on this topic?

This question verges on overly-broad, i.e. it may lead to an argument over technologies rather than a "this is the solution" situation.
That being said, here's an answer I think won't be controversial: when you say
"readable, interchangeable format... solution to style texts... custom attributes"
I think HTML. I don't recommend trying to roll-your-own format, because 1.) you are correct that it will be difficult and 2.) it will be even more difficult to match the feature sets of existing solutions

To sum it up: I like the idea of using HTML instead of Markdown. It is an open standard format and exchangable as well as human-readable.
The problem I see with all of these solutions: How to write a WYSIWYG-Editor with this in mind? I am already working with Markdown using the Markwon library: https://github.com/noties/Markwon
It is no problem to write Markdown in an Android EditText widget and render it. You can easily convert it back to plaintext (you can save it). It is much more complicated to get a WYSIWYG experience. You have to deal with every User input, writing in a second file or string which contains the Markup while the user just sees the rendered result. The user can edit/delete anything anywhere in the EditText and you have to take care that those changes will affect the Markdown String/File too. I didn't find an easy solution for this.
The easiest way would be to somehow parse the content of the EditText back to Markdown. But there is no getSpannables-method or alike for the EditText widget. I am thinking of looping through the EditText and see what character is there and how it's formatted. But I think this will have disadvantages too, because there are other things like bulleted lists and checkboxes..

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.

I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!

Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.

You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

How can I highlight text - strictly timed - a la Karaoke without Flash on a web page. What technology choice?

I would like to display the whole text of a poem, then have text highlighted according to a pre-established time sequence. Something like Karaoke, but without any sound track. A user would then be able to read it at exactly the "right" tempo.
I figure I can generate a subtitle track (for example, with something like Aegisum - although this keeps crashing on my Mac) with the timing data. Something line by line, such as:
1
00:00:18,067 --> 00:00:20,067
Twinkle twinkle little star
2
00:00:20,467 --> 00:00:22,467
How I wonder what you are
... or better still, a word or sylable at a time.
I don't want to use Flash for iPad/iPhone reasons.
My exact question is this as I'm somewhat naive: What would be the best technology to use? I don't need an exact solution, just some pointers on where I should concentrate my efforts. Does Timed Text in HTML5 (TTML) have anything I could use on this? Or SMIL?

Someone posted a karaoke display engine build in js: https://github.com/sk89q/ricekaraoke

You can use Javascript and CSS to accomplish what you want. You can wrap each word in a span, then apply styles to the span elements at the proper timing intervals. If you can store timing information about when you want corresponding words highlighted, you can use setInterval to add styles at the appropriate times. If you want to use HTML5 features, you might look into using Canvas or SVG to enable more advanced animations.

You can achieve a karaoke effect using a javascript library from Mozilla called popcorn.js You can download it from http://mozillapopcorn.org/
Here is a tutorial http://net.tutsplus.com/articles/news/a-look-at-popcorn/
Here is a demo http://danharper.me/demo/a-look-at-popcorn/
Lots of links to related info at the bottom of the second link.

How to generate a printable output for a phonebook

I'm developing a desktop software to manage people and telephones, and also to generate (export) a list of telephones (also with a summary of the cities) that can be printed (like pdf). The part of telephones management is ready and was made with java and swt/jface. Exporting the list in a print friendly format is what has become an issue.
I tried exporting the list in HTML with CSS, but the result is not the same in different browsers.
I was thinking about generating it in LaTeX, but creating an style is getting too complicated (need an A7 page size, smaller fonts...).
What file format can be used to export this list? Is there an easy way to generate printable stuff?
Edit: forgot to mention that the file will be sent to a company to be printed.
Thanks!

Generate a pdf, it will look the same no matter what browser they use. You can use iText to create the pdf, it is fairly straight forward for a simple pdf.

You could just draw an image, it will stay the same on different systems and its easy to print. by drawing it, you can style it like you imagine, without learning any document format. It should be easy to draw a simple table.

Plain text is a very friendly format for me. Altough, this could be done with HTML and CSS, if you keep the style complexity level to a minimum. Try reading:
http://www.smashingmagazine.com/2010/06/07/the-principles-of-cross-browser-css-coding/
And be careful when choosing your properties!

rendering pdf on webside ala google documents

In a current project i need to display PDFs in a webpage. Right now we are embedding them with the Adobe PDF Reader but i would rather have something more elegant (the reader does not integrate well, it can not be overlaid with transparent regions, ...).
I envision something close google documents, where they display PDFs as image but also allow text to be selected and copied out of the PDF (an requirement we have).
Does anybody know how they do this? Or of any library we could use to obtain a comparable result?
I know we could split the PDFs into images on server side, but this would not allow for the selection of text ...
Thanks in advance for any help
PS: Java based project, using wicket.

I have some suggestions, but it'll be definitely hard to implement this stuff. Good luck!
First approach:
First, use a library like pdf-renderer (https://pdf-renderer.dev.java.net/) to convert the PDF into an image. Store these images on your server or use a caching-technique. Converting PDF into an image is not hard.
Then, use the Type Select JavaScript library (http://www.typeselect.org/) to overlay textual data over your text. This text is selectable, while the real text is still in the original image. To get the original text, see the next approach, or do it yourself, see the conclusion.
The original text then must be overlaid on the image, which is a pain.
Second approach:
The PDF specifications allow textual information to be linked to a Font. Most documents use a subset of Type-3 or Type-1 fonts which (often) use a standard character set (I thought it was Unicode, but not sure). If your PDF document does not contain a standard character set, (i.e. it has defined it's own) it's impossible to know what characters are which glyphs (symbols) and thus are you unable to convert to a textual representation.
Read the PDF document, read the graphics-objects, parse the instructions (use the PDF specification for more insight in this process) for rendering text, converting them to HTML. The HTML conversion can select appropriate tags (like <H1> and <p>, but also <b> and <i>) based on the parameters of the fonts (their names and attributes) used and the instructions (letter spacing, line spacing, size, face) in the graphics-objects.
You can use the pdf-renderer library for reading and parsing the PDF files and then code a HTML translator yourself. This is not easy, and it does not cover all cases of PDF documents.
In this approach you will lose the original look of the document. There are some PDF generation libraries which do not use the Adobe Font techniques. This also is a problem with the first approach, even you can see it you can not select it (but equal behavior with the official Adobe Reader, thus not a big deal you'd might say).
Conclusion:
You can choose the first approach, the second approach or both.
I wouldn't go in the direction of Optical Character Recognition (OCR) since it's really overkill in such a problem, since it also has several drawbacks. This approach is Google using. If there are characters which are unrecognized, a human being does the processing.
If you are into the human-processing thing; you can only use the Type Select library and PDF to Image conversion and do the OCR yourself, which is probably the easiest (human as a machine = intelligently cheap, lol) way to solve the problem.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.