How to find fuzzy word in a text? - java

I'm using an OCR (Tesseract) to extract data from a document, this document must contains certain keyword to be valid, OCR isn't perfect so sometime he may read for example "Technlquos" instead of "Techniques".
So I'm wondering if there is a way in java to find "techniques" in a text even if it's read by OCR as "Technlquos" ? and the same thing for composed word : searching "Sciences Techniques" must accept "Sclences Technlquos", something like founding the closest word to the searched word and accepting it if it's close enough (75% matching for example) I found some solutions here but none of them is answering my question
Thank you

In other OCR libraries, this can be done by keeping recognized word variants in the resulting text. Most likely, "Techniques" is found and considered suspicious by your OCR. If there is an option to keep suspicious word recognition variants, then you will be able to search for it.

Related

Easiest, simplest, and fastest way to search within the eclipse workspace from a list of thousands of search strings?

Additional requirements:
It would be nice to save the search result for every search term. It would be even nicer to only save a certain result if certain strings are found in the search result
I am only interested in searching through *.java, *.xsd, and *.xml file extensions within the workspace
Notes:
There is a good chance that I will need to do this in the future
The search terms exist in a text file
I have seen a similar question asked but there were a limited number of answers and they did not answer this question. See Eclipse Text Search, automating loop over a list of search terms?
The eclipse project is massive with many nested directiories, so recursion could take a very long time.
The eclipse workspace (codebase/directory) exists in my C:/DEV directory
Some pseudo code:
for (searchTerms)
Find matches
If match is found
If match contains certain string
save this match
Maybe there are tools out there that can do this?

Java: Print Text With Strikethrough

I'm printing to a file. Is there a way to print the text with strikethrough through it? I have done some googling, but did not find any applicable answers.
You would have to save the file in a PDF, HTML or create some kind of word processor document. Simple text (or more correctly plaintext) does not have formatting ... in any language ...
I'd recommend HTML. It is simple to create (PDF is a pain), gives you the option of other formatting (people always end up asking for a heading), allows you to format as tables (managers love tables), and will open anywhere (could even be served on a web-server, eliminating printing and tree-killing altogether).
If you want to force it, you can use the unicode index of those letters, like this:
"\u03C0" //π
http://unicode-table.com/de/0268/
This, as an example is the ɨ

Is it possible to do this type of search in Java

I am stuck on a project at work that I do not think is really possible and I am wondering if someone can confirm my belief that it isn't possible or at least give me new options to look at.
We are doing a project for a client that involved a mass download of files from a server (easily did with ftp4j and document name list), but now we need to sort through the data from the server. The client is doing work in Contracts and wants us to pull out relevant information such as: Licensor, Licensee, Product, Agreement date, termination date, royalties, restrictions.
Since the documents are completely unstandardized, is that even possible to do? I can imagine loading in the files and searching it but I would have no idea how to pull out information from a paragraph such as the licensor and restrictions on the agreement. These are not hashes but instead are just long contracts. Even if I were to search for 'Licensor' it will come up in the document multiple times. The documents aren't even in a consistent file format. Some are PDF, some are text, some are html, and I've even seen some that were as bad as being a scanned image in a pdf.
My boss keeps pushing for me to work on this project but I feel as if I am out of options. I primarily do web and mobile so big data is really not my strong area. Does this sound possible to do in a reasonable amount of time? (We're talking about at the very minimum 1000 documents). I have been working on this in Java.
I'll do my best to give you some information, as this is not my area of expertise. I would highly consider writing a script that identifies the type of file you are dealing with, and then calls the appropriate parsing methods to handle what you are looking for.
Since you are dealing with big data, python could be pretty useful. Javascript would be my next choice.
If your overall code is written in Java, it should be very portable and flexible no matter which one you choose. Using a regex or a specific string search would be a good way to approach this;
If you are concerned only with Licensor followed by a name, you could identify the format of that particular instance and search for something similar using the regex you create. This can be extrapolated to other instances of searching.
For getting text from an image, try using the API's on this page:
How to read images using Java API?
Scanned Image to Readable Text
For text from a PDF:
https://www.idrsolutions.com/how-to-search-a-pdf-file-for-text/
Also, PDF is just text, so you should be able to search through it using a regex most likely. That would be my method of attack, or possibly using string.split() and make a string buffer that you can append to.
For text from HTML doc:
Here is a cool HTML parser library: http://jericho.htmlparser.net/docs/index.html
A resource that teaches how to remove HTML tags and get the good stuff: http://www.rgagnon.com/javadetails/java-0424.html
If you need anything else, let me know. I'll do my best to find it!
Apache tika can extract plain text from almost any commonly used file format.
But with the situation you describe, you would still need to analyze the text as in "natural language recognition". Thats a field where; despite some advances have been made (by dedicated research teams, spending many person years!); computers still fail pretty bad (heck even humans fail at it, sometimes).
With the number of documents you mentioned (1000's), hire a temp worker and have them sorted/tagged by human brain power. It will be cheaper and you will have less misclassifications.
You can use tika for text extraction. If there is a fixed pattern, you can extract information using regex or xpath queries. Other solution is to use Solr as shown in this video.You don't need solr but watch the video to get idea.

How to avoid pdfbox appending separate words

I'm making an application which allows searching in pdf's using apache Solr. I was having trouble finding certain terms in pdfs.
I noticed words in columns got appended.
Example
Column1 | Column2
stack | overflow
Here the PdftextStripper would sometimes give me stackoverflow as extracted text. This would lead to bad tokinazation in solr which prevents you from finding the term. (Yes I know I can use wildcards but that doesn't work in phrase queries)
I have been looking at the sources to see what causes the problem. But it seems that the writePage method has to guess the spaces. I can't really change this since it seems very complex.
Are there any other solutions to get a good text extraction from a pdf with columns?
Maybe some sort of conversion other program.
Maybe patch for pdfbox.
Yes I've seen similar
question but they mostly handle the order of the extraction(which in
my case doesn't matter that much).
I got the same problem while extracting text with PDFbox. I solved this issue by taking the position information of each character. I took x position and y position of each character. And implemented a simple logic to distinguish words. Before that my word delimitter was only the " "(space). I added one more logic that if the difference of the X position of two characters are beyond a certain value (this value will be your choice.) and it is in the same line, that is same y coordinate (Different y coordinate means certainly a new word), I treated them as a new word. With this logic I was able to solve problems with table content, new line etc.
This link will help you to get the position of characters from pdf with PDFbox.

Searching for words like "UTTD_Equip_City_TE" in Lucene

Thanks for reading :)
I'm trying to search for words like "UTTD_Equip_City_TE" across RTF documents using Lucene. This word appears in two different forms:
«UTTD_Equip_City_TE»,
«UTTD_Equip_City_TE»
I first tried with StandardAnalyzer, but it seems to break down the word into "UTTD", "Equip", "City", and "TE".
Then I tried again using WhiteSpaceAnalyzer, but it doesn't seem to be working... (I don't know why).
Could you help me I should approach this problem? By the way, editing the Lucene source and recompiling it with Ant is not an option :(
Thanks.
EDIT: there are other texts in this document, too. For example:
SHIP TO LESSEE (EQUIPMENT location address): «UTTD_Equip_StreetAddress_TE», «UTTD_Equip_City_TE», «UTTD_Equip_State_MC»
Basically, I'm trying to index RTF files, and inside each RTF file is tables with variables. Variables are wrapped with « and » . I'm trying to search those variables in the documents. I've tried searching "«" + string + "»", but it hasn't worked...
This example could give a better picture: http://i.imgur.com/SwlO1.png
Please help.
KeywordAnalyzer tokenizes the entire field as a single string. It sounds like this might be what you're looking for, if the substrings are in different fields within your document.
See: KeywordAnalyzer
Instead, if you are adding the entire content of the document within a single field, and you want to search for a substring with embedded '_' characters within it, then I would think that WhitespaceAnalyzer would work. You stated that it didn't work, though. Can you tell us what the results were when you tried using WhitespaceAnalyzer? And did you use it for both Indexing and Querying?
I see two options here. In both cases you have to build a custom analyzer.
Option 1
Start with StandardTokenizer's grammar file and customize it so that it emits text separated by '_' as a single token. (refer to Generating a custom Tokenizer for new TokenStream API using JFlex/ Java CC). Build your Analyzer using this new Tokenizer along with LowerCaseFilter.
Oprion 2
Write a Custom Analyzer that is made of WhiteSpaceTokenizer and custom TokenFilters. In these TokenFilters you decide on how to act on the tokens returned by WhiteSpaceTokenizer.
Refer to http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/analysis/package-summary.html for more details on analysis

Categories