I'm making an application which allows searching in pdf's using apache Solr. I was having trouble finding certain terms in pdfs.
I noticed words in columns got appended.
Example
Column1 | Column2
stack | overflow
Here the PdftextStripper would sometimes give me stackoverflow as extracted text. This would lead to bad tokinazation in solr which prevents you from finding the term. (Yes I know I can use wildcards but that doesn't work in phrase queries)
I have been looking at the sources to see what causes the problem. But it seems that the writePage method has to guess the spaces. I can't really change this since it seems very complex.
Are there any other solutions to get a good text extraction from a pdf with columns?
Maybe some sort of conversion other program.
Maybe patch for pdfbox.
Yes I've seen similar
question but they mostly handle the order of the extraction(which in
my case doesn't matter that much).
I got the same problem while extracting text with PDFbox. I solved this issue by taking the position information of each character. I took x position and y position of each character. And implemented a simple logic to distinguish words. Before that my word delimitter was only the " "(space). I added one more logic that if the difference of the X position of two characters are beyond a certain value (this value will be your choice.) and it is in the same line, that is same y coordinate (Different y coordinate means certainly a new word), I treated them as a new word. With this logic I was able to solve problems with table content, new line etc.
This link will help you to get the position of characters from pdf with PDFbox.
Related
I'm using an OCR (Tesseract) to extract data from a document, this document must contains certain keyword to be valid, OCR isn't perfect so sometime he may read for example "Technlquos" instead of "Techniques".
So I'm wondering if there is a way in java to find "techniques" in a text even if it's read by OCR as "Technlquos" ? and the same thing for composed word : searching "Sciences Techniques" must accept "Sclences Technlquos", something like founding the closest word to the searched word and accepting it if it's close enough (75% matching for example) I found some solutions here but none of them is answering my question
Thank you
In other OCR libraries, this can be done by keeping recognized word variants in the resulting text. Most likely, "Techniques" is found and considered suspicious by your OCR. If there is an option to keep suspicious word recognition variants, then you will be able to search for it.
I am iterating over XWPFParagraph instances coming from an XWPTDocument instance (using the "getParagraphs()" method) Is there a way to retrieve the page numbers where each paragraph is located from the XWPFParagraph instances?
To eventually turn Gagravarr's comment into a proper answer: No, this is not possible.
Doing so would require a full-blown Word rendering engine (i.e. MS Word itself) and even then you cannot be absolutely sure that page breaks will always occur at exactly those positions where they happened to be when the file was once created (think: missing fonts, missing pictures, different display options for vanished text and/or revision marking, different printer margins, etc.).
So claiming that some content in a Word file is on a certain line X on a certain page Y actually expresses a fundamental misunderstanding of the Word file format. There is simply no notion of line and page in there. It's all about runs resp. ranges.
In other words: Only upon opening such a file with MS Word will those contents be rendered onto individual lines / pages. And the behavior of this renderer unpredictable to a certain extent.
I have a strange behavious and hope that someone might know what is going on.
For a personal project in Java i'm using Apache poi (version 3.9) to read and write to excel files.
In these excel files there is a formula i wanted to change to another way writing it.
I have a loop that sets my Excelobject with the required formula string
excelobject.setDataFormula("SUM(L" + counter + "-6,75)"); // it will look like SUM(L2-6,75) and so on
However when i write these formula in a file and check it. it has mysteriously changed to something like SUM(L2-6;75). changing the , to a ; and thus the formula does not work like intended.
can someone explain to me why apache poi setFormula on a cell does this to a , ?
EDIT :
I changed my loop to use a double 6,75 instead of a string 6,75 and that seems to help when creating the formula. So this immediate question is fixed though I am still curious on why this behavious comes.
Go to Control panel -> Region and Language -> Additional settings and change the List Separator to something other than a comma. You should use a character that doesn't appear in your file. Be sure to pick something that won't be added to the file in future either.
It changes from , to ; because , is being used as a delimiter. Anywhere that a , is found indicates a new cell. When your string is read, it changes the comma so that 2 cells wont be created where there should only be one. If you change the List Separator as described, the comma should remain unchanged.
I have a text file with some Account details in a huge size more than 7GB.Each Lines contains details of a single Accounts and other information.Here i want to read some Account details which contains first 3 charecters as "XBB". If i useed to search line by line it will take such a long time so I want to hit directly to that Particular Lines which contains the "XBB"..
Is there any Possible ways to do that in Java or VB , or VB.net
If the lines are sorted by their first 3 characters, then you can do a binary search. This is straightforward if the lines are a fixed length. Otherwise, you will need to search for the start of each line at each step of the binary search.
If you know the index of the line, you can try going to it directly. Again, this is trivial if the lines are a fixed length; otherwise you will still have to probe and search a bit.
In Java, the tool to use for this is RandomAccessFile. I don't know about VB/VB.net.
Following the suggestion by Peter Lawrey, if you are willing to scan the file once, you can build an index of the offset into the file at which each 3-character prefix starts. You can then use this to very quickly get to the correct line.
It doesn't matter what language you use; the only way to find something is to search for it. You can use a search tool like Lucene to do the searching ahead of time, i.e., create full-text search index, or you can do the searching when you need to as you're doing it now, but you won't be able to escape the searching part.
You can do this only if you have an Index file, and that index file contains indexes for the particular column of data you want to search on.
The other option would be to load the file into a database, like Sql Server Express, and run a sql query on it.
Use regular expressions (regex). With these you can set an expression that contains only those specific letters. Then using a scanner it will look for only that sequence of letters.
Thanks for reading :)
I'm trying to search for words like "UTTD_Equip_City_TE" across RTF documents using Lucene. This word appears in two different forms:
«UTTD_Equip_City_TE»,
«UTTD_Equip_City_TE»
I first tried with StandardAnalyzer, but it seems to break down the word into "UTTD", "Equip", "City", and "TE".
Then I tried again using WhiteSpaceAnalyzer, but it doesn't seem to be working... (I don't know why).
Could you help me I should approach this problem? By the way, editing the Lucene source and recompiling it with Ant is not an option :(
Thanks.
EDIT: there are other texts in this document, too. For example:
SHIP TO LESSEE (EQUIPMENT location address): «UTTD_Equip_StreetAddress_TE», «UTTD_Equip_City_TE», «UTTD_Equip_State_MC»
Basically, I'm trying to index RTF files, and inside each RTF file is tables with variables. Variables are wrapped with « and » . I'm trying to search those variables in the documents. I've tried searching "«" + string + "»", but it hasn't worked...
This example could give a better picture: http://i.imgur.com/SwlO1.png
Please help.
KeywordAnalyzer tokenizes the entire field as a single string. It sounds like this might be what you're looking for, if the substrings are in different fields within your document.
See: KeywordAnalyzer
Instead, if you are adding the entire content of the document within a single field, and you want to search for a substring with embedded '_' characters within it, then I would think that WhitespaceAnalyzer would work. You stated that it didn't work, though. Can you tell us what the results were when you tried using WhitespaceAnalyzer? And did you use it for both Indexing and Querying?
I see two options here. In both cases you have to build a custom analyzer.
Option 1
Start with StandardTokenizer's grammar file and customize it so that it emits text separated by '_' as a single token. (refer to Generating a custom Tokenizer for new TokenStream API using JFlex/ Java CC). Build your Analyzer using this new Tokenizer along with LowerCaseFilter.
Oprion 2
Write a Custom Analyzer that is made of WhiteSpaceTokenizer and custom TokenFilters. In these TokenFilters you decide on how to act on the tokens returned by WhiteSpaceTokenizer.
Refer to http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/analysis/package-summary.html for more details on analysis