Searching for words like "UTTD_Equip_City_TE" in Lucene - java

Thanks for reading :)
I'm trying to search for words like "UTTD_Equip_City_TE" across RTF documents using Lucene. This word appears in two different forms:
«UTTD_Equip_City_TE»,
«UTTD_Equip_City_TE»
I first tried with StandardAnalyzer, but it seems to break down the word into "UTTD", "Equip", "City", and "TE".
Then I tried again using WhiteSpaceAnalyzer, but it doesn't seem to be working... (I don't know why).
Could you help me I should approach this problem? By the way, editing the Lucene source and recompiling it with Ant is not an option :(
Thanks.
EDIT: there are other texts in this document, too. For example:
SHIP TO LESSEE (EQUIPMENT location address): «UTTD_Equip_StreetAddress_TE», «UTTD_Equip_City_TE», «UTTD_Equip_State_MC»
Basically, I'm trying to index RTF files, and inside each RTF file is tables with variables. Variables are wrapped with « and » . I'm trying to search those variables in the documents. I've tried searching "«" + string + "»", but it hasn't worked...
This example could give a better picture: http://i.imgur.com/SwlO1.png
Please help.

KeywordAnalyzer tokenizes the entire field as a single string. It sounds like this might be what you're looking for, if the substrings are in different fields within your document.
See: KeywordAnalyzer
Instead, if you are adding the entire content of the document within a single field, and you want to search for a substring with embedded '_' characters within it, then I would think that WhitespaceAnalyzer would work. You stated that it didn't work, though. Can you tell us what the results were when you tried using WhitespaceAnalyzer? And did you use it for both Indexing and Querying?

I see two options here. In both cases you have to build a custom analyzer.
Option 1
Start with StandardTokenizer's grammar file and customize it so that it emits text separated by '_' as a single token. (refer to Generating a custom Tokenizer for new TokenStream API using JFlex/ Java CC). Build your Analyzer using this new Tokenizer along with LowerCaseFilter.
Oprion 2
Write a Custom Analyzer that is made of WhiteSpaceTokenizer and custom TokenFilters. In these TokenFilters you decide on how to act on the tokens returned by WhiteSpaceTokenizer.
Refer to http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/analysis/package-summary.html for more details on analysis

Related

Why solr reindex data on highlighting?

I wrote custom tokenizer for solr, when I first add records to solr, they are going throug my tokenizer and other filters, when they are going through my tokenizer I call web service and add needed attributes. After it I can use search without sending requests to web service. When I use search with highlighting data are going through my tokenizer again, what should I do for not going through tokenizer again?
When the highlighter is run on the text to highlight, the analyzer and tokenizer for the field is re-run on the text to score the different tokens against the submitted text, to determine which fragment is the best match for the query produced. You can see this code around line #62 of Highlighter.java in Lucene.
There are however a few options that might help for negating the need to re-parse the document text, all given as options on the community wiki for Highlighting:
For the standard highlighter:
It does not require any special datastructures such as termVectors,
although it will use them if they are present. If they are not, this
highlighter will re-analyze the document on-the-fly to highlight it.
This highlighter is a good choice for a wide variety of search
use-cases.
There are also two other Highlighter-implementations you might want to look at, as either one uses other support structures that might avoid doing the retokenizing / analysis of the field (I think testing it will be a lot quicker for you than for me right now).
FastVector Highlighter: The FastVector Highlighter requires term vector options (termVectors, termPositions, and termOffsets) on the field.
Postings Highlighter: The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field. This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms.
You can switch the highlighting implementation by using hl.useFastVectorHighligter=true or adding <highlighting class="org.apache.solr.highlight.PostingsSolrHighlighter"/> to your searchComponent definition.

How to Dirctly read Specific lines of data from the large text file without search every line in C , or java

I have a text file with some Account details in a huge size more than 7GB.Each Lines contains details of a single Accounts and other information.Here i want to read some Account details which contains first 3 charecters as "XBB". If i useed to search line by line it will take such a long time so I want to hit directly to that Particular Lines which contains the "XBB"..
Is there any Possible ways to do that in Java or VB , or VB.net
If the lines are sorted by their first 3 characters, then you can do a binary search. This is straightforward if the lines are a fixed length. Otherwise, you will need to search for the start of each line at each step of the binary search.
If you know the index of the line, you can try going to it directly. Again, this is trivial if the lines are a fixed length; otherwise you will still have to probe and search a bit.
In Java, the tool to use for this is RandomAccessFile. I don't know about VB/VB.net.
Following the suggestion by Peter Lawrey, if you are willing to scan the file once, you can build an index of the offset into the file at which each 3-character prefix starts. You can then use this to very quickly get to the correct line.
It doesn't matter what language you use; the only way to find something is to search for it. You can use a search tool like Lucene to do the searching ahead of time, i.e., create full-text search index, or you can do the searching when you need to as you're doing it now, but you won't be able to escape the searching part.
You can do this only if you have an Index file, and that index file contains indexes for the particular column of data you want to search on.
The other option would be to load the file into a database, like Sql Server Express, and run a sql query on it.
Use regular expressions (regex). With these you can set an expression that contains only those specific letters. Then using a scanner it will look for only that sequence of letters.

Replace Blank Lines and spaces java

I have a java string like the one below which has multiple lines and blank spaces. Need to remove all of them such that these are one line.
These are xml tags and the editor is not allowing to include less than symbol
<paymentAction>
Authorization
</paymentAction>
Should become
<paymentAction>AUTHORIZATION</paymentAction>
Thanks in advance
Calling theString.replaceAll("\\s+","") will replace all whitespace sequences with the empty string. Just be sure that the text between the tags doesn't contain spaces too, othewerise they'll get removed too.
You essentially want to convert the XML you have to Canonical Form. Below is one way of doing it but it requires you to use that library. If you doesn't want to depend upon external libraries then another option for you is to use XSLT.
The Canonicalizer class at Apache XML Security project:
NOTE: Dealing with non-xml aware API's (String.replaceAll()) is not generally recommended as you end up dealing with special/exception cases.
This is a start. Probably not enough, but should be in the right direction.
xml.replaceAll(">\\s*", ">").replaceAll("\\s*<, "<");
However, I'm tempted to say there has to be a way to create a document from the XML and then serialize it in canonical form as Pangea suggested.

Java API : downloading and calculating tf-idf for a given web page

I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Download the given set of URLs
Extract the tokens
Remove the stop words
Perform Stemming
Create Inverted Index
Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards
Yuvi
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document.
If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene.
If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.

Regex a xml string

What would be the correct way to find a string like this in a large xml:
<ser:serviceItemValues>
<ord1:label>Start Type</ord1:label>
<ord1:value>Loop</ord1:value>
<ord1:valueCd/>
<ord1:activityCd>iactn</ord1:activityCd>
</ser:serviceItemValues>
1st in this xml there will be a lot of repeats of the element above with different values (Loop, etc.) and other xml elements in this document. Mainly what I am concerned with is if there is a serviceItemValues that does not have 'Loop' as it's value. I tried this, but it doesn't seem to work:
private static Pattern LOOP_REGEX =
Pattern.compile("[\\p{Print}]*?<ord1:label>Start Type</ord1:label>[\\p{Print}]+[^(Loop)][\\p{Print}]+</ser:serviceItemValues>[\\p{Print}]*?", Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);
Thanks
Regular expressions are not the best option when parsing large amounts of HTML or XML.
There are a number of ways you could handle this without relying on Regular Expressions. Depending on the libraries you have at your disposal you may be able to find the elements you're looking for by using XPaths.
Heres a helpful tutorial that may help you on your way: http://www.totheriver.com/learn/xml/xmltutorial.html
Look up XPath, which is kinda like regex for XML. Sort of.
With XPath you write expressions that extract information from XML documents, so extracting the nodes which don't have Loop as a sub-node is exactly the sort of thing it's cut out for.
I haven't tried this, but as a first stab, I'd guess the XPath expression would look something like:
"//ser:serviceItemValues/ord1:value[text()!='Loop']/parent::*"
Regular expression is not the right tool for this job. You should be using an XML parser. It's pretty simple to setup and use, and will probably take you less time to code. It then will come up with this regular expression.
I recommend using JDOM. It has an easy syntax. An example can be found here:
http://notetodogself.blogspot.com/2008/04/teamsite-dcr-java-parser.html
If the documents that you will be parsing are large, you should use a SAX parser, I recommend Xerces.
When dealing with XML, you should probably not use regular expressions to check the content. Instead, use either a SAX parsing based routine to check relevant contents or a DOM-like model (preferably pull-based if you're dealing with large documents).
Of course, if you're trying to validate the document's contents somehow, you should probably use some schema tool (I'd go with RELAX NG or Schematron, but I guess you could use XML Schema).
As mentioned by the other answers, regular expressions are not the tool for the job. You need a XPath engine. If you want to these things from the command line though, I recommend to install XMLStar. I have very good experience with this tool and solving various XML related tasks. Depending on your OS you might be able to just install the xmlstarlet RPM or deb package. Mac OS X ports includes the package as well I think.

Categories