I'm having a bit of trouble with a Firebase Query. I want to query for objects, where the objects child value contains a certain string. So far I have something that looks like this:
Firebase *ref = [[Firebase alloc] initWithUrl:#"https://dinosaur-facts.firebaseio.com/dinosaurs"];
[[[[ref queryOrderedByKey] queryStartingAtValue:#"b"] queryEndingAtValue:#"b~"]
observeEventType:FEventTypeChildAdded withBlock:^(FDataSnapshot *snapshot) {
NSLog(#"%#", snapshot.key);
}];
But that only gives objects that have a starting value of "b". I want objects that contains the string "b". How do I do that?
There are no contains or fuzzy matching methods in the query API, which you have probably already guessed if you've scanned the API and the guide on queries.
Not only has this subject been discussed ad nauseam on SO [1] [2] [3] [4] [5], but I've touched several times on why one should use a real search engine, instead of attempting this sort of half-hearted search approach.
There is a reason it's often easier to Google a website to find results than to use the built-in search, and this is a primary component of that failure.
With all of that said, the answer to your question of how to do this manually, since there is no built-in contains, is to set up a server-side process that loads/streams data into memory and does manual searching of the contents, preferably with some sort of caching.
But honestly, ElasticSearch is faster and simpler, and more efficient here. Since that's a vast topic, I'll defer you to the blog post on this subject.
Related
I'm having a bit of trouble with a Firebase Query. I want to query for objects, where the objects child value contains a certain string. So far I have something that looks like this:
Firebase *ref = [[Firebase alloc] initWithUrl:#"https://dinosaur-facts.firebaseio.com/dinosaurs"];
[[[[ref queryOrderedByKey] queryStartingAtValue:#"b"] queryEndingAtValue:#"b~"]
observeEventType:FEventTypeChildAdded withBlock:^(FDataSnapshot *snapshot) {
NSLog(#"%#", snapshot.key);
}];
But that only gives objects that have a starting value of "b". I want objects that contains the string "b". How do I do that?
There are no contains or fuzzy matching methods in the query API, which you have probably already guessed if you've scanned the API and the guide on queries.
Not only has this subject been discussed ad nauseam on SO [1] [2] [3] [4] [5], but I've touched several times on why one should use a real search engine, instead of attempting this sort of half-hearted search approach.
There is a reason it's often easier to Google a website to find results than to use the built-in search, and this is a primary component of that failure.
With all of that said, the answer to your question of how to do this manually, since there is no built-in contains, is to set up a server-side process that loads/streams data into memory and does manual searching of the contents, preferably with some sort of caching.
But honestly, ElasticSearch is faster and simpler, and more efficient here. Since that's a vast topic, I'll defer you to the blog post on this subject.
I'm writing a Java application that is using Apache Solr to index and search through a list of articles. A requirement I am dealing with is that when a user searches for something, we are supplying a list of recommended related search terms, and the user has the option to include those extra terms in their search. The problem I'm having, however, is that we want the user's original search term to be prioritized, and results that match that should appear before results that only match related terms.
My research suggests that Solr's boost function is the solution for this, but I'm having some trouble getting it to work with Spring. The code all runs fine and I get my search results as expected, but the boost function doesn't seem to actually be re-ordering my searches at all. For example, I'm trying to do something like this:
Query query = new SimpleQuery();
Criteria searchCriteria = Criteria.where("title").contains("A").boost((float) 2);
Criteria extraCriteria = Criteria.where("title").contains("B").boost((float) 1);
query.addCriteria(searchCriteria.or(extraCriteria));
In this example I would be searching for any document whose title contains "A" or "B", but I want to boost results that match "A" to the top of the list.
I've also tried using the Extended DisMax Query Parser with a different syntax to achieve the same result, with similar lack of success. To follow the same example pattern, I'm trying to use the expression criteria as follows:
Query query = new SimpleQuery();
Criteria searchCriteria = Criteria.where("title").expression("A^2.0 OR B^1.0");
query.setDefType("edismax");
query.addCriteria(searchCriteria);
Again I would expect this to return documents with titles matching "A" or "B" but boost results matching "A", and again it simply doesn't seem to actually affect the ordering of my results at all.
Okay, I figured out the problem here. Elsewhere in the code someone else had added this snippet:
query.setPageRequest(pageable);
This was done to support pagination of the search results, but the pageable object ALSO contained some sort orders that looks like they got added to the query as part of the .setPageRequest method. Something to look out for in the future, it looks like sorts override boosting when working with Spring Solr queries in this scenario.
Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.
I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?
A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).
If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.
There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.
I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.
I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?
You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)
Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)
I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.
Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.
Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.
You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.