Query String with DatastoreService - java

Using the DatastoreService how can I do queries for String containing some string similar to Java String:
contains
startsWith
endsWith

When querying against a String property, exact matches are the easiest, since that behavior works "out of the box".
"startsWith" queries can be done fairly easily by turning property startsWith: abc into property >= 'abc' and property < 'abd', where you calculate the end of the range.
"endsWith" can be done by storing a reversed copy of the String, and creating a query as above, but with the target reversed. I.e., property endsWith: 'abc' becomes `propertyReversed >= 'cba' and propertyReversed < 'cbb'.
"contains" is a large challenge. There are several approaches, and the right one for your situation depends on your situation. If the string is relatively short (e.g., a name of an address), you could store list of trailing substrings, matching against them with a range query as above.

As Dave mentions in his answer, contains is not available as a Datastore primitive. If you're looking for containment queries, the Search API is a good place to look (note: it's still in experimental).

Related

SQL: Long comporating or String, which one is faster

I'm running this off a java program which connects to my sql server on the same machine.
Basically I'm trying to call a certain 'String' which can be identified by the string self or by it's already stored 'long'(int64) which is a method that stores an unique long related to the string.
So in this case my question would be, would long comparison at a SQL lookup be faster or wouldn't it matter that much versus String comparison.
SELECT * FROM playerAccount WHERE playerName = {string in Java}
or
SELECT * FROM playerAccount WHERE nameHash = {long in Java}
Thanks in advance ;)
The comparison operation itself is rather negligible. However, in general in computer code, the comparison of the long is going to use fewer cycles than the comparison for a string.
The reason is that comparing the bits in a numeric value is unambiguous and the code doesn't need to worry about the length of the value. When comparing strings, the underlying code has to "parse" the strings, character by character, to make the comparison, figure out where they end, and handle collations and character pages.
But, this is rather unimportant. For speed, you want an index. And although an index using the numeric value might be an iota faster than an index using a string, this is the wrong criteria for choosing which to use. Your code should be designed to function correctly and to be maintainable. It is doubtful that an optimization of this sort would ever be necessary to achieve a real-world goal.
Generally comparing long values is faster than comparing string values.
If string and long are stored in a database the problem is not the comparison, but the presence or absence of an index.
So the better solution is using the long value with an index the database.
Please note that if "nameHash" is the hashCode of the field "playerName" the two search don't returns necessarely the same record, infact two players with different names can possible have the same hashCode, so consider exactly what are your needs and eventually update the code.

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Data structure for search engine in JAVA?

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?
A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).
If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.
There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.
I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

JCR 170 Data modeling: Node names

The situation:
Lets say we are implementing a blog engine based on JCR with support for localization.
The content structure looks something like this /blogname/content/[node name]
The problem:
What is the best way to name the content nodes (/blogname/content/[nodename]) to satisfy the following requirements:
The node name must be usable in HTML to support REST like URLs i.e.: blogname.com/content/nodename should point to a single content item.
The above requirement must not produce ugly URLs i.e.: /content/node_name is good, /content/node%20name is bad.
Programmatic retrieval should be easy given the node name i.e.: //content[#node_name=some-name]
The naming scheme must guarantee node name uniqueness.
PS: The JCR implementation used is JackRabbit
For 1. to 3. the answer is simple: just use characters you want to see in the node name, ie. escape whatever input string you have (eg. the blog post title) against a restricted character set such as the one for URIs.
For example, do not allow spaces (which are allowed for JCR node names, but would produce the ugly %20 in URLs) and other chars that must be encoded in URLs. You can remove those chars or simply replace them with a underscore, because that looks good in most cases.
Regarding unique names (4.), you can either include the current time incl. milliseconds into it or you explicitly check for collisions. The first might look a bit ugly, but should probably never fail for a blog scenario. The latter can be done by reacting upon the exception thrown if a node with such a name already exists and adding eg. an incrementing counter and try again (eg. my_great_post1, my_great_post2, etc.). You can also lock the parent node so that only one session can actually add a node at the same time, which avoids a trial loop, but comes at the cost of blocking.
Note: //content[#node_name=some-name] is not a valid JCR Xpath query. You probably want to use /jcr:root/content//some-name for that.
Regarding item 3. I recently learned that xpath queries do not allow items to start with a number. If your node name starts with a number it can still be queried by escaping the first byte of the name, but your queries will be more straightforward if you start all node names with a letter.
(I'm not sure about property names. Haven't ever seen one that didn't start with a letter.)
Unique names: To quickly generate a unique name from the first characters of a title plus a random number (to resolve conflicts), you could use the following algorithm:
String title = "JCR 170 Data modeling: Node names";
String name = title.substring(0, Math.min(title.length(), 10)).trim().replace(' ', '_');
if (name is not unique) {
name += "_";
Random r = new Random();
while (name is not unqiue) {
name += Integer.toString(r.nextInt(10));
}
}
The advantage to use a random number is: even if you have many similar names, this will resolve conflicts very quickly.

Matching inexact company names in Java

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?
You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)
Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)
I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.
Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.
Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.
You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.

Categories