The situation:
Lets say we are implementing a blog engine based on JCR with support for localization.
The content structure looks something like this /blogname/content/[node name]
The problem:
What is the best way to name the content nodes (/blogname/content/[nodename]) to satisfy the following requirements:
The node name must be usable in HTML to support REST like URLs i.e.: blogname.com/content/nodename should point to a single content item.
The above requirement must not produce ugly URLs i.e.: /content/node_name is good, /content/node%20name is bad.
Programmatic retrieval should be easy given the node name i.e.: //content[#node_name=some-name]
The naming scheme must guarantee node name uniqueness.
PS: The JCR implementation used is JackRabbit
For 1. to 3. the answer is simple: just use characters you want to see in the node name, ie. escape whatever input string you have (eg. the blog post title) against a restricted character set such as the one for URIs.
For example, do not allow spaces (which are allowed for JCR node names, but would produce the ugly %20 in URLs) and other chars that must be encoded in URLs. You can remove those chars or simply replace them with a underscore, because that looks good in most cases.
Regarding unique names (4.), you can either include the current time incl. milliseconds into it or you explicitly check for collisions. The first might look a bit ugly, but should probably never fail for a blog scenario. The latter can be done by reacting upon the exception thrown if a node with such a name already exists and adding eg. an incrementing counter and try again (eg. my_great_post1, my_great_post2, etc.). You can also lock the parent node so that only one session can actually add a node at the same time, which avoids a trial loop, but comes at the cost of blocking.
Note: //content[#node_name=some-name] is not a valid JCR Xpath query. You probably want to use /jcr:root/content//some-name for that.
Regarding item 3. I recently learned that xpath queries do not allow items to start with a number. If your node name starts with a number it can still be queried by escaping the first byte of the name, but your queries will be more straightforward if you start all node names with a letter.
(I'm not sure about property names. Haven't ever seen one that didn't start with a letter.)
Unique names: To quickly generate a unique name from the first characters of a title plus a random number (to resolve conflicts), you could use the following algorithm:
String title = "JCR 170 Data modeling: Node names";
String name = title.substring(0, Math.min(title.length(), 10)).trim().replace(' ', '_');
if (name is not unique) {
name += "_";
Random r = new Random();
while (name is not unqiue) {
name += Integer.toString(r.nextInt(10));
}
}
The advantage to use a random number is: even if you have many similar names, this will resolve conflicts very quickly.
Related
I have a pretty large list containing many instances of one class, this class has many attributes(member variables). My problem is to find a feasible data structure to store these instances that allow searches based on multiple attributes like database search(i.e. A Student class, each student has age, date of birth, grade and GPA.find all 2nd year students whose ages are between 20 and 23). The Map seems not applicable as it only allow single key and if I create multi attribute index for searching, the big O is still not decreased. I also considered using trees like AVL tree, and I don't think it would work.
I'd be grateful if someone could give me some hints.
I think what you are looking for is an Inverted Index (using attribute name + value as keys) or possibly one Inverted Index per attribute. A search would build the intersection of all results found for each attribute.
You could do this:
Build an AVL tree with objects sorted by the most recurrent attribute (just one, e.g. "id" or "name").
Then create a search function that instead of taking a value, takes a Java lambda expression F (so your seacrh condition must be something like F(myObj) == true instead of myObj.deFaultAttribute == searchParameter)
For the example you gave, F could be something like ((myObj) -> myObj.year==2 && myObj.age>=20 && myObj.age<=23)
I hope it helps.
I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong
Using the DatastoreService how can I do queries for String containing some string similar to Java String:
contains
startsWith
endsWith
When querying against a String property, exact matches are the easiest, since that behavior works "out of the box".
"startsWith" queries can be done fairly easily by turning property startsWith: abc into property >= 'abc' and property < 'abd', where you calculate the end of the range.
"endsWith" can be done by storing a reversed copy of the String, and creating a query as above, but with the target reversed. I.e., property endsWith: 'abc' becomes `propertyReversed >= 'cba' and propertyReversed < 'cbb'.
"contains" is a large challenge. There are several approaches, and the right one for your situation depends on your situation. If the string is relatively short (e.g., a name of an address), you could store list of trailing substrings, matching against them with a range query as above.
As Dave mentions in his answer, contains is not available as a Datastore primitive. If you're looking for containment queries, the Search API is a good place to look (note: it's still in experimental).
I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...
The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.
You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.
Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}
I'm currently working on implementing a fuzzy search for a terminology web service and I'm looking for suggestions on how I might improve the current implementation. It's too much code to share, but I think an explanation might suffice to prompt thoughtful suggestions. I realize it's a lot to read but I'd appreciate any help.
First, the terminology is basically just a number of names (or terms). For each word, we split it into tokens by space and then iterate through each character to add it to the trie. On a terminal node (such as when the character y in strawberry is reached) we store in a list an index to the master term list. So a terminal node can have multiple indices (since the terminal node for strawberry will match 'strawberry' and 'allergy to strawberry').
As for the actual search, the search query is also broken up into tokens by space. The search algorithm is run for each token. The first character of the search token must be a match (so traw will never match strawberry). After that, we go through children of each successive node. If there is child with a character that matches, we continue the search with the next character of the search token. If a child does not match the given character, we look at the children using the current character of the search token (so not advancing it). This is the fuzziness part, so 'stwb' will match 'strawberry'.
When we reach the end of the search token, we will search through the rest of the trie structure at that node to get all potential matches (since the indexes to the master term list are only on the terminal nodes). We call this the roll up. We store the indices by setting their value on a BitSet. Then, we simply and the BitSets from the results of each search token result. We then take, say, the first 1000 or 5000 indices from the anded BitSets and find the actual terms they correspond to. We use Levenshtein to score each term and then sort by score to get our final results.
This works fairly well and is pretty fast. There are over 390k nodes in the tree and over 1.1 million actual term names. However, there are problems with this as it stands.
For example, searching for 'car cat' will return Catheterization, when we don't want it to (since the search query is two words, the result should be at least two). That would be easy enough to check, but it doesn't take care of a situation like Catheterization Procedure, since it is two words. Ideally, we'd want it to match something like Cardiac Catheterization.
Based on the need to correct this, we came up with some changes. For one, we go through the trie in a mixed depth/breadth search. Essentially we go depth first as long as a character matches. Those child nodes that didn't match get added to a priority queue. The priority queue is ordered by edit distance, which can be calculated while searching the trie (since if there's a character match, distance remains the same and if not, it increases by 1). By doing this, we get the edit distance for each word.
We are no longer using the BitSet. Instead, it's a map of the index to a Terminfo object. This object stores the index of the query phrase and the term phrase and the score. So if the search is "car cat" and a term matched is "Catheterization procedure" the term phrase indices will be 1 as will the query phrase indices. For "Cardiac Catheterization" the term phrase indices will be 1,2 as will the query phrase indices. As you can see, it's very simple afterward to look at the count of term phrase indices and query phrase indices and if they aren't at least equal to the search word count, they can be discarded.
After that, we add up the edit distances of the words, remove the words from the term that match the term phrase index, and count the remaining letters to get the true edit distance. For example, if you matched the term "allergy to strawberries" and your search query was "straw" you would have a score of 7 from strawberries, then you'd use the term phrase index to discard strawberries from the term, and just count "allergy to" (minus the spaces) to get the score of 16.
This gets us the accurate results we expect. However, it is far too slow. Where before we could get 25-40 ms on one word search, now it could be as much as half a second. It's largely from things like instantiating TermInfo objects, using .add() operations, .put() operations and the fact that we have to return a large number of matches. We could limit each search to only return 1000 matches, but there's no guarantee that the first 1000 results for "car" would match any of the first 1000 matches for "cat" (remember, there are over 1.1. million terms).
Even for a single query word, like cat, we still need a large number of matches. This is because if we search for 'cat' the search is going to match car and roll up all the terminal nodes below it (which will be a lot). However, if we limited the number of results, it would place too heavy an emphasis on words that begin with the query and not the edit distance. Thus, words like catheterization would be more likely to be included than something like coat.
So, basically, are there any thoughts on how we could handle the problems that the second implementation fixed, but without as much of the speed slow down that it introduced? I can include some selected code if it might make things clearer but I didn't want to post a giant wall of code.
Wow... tough one.
Well why don't you implement lucene? It is the best and current state of the art when it comes to problems like yours afaik.
However I want to share some thoughts...
Fuziness isnt something like straw* its rather the mis typing of some words. And every missing/wrong character adds 1 to the distance.
Its generally very, very hard to have partial matching (wildcards) and fuzziness at the same time!
Tokenizing is generally a good idea.
Everything also heavily depends on the data you get. Are there spelling mistakes in the source files or only in the search queries?
I have seen some pretty nice implementations using multi dimensional range trees.
But I really think if you want to accomplish all of the above you need a pretty neat combination of a graph set and a nice indexing algorithm.
You could for example use a semantic database like sesame and when importing your documents import every token and document as a node. Then depending on position in the document etc you can add a weighted relation.
Then you need the tokens in some structure where you can do efficient fuzzy matches such as bk-trees.
I think you could index the tokens in a mysql database and do bit-wise comparision functions to get differences. Theres a function that returns all matching bits, if you translit your strings to ascii and group the bits you could achieve something pretty fast.
However if you matched the tokens to the string you can construct a hypothetical perfect match antity and query your semantic database for the nearest neighbours.
You would have to break the words apart into partial words when tokenizing to achieve partial matches.
However you can do also wildcard matches (prefix, suffix or both) but no fuzziness then.
You can also index the whole word or different concatenations of tokens.
However there may be special bk-tree implementations that support this but i have never seen one.
I did a number of iterations of a spelling corrector ages ago, and here's a recent description of the basic method. Basically the dictionary of correct words is in a trie, and the search is a simple branch-and-bound. I used repeated depth-first trie walk, bounded by lev. distance because, since each additional increment of distance results in much more of the trie being walked, the cost, for small distance, is basically exponential in the distance, so going to a combined depth/breadth search doesn't save much but makes it a lot more complicated.
(Aside: You'd be amazed how many ways physicians can try to spell "acetylsalicylic acid".)
I'm surprised at the size of your trie. A basic dictionary of acceptable words is maybe a few thousand. Then there are common prefixes and suffixes. Since the structure is a trie, you can connect together sub-tries and save a lot of space. Like the trie of basic prefixes can connect to the main dictionary, and then the terminal nodes of the main dictionary can connect to the trie of common suffixes (which can in fact contain cycles). In other words, the trie can be generalized into a finite state machine. That gives you a lot of flexibility.
REGARDLESS of all that, you have a performance problem. The nice thing about performance problems is, the worse they are, the easier they are to find. I've been a real pest on StackOverflow pointing this out. This link explains how to do it, links to a detailed example, and tries to dispel some popular myths. In a nutshell, the more time it is spending doing something that you could optimize, the more likely you will catch it doing that if you just pause it and take a look. My suspicion is that a lot of time is going into operations on overblown data structure, rather than just getting to the answer. That's a common situation, but don't fix anything until samples point you directly at the problem.