I need to extend REST API in java with Spring reaching to Marklogic database. I already have functionality using StructuredQueryBuilder and the search method from DocumentManagerImpl (package com.marklogic.client.impl), but the client expects highlighting fragments of answers matching the searched phrases in Polish language, including derivatives from the stems (there may be several keywords by which we search, but with the condition of joint occurrence in the result).
How to extend the search query to Marklogic in the simplest way and using the Java API from Marklogic to obtain additional information about the location of the searched phrases in the returned objects in one query to the database?
Should I put a custom dictionary for stemming in Marklogic? Are there any sources recommended by Marklogic where I can get dictionaries?
You can get snippets with highlighting via the Java API via code like this:
QueryManager mgr = client.newQueryManager();
SearchHandle handle = mgr.search(mgr.newStructuredQueryBuilder().term("quick"), new SearchHandle());
for (MatchDocumentSummary matchResult : handle.getMatchResults()) {
for (MatchLocation matchLocation : matchResult.getMatchLocations()) {
for (MatchSnippet snippet : matchLocation.getSnippets()) {
System.out.println(snippet.getText());
System.out.println(snippet.isHighlighted());
}
}
}
Custom dictionaries are covered at https://docs.marklogic.com/guide/search-dev/custom-dictionaries. I believe that once you've created a dictionary, you'll want to modify the Language setting on your database to use the new dictionary (I have not tried that before, but that appears to be the expected approach).
As for a Polish dictionary - there's a link to a repository of dictionaries at https://developer.marklogic.com/code/dictionaries-and-thesauri/, but there's not a Polish dictionary there. Building a complete dictionary would of course be a significant effort, though it sounds like if you're mostly interested in stemming on certain keywords, you could build a custom dictionary containing just those keywords and their stems.
Related
I'm having a bit of trouble with a Firebase Query. I want to query for objects, where the objects child value contains a certain string. So far I have something that looks like this:
Firebase *ref = [[Firebase alloc] initWithUrl:#"https://dinosaur-facts.firebaseio.com/dinosaurs"];
[[[[ref queryOrderedByKey] queryStartingAtValue:#"b"] queryEndingAtValue:#"b~"]
observeEventType:FEventTypeChildAdded withBlock:^(FDataSnapshot *snapshot) {
NSLog(#"%#", snapshot.key);
}];
But that only gives objects that have a starting value of "b". I want objects that contains the string "b". How do I do that?
There are no contains or fuzzy matching methods in the query API, which you have probably already guessed if you've scanned the API and the guide on queries.
Not only has this subject been discussed ad nauseam on SO [1] [2] [3] [4] [5], but I've touched several times on why one should use a real search engine, instead of attempting this sort of half-hearted search approach.
There is a reason it's often easier to Google a website to find results than to use the built-in search, and this is a primary component of that failure.
With all of that said, the answer to your question of how to do this manually, since there is no built-in contains, is to set up a server-side process that loads/streams data into memory and does manual searching of the contents, preferably with some sort of caching.
But honestly, ElasticSearch is faster and simpler, and more efficient here. Since that's a vast topic, I'll defer you to the blog post on this subject.
I'm having a bit of trouble with a Firebase Query. I want to query for objects, where the objects child value contains a certain string. So far I have something that looks like this:
Firebase *ref = [[Firebase alloc] initWithUrl:#"https://dinosaur-facts.firebaseio.com/dinosaurs"];
[[[[ref queryOrderedByKey] queryStartingAtValue:#"b"] queryEndingAtValue:#"b~"]
observeEventType:FEventTypeChildAdded withBlock:^(FDataSnapshot *snapshot) {
NSLog(#"%#", snapshot.key);
}];
But that only gives objects that have a starting value of "b". I want objects that contains the string "b". How do I do that?
There are no contains or fuzzy matching methods in the query API, which you have probably already guessed if you've scanned the API and the guide on queries.
Not only has this subject been discussed ad nauseam on SO [1] [2] [3] [4] [5], but I've touched several times on why one should use a real search engine, instead of attempting this sort of half-hearted search approach.
There is a reason it's often easier to Google a website to find results than to use the built-in search, and this is a primary component of that failure.
With all of that said, the answer to your question of how to do this manually, since there is no built-in contains, is to set up a server-side process that loads/streams data into memory and does manual searching of the contents, preferably with some sort of caching.
But honestly, ElasticSearch is faster and simpler, and more efficient here. Since that's a vast topic, I'll defer you to the blog post on this subject.
Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.
I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?
A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).
If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.
There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.
I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.
I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?
You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)
Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)
I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.
Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.
Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.
You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.