Case-insensitive search with eXist-db

Case-insensitive search with eXist-db - java

I am going through a final refinement posted by the client, which needs me to do a case-insensitive query. I will basically walk through how this simple program works.
First of all, in my Java class, I did a fairly simple webpage parsing:
title=(String)results.get("title");
doc = docBuilder.parse("http://" + server + ":" + port + "/exist/rest/db/wb/xql/media_lookup.xql?" + "&title=" + title);
This Java statement references an XQuery file "media_lookup.xql" which is stored on localhost, and the only parameter we are passing is the string "title".
Secondly, let's take at look at that XQuery file:
$title := request:get-parameter('title',""),
$mediaNodes := doc('/db/wb/portfolio/media_data.xml'),
$query := $mediaNodes//media[contains(title,$title)],
Then it will evaluate that query. This XQuery will get the "title" parameter that are passes from our Java class, and query the "media_data" xml file stored in the database, which contains a bunch of media nodes with a 'title' element node. As you may expect, this simple query will just match those media nodes whose 'title' element contains a substring of what the value of string 'title' is. So if our 'title' is "Chi", it will return media nodes whose title may be "Chicago" or "Chicken".
The refinement request posted by the client is that there should be NO case-sensitivity. The very intuitive way is to modify the XQuery statement by using a lower-case function in it, like:
$query := $mediaNodes//media[contains(lower-case(title/text(),lower-case($title))],
However, the question comes: this modified query will run my machine into memory overflow. Since my "media_data.xml" is quite huge and contains thouands of millions of media nodes,
I assume the lower-case() function will run on each of the entries, thus causing the machine to crash.
I've talked with some experienced XQuery programmer, and they think I should use an index to solve this problem, and I will definitely research into that. But before that, I am just posting this problem here to get other ideas or any suggestions, do you think any other way may help? for example, could I tweak the Java parse statement to realize the case-insensitivity? Since I think I saw some people did some string concatenation by using "contains." in Java before passing it to the server.
Any idea or help is welcomed.

The refinement request posted by the
client is that there should be NO
case-sensitivity. The very intuitive
way is to modify the XQuery statement
by using a lower-case function in it,
like:
$query := $mediaNodes//media
[contains(lower-case(title/text(),lower-case($title))],
However, the question comes: this
modified query will run my machine
into memory overflow. Since my
"media_data.xml" is quite huge and
contains thousands of millions of media
nodes, I assume the lower-case()
function will run on each of the
entries, thus causing the machine to
crash.
Such fears are not justified.
Any sane implementation of XPath uses automatic memory for its functions. This means that the memory required for evaluating a particular predicate, including the result of lower-case() becomes freed (in languages with no garbage collection) or unreferenced and ready for garbage collection immediately after the evaluation of the predicate.

A table index probably is not the solution as absebse of an index will slow things down, but not trigger a memory overflow.
I think your best bet is to duplicate the title in your database copying it into an all-lowercase (or uppercase with makes more clear that it was converted) and query the alternate title while presenting the normal title.
To save some processing to you can do the case coversion of $product before the query.
You can drop the ampersand in your URL, I'm not sure all webservers parse the ?& correctly.

Related

Accessing Neo4j List in Java After Query

I don't have much code to post, and I'm quite confused on where to start. There's a lot of documentation online and I can't seem to find what I'm looking for.
Suppose I have this query result saved into a StatementResult variable:
result = session.run("MATCH (n:person {tag1: 'Person1'})"
+ "RETURN [(n)-->(b) WHERE b:type1 | m.tag2]")
In the Neo4j browser, this returns a list of exactly what I'm looking for. My question is how we can access this in Java. I know how to access single values, but not a list of this type.
Any help would be appreciated.
Thanks.

Usually you just iterate over the statement results, to access each record and then with each record you can access each named column. You didn't use any names.
The column will return Value objects which you then can turn into the types you expect, so in your case into a list with asList().
See the API docs for StatementResult and Value.asList()
also your statement is not correct you probably meant b where you wrote m and you need to name your column to access it

Calculation of Document hashCode in SQL or PL/SQL for CLOBs

I am looking for help in replicating Java's String.hashCode() function in SQL. It is computed as:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
I could do it easily with a java program, but I have a lot of CLOBs to process and I am thinking (perhaps incorrectly?) that the update would run faster on the server without dealing with the network overhead. Does anybody have such a function?
Some requirements are that:
It be in SQL or Oracle's PL/SQL since I am doing this on Oracle (sadly)
It work on CLOBs, not just varchars
It can handle large CLOBs (>4K)
Also it doesn't have to use java hashCode(), it can use a different hashing algorithm like MD5SUM if it is easier to do. I will need to update about a million records and will be using the hash to indicate if the source document (or conversion process) results in a changed document.

I discovered ora_hash() in Oracle, but it looks like it only looks at the first 4K of the document. Instead I am using:
CREATE OR REPLACE FUNCTION get_md5sum_clob_fn( i_clob IN CLOB)
RETURN RAW
IS
BEGIN
RETURN
DBMS_CRYPTO.HASH
(
src => i_clob,
typ => DBMS_CRYPTO.HASH_MD5
);
END;
I also only give this non-null CLOBs otherwise there will be an error. It suffers from a few limitations:
I haven't tested if it looks past 4K
It is Oracle specific.
So I am not accepting my own answer yet. I'd love to see a database independent solution.

Interpret Queries of Lucene

I was wondering if there is any way to interpret Queries of Lucene in simple terms?
For example :
Example # 1:
Input Query - name:John
Output - Interpreted as : Find all entries where attribute "name" is equal "John".
Example # 2:
Input Query - name:John AND phoneNumber:1234
Output - Interpreted as : Find all entries where attribute "name" is equal to "John" and attribute "phoneNumber" is equal to "1234".
Any tutorials in this regard will be helpful,
Thanks

The Lucene documentation does a pretty decent job in explaining basic queries and their interpretation. It seems as though that's all you're looking for; once you get into some of the more advanced query types, it gets hairy, but the documentation should always be your first stop; it's fairly comprehensive.
Edit: Ah, you want automated query explanation. I don't know of any that currently exist; I think you'll have to write your own, but if you're starting with standard QueryParser Syntax, I think the best input for your interpreter would be the output of QueryParser.parse(). That breaks down the free text into Lucene query objects that shouldn't be too difficult to wrap in a utility function that outputs a plain-English string for each one.

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?

If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.

From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Read a cell with getBinaryStream

I have a varchar2 typed column1 in my database table .
I read this column from java
java.sql.ResultSet r = s.executeQuery("Select * from table1");
InputStream is = r.getBinaryStream("column1");
I do something after this code. But I could not read whole value
Below text is my row.
"Called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text, it was first applied to text at Bell Laboratories in the late 1980s. The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don’t share a specific word or words with the search criteria."
But I could read only this part of it
Called Latent Semantic Indexing because of its ability to correlate semantically related terms
Why I could not read whole of it ?

Apparently you're not fully reading the InputStream. A common beginner's mistake is assuming that InputStream#available() returns the length of the stream and then only that amount of bytes is been read. This is not correct. You need to read it fully until InputStream#read() method returns -1. See also the Java IO tutorial. Another possible cause is that the text contains newlines and you're using BufferedReader#readLine() to read it and it is been called only once. This is also not correct. You need to call it in a loop until it returns null.
But as it's a varchar field, why don't you just use ResultSet#getString()?
String column1 = r.getString("column1");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Case-insensitive search with eXist-db - java

Related

Accessing Neo4j List in Java After Query

Calculation of Document hashCode in SQL or PL/SQL for CLOBs

Interpret Queries of Lucene

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

Read a cell with getBinaryStream

Categories

Resources