I have a varchar2 typed column1 in my database table .
I read this column from java
java.sql.ResultSet r = s.executeQuery("Select * from table1");
InputStream is = r.getBinaryStream("column1");
I do something after this code. But I could not read whole value
Below text is my row.
"Called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text, it was first applied to text at Bell Laboratories in the late 1980s. The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don’t share a specific word or words with the search criteria."
But I could read only this part of it
Called Latent Semantic Indexing because of its ability to correlate semantically related terms
Why I could not read whole of it ?
Apparently you're not fully reading the InputStream. A common beginner's mistake is assuming that InputStream#available() returns the length of the stream and then only that amount of bytes is been read. This is not correct. You need to read it fully until InputStream#read() method returns -1. See also the Java IO tutorial. Another possible cause is that the text contains newlines and you're using BufferedReader#readLine() to read it and it is been called only once. This is also not correct. You need to call it in a loop until it returns null.
But as it's a varchar field, why don't you just use ResultSet#getString()?
String column1 = r.getString("column1");
Related
I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong
Using the DatastoreService how can I do queries for String containing some string similar to Java String:
contains
startsWith
endsWith
When querying against a String property, exact matches are the easiest, since that behavior works "out of the box".
"startsWith" queries can be done fairly easily by turning property startsWith: abc into property >= 'abc' and property < 'abd', where you calculate the end of the range.
"endsWith" can be done by storing a reversed copy of the String, and creating a query as above, but with the target reversed. I.e., property endsWith: 'abc' becomes `propertyReversed >= 'cba' and propertyReversed < 'cbb'.
"contains" is a large challenge. There are several approaches, and the right one for your situation depends on your situation. If the string is relatively short (e.g., a name of an address), you could store list of trailing substrings, matching against them with a range query as above.
As Dave mentions in his answer, contains is not available as a Datastore primitive. If you're looking for containment queries, the Search API is a good place to look (note: it's still in experimental).
I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...
The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.
You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.
Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}
I am going through a final refinement posted by the client, which needs me to do a case-insensitive query. I will basically walk through how this simple program works.
First of all, in my Java class, I did a fairly simple webpage parsing:
title=(String)results.get("title");
doc = docBuilder.parse("http://" + server + ":" + port + "/exist/rest/db/wb/xql/media_lookup.xql?" + "&title=" + title);
This Java statement references an XQuery file "media_lookup.xql" which is stored on localhost, and the only parameter we are passing is the string "title".
Secondly, let's take at look at that XQuery file:
$title := request:get-parameter('title',""),
$mediaNodes := doc('/db/wb/portfolio/media_data.xml'),
$query := $mediaNodes//media[contains(title,$title)],
Then it will evaluate that query. This XQuery will get the "title" parameter that are passes from our Java class, and query the "media_data" xml file stored in the database, which contains a bunch of media nodes with a 'title' element node. As you may expect, this simple query will just match those media nodes whose 'title' element contains a substring of what the value of string 'title' is. So if our 'title' is "Chi", it will return media nodes whose title may be "Chicago" or "Chicken".
The refinement request posted by the client is that there should be NO case-sensitivity. The very intuitive way is to modify the XQuery statement by using a lower-case function in it, like:
$query := $mediaNodes//media[contains(lower-case(title/text(),lower-case($title))],
However, the question comes: this modified query will run my machine into memory overflow. Since my "media_data.xml" is quite huge and contains thouands of millions of media nodes,
I assume the lower-case() function will run on each of the entries, thus causing the machine to crash.
I've talked with some experienced XQuery programmer, and they think I should use an index to solve this problem, and I will definitely research into that. But before that, I am just posting this problem here to get other ideas or any suggestions, do you think any other way may help? for example, could I tweak the Java parse statement to realize the case-insensitivity? Since I think I saw some people did some string concatenation by using "contains." in Java before passing it to the server.
Any idea or help is welcomed.
The refinement request posted by the
client is that there should be NO
case-sensitivity. The very intuitive
way is to modify the XQuery statement
by using a lower-case function in it,
like:
$query := $mediaNodes//media
[contains(lower-case(title/text(),lower-case($title))],
However, the question comes: this
modified query will run my machine
into memory overflow. Since my
"media_data.xml" is quite huge and
contains thousands of millions of media
nodes, I assume the lower-case()
function will run on each of the
entries, thus causing the machine to
crash.
Such fears are not justified.
Any sane implementation of XPath uses automatic memory for its functions. This means that the memory required for evaluating a particular predicate, including the result of lower-case() becomes freed (in languages with no garbage collection) or unreferenced and ready for garbage collection immediately after the evaluation of the predicate.
A table index probably is not the solution as absebse of an index will slow things down, but not trigger a memory overflow.
I think your best bet is to duplicate the title in your database copying it into an all-lowercase (or uppercase with makes more clear that it was converted) and query the alternate title while presenting the normal title.
To save some processing to you can do the case coversion of $product before the query.
You can drop the ampersand in your URL, I'm not sure all webservers parse the ?& correctly.
what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!
In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.
I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.