How to improve performance of SMO classifier in weka? - java

I am using weka SMO classifier for classify the documents.There are many parameters for smo available like Kernal, tolerance etc.., I tested using different parameters but i not get good result large data set.
For more than 90 category only 20% documents getting correctly classified.
Please anyone tell me the best set of parameter to get highest performance in SMO.

Principal issue here is not classification itself, but rather selecting suitable features. Using raw HTML leads to very large noise which in its turn makes classification results very poor. Thus, to get good results do the following:
Extract relevant text. Not just remove HTML tags, but get exactly the text describing item.
Create dictionary of key words. E.g. capuccino, latte, white rice, etc.
Use stemming or lemmatization to get word's base form and avoid counting, for example, "cotton" and "cottons" as 2 different words.
Make feature vectors from text. Attributes (feature names) should be all words from your dictionary. Values may be: binary (1 if word occurs in text, 0 otherwise), integer (number of occurrences of word in question in text), tf-idf (use this one if your texts have very different lengths) and others.
And only after all these steps you can use classifer.
Most probably classifier type won't play a big role here: dictionary-based features normally lead to quite exact results regardless of classification technique in use. You can use SVM (SMO), Naive Bayes, ANN or even kNN. More sophisticated methods include creation of category hierarchy, where, for example, category "coffee" is included into category "drinks" which in its turn is part of category "food".

Related

Data retrieval / search in text

I am working on a selfProjet for my own interest on data retrieval. I have one text file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
I did:
split the docs and put them in separate files
delete stopwords (and, or, while, is, are, ...)
stem the words to get the root of each (achievement, achieve, achievable, ...all converted to achiv and so on)
and finally create e TreeMultiMap which looks like this:
{key: word} {Values are arraylist of docID and frequency of that word in that docID}
aerodynam [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
book [[Doc_00562,6],[Doc_01111,1]]
....
....
result [[Doc_00010,5]]
....
....
zzzz [[Doc_01235,1]]
Now my questions:
Suppose that user is interested to know:
what documents does have achieving and book? (idea)
documents which has achieving and skills but not book nor video
document include Aerodynamic
and some other simple queries like this
(input) so suppose she enters
achieving AND book
(achieving AND skills) AND (NOT (book AND video))
Aerodynamic
.....and some other simple queries
(Output)
[Doc_00562,6],[Doc_01121,5],[Doc_01151,3],[Doc_00012,2],[Doc_00001,1]
....
as you can see there might be
Some precedence modifier (parenthesis which we dont know the depth)
precedence of AND, OR, NOT
and some other interesting challenges and issues
So, I would like to run the queries against the TreeMultimap and search in the words(key) and retrieve the Values(list of docs) to user.
how should I think about this problem and how to design my solution? what articles or algorithms should i read? any idea would be appreciated. (thanks for reading this long post)
The collection that you have used is the Cranfield test collection, which I believe has around 3000 documents. While for collections of this size, it is okay to store the inverted list (the data structure that you have constructed) in memory with a hash-based or trie based organization, for realistic collections of much larger sizes, often comprised of millions of documents, you would find it difficult to store the inverted list entirely within memory in such cases.
Instead of reinventing the wheel, the practical solution is thus to make use of a standard text indexing (and retrieval) framework such as Lucene. This tutorial should help you to get started.
The questions that you seek to address can be answered by Boolean queries where you can specify set of Boolean operators AND, OR and NOT between its constituent terms. Lucene supports this. Have a look at the API doc here and a related StackOverflow question here.
The Boolean query retrieval algorithm is very simple. The list elements (i.e. the document ids) corresponding to each term are stored in sorted order so that at run-time it is possible to compute the union and intersection in time linear to the size of the lists, i.e. O(n1+n2).... (this is very similar to mergesort).
You can find more information in this book chapter.

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...
The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.
You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.
Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

Data structure for search engine in JAVA?

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?
A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).
If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.
There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.
I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

Matching inexact company names in Java

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?
You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)
Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)
I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.
Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.
Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.
You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.

Categories