Binary Search multidimensional array in string - java

Actually, I'm working on my homework assignment. And, I'm really stuck.
I need to learn Java the right way. My teacher hasn't been teaching us about Binary Search with String. So, I had to end up at least few hours researching about the topic.
I need some simple explanation and code.
for example :
String[][] data={{"John abc","123"},{"Nike cbd","321"}};
I need input for searching 'John' and it will show the output 'John abc, 123'.
Can somebody suggest some guidance on the principles of binary search?

Strings can be sorted and compared just like numbers, using alphabetic string comparison. let's assume only English for simplicity, "ABD" is bigger than "ABC" and so forth.
So any binary search algorithm example that you find for numbers will work on strings, provided that the list you have is sorted of course. the idea is simple of course - narrow your candidates in half for each iteration until you find the right one.

Arrays.binarySearch is currently supported for one dimensional array.
So you have to narrow down your array to one dimension then call binarySearch().
Example:
for(String[] oneDimension : multiDimension ){
Arrays.sort(oneDimension);
Arrays.binarySearch(oneDimension, 'search-field');
}

Related

Comparing two strings using known algorithms

I'm trying to compare two strings (product names) using some of well known algorithms like Levenstein distance and library of different solutions for string simmetrics (got best results with SmithWatermanGotoh alg).
Two strings are:
iPhone 3gs 32 GB black
Apple iPhone 3 gs 16GB black
Levenstein is working pretty bad on whole string if some words are in different order (which is expected from how algorithm works) so I tried to implement word by word comparison.
The problem I'm facing with is the way to detect similar 'words' that are divided with space char ('3gs'->'3 gs' ; '32 GB'->'16GB').
My code compares shorter (word count, if == then str.length) string with longer one. Words are split into ArrayList<String>. I'm combining each word from str1 with others in the same string creating new arraylist.
Here is a rough code:
foreach(str1)
foreach(str2)
res1 = getLevensteinDist
endforeach
foreach(combinedstr2)
res1 = getLevensteinDist
endforeach
return getHigherPercent(res1, res2)
endforeach
This works if the words in str2 are split, but I can't figure out how to do a reverse, detect words in str2 that are split in str1.
I hope I'm at least a bit clear what I'm trying to do. Every help is appreciated.
First of all you should preprocess your strings, I mean you should remove "a, the, as, an" and all common verbs, numnbers,... from input strings, also you should convert every plural form to the singular form, .... to unify all words. Then you can apply some string matching algorithms, or just put the words into the hashmap, or if they are a lot, put them into the trie, and run your similarity algorithm.
Have a look at TF-IDF. It is specifically designed to compute similarities between textual features.
http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
Try to split one of the string into words and then for eash word run SmithWaterman and use scores from SmithWaterman as similarity measure.
13 years ago I wrote my own implementation of trigram fuzzy search algorithm,
named "Wilbur-Khovayko algorithm".
You can download here: http://olegh.cc.st/wilbur-khovayko.tar.gz
It search "N closest terms" for entered search term.
List of terms - in the file termlist.txt
N - in the variable lim, file findtest.c
Alrorithm very quick: on the old Sun 200mHz, it search 100 closest term among 100,000
entries for ~0.3 secs.

How to check if two Strings are approximately equal?

I'm making a chat responder for a game and i want know if there is a way you can compare two strings and see if they are approximatley equal to each other for example:
if someone typed:
"Strength level?"
it would do a function..
then if someone else typed:
"Str level?"
it would do that same function, but i want it so that if someone made a typo or something like that it would automatically detect what they're trying to type for example:
"Strength tlevel?"
would also make the function get called.
is what I'm asking here something simple or will it require me to make a big giant irritating function to check the Strings?
if you've been baffled by my explanation (Not really one of my strong points) then this is basically what I'm asking.
How can I check if two strings are similar to each other?
See this question and answer: Getting the closest string match
Using some heuristics and the Levenshtein distance algorithm, you can compute the similarity of two strings and take a guess at whether they're equal.
Your only option other than that would be a dictionary of accepted words similar to the one you're looking for.
You can use Levenshtein distance.
I believe you should use one of Edit distance algorithms to solve your problem. Here is for example Levenstein distance algorithm implementation in java. You may use it to compare words in the sentences and if sum of their edit distances would be less than for example 10% of sentence length consider them equals.
Perhaps what you need is a large dictionary for similar words and common spelling mistakes, for which you would use for each word to "translate" to one single entry or key.
This would be useful for custom words, so you could add "str" in the same key as "strength".
However, you could also make a few automated methods, i.e. when your word isn't found in the dictionary, to loop recursively for 1 letter difference (either missing or replaced) and can recurse into deeper levels, i.e. 2 missing letters etc.
I found a few projects that do text to phonemes translations, don't know which one is best
http://mary.dfki.de/
http://www2.eng.cam.ac.uk/~tpl/asp/source/Phoneme.java
http://java.dzone.com/announcements/announcing-phonemic-10
If you want to find similar word beginnings, you can use a stemmer. Stemmers reduce words to a common beginning. The most known algorithm if the Port Stemmer (http://tartarus.org/~martin/PorterStemmer).
Levenshtein, as pointed above, is great, but computational heavy for distances greater than one or two.

Spell checker solution in java

I need to implement a spell checker in java , let me give you an example for a string lets say "sch aproblm iseasili solved" my output is "such a problem is easily solved".The maximum length of the string to correct is 64.As you can see my string can have spaces inserted in the wrong places or not at all and even misspelled words.I need a little help in finding a efficient algorithm of coming up with the corrected string. I am currently trying to delete all spaces in my string and inserting spaces in every possible position , so lets say for the word (it apply to a sentence as well) "hot" i generate the next possible strings to afterwords be corrected word by word using levenshtein distance : h o t ; h ot; ho t; hot. As you can see i have generated 2^(string.length() -1) possible strings. So for a string with a length of 64 it will generate 2^63 possible strings, which is damn high, and afterwords i need to process them one by one and select the best one by a different set of parameters such as : - total editing distance (must take the smallest one)
-if i have more strings with same editing distance i have to choose the one with the fewer number of words
-if i have more strings with the same number of words i need to choose the one with the total maximum frequency the words have( i have a dictionary of the most frequent 8000 words along with their frequency )
-and finally if there are more strings with the same total frequency i have to take the smallest lexicographic one.
So basically i generate all possible strings (inserting spaces in all possible positions into the original string) and then one by one i calculate their total editing distance, nr of words ,etc. and then choose the best one, and output the corrected string. I want to know if there is a easier(in terms of efficiency) way of doing this , like not having to generate all possible combinations of strings etc.
EDIT:So i thought that i should take another approach on this one.Here is what i have in mind: I take the first letter from my string , and extract from the dictionary all the words that begin with that letter.After that i process all of them and extract from my string all possible first words. I will remain at my previous example , for the word "hot" by generating all possible combinations i got 4 results , but with my new algorithm i obtain only 2 "hot" , and "ho" , so it's already an improvement.Though i need a little bit of help in creating a recursive or PD algorithm for doing this . I need a way to store all possible strings for the first word , then for all of those all possible strings for the second word and so on and finally to concatenate all possibilities and add them into an array or something. There will still be a lot of combinations for large strings but not as many as having to do ALL of them. Can someone help me with a pseudocode or something , as this is not my strong suit.
EDIT2: here is the code where i generate all the possible first word from my string http://pastebin.com/d5AtZcth .I need to somehow implement this to do the same for the rest and combine for each first word with each second word and so on , and store all these concatenated into an array or something.
A few tips for you:
try correcting just small parts of the string, not everything at once.
90% of erros (IIRC) have 1 edit distance from the source.
you can use a phonetic index to match words against words that sound alike.
you can assume most typos are QWERTY errors (j=>k, h=>g), and try to check them first.
A few more ideas can be found in this nice article:
http://norvig.com/spell-correct.html

Concatenating RowFilter orFilters with andFilter in Java

All the questions pertaining this don't seem to answer the particular question I have.
My problem is this. I have a list of search terms, and for each term I find the edit distance to find possible misspelling of a word.
So for each word separated by a space, I have possible words each word could be.
For example: searching for green chilli might give us "fuzzy" words "green, greene and grain" and "chilli, chill and chilly".
Now I want the RowFilter to search for: "green OR greene OR grain" AND "chilli OR chill OR chilly".
I can't seem to find a way to do this in Java. I've looked all over the place but nothing talks about concatenating the OR and AND filters together in one RowFilter.
Would I have to roll my own solution based on the model? I suppose I can do this, but my method would most probably be naive at first and slow.
Any pointers as to how to roll my own solution for this or better yet, what's the Java way to do this right?
RowFilter.orFilter() and RowFilter.andFilter() seem apropos; each includes examples, and each accepts an arbitrary number of arguments.

Is it possible to automate generation of wrong choices from a correct word?

The following list contains 1 correct word called "disastrous" and other incorrect words which sound like the correct word?
A. disastrus
B. disasstrous
C. desastrous
D. desastrus
E. disastrous
F. disasstrous
Is it possible to automate generation of wrong choices given a correct word, through some kind of java dictionary API?
No, there is nothing related in java API. You can make a simple algorithm which will do the job.
Just make up some rules about letters permutations and doubling and add generated words to the Set until you get enough words.
There are a number of algorithms for matching words by sound - 'soundex' is the one that springs to mind, but I remember uncovering a few when I did some research on this a couple of years ago. I expect the problem you would find is that they take a word and return a value that represents how the word sounds so you can see if two spellings sound similar (so the words in the question should generate similar values); but I expect doing the reverse, i.e. taking the value and generating similar sounding spellings, would be quite hard.

Categories