How to generate all possible sentence from given tokens in Java

How to generate all possible sentence from given tokens in Java - java

I am trying to generate all possible sentences from given token. It is a transliteration program. I have various possibilities for each token to be transliterated and I want to generate all possible sentences. e.g. if sentence is token1 token2 token3 and supposing token1 can be represented in 3 ways after transliteration, token2 can be represented by 2 ways and token3 can be represented by 4 ways then total possible sentences are 24. I am developed a general tree and then perform depth first traversal to generate all possible sentences. the problem is when sentence become long, the number of possibilities increases and I got "java.lang.OutOfMemoryError: Java heap space" error.
Is there any other way to generate all possible sentences?? At some instances I need to generate millions of sentences. Please Help!!!

You can't generate them all at once like that.
Depending on what you need them for, you should either do whatever that is or write them to a file.
Another thought, that still might not work, would be to not store every possible value but store a set of references/relationships. You can make this much more complex with n-grams and mMrkov chains, or simply have a a set of references, or even just have a list of array indexes.
So besides using storage space as a memory buffer, you can conceptualize instead of foo calling gen for the full set, have gen call foo after each one is generated.
[EDIT: looking back on this, (I was interested to see any other answers) I want to clarify that the function foo is whatever you're using them for and the function gen generates them (just in case it isn't clear, and especially for anyone who's first language isn't english)]

Related

Randomly Generate Meaningful(Valid) English Words In Android Application

I am making a Dictionary Application. I am using Pearson Dictionary API for the same. I need to generate a word so that I could query that word for its definition.
PROBLEM
I know how to generate a random word but I don't know how to generate a meaningful English word.
I tried to solve this problem by requesting a JSON response and checking the results[](results[ ] hold definitions for the word) in the response. So, if results[].lenght > 0 then the word is a valid English word.
But the solution above has its own serious problem: Suppose I want to generate a 5 letter word, there are as many as 26^5 = 11881376different combinations whereas there aren't as many 5 letter meaningful English words. As the letters in the word increases, the number of combinations increases too. Thus, generating a meaningful word can take a very long time.
How can I check if the generated word is a meaningful English word or not? Isn't there any feasible programmatic way of doing this?
OR Is there any other way I could solve this Problem?

As far as I can see, you either generate random strings of letters and check to see if they're words (which, as you realise, is very slow, hit-or-miss approach) or you store a list of "known good" words and select randomly from that list.
How big that list needs to be depends on what you're trying to achieve.
According to this page the OED has around 171,476 main entries, not including variants like plurals (cat, cats), standard variants (sit, sitting), nor words that have multiple classes (e.g. dog can be a noun [the animal] or a verb [to follow persistently] etc.). According to this page an average adult knows between 20,000 and 35,000 words, so a prudent selection of 50,000 should cover most general purpose uses.
The answers to this question (now closed) provide a number of sources for word-lists. Examining one of them (originally provided by infochimps.org but available as a simple text-list on github) shows that the average length of 350,000+ words is just under 10 characters. For Linux (and possibly other flavours) /usr/share/dict/words may be a useful place to start.

There is this beautifull text file containing all english wordS:
https://github.com/AlexHakman/Java-challenge/blob/master/words.txt
You can then generate 5 letter words based on whats inside this text document :)
Get per line the length of the line, or just generate and compare it with the text file :)

Instead of doing it random because you need to spend time verifying just store a dictionary of the words that you would require and have a lookup table for it.
A relatively complete dictionary for English is about 2MBs compressed like the one here http://wordlist.aspell.net/12dicts/
Even for an Android app unless you're targeting really under powered devices it shouldn't be that big.
You can use SQLite to store the data so it may take up a bit more storage but you get SQL as your query language rather than making up your own.
Since you would also need a bit of randomness, each row can add some sort of randomized key that you can further query.
If you really wanted to limit it to 5 characters then just use a subset of the dictionary. But this will allow you to have an arbitrary length even length ranges (e.g. 2 to 10 characters)

Performance of HashMap

I have to process 450 unique strings about 500 million times. Each string has unique integer identifier. There are two options for me to use.
I can append the identifier with the string and on arrival of the
string I can split the string to get the identifier and use it.
I can store the 450 strings in HashMap<String, Integer> and on
arrival of the string, I can query HashMap to get the identifier.
Can someone suggest which option will be more efficient in terms of processing?

It all depends on the sizes of the strings, etc.
You can do all sorts of things.
You can use a binary search to get the index in a list, and at that index is the identifier.
You can hash just the first 2 characters, rather than the entire string, that would likely be faster than the binary search, assuming the strings have an OK distribution.
You can use the first character, or first two characters, if they're unique as a "perfect index" in to 255 or 65K large array that points to the identifier.
Also, if your identifier is numeric, it's better to pre-calculate that, rather than convert it on the fly all the time. Text -> Binary is actually rather expensive (Binary -> Text is worse). So it's probably nice to avoid that if possible.
But it behooves you work the problem. 1 million anything at 1ms each, is 20 minutes of processing. At 500m, every nano-second wasted adds up to 8+ minutes extra of processing. You may well not care, but just demonstrating that at these scales "every little bit helps".
So, don't take our words for it, test different things to find what gives you the best result for your work set, and then go with that. Also consider excessive object creation, and avoiding that. Normally, I don't give it a second thought. Object creation is fast, but a nano-second is a nano-second.
If you're working in Java, and you don't REALLY need Unicode (i.e. you're working with single characters of the 0-255 range), I wouldn't use strings at all. I'd work with raw bytes. String are based on Java characters, which are UTF-16. Java Readers convert UTF-8 in to UTF-16 every. single. time. 500 million times. Yup! Another few nano-seconds. 8 nano-seconds adds an hour to your processing.
So, again, look in all the corners.
Or, don't, write it easy, fire it up, run it over the weekend and be done with it.

If each String has a unique identifier then retrieval is O(1) only in case of hashmaps.
I wouldn't suggest the first method because you are splitting every string for 450*500m, unless your order is one string for 500m times then on to the next. As Will said, appending numeric to strings then retrieving might seem straight forward but is not recommended.
So if your data is static (just the 450 strings) put them in a Hashmap and experiment it. Good luck.

Use HashMap<Integer, String>. Splitting a string to get the identifier is an expensive operation because it involves creating new Strings.

I don't think anyone is going to be able to give you a convincing "right" answer, especially since you haven't provided all of the background / properties of the computation. (For example, the average length of the strings could make a lot of difference.)
So I think your best bet would be to write a benchmark ... using the actual strings that you are going to be processing.
I'd also look for a way to extract and test the "unique integer identifier" that doesn't entail splitting the string.

Splitting the string should work faster if you write your code well enough. In fact if you already have the int-id, I see no reason to send only the string and maintain a mapping.
Putting into HashMap would need hashing the incoming string every time. So you are basically comparing the performance of the hashing function vs the code you write to append (prepending might be a bit more tricky) on sending end and to parse on receiving end.
OTOH, only 450 strings aren't a big deal, and if you're into it, writing your own hashing algo/function would actually be the most elegant and performant.

Is it possible to automate generation of wrong choices from a correct word?

The following list contains 1 correct word called "disastrous" and other incorrect words which sound like the correct word?
A. disastrus
B. disasstrous
C. desastrous
D. desastrus
E. disastrous
F. disasstrous
Is it possible to automate generation of wrong choices given a correct word, through some kind of java dictionary API?

No, there is nothing related in java API. You can make a simple algorithm which will do the job.
Just make up some rules about letters permutations and doubling and add generated words to the Set until you get enough words.

There are a number of algorithms for matching words by sound - 'soundex' is the one that springs to mind, but I remember uncovering a few when I did some research on this a couple of years ago. I expect the problem you would find is that they take a word and return a value that represents how the word sounds so you can see if two spellings sound similar (so the words in the question should generate similar values); but I expect doing the reverse, i.e. taking the value and generating similar sounding spellings, would be quite hard.

is there a dictionary i can download for java?

is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary

Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.

Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.

OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.

If you are on a unix like OS look in /usr/share/dict.

Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html

Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary

Are there some better ways to implement find as you type in Java with a fairly small data set?

I've got about 2500 short phrases in a file. I want to be able to find phrases as I type possible substrings of them. My app has a text box and a list of phrases. The text box is initially empty and the list contains all 2500 phrases, since the empty string is a substring of all of them. As I type in the text box, the list updates so that it always only contains phrases which contain the text box's value as a substring.
At the moment I have one of Google's Multimaps, specifically:
LinkedHashMultimap<String, String>
with every single possible substring mapped to its possible matches. This takes a while to load (about a second) and I think it must be taking up quite a bit of space (which may be a concern in the future.) It's very fast with the lookups though.
Is there a way I could do this with some other data structure or strategy that would be quicker to load and take less space (possibly at the expense of the speed of the lookups)?

If your list only contains 2500 elements, a simple loop and checking contains() on all of them should be fast enough.
If it grows bigger and/or is too slow, you can apply some easy optimizations:
Don't search immediately as the user types each character, but introduce some delay. So if he types "foobar" really fast, you only search for "foobar", not first "f" then "fo" then "foo",...
Reuse your previous results: if the user first types "foo" and then extends that to "foobar", don't search in the whole original list again, but search inside the results for "foo" (because everything that contains "foobar" must contain "foo").
In my experience, these these basic optimizations already get you quite far.
Now, if the list grows so big that even that is too slow, some "smarter" optimizations as proposed in other answers here (tries, suffix trees,...) would be needed.

You'll want to look into using the Trie data structure.

Try simply looping over the entire list and calling contains() - doing that 2500 times is probably completely unnoticeable.

You definetely need a Suffix Tree.. (wiki)
(i think this implementation could be ok: link)
EDIT:
I've read your comment, you shouldn't blindly check if the string is a substring somewhere in you phrase, you usually start with a word, not with a space. So maybe it's better to tokenize words inside your phrase?
Are you allowed to do it? Otherwise the best way is to build an automata for every phrase or using similar algorithms (for example the Karp-Rabin string search algorithm).

Wouter Coekaerts has a good approach, but I would go a bit further.
Don't bring up anything when the textbox contains a single character. The results won't be useful. You may find that this is true for two characters as well.
Precompute the results for two characters. When there are two characters bring up the precomputed list.
When a third character is added do the 'contains' search on the list you have currently displayed (anything that doesn't contain c1c2 can't contain c1c2c3). By now the list should be small enough that 'contains' has perfectly adequate performance.
Similarly for four characters etc.
As said above, put in a little delay before starting the search. Or better still arrange for a search to be killed if another character is typed before it finishes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.