What type of Trie is this? - java

I want to add words an opensource Java word splitting program for Khmer (a language that does not have spaces between words). The developers have not worked on it in a long time, and I haven't been able to contact them for details (http://sourceforge.net/projects/khmer/files/Khmer%20Word%20Breaking/Khmer%20Word%20Breaking%20program%20V1.0/). Supposedly the list was created from a Khmer dictionary, and I would like to re-create the file to include more words.
Can anyone identify what format the word dictionary is in (I believe it is some type of Trie)? Here are the first few lines:
0ឳមអគណជយឍឫហកដពទឱលថឦឡញឩខនឧផប។ឋវឭឈឃឥឌឰឪសងចភធឯតឆរ
1ទ
0ក
1
1ីែមគួណជយ៍ៀហកទុលេញ៉ឺនំឹៃូឈឃោាឿសងចិ្ធើតៅរ
1គនសងរ
0ទ
0ា
0យ
0ព
0ន
1
1រ
0ា
0ស
0ី
1
And does anyone know how I would go about making a new one (I have a large wordlist, but I am not sure how to get it into this format).
Thanks!

After a quick look through the code, I have a theory.
Create a SearchTree which extends TreeItem. For each word in your dictionary, call addWord from TreeItem. When the iteration is done, call export on SearchTree. Use new file as the word input file.
Additionally, there may be an undocumented parameter for khwrdbrk.jar, --create, that will read the words for the new tree from standard input.
Again, just a theory, but let me know what happens if you test it out.

Related

Randomly Generate Meaningful(Valid) English Words In Android Application

I am making a Dictionary Application. I am using Pearson Dictionary API for the same. I need to generate a word so that I could query that word for its definition.
PROBLEM
I know how to generate a random word but I don't know how to generate a meaningful English word.
I tried to solve this problem by requesting a JSON response and checking the results[](results[ ] hold definitions for the word) in the response. So, if results[].lenght > 0 then the word is a valid English word.
But the solution above has its own serious problem: Suppose I want to generate a 5 letter word, there are as many as 26^5 = 11881376different combinations whereas there aren't as many 5 letter meaningful English words. As the letters in the word increases, the number of combinations increases too. Thus, generating a meaningful word can take a very long time.
How can I check if the generated word is a meaningful English word or not? Isn't there any feasible programmatic way of doing this?
OR Is there any other way I could solve this Problem?
As far as I can see, you either generate random strings of letters and check to see if they're words (which, as you realise, is very slow, hit-or-miss approach) or you store a list of "known good" words and select randomly from that list.
How big that list needs to be depends on what you're trying to achieve.
According to this page the OED has around 171,476 main entries, not including variants like plurals (cat, cats), standard variants (sit, sitting), nor words that have multiple classes (e.g. dog can be a noun [the animal] or a verb [to follow persistently] etc.). According to this page an average adult knows between 20,000 and 35,000 words, so a prudent selection of 50,000 should cover most general purpose uses.
The answers to this question (now closed) provide a number of sources for word-lists. Examining one of them (originally provided by infochimps.org but available as a simple text-list on github) shows that the average length of 350,000+ words is just under 10 characters. For Linux (and possibly other flavours) /usr/share/dict/words may be a useful place to start.
There is this beautifull text file containing all english wordS:
https://github.com/AlexHakman/Java-challenge/blob/master/words.txt
You can then generate 5 letter words based on whats inside this text document :)
Get per line the length of the line, or just generate and compare it with the text file :)
Instead of doing it random because you need to spend time verifying just store a dictionary of the words that you would require and have a lookup table for it.
A relatively complete dictionary for English is about 2MBs compressed like the one here http://wordlist.aspell.net/12dicts/
Even for an Android app unless you're targeting really under powered devices it shouldn't be that big.
You can use SQLite to store the data so it may take up a bit more storage but you get SQL as your query language rather than making up your own.
Since you would also need a bit of randomness, each row can add some sort of randomized key that you can further query.
If you really wanted to limit it to 5 characters then just use a subset of the dictionary. But this will allow you to have an arbitrary length even length ranges (e.g. 2 to 10 characters)

String Name; Method 1: Atleast 2 words, max 4 words, no special signs and only letters

I've been searching around and havn't quite found my answer.
At this moment me and along with my group have created a few classes resembling a Bank with Customer and Account and so on.
I've been struggling lately with trying to improve and secure our code by making our variable called "name" only respond to certain inputs.
In this case, I want to make it only possible for the person to enter name as such:
Atleast 2 words = (For the word part I've seen codes where you count towards the white space between but don't know yet what you do about the last word since there wont be a white space)
Max 4 words = ( Same thing here)
No special signs such as ,!%¤"#()=%/'¨. = ( for this, I've read something about "Matcher and pattern" )
Now I'm quite new to Java and I'm not asking for a code from someone, I'm asking for someone to point me in the right directions regarding codes, because alot of what i've seen like the Matcher and pattern are things that you import with downloading utils and stuff but I reckon that it's not needed and there should be a simpler more basic way as I'm not trying to get ahead of myself with copying codes just to get it done.
So yeah, the String "name" is used alot in our main class "Banklogic" where almost every method that adds something has the variable "name" in it, so it's quite important that I get this done.
I hope I was clear enough and any help would be appreciated! I'm gonna put the alarm for 3 hours before school to see what you guys have come up with so I can try and complete the code before our meeting! Thanks alot in advance :)
Since you asked for hints, you can use Regex to add such rules.
For Numbers only:
if(string.matches("[0-9\\W]")
//allow insertion of data else not
As for rules related Word Count:
string.split("\\W") will create an array separated by space character. You can count the number of elements in this array and allow/disallow input based on that.
As for no signs and only letters:
if(string.matches("[a-zA-Z\\W]")
// Allow Input else not
You can use Document Filter to implement these methods. Document filter will only allow text to be entered if you allow it to.
I hope this helped as a hint.
Also, note that \\W is for whitespaces. If you dont want to allow whitespaces, remove that char.
This is the most effective and simple way of doing the task.
EDIT:
This is a Class I wrote a little while ago to achieve such tasks. Just in case if you are interested....

Having issues with apostrophes in strings (Scala)

I'm running into some weird issues in Scala right now. I'm writing a spell checker and the dictionary is in a .txt file that is being read in and stored in a map. In my dictionary is the word "Boston's". I did a check to see if "Boston's" was in the map by using the contains method and it's there. However, the real issue arises when I do the spell check on a document.
"Boston's" is being read in from the document I'm spell checking and stored in a ListBuffer, but when I check if my "dictionary" map contains it, it says it doesn't. So I did a println on both instances of "Boston's" (in my "dictionary" map and in my "wordToBeChecked" list) and I noticed something odd:
Both are there, but they look different. The one in my wordToBeChecked list looks as if it contains a single quote rather than an apostrophe. I've been trying to fix this for hours, but now I'm officially stumped.

Is it possible to automate generation of wrong choices from a correct word?

The following list contains 1 correct word called "disastrous" and other incorrect words which sound like the correct word?
A. disastrus
B. disasstrous
C. desastrous
D. desastrus
E. disastrous
F. disasstrous
Is it possible to automate generation of wrong choices given a correct word, through some kind of java dictionary API?
No, there is nothing related in java API. You can make a simple algorithm which will do the job.
Just make up some rules about letters permutations and doubling and add generated words to the Set until you get enough words.
There are a number of algorithms for matching words by sound - 'soundex' is the one that springs to mind, but I remember uncovering a few when I did some research on this a couple of years ago. I expect the problem you would find is that they take a word and return a value that represents how the word sounds so you can see if two spellings sound similar (so the words in the question should generate similar values); but I expect doing the reverse, i.e. taking the value and generating similar sounding spellings, would be quite hard.

is there a dictionary i can download for java?

is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary
Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.
Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.
OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.
If you are on a unix like OS look in /usr/share/dict.
Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html
Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary

Categories