Source for iterating through all words of english dictionary - java

I need to iterate through all words of english dictionary & filter certain based on whether they are noun/verb or anything else & certain other traits . Is there any thing I could use as a source for these words ?

Just wanted to mention, with regards to WordNet, there are 'stop words' which are not included. Some people online have made lists of stopwords, but I'm not sure how complete they are.
Some stop words are: 'the', 'that', 'I', 'to' 'from' 'whose'.
A larger list is here:
http://www.d.umn.edu/~tpederse/Group01/WordNet/wordnet-stoplist.html
For a list of words see this sourceforge project:
http://wordlist.sourceforge.net/
You may also want to search for the usecases of such a list, in order to find a suitable data source.
For instance:
Spell checking algorithms use a word list (stand alone spell checkers, word processing apps like OpenOffice, etc).
Word game algorithms use words (Scrabble type games, vocabulary education games, crossword puzzle generators)
Password cracking algorithm use words to help find weak passwords.
outpost9.com/files/WordLists.html
Also there are several Java APIs to choose from, and only some work with the latest dictionary (3.1) The one by MIT uses Java 5 and words with WordNet 3.1.

I recommend WordNet from princeton.edu. It is a popular English lexical database with word attributes such as:
Short definition
Part of speech, e.g. noun, verb, adjective, &c.
Synonyms and groupings
There is a WordNet Java API from smu.edu that will simplify using WordNet in your application. You might also download the database and parse it yourself, as its only 12MB compressed.

Related

normalizing String with adding appropriate spacing

probably a very broad question for stackoverflow but here it goes,
I'm trying to normalize words within sentence, for example:
INPUT:
I developGeographicallydispersed teams through good ASDWEQ.
OUTPUT
(Notice the spaces between develop Geographically dispersed)
I develop Geographically dispersed teams through good ASDWEQ.
since using external API is out of option ( e.g. using google API).
I require to design our in house Java API
the obvious and naive solution would be something like this:
for all word in sentence do:
if word is in dictionary then ignore
else:
if word is reduce-able to a set of dictionary keywords then split
else ignore
od;
So before I start with such approach, my question is that if there is a better way of doing it? for example some an OPEN SOURCE library, or even different approach?
Did you have a look at Flex and Bison ? It helps to create a scanner and define your patterns for text processing, you should find a trick to map your parser to an existing dictionary in your case.

How to identify plurals of noun

I've seen people asked similar questions but without any good answer. I now encountered the same question, can anyone help?
See below:
Input: a list of words
Output: identify nouns in their plural forms, convert them into their singular forms if possible
WordNet will be able to help with stripping plurals. It is a full morphological dictionary of English language.
http://wordnet.princeton.edu/
The JAWS is a simple Java API which talks to WordNet, though others exist.
http://lyle.smu.edu/~tspell/jaws/index.html
Note, WordNet will not perfectly deal with the various idiosyncrasies of English, from their FAQ:
Along with a set of irregular forms (e.g. children - child), it uses a
sequence of simple rules, stripping common English endings until it
finds a word form present in WordNet. Furthermore, it assumes its
input is a valid inflected form. So, it will take "childes" to
"child", even though "childes" is not a word.

Defining words using Java

I was wondering if there as an API in Java that can define words and find the origins of words. I remember awhile back searching this up and seeing "apache commons" but I am not sure.
So basically, the user will be able to enter a word "overflow" then the program will be able to define the word. So I am looking for an API that can define words and find origins of words. So the word "recherche" would have an origin that is "French".
WordNet will give you half of what you are looking for: you can look up the definition for a word. Note that there are several implementations of WordNet for Java: jwi, jaws, Dan Bikel's, WordnetAPI. Some of these might be easier to use for your purpose than jwordnet suggested by miku (I have only used jaws and jwi).
Note: WordNet will not give you origins (AFAIK). I'm not aware of a software that does.
Note: You will have to provide the lemma of a word to be able to look it up in the dictionary. This means that you will have to apply some Natural Language Processing (NLP) techniques if you want to do this automatically on a free-text document (which can contain inflected forms). If you go this route, I'd suggest the GATE project's Morph plugin.
Wordnet maybe? There is a Java wrapper for it: http://sourceforge.net/projects/jwordnet/
Another list of NLP toolkits:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
To detect a language:
http://www.jroller.com/melix/entry/nlp_in_java_a_language
There is a website for etymology: http://www.etymonline.com/
It gives the result:
recherche
1722, from Fr. recherché "carefully sought out," pp. of rechercher "to seek out." Commonly used 19c. of food, styles, etc., to denote obscure excellence.
Don't know if they got an API but use some sort of script to query it.
So find a good way of detecting "Fr." in the sentence above.
Cheers,
Erik
Have you look for JWKTL?
"Wiktionary is a multilingual, web-based, freely available dictionary,
thesaurus and phrase book, designed as the lexical companion to
Wikipedia. Lately, it has been recognized as a promising lexical
semantic resource for natural language processing applications."
Using this, you can see the etymology of words.

How to develop Nutch for better Arabic searching technology?

I am a Computer Science student and working on a project based on the Nutch search engine. I want to develop Java algorithms to better index and search Arabic websites. How can I optimize for this purpose, any ideas?
Arabic language has 29 alphabets, some of these alphabets are having sub alphabets like the Alif (أ) which can come in different forms.
if you managed to be sub alphabet tolerant i.e. to allow spelling mistakes on these characters
e.g. أحمد and احمد and إحمد and آحمد although they have different UTF8 values, you can take them as close results.
moreover, if you can derive roots from words to allow searching for singulars, plurals, verbs, nouns, etc.
so if someone typed قال (said) you can include in the searched terms the words قول (saying) and (يقول) (to say) and مقال (a saying), etc.
it will require a complicated engine to do such thing
finally, if you consider tashkeel (decorating vowels) that are optional in typing where you could take as a more specific search but would allow ignoring it
e.g. رجل could match رَجُلٌ (meaning a man) or رَجَلَ (meaning walked on feet) or رِِِِِجْل (leg)
I hope this would help

How do I identify language of a text document in Java?

Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for English)?
There are various techniques, and a robust method would combine various ones:
look at the frequencies of groups of n letters (say, groups of 3 letters or trigrams) in your text and see if they are similar to the frequencies found for the language you are testing against
look at whether the instances of frequent words in the given language match the freuencies found in your text (this tends to work better for longer texts)
does the text contain characters which strongly narrow it down to a particular language? (e.g. if the text contains an upside down question mark there's a good chance it's Spanish)
can you "loosely parse" certain features in the text that would indicate a particular language, e.g. if it contains a match to the following regular expression, you could take this as a strong clue that the language is French:
\bvous\s+\p{L}+ez\b
To get you started, here are frequent trigram and word counts for English, French and Italian (copied and pasted from some code-- I'll leave it as an exercise to parse them):
Locale.ENGLISH,
"he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
"the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
Locale.FRENCH,
"es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
"de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
Locale.ITALIAN,
"re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
"di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",
(Trigram counts are per million characters; word counts are per million words. The '_' character represents a word boundary.)
As I recall, the figures are cited in the Oxford Handbook of Computational Linguists and are based on a sample of newspaper articles. If you have a corpus of text in these languages, it's easy enough to derive similar figures yourself.
If you want a really quick-and-dirty way of applying the above, try:
consider each sequence of three characters in your text (replacing word boundaries with '_')
for each trigram that matches one of the frequent ones for the given language, increment that language's "score" by 1 (more sophisticatedly, you could weight according to the position in the list)
at the end, assume the language is that with the highest score
optionally, do the same for the common words (combine scores)
Obviously, this can then be refined, but you might find that this simple solution is good enough for what you want, since you're essentially interested in "English or not".
Did you tried Apache Tika. It has good API to detect language and It can also support different language by loading respective profile.
You could try comparing each word to an English, French, or Italian dictionary. Keep in mind though some words may appear in multiple dictionaries.
Here's an interesting blog post that discusses this concept. The examples are in Scala, but you should be able to apply the same general concepts to Java.
If you are looking at individual characters or words, this is a tough problem. Since you're working with a whole document, however, there might be some hope. Unfortunately, I don't know of an existing library to do this.
In general, one would need a fairly comprehensive word list for each language. Then examine each word in the document. If it appears in the dictionary for a language, give that language a "vote". Some words will appear in more than one language, and sometimes a document in one language will use loanwords from another language, but a document wouldn't have to be very long before you saw a very clear trend toward one language.
Some of the best word lists for English are those used by Scrabble players. These lists probably exist for other languages too. The raw lists can be hard to find via Google, but they are out there.
There's no "good" way of doing this imo. All answers can be very complicated on this topic. The obvious part is to check for characters that is in french + italian and not in english and then return false.
However, what if the word is french but has no special characters? Play with the thought you have a whole sentance. You could match each word from dictionaries and if the sentance has more french points than english points, it's not english. This will prevent the common words that french, italian and english have.
Good Luck.

Categories