How to develop Nutch for better Arabic searching technology?

How to develop Nutch for better Arabic searching technology? - java

I am a Computer Science student and working on a project based on the Nutch search engine. I want to develop Java algorithms to better index and search Arabic websites. How can I optimize for this purpose, any ideas?

Arabic language has 29 alphabets, some of these alphabets are having sub alphabets like the Alif (أ) which can come in different forms.
if you managed to be sub alphabet tolerant i.e. to allow spelling mistakes on these characters
e.g. أحمد and احمد and إحمد and آحمد although they have different UTF8 values, you can take them as close results.
moreover, if you can derive roots from words to allow searching for singulars, plurals, verbs, nouns, etc.
so if someone typed قال (said) you can include in the searched terms the words قول (saying) and (يقول) (to say) and مقال (a saying), etc.
it will require a complicated engine to do such thing
finally, if you consider tashkeel (decorating vowels) that are optional in typing where you could take as a more specific search but would allow ignoring it
e.g. رجل could match رَجُلٌ (meaning a man) or رَجَلَ (meaning walked on feet) or رِِِِِجْل (leg)
I hope this would help

Related

Getting the alphabet of a language from the Locale in Java

I am making an internationalized app in Java. I need a list of all the letters in a language, starting from the Locale. There are some questions like Alphabet constant in Java? or Create Alphabet List from list : Java which touch on the issue, but I'm wondering is there a Utils class or something where it's already defined and where I can get a list of chars or a String containing all the letters in the alphabet of a language by it's Locale.

You can refer this library and methods in detail, com.ibm.icu.util.LocaleData. Pass argument as Locale.ENGLISH to get alphabets of English.

There are several issues here.
First, I have to point out that there are many languages that aren't alphabetic. Obviously, Chinese, or Japanese are examples of ideographic languages. Unfortunately, it will be very hard, next to impossible to create a list of all the characters in these languages.
Second, although Common Locale Data Repository and as a consequence ICU have predefined sets of index exemplars and example characters this information is far from being complete.
Third, there are languages that use more than one script (aka writing system). Depending on the source of your locale you may or may not know which characters needs to be displayed.
Finally, it is hard to give you right answer when you haven't provided your use case. The design of your application may impose serious limitations on usability or localizability...

How to identify plurals of noun

I've seen people asked similar questions but without any good answer. I now encountered the same question, can anyone help?
See below:
Input: a list of words
Output: identify nouns in their plural forms, convert them into their singular forms if possible

WordNet will be able to help with stripping plurals. It is a full morphological dictionary of English language.
http://wordnet.princeton.edu/
The JAWS is a simple Java API which talks to WordNet, though others exist.
http://lyle.smu.edu/~tspell/jaws/index.html
Note, WordNet will not perfectly deal with the various idiosyncrasies of English, from their FAQ:
Along with a set of irregular forms (e.g. children - child), it uses a
sequence of simple rules, stripping common English endings until it
finds a word form present in WordNet. Furthermore, it assumes its
input is a valid inflected form. So, it will take "childes" to
"child", even though "childes" is not a word.

Source for iterating through all words of english dictionary

I need to iterate through all words of english dictionary & filter certain based on whether they are noun/verb or anything else & certain other traits . Is there any thing I could use as a source for these words ?

Just wanted to mention, with regards to WordNet, there are 'stop words' which are not included. Some people online have made lists of stopwords, but I'm not sure how complete they are.
Some stop words are: 'the', 'that', 'I', 'to' 'from' 'whose'.
A larger list is here:
http://www.d.umn.edu/~tpederse/Group01/WordNet/wordnet-stoplist.html
For a list of words see this sourceforge project:
http://wordlist.sourceforge.net/
You may also want to search for the usecases of such a list, in order to find a suitable data source.
For instance:
Spell checking algorithms use a word list (stand alone spell checkers, word processing apps like OpenOffice, etc).
Word game algorithms use words (Scrabble type games, vocabulary education games, crossword puzzle generators)
Password cracking algorithm use words to help find weak passwords.
outpost9.com/files/WordLists.html
Also there are several Java APIs to choose from, and only some work with the latest dictionary (3.1) The one by MIT uses Java 5 and words with WordNet 3.1.

I recommend WordNet from princeton.edu. It is a popular English lexical database with word attributes such as:
Short definition
Part of speech, e.g. noun, verb, adjective, &c.
Synonyms and groupings
There is a WordNet Java API from smu.edu that will simplify using WordNet in your application. You might also download the database and parse it yourself, as its only 12MB compressed.

should i screen out odd characters from names

From
Personal names in a global application: What to store and How can I validate a name, middle name, and last name using regex in Java?
i have read that you can't really validate names because of international possibilities long names, multiple names, weird names. the general verdict is to avoid it and play safe instead - which means allowing all possible characters, combinations and just print it as html-safe mark-up.
but what about special characters? Shift + "one to nine" series and others, should i just allow them to be placed in the database and "play safe" or should i screen them out?
i would also want users of my program to responsibly input names (though i can't guarantee that) but at least at some point there should be enforced rules but without totally locking out others who legitimately have a reason to use $ or # in their names.
i'm on PHP and JS but same goes for all languages that use input validations
EDIT:
i do have to note, it does not really mean just the Shift 1-9. that's just what i call them. it also includes special characters outside the 1-9. sorry for the confusion.
here's the thing, my application is like a library application. a book has a title, an author, and a year. while the title and year may go to one table, the author i want listed to another table. these inputs are from the users. now i'm going to implement an autocomplete for the authors. but the data for autocomplete is based on the input of the users - the reliability of the autocomplete data will be based on the author inputs of the users.
just like facebook, how do they implement this? i haven't seen any friend using special characters, unlike those friendster times where everytime i search, people with numeric or special charactered names come up first - not really great for an autocomplete.

Shift + "one to nine" doesn’t really specify a set of characters, as it depends on the keyboard what such combinations produce. If you mean the characters in Shift positions of keys 0 to 9 in standard US keyboards, then I have to admit that I have never seen a person’s real name (as opposite to nicknames) with such characters. But I would not bet on their absolute absence from names. Yesterday, I learned that some orthography of the Venetian language uses “£” (pound sign) as a letter. Moreover, people might use easily available characters as replacements of characters they cannot easily produce on a keyboard, e.g. using “!” instead of “ǃ” (U+01C3 Latin letter retroflex click) or “e^” instead of “ê”.
The question is what you expect to gain by excluding some characters. To catch typos?

How do I identify language of a text document in Java?

Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for English)?

There are various techniques, and a robust method would combine various ones:
look at the frequencies of groups of n letters (say, groups of 3 letters or trigrams) in your text and see if they are similar to the frequencies found for the language you are testing against
look at whether the instances of frequent words in the given language match the freuencies found in your text (this tends to work better for longer texts)
does the text contain characters which strongly narrow it down to a particular language? (e.g. if the text contains an upside down question mark there's a good chance it's Spanish)
can you "loosely parse" certain features in the text that would indicate a particular language, e.g. if it contains a match to the following regular expression, you could take this as a strong clue that the language is French:
\bvous\s+\p{L}+ez\b
To get you started, here are frequent trigram and word counts for English, French and Italian (copied and pasted from some code-- I'll leave it as an exercise to parse them):
Locale.ENGLISH,
"he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
"the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
Locale.FRENCH,
"es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
"de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
Locale.ITALIAN,
"re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
"di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",
(Trigram counts are per million characters; word counts are per million words. The '_' character represents a word boundary.)
As I recall, the figures are cited in the Oxford Handbook of Computational Linguists and are based on a sample of newspaper articles. If you have a corpus of text in these languages, it's easy enough to derive similar figures yourself.
If you want a really quick-and-dirty way of applying the above, try:
consider each sequence of three characters in your text (replacing word boundaries with '_')
for each trigram that matches one of the frequent ones for the given language, increment that language's "score" by 1 (more sophisticatedly, you could weight according to the position in the list)
at the end, assume the language is that with the highest score
optionally, do the same for the common words (combine scores)
Obviously, this can then be refined, but you might find that this simple solution is good enough for what you want, since you're essentially interested in "English or not".

Did you tried Apache Tika. It has good API to detect language and It can also support different language by loading respective profile.

You could try comparing each word to an English, French, or Italian dictionary. Keep in mind though some words may appear in multiple dictionaries.

Here's an interesting blog post that discusses this concept. The examples are in Scala, but you should be able to apply the same general concepts to Java.

If you are looking at individual characters or words, this is a tough problem. Since you're working with a whole document, however, there might be some hope. Unfortunately, I don't know of an existing library to do this.
In general, one would need a fairly comprehensive word list for each language. Then examine each word in the document. If it appears in the dictionary for a language, give that language a "vote". Some words will appear in more than one language, and sometimes a document in one language will use loanwords from another language, but a document wouldn't have to be very long before you saw a very clear trend toward one language.
Some of the best word lists for English are those used by Scrabble players. These lists probably exist for other languages too. The raw lists can be hard to find via Google, but they are out there.

There's no "good" way of doing this imo. All answers can be very complicated on this topic. The obvious part is to check for characters that is in french + italian and not in english and then return false.
However, what if the word is french but has no special characters? Play with the thought you have a whole sentance. You could match each word from dictionaries and if the sentance has more french points than english points, it's not english. This will prevent the common words that french, italian and english have.
Good Luck.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.