Java API for plural forms of English words

Java API for plural forms of English words - java

Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?

Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"

jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"

I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.

considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".

Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).

If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.

Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/

Related

Determine the Plurality of a Noun/Verb

I have a program that is randomly generating sentences based on a bunch of text documents of all the nouns, verbs, adjectives, and adverbs. Does anyone know a way to determine if a noun/verb are plural or singular, or if there any text documents that contain a list of singular nouns/verbs and plural nouns? I'm doing this all in Java, and I have a decent idea of how to get information off of a website, so if there are any websites that could do that as well, I'd also appreciate those.

I am afraid, you cannot solve this by having a fixed list of words, especially verbs. Consider sentences:
You are free. We are free.
In the first one, are is singular, it is plural. Using a proper tagger as #jdaz suggests is the only way how you can do it in a reliable way.
If you work with English or a few other supported languages, StanfordNLP is an excellent choice. If you need a broad language coverage, you can use UDPipe that is natively in C++ but has a Java binding.

The first step would be to look it up in a list. For English you can reduce the size of the list by only including singular nouns, and then apply some basic string processing to find plurals: if your word ends in -s and is not in the list, cut off the -s and look again. If it now is in the list, it was a simple plural (car/cars). If not, continue. If it ends in -ies, remove that, append -y and look again. Now you will capture remedies/remedy. There are a number of such patterns you can use.
Some irregular nouns need to be in an exception list (ox/oxen), but there aren't that many. Some words of course are unspecified, like sheep, data, or police. Here you need to look at the context: if the noun is followed by a singular verb (eg eats, or is), then it would be singular as well.
With (English) verbs you can generally only identify the third person singular (with a similar procedure as used for nouns; you's need a list of exceptions for verbs anding in -s (such as kiss)). Forms of to be are more helpful, but the second person singular is an issue (are). However, unless you have direct speech in your texts, it will not be used very frequently.
Part of speech taggers can also only make these decisions on context, so I don't think they will be much of a help here. It's likely to be overkill. A couple of word lists and simple heuristic rules will probably give you equal or better accuracy using far fewer resources. This is the way these things were done before large amounts of annotated data were available.
In the end it depends on your circumstances. It might be quicker to simply use an existing tagger, but for this limited problem you might get better accuracy and speed with the rule-based approach, (or even a combined one for accuracy).

Optimize a Regex

I'm using the following code to discard unsupported physical interfaces / subinterfaces from routers that connects to a big ISP network (by big I mean tens of thousands of routers):
private final static Pattern INTERFACES_TO_FILTER =
Pattern.compile("unrouted VLAN|GigabitEthernet.+-mpls layer|FastEthernet.+-802\\.1Q vLAN subif");
// Simplification
List<String> interfaces;
// lots of irrelevant code to query the routers
for (String intf : interfaces) {
if (INTERFACES_TO_FILTER.matcher(intf).find()) {
// code to prevent the interface from being used
}
}
The idea is discarding entries such as:
unrouted VLAN 2000 for GigabitEthernet2/11.2000
GigabitEthernet1/2-mpls layer
FastEthernet6/0/3.2000-802.1Q vLAN subif
This code is hit often enough (several times per minute) over huge sets of interfaces (some routers have 50k+ subintefaces), cache doesn't really help much either because new subinterfaces are being configured / discarded very often. The plan is to optimize the regex so that the procedure completes a tad faster (every nanosecond counts). Can you guys enlighten me?
Note: mpls layer and 802.1Q are supported for other kinds of interfaces, unrouted VLANs isn't.

There are some string search algorithms that allow you to try to search in a string of length n for k strings at once cheaper than the obvious O(n*k) cost.
They usually compare a rolling hash against a list of existing hashes of your words. A prime example of this would be the Rabin-Karp algorithm. The wiki page even has a section about this. There are more advanced versions of the principle out there as well, but it's easy to understand the principle.
No idea if there already are libraries in Java that do this (I'd think so), but that's what I'd try - although 5 strings is rather small here (and different size makes it more complex too). So better check whether a good KMP string search isn't faster - I'd think that'd be by far the best solution really (the default java api uses a naive string search, so use a lib)
About your regexes: backtracking regex implementation for performance critical search code? I doubt that's a good idea.
PS: If you'd post a testset and a test harness for your problem, chances are good people would see how much they could beat the favorite - has worked before.. human nature is so easy to trick :)

I'm answering my own question for further reference, although the credits goes to #piotrekkr since he was the one that pointed the way. Also my Kudos to #JB and #ratchet. I ended up using matches(), and the logic using indexOf and several contains was almost as fast (that's news to me, I always assumed that a single regex would be faster than several calls to contains).
Here's a solution that is several times faster (according to the profiler, about 7 times less time is spent at Matcher class methods):
^(?:unrouted VLAN.++|GigabitEthernet.+?-mpls layer|FastEthernet.+?-802\\.1Q vLAN subif)$

If your problem is that you have a number of long string constants you're searching for, i would recommend using a Java analog of the standard C tool "lex".
A quick googling took me to JFlex. I haven't used this particular tool and there may be others available, but that is an example of the kind of tool i would look for.

If you must use regex for this try changing to this one:
^(?:unrouted VLAN)|(?:GigabitEthernet.+?-mpls layer)|(?:FastEthernet.+?-802\.1Q vLAN subif)
^ make engine match from begining of string, not anywhere in string
.+? makes + ungreedy
(?:...) makes () non-capturing group

searching list of tens or few hundreds short text strings, sorting by relevance

I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.
The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)
What would be your choice in such case please?
EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.

You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.

If you are looking for a 'how much' match, you should use Soundex. Here is a Java implementation of this algorithm.

Check out Double Metaphone, an improved soundex from 1990.
http://commons.apache.org/codec/userguide.html
http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup

According to me Jaro-Winkler algorithm will suit your requirement best.
Here is a Short summary of Jaro-Winkler Distance Algo
One of the PDF which compares different algorithms --> Link to PDF

implementing unification algorithm

I worked the last 5 days to understand how unification algorithm works in Prolog .
Now ,I want to implement such algorithm in Java ..
I thought maybe best way is to manipulate the string and decompose its parts using some datastructure such as Stacks ..
to make it clear :
suppose user inputs is:
a(X,c(d,X)) = a(2,c(d,Y)).
I already take it as one string and split it into two strings (Expression1 and 2 ).
now, how can I know if the next char(s) is Variable or constants or etc.. ,
I can do it by nested if but it seems to me not good solution ..
I tried to use inheritance but the problem still ( how can I know the type of chars being read ?)

First you need to parse the inputs and build expression trees. Then apply Milner's unification algorithm (or some other unification algorithm) to figure out the mapping of variables to constants and expressions.
A really good description of Milner's algorithm may be found in the Dragon Book: "Compilers: Principles, Techniques and Tools" by Aho, Sethi and Ullman. (Milners algorithm can also cope with unification of cyclic graphs, and the Dragon Book presents it as a way to do type inference). By the sounds of it, you could benefit from learning a bit about parsing ... which is also covered by the Dragon Book.
EDIT: Other answers have suggested using a parser generator; e.g. ANTLR. That's good advice, but (judging from your example) your grammar is so simple that you could also get by with using StringTokenizer and a hand-written recursive descent parser. In fact, if you've got the time (and inclination) it is worth implementing the parser both ways as a learning exercise.

It sounds like this problem is more to do with parsing than unification specifically. Using something like ANTLR might help in terms of turning the original string into some kind of tree structure.
(It's not quite clear what you mean by "do it by nested", but if you mean that you're doing something like trying to read an expression, and recursing when meeting each "(", then that's actually one of the right ways to do it -- this is at heart what the code that ANTLR generates for you will do.)
If you are more interested in the mechanics of unifying things than you are in parsing, then one perfectly good way to do this is to construct the internal representation in code directly, and put off the parsing aspect for now. This can get a bit annoying during development, as your Prolog-style statements are now a rather verbose set of Java statements, but it lets you focus on one problem at a time, which is usually helpful.
(If you structure things this way, this should make it straightforward to insert a proper parser later, that will produce the same sort of tree as you have until then been constructing by hand. This will let you attack the two problems separately in a reasonably neat fashion.)

Before you get to do the semantics of the language, you have to convert the text into a form that's easy to operate on. This process is called parsing and the semantic representation is called an abstract syntax tree (AST).
A simple recursive descent parser for Prolog might be hand written, but it's more common to use a parser toolkit such as Rats! or Antlr
In an AST for Prolog, you might have classes for Term, and CompoundTerm, Variable, and Atom are all Terms. Polymorphism allows the arguments to a compound term to be any Term.
Your unification algorithm then becomes unifying the name of any compound term, and recursively unifying the value of each argument of corresponding compound terms.

Need some help with String.format

I'm trying to find a complete tutorial about formatting strings in java.
I need to create a receipt, like this:
HEADER IN MIDDLE
''''''''''''''''''''''''''''''
Item1 Price
Item2 x 5 Price
Item3 that has a very
long name.... Price
''''''''''''''''''''''''''''''
Netprice: xxx
Grossprice: xxx
VAT: xxx
Shipping cost: xxx
Total: xxx
''''''''''''''''''''''''''''''
FOOTER IN MIDDLE

The format to pass to string.format is documented here:
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formatter.html#syntax
From the page:
The format specifiers for general,
character, and numeric types have the
following syntax:
%[argument_index$][flags][width][.precision]conversion
The optional argument_index is a
decimal integer indicating the
position of the argument in the
argument list. The first argument is
referenced by "1$", the second by
"2$", etc.
The optional flags is a set of
characters that modify the output
format. The set of valid flags depends
on the conversion.
The optional width is a non-negative
decimal integer indicating the minimum
number of characters to be written to
the output.
The optional precision is a
non-negative decimal integer usually
used to restrict the number of
characters. The specific behavior
depends on the conversion.
The required conversion is a character
indicating how the argument should be
formatted. The set of valid
conversions for a given argument
depends on the argument's data type.

formating string is some what complicated, for this kind of requirement.
so its better to go for some reporting tool using the format you have given.
which would be the better approach.
Either a crystal report or some others which are easy to implement.

Trying to do this with formatting a string will cost you to much time and nerves. I would suggest a templating engine like Stringtemplate or something similar.
with doing these you will separate the presentation from the data and that will be a very good thing in the long run.

See if these classes in java.text package can help..
Format
MessageFormat

Yea as solairaja said if you are planning to create reports or receipts you can go for reporting tools as Crystal reports
Crystal Report Crystal Report Tutorial
Or if you plan to use StringFormatting itself then "StringBuffer" would be the best option coz u can play around with it.

You should probably look at Java templating tools for this sort of multi-line reporting formatting.
Velocity is simple and forgiving of errors. Freemarker is very powerful but more intolerant. I would perhaps look at Velocity initially, and if you have to do more of this sort of work, take a further look at Freemarker.

Looks like the general advice from the community as a better approach to solve your problem is using a reporting tool.
Here you have a detailed list of open source Java charting and reporting tools:
http://java-source.net/open-source/charting-and-reporting
The most well known is, in my opinion, Jasper Reports. A lot of resources about it are available on the web

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.