Better way to parse the file? - java

I have a sample data of employees.
Brad Senior
<Fname>Brad Junior</Fname>
CHICAGO, March 6 1990 - He is a great Java Developer.
He has worked in XYZ company.
Data is in the format:
Person's name
<Fname> xxx </Fname> // Optional
Current Location, DOB - Description about his work.
I am able to parse it using BufferedReader and by using so many conditions.
Is there a better way to pares this content (e.g. Regex) and store it in a Employee object?
I cannot use external libraries.
Thanks.

You could use a parser generator as Cup.
It's really useful when the format becomes more complex. It also makes maintenance of the parser easier if the file format is extended.

Better is an opinion so your question is innately unanswerable. As Tobías stated parsers are arguable and your options are vast. I recommend the industry standard ANTLR in opposition to Cup due to the fact that it has absolutely no license information anywhere I could find.

Related

Obtaining the Subject of a String in Java

Suppose I tell my Java program to find the subject of the sentence:
I enjoy spending time with my family.
My program should output:
Tell me more about your family.
How would it go about doing this?
EDIT
I could do this by having an array of String and have that filled with every noun in the English dictionary, but is there a simpler way?
This is way too open-ended a question. But a good place to start would be to learn about Natural Language Processing concepts and then look at using a framework like CoreNLP. It breaks down sentences into a parse tree and you can use this to identify parts of speech and things like the subject of a sentence. This is probably your best bet if you want a reasonably-reliable method.

Localizing a string containing list of names

I have string containing a list of name like below:
"John asked Kim, Kelly, Lee and Bob about the new year plans". The number of names in the list can very.
How can I localize this in Java?
I am thinking about ResourceBundle and MessageFormat. How will I write the pattern for this in MessageFormat?
Is there any better approach?
Localizing an (inline) list is more than just translating the word “and.” CLDR deals with the issue of formatting lists, check out their page on lists. I’m afraid ICU doesn’t have support to this yet, so you might need to code it separately.
Another issue is that you cannot expect to be able to use names as such in sentences like this. Many languages require the object to be in an inclined form, for example. In Finnish, your sample sentence would read as “John kysyi Kimiltä, Kellyltä, Leeltä ja Bobilta uudenvuoden suunnitelmista.” So you may need to find out and include different inclined forms of the names. Moreover, if the language used does not have Latin alphabet, you may need transliterated forms of the names (e.g., in Arabic, John is جون). There are other problems as well. In Russian, the verb corresponding to “asked” depends on the gender of the subject (e.g., спросила vs. спросил).
I know this sounds complex, but localization is often complex. If you target a limited set of languages only, things can be much easier, so it is important to defined your goals—perhaps accepting some simplifications that may result in grammatically incorrect expressions. But for localization that is to cover a wide range languages, you may need to make the generating function localized. That is, you would have, for each language, a function that accepts a list of names as arguments and returns a string representing the statement, possibly using resource files containing information (transliterated form, different inclined form, gender) about proper names that may appear.
In some situations, you might even consider generating the sentence in English, then sending it to an online translator. For example, Google Translator can deal with some of the issues that I mentioned. It surely produces wrong translations a lot, but for sentences with grammatically very simple structure, it might be a pragmatic solution, if you can accept some amount of errors. If you consider trying this, make sure you test sufficiently how the automatic translator can handle the specific sentences you will use. Quite often you can improve the results by reformulating the sentences. Dividing a sentence with several clauses into separate sentences often helps. But even your simple sentence causes problems in automatic translation.
You might avoid some complications if you can reformulate the sentence structure, e.g. so that all the nouns appear in the subject position and you avoid “packed” expressions like “new year plans.” For example, “John asked what plans Kim, Kelly, Lee, and Bob have for the new year” would be simpler, both for automatic translation and for pattern-based localization.
You could do something like:
"{0} asked {1} about the new year plans"
where 0 is the first name and 1 is a comma-separated list of the other names.
Hope this helps.
I see an answer was already accepted, I'm just adding this here as an alternative. The code has hard coded values for the data, but is only meant to present an idea that can be refined:
MessageFormat people = new MessageFormat("{0} asked {1,choice,0#no one|1#{2}|2#{2} and {3}|2<{2}, and {3}} about the new year plans");
String john = "John";
Object[][] parties = new Object[][] { {john, 0}, {john, 1, "Kim"}, {john, 2, "Kim", "Kelly}, {john, 4, "Kim, Kelly, Lee", "Bob"}};
for (final Object[] strings : parties) {
System.out.println(people.format(strings));
}
This outputs the following:
John asked no one about the new year plans
John asked Kim about the new year plans
John asked Kim and Kelly about the new year plans
John asked Kim, Kelly, Lee, and Bob about the new year plans
Determining the number of names that is used for the 2nd argument and creating the comma-delimited string for the 3rd argument isn't displayed in that sample, but can easily be done instead of using the hard coded values I used.
For localization, the normal approach is to use external language packs, which is a file contains the text you're going to display, assign each text a name/key, then load the text in the program by the key.
You could combine your ResourceBundle (for I18N) with a MessageFormat (to replace placeholders with the names) : "{0} asked {1} about the new year plans"
It would be up to you to prepare the names beforehand, though.

Java API for plural forms of English words

Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?
Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"
jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"
I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.
considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".
Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).
If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.
Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/

Open Source Text Localization Library

Is there an open source project that handles the process of localizing tokenized string text to other languages, and has complex handling for grammar, spelling (definite, indefinite, plural, singular), also for languages like german handling of masculine, feminine, neuter.
Most localization frameworks do wholesale replace of strings and don't take into account tokenized strings that might refer to objects that in some languages could be masculine/feminine/neuter.
The programming language I'm looking for is Javascript/Java/Actionscript/Python, it'd be nice if there was a programming-language independent data-format for creating the string tables.
To answer your question, I've not heard of any such a framework.
From the limited amount that I understand and have heard about this topic, I'd say that this is beyond the state of the art ... certainly if you are trying to do this across multiple languages.
Here are some relevant resources:
"Open-Source Software and Localization" by Frank Bergman [2005]
Plural forms in GNU gettext - http://www.gnu.org/software/hello/manual/gettext/Plural-forms.html

Need some help with String.format

I'm trying to find a complete tutorial about formatting strings in java.
I need to create a receipt, like this:
HEADER IN MIDDLE
''''''''''''''''''''''''''''''
Item1 Price
Item2 x 5 Price
Item3 that has a very
long name.... Price
''''''''''''''''''''''''''''''
Netprice: xxx
Grossprice: xxx
VAT: xxx
Shipping cost: xxx
Total: xxx
''''''''''''''''''''''''''''''
FOOTER IN MIDDLE
The format to pass to string.format is documented here:
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formatter.html#syntax
From the page:
The format specifiers for general,
character, and numeric types have the
following syntax:
%[argument_index$][flags][width][.precision]conversion
The optional argument_index is a
decimal integer indicating the
position of the argument in the
argument list. The first argument is
referenced by "1$", the second by
"2$", etc.
The optional flags is a set of
characters that modify the output
format. The set of valid flags depends
on the conversion.
The optional width is a non-negative
decimal integer indicating the minimum
number of characters to be written to
the output.
The optional precision is a
non-negative decimal integer usually
used to restrict the number of
characters. The specific behavior
depends on the conversion.
The required conversion is a character
indicating how the argument should be
formatted. The set of valid
conversions for a given argument
depends on the argument's data type.
formating string is some what complicated, for this kind of requirement.
so its better to go for some reporting tool using the format you have given.
which would be the better approach.
Either a crystal report or some others which are easy to implement.
Trying to do this with formatting a string will cost you to much time and nerves. I would suggest a templating engine like Stringtemplate or something similar.
with doing these you will separate the presentation from the data and that will be a very good thing in the long run.
See if these classes in java.text package can help..
Format
MessageFormat
Yea as solairaja said if you are planning to create reports or receipts you can go for reporting tools as Crystal reports
Crystal Report Crystal Report Tutorial
Or if you plan to use StringFormatting itself then "StringBuffer" would be the best option coz u can play around with it.
You should probably look at Java templating tools for this sort of multi-line reporting formatting.
Velocity is simple and forgiving of errors. Freemarker is very powerful but more intolerant. I would perhaps look at Velocity initially, and if you have to do more of this sort of work, take a further look at Freemarker.
Looks like the general advice from the community as a better approach to solve your problem is using a reporting tool.
Here you have a detailed list of open source Java charting and reporting tools:
http://java-source.net/open-source/charting-and-reporting
The most well known is, in my opinion, Jasper Reports. A lot of resources about it are available on the web

Categories