Normalizing/unaccenting text in Java - java

How can I normalize/unaccent text in Java? I am currently using java.text.Normalizer:
Normalizer.normalize(str, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
But it is far from perfect. For example, it leaves Norwegian characters æ and ø untouched. Does anyone know of an alternative? I am looking for something that would convert characters in all sorts of languages to just the a-z range. I realize there are different ways to do this (e.g. should æ be encoded as 'a', 'e' or even 'ae'?) and I'm open for any solution. I prefer to not write something myself since I think it's unlikely that I will be able to do this well for all languages. Performance is NOT critical.
The use case: I want to convert a user entered name to a plain a-z ranged name. The converted name will be displayed to the user, so I want it to match as close as possible what the user wrote in his original language.
EDIT:
Alright people, thanks for negging the post and not addressing my question, yay! :) Maybe I should have left out the use case. But please allow me to clarify. I need to convert the name in order to store it internally. I have no control over the choice of letters allowed here. The name will be visible to the user in, for example, the URL. The same way that your user name on this forum is normalized and shown to you in the URL if you click on your name. This forum converts a name like "Bășan" to "baan" and a name like "Øyvind" to "yvind". I believe it can be done better. I am looking for ideas and preferably a library function to do this for me. I know I can not get it right, I know that "o" and "ø" are different, etc, but if my name is "Øyvind" and I register on an online forum, I would likely prefer that my user name is "oyvind" and not "yvind". Hope that this makes any sense! Thanks!
(And NO, we will not allow the user to pick his own user name. I am really just looking for an alternative to java.text.Normalizer. Thanks!)

Assuming you have considering ALL of the implications of what you're doing, ALL the ways it can go wrong, what you'll do when you get Chinese pictograms and other things that have no equivalent in the Latin Alphabet...
There's not a library that I know of that does what you want. If you have a list of equivalencies (as you say, the 'æ' to 'ae' or whatever), you could store them in a file (or, if you're doing this a lot, in a sorted array in memory, for performance reason) and then do a lookup and replace by character. If you have the space in memory to store the (# of unicode characters) as a char array, being able to run through the unicode values of each character and do a straight lookup would be the most efficient.
i.e., /u1234 => lookupArray[1234] => 'q'
or whatever.
so you'll have a loop that looks like:
StringBuffer buf = new StringBuffer();
for (int i = 0; i < string.length(); i++) {
buf.append(lookupArray[Character.unicodeValue(string.charAt(i))]);
}
I wrote that from scratch, so there are probably some bad method calls or something.
You'll have to do something to handle decomposed characters, probably with a lookahead buffer.
Good luck - I'm sure this is fraught with pitfalls.

Related

How to tell if a string of characters makes intelligible words

So, I'm working on a simple mobile app project (mostly for fun) that uses an OCR library (tesseract) on Android to scan a camera picture, do some stuff with the text, and return it to the user.
What I'm wondering is if anyone out there knows of a way to programmatically (or statistically) tell if a String of characters makes actual words or if it's just nonsense. (I'm only targeting the English language at this point, FYI)
For example, OCR may read a picture and it might return
String returned = "The quick brown fox."
Or, it might read another picture and return
String returned = "$. _- %/ hj #;+__~"
Obviously, the first string returned makes words and the second is just gibberish. I'm wondering if anyone has ideas for a way to easily differentiate between good return and nonsense return.
Run some character frequencies and some other statistics. I would look for the frequency and placement of whitespace, sizes of words, and frequency of symbols that I would and wouldn't expect to find in the content I expect my users to be taking pictures of.
If you're expecting large amounts of text, maybe check the frequencies on the alphabet and see if they match up with the known character frequencies in English. If you're expecting receipts, look for a lot more numbers than usual.
In the end, you could let the user decide if it's really what they wanted. All the analysis could just warn the user with a "We don't believe this is what you want" warning they could ignore.
I used concepts like these to solve a Project Euler problem about knowing when text is properly decrypted.
An easy solution is to have a dictionary of valid words and see if the returned words are in the dictionary.

How should I specify Asian char, and String, constants in Java?

I need to tokenize Japanese sentences. What is best practices for representing the char values of kana and kanji? This is what I might normally do:
String s = "a";
String token = sentence.split(s)[0];
But, the following is not good in my opinion:
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
because people who read my source might not be able to read, or display, Japanese characters. I'd prefer to not insult anyone by writing the actual character. I'd want a "romaji", or something, representation. This is an example of the really stupid "solution" I am using:
char YaSmall_hira_char = (char) 12419; // [ゃ] <--- small
char Ya_hira_char = (char) 12420; // [や]
char Toshi_kj_char = (char) 24180; // [年]
char Kiku_kj_char = (char) 32862; // [聞]
That looks absolutely ridiculous. And, it's not sustainable because there are over 2,000 Japanese characters...
My IDE, and java.io.InputStreamReaders, are all set to UTF-8, and my code it working fine. But the specter of character encoding bugs are hanging over my head because I just don't understand how to represent Asian characters as chars.
I need to clean-up this garbage I wrote, but I don't know which direction to go. Please help.
because people who read my source might not be able to read, or display, Japanese characters.
Then how could the do anything useful with your code when dealing with such characters is an intergral part of it?
Just make sure your development environment is set up correctly to support these characters in source code and that you have procedures in place to ensure everyone who works with the code will get the same correct setup. At the very least document it in your project description.
Then there is nothing wrong with using those characters directly in your source.
I agree that what you are currently doing is unsustainable. It is horribly verbose, and probably a waste of your time anyway.
You need to ask yourself who exactly you expect to read your code:
A native Japanese speaker / writer can read the Kana. They don't need the romanji, and would probably consider them to be an impediment to readability.
A non Japanese speaker would not be able to discern the meaning of the characters whether they are written as Kana or as romanji. Your effort would be wasted for them.
The only people who might be helped by romanji would be non-native Japanese speakers who haven't learned to read / write Kana (yet). And I imagine they could easily find a desktop tool / app for mapping Kana to romanji.
So lets step back to your example which you think is "not good".
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
Even to someone (like me) who can't read (or speak) Japanese, the surface meaning of that code is clear. You are splitting the String using a Japanese character as the separator.
Now, I don't understand the significance of that character. But I wouldn't if it was a constant with a romanji name either. Besides, the chances are that I don't need to know in order to understand what the application is doing. (If I do need to know, I'm probably the wrong person to be reading the code. Decent Japanese language skills are mandatory for your application domain!!)
The issue you raised about not being able to the display the Japanese characters is easy to solve. The programmer simply needs to upgrade his software that can display Kana. Any decent Java IDE will be able to cope ... if properly configured. Besides, if this is a real concern, the proper solution (for the programmer!) is to use Java's Unicode escape sequence mechanism to represent the characters; e.g.
String s = String.valueOf('\uxxxx'); // (replace xxxx with hex unicode value)
The Java JDK includes tools that can rewrite Java source code to add or remove Unicode escaping. All the programmer needs to do is to "escape" the code before trying to read it.
Aside: You wrote this:
"I'd prefer to not insult anyone by writing the actual character."
What? No Westerner would or should consider Kana an insult! They may not be able to read it, but that's not an insult / insulting. (And if they do feel genuinely insulted, then frankly that's their problem ... not yours.)
The only thing that matters here is whether non-Japanese-reading people can fully understand your code ... and whether that's a problem you ought to be trying to solve. Worrying about solving unsolvable problems is not a fruitful activity.
Michael has the right answer, I think. (Posting this as an Answer rather than a Comment because Comment sizes are limited; apologies to those who are picky about the distinction.)
If anyone is working with your code, it will be because they need to alter how Japanese sentences are tokenized. They had BETTER be able to deal with Japanese characters at least to some degree, or they'll be unable to test any changes they make.
As you've pointed out, the alternatives are certainly no more readable. Maybe less so; even without knowing Japanese I can read your code and know that you are using the 'あ' character as your delimiter, so if I see that character in an input string I know what the output will be. I have no idea what the character means, but for this simple bit of code analysis I don't need to.
If you want to make it a bit easier for those of us who don't know the full alphabet, then when referring to single characters you could give us the Unicode value in a comment. But any Unicode-capable text editor ought to have a function that tells us the numeric value of the character we've pointed at -- Emacs happily tells me that it's #x3042 -- so that would purely be a courtesy to those of us who probably shouldn't be messing with your code anyway.

Java & Regex: Matching a substring that is not preceded by specific characters

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?
You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.
It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Need some ideas on how to acomplish this in Java (parsing strings)

Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?
There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.
Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

Parsing of data structure in a plain text file

How would you parse in Java a structure, similar to this
\\Header (name)\\\
1JohnRide 2MarySwanson
1 password1
2 password2
\\\1 block of data name\\\
1.ABCD
2.FEGH
3.ZEY
\\\2-nd block of data name\\\
1. 123232aDDF dkfjd ksksd
2. dfdfsf dkfjd
....
etc
Suppose, it comes from a text buffer (plain file).
Each line of text is "\n" - limited. Space is used between the words.
The structure is more or less defined. Ambuguity may sometimes be, though, case
number of fields in each line of information may be different, sometimes there may not
be some block of data, and the number of lines in each block may vary as well.
The question is how to do it most effectively?
First solution that comes to my head is to use regular expressions.
But are there other solutions? Problem-oriented? Maybe some java library already written?
Check out UTAH: https://github.com/sonalake/utah-parser
It's a tool that's pretty good at parsing this kind of semi structured text
As no one recommended any library, my suggestion would be : use REGEX.
From what you have posted it looks like the data is delimited by whitespace. One idea is to use a Scanner or a StringTokenizer to get one token at a time. You can then check the first char of a token to see if it is a digit (in which case the part of the token after the digit(s) will be the data, if there is any).
This sounds like a homework problem so I'm going to try to answer it in such a way to help guide you (not give the final solution).
First, you need to consider each object of data you're reading. Is it a number then a text field? A number then 3 text fields? Variable numbers and text fields?
After that you need to determine what you're going to use to delimit each field and each object. For example, in many files you'll see something like a semi-colon between the fields and a new line for the end of the object. From what you said it sounds like yours is different.
If an object can go across multiple lines you'll need to bear that in mind (don't stop partway through an object).
Hopefully that helps. If you research this and you're still having problems post the code you've got so far and some sample data and I'll help you to solve your problems (I'll teach you to fish....not give you fish :-) ).
If the fields are fixed length, you could use a DataInputStream to read your file. Or, since your format is line-based, you could use a BufferedReader to read lines and write yourself a state machine which knows what kind of line to expect next, given what it's already seen. Once you have each line as a string, then you just need to split the data appropriately.
E.g., the password can be gotten from your password line like this:
final int pos = line.indexOf(' ');
String passwd = line.substring(pos+1, line.length());

Categories