I need to tokenize Japanese sentences. What is best practices for representing the char values of kana and kanji? This is what I might normally do:
String s = "a";
String token = sentence.split(s)[0];
But, the following is not good in my opinion:
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
because people who read my source might not be able to read, or display, Japanese characters. I'd prefer to not insult anyone by writing the actual character. I'd want a "romaji", or something, representation. This is an example of the really stupid "solution" I am using:
char YaSmall_hira_char = (char) 12419; // [ゃ] <--- small
char Ya_hira_char = (char) 12420; // [や]
char Toshi_kj_char = (char) 24180; // [年]
char Kiku_kj_char = (char) 32862; // [聞]
That looks absolutely ridiculous. And, it's not sustainable because there are over 2,000 Japanese characters...
My IDE, and java.io.InputStreamReaders, are all set to UTF-8, and my code it working fine. But the specter of character encoding bugs are hanging over my head because I just don't understand how to represent Asian characters as chars.
I need to clean-up this garbage I wrote, but I don't know which direction to go. Please help.
because people who read my source might not be able to read, or display, Japanese characters.
Then how could the do anything useful with your code when dealing with such characters is an intergral part of it?
Just make sure your development environment is set up correctly to support these characters in source code and that you have procedures in place to ensure everyone who works with the code will get the same correct setup. At the very least document it in your project description.
Then there is nothing wrong with using those characters directly in your source.
I agree that what you are currently doing is unsustainable. It is horribly verbose, and probably a waste of your time anyway.
You need to ask yourself who exactly you expect to read your code:
A native Japanese speaker / writer can read the Kana. They don't need the romanji, and would probably consider them to be an impediment to readability.
A non Japanese speaker would not be able to discern the meaning of the characters whether they are written as Kana or as romanji. Your effort would be wasted for them.
The only people who might be helped by romanji would be non-native Japanese speakers who haven't learned to read / write Kana (yet). And I imagine they could easily find a desktop tool / app for mapping Kana to romanji.
So lets step back to your example which you think is "not good".
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
Even to someone (like me) who can't read (or speak) Japanese, the surface meaning of that code is clear. You are splitting the String using a Japanese character as the separator.
Now, I don't understand the significance of that character. But I wouldn't if it was a constant with a romanji name either. Besides, the chances are that I don't need to know in order to understand what the application is doing. (If I do need to know, I'm probably the wrong person to be reading the code. Decent Japanese language skills are mandatory for your application domain!!)
The issue you raised about not being able to the display the Japanese characters is easy to solve. The programmer simply needs to upgrade his software that can display Kana. Any decent Java IDE will be able to cope ... if properly configured. Besides, if this is a real concern, the proper solution (for the programmer!) is to use Java's Unicode escape sequence mechanism to represent the characters; e.g.
String s = String.valueOf('\uxxxx'); // (replace xxxx with hex unicode value)
The Java JDK includes tools that can rewrite Java source code to add or remove Unicode escaping. All the programmer needs to do is to "escape" the code before trying to read it.
Aside: You wrote this:
"I'd prefer to not insult anyone by writing the actual character."
What? No Westerner would or should consider Kana an insult! They may not be able to read it, but that's not an insult / insulting. (And if they do feel genuinely insulted, then frankly that's their problem ... not yours.)
The only thing that matters here is whether non-Japanese-reading people can fully understand your code ... and whether that's a problem you ought to be trying to solve. Worrying about solving unsolvable problems is not a fruitful activity.
Michael has the right answer, I think. (Posting this as an Answer rather than a Comment because Comment sizes are limited; apologies to those who are picky about the distinction.)
If anyone is working with your code, it will be because they need to alter how Japanese sentences are tokenized. They had BETTER be able to deal with Japanese characters at least to some degree, or they'll be unable to test any changes they make.
As you've pointed out, the alternatives are certainly no more readable. Maybe less so; even without knowing Japanese I can read your code and know that you are using the 'あ' character as your delimiter, so if I see that character in an input string I know what the output will be. I have no idea what the character means, but for this simple bit of code analysis I don't need to.
If you want to make it a bit easier for those of us who don't know the full alphabet, then when referring to single characters you could give us the Unicode value in a comment. But any Unicode-capable text editor ought to have a function that tells us the numeric value of the character we've pointed at -- Emacs happily tells me that it's #x3042 -- so that would purely be a courtesy to those of us who probably shouldn't be messing with your code anyway.
Related
How can I normalize/unaccent text in Java? I am currently using java.text.Normalizer:
Normalizer.normalize(str, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
But it is far from perfect. For example, it leaves Norwegian characters æ and ø untouched. Does anyone know of an alternative? I am looking for something that would convert characters in all sorts of languages to just the a-z range. I realize there are different ways to do this (e.g. should æ be encoded as 'a', 'e' or even 'ae'?) and I'm open for any solution. I prefer to not write something myself since I think it's unlikely that I will be able to do this well for all languages. Performance is NOT critical.
The use case: I want to convert a user entered name to a plain a-z ranged name. The converted name will be displayed to the user, so I want it to match as close as possible what the user wrote in his original language.
EDIT:
Alright people, thanks for negging the post and not addressing my question, yay! :) Maybe I should have left out the use case. But please allow me to clarify. I need to convert the name in order to store it internally. I have no control over the choice of letters allowed here. The name will be visible to the user in, for example, the URL. The same way that your user name on this forum is normalized and shown to you in the URL if you click on your name. This forum converts a name like "Bășan" to "baan" and a name like "Øyvind" to "yvind". I believe it can be done better. I am looking for ideas and preferably a library function to do this for me. I know I can not get it right, I know that "o" and "ø" are different, etc, but if my name is "Øyvind" and I register on an online forum, I would likely prefer that my user name is "oyvind" and not "yvind". Hope that this makes any sense! Thanks!
(And NO, we will not allow the user to pick his own user name. I am really just looking for an alternative to java.text.Normalizer. Thanks!)
Assuming you have considering ALL of the implications of what you're doing, ALL the ways it can go wrong, what you'll do when you get Chinese pictograms and other things that have no equivalent in the Latin Alphabet...
There's not a library that I know of that does what you want. If you have a list of equivalencies (as you say, the 'æ' to 'ae' or whatever), you could store them in a file (or, if you're doing this a lot, in a sorted array in memory, for performance reason) and then do a lookup and replace by character. If you have the space in memory to store the (# of unicode characters) as a char array, being able to run through the unicode values of each character and do a straight lookup would be the most efficient.
i.e., /u1234 => lookupArray[1234] => 'q'
or whatever.
so you'll have a loop that looks like:
StringBuffer buf = new StringBuffer();
for (int i = 0; i < string.length(); i++) {
buf.append(lookupArray[Character.unicodeValue(string.charAt(i))]);
}
I wrote that from scratch, so there are probably some bad method calls or something.
You'll have to do something to handle decomposed characters, probably with a lookahead buffer.
Good luck - I'm sure this is fraught with pitfalls.
The following code is very well known to convert accented chars into plain Text:
Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll
1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)
Thanks.
\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.
What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.
For example, there are Latin letters in the \p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.
Blocks are nearly never what you want.
In this case, I suspect that you may want to use the property \p{Mn}, a.k.a. \p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.
That is almost the same as checking for \p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, \p{Me}. If you want both, you could say [\p{Mn}\p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.
You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like \p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write \p{Script=Latin} or \p{SC=Latin}, or the short-cut \p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [\p{Latin}\p{Common}\p{Inherited}].
Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.
Another place where the \p{Mn} thing fails is of course enclosing marks like \p{Me}, obviously, but also there are \p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.
Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.
However, you don’t really want to remove accents, I bet, but rather you want to be able to match things “accent-insensitively”, right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.
Took me a while, but I fished them all out:
Here's regex that should include all the zalgo chars including ones bypassed in 'normal' range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62])
Hope this saves you some time.
I recently learned that Unicode is permitted within Java source code not only as Unicode characters (eg. double π = Math.PI; ) but also as escaped sequences (eg. double \u03C0 = Math.PI; ).
The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. However, I don't see any practical application of the second approach.
Here are a few pieces of code to illustrate usage, tested with Java SE 6 and NetBeans 6.9.1:
This code will print out 3.141592653589793
public static void main(String[] args) {
double π = Math.PI;
System.out.println(\u03C0);
}
Explanation: π and \u03C0 are the same Unicode character
This code will not print out anything
public static void main(String[] args) {
double π = Math.PI; /\u002A
System.out.println(π);
/* a comment */
}
Explanation: The code above actually encodes:
public static void main(String[] args) {
double π = Math.PI; /*
System.out.println(π);
/* a comment */
}
Which comments out the print satement.
Just from my examples, I notice a number of potential problems with this language feature.
First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable. Perhaps there are other horrible things that can be done that I haven't thought of.
Second, there seems to be a lack of support among IDEs. Neither NetBeans nor Eclipse provided the correct code highlighting for the examples. In fact, NetBeans even marked a syntax error (though compilation was not a problem).
Finally, this feature is poorly documented and not commonly accepted. Why would a programmer use something in his code that other programmers will not be able to recognize and understand? In fact, I couldn't even find something about this on the Hidden Java Features question.
My question is this:
Why does Java allow escaped Unicode sequences to be used within syntax?
What are some "pros" of this feature that have allowed it to stay a part Java, despite its many "cons"?
Unicode escape sequences allow you to store and transmit your source code in pure ASCII and still use the entire range of Unicode characters. This has two advantages:
No risk of non-ASCII characters getting broken by tools that can't handle them. This was a real concern back in the early 1990s when Java was designed. Sending an email containing non-ASCII characters and having it arrive unmangled was the exception rather than the norm.
No need to tell the compiler and editor/IDE which encoding to use for interpreting the source code. This is still a very valid concern. Of course, a much better solution would have been to have the encoding as metadata in a file header (as in XML), but this hadn't yet emerged as a best practice back then.
The first variant makes sense to me -
it allows programmers to name
variables and methods in an
international language of their
choice. However, I don't see any
practical application of the second
approach.
Both will result in exactly the same byte code and have the same power as a language feature. The only difference is in the source code.
First, a bad programmer could use it
to secretly comment out bits of code,
or create multiple ways of identifying
the same variable.
If you're concerned about a programmer deliberately sabotaging your code's readability, this language feature is the least of your problems.
Second, there seems to be a lack of support among IDEs.
That's hardly the fault of the feature or its designers. But then, I don't think it was ever intended to be used "manually". Ideally, the IDE would have an option to have you enter the characters normally and have them displayed normally, but automatically save them as Unicode escape sequences. There may even already be plugins or configuration options that makes the IDEs behave that way.
But in general, this feature seems to be very rarely used and probably therefore badly supported. But how could the people who designed Java around 1993 have known that?
The nice thing about the \u03C0 encoding is that it is much less likely to be munged by a text editor with the wrong encoding settings. For example a bug in my software was caused by the accidental transformation from UTF-8 é into a MacRoman é by a misconfigured text editor. By specifying the Unicode codepoint, it's completely unambiguous what you mean.
The \uXXXX syntax allows Unicode characters to be represented unambiguously in a file with an encoding not capable of expressing them directly, or if you want a representation guaranteed to be usable even in the lowest common denominator, namely an 7-bit ASCII encoding.
You could represent all your characters with \uXXXX, even spaces and letters, but there is rarely a need to.
First, thank you for the question. I think it is very interesting.
Second, the reason is that the java source file is a text that can use itself various charsets. For example the default charset in Eclipse is Cp1255. This endoding does not support characters like π. I think that they thought about programmers that have to work on systems that do not support unicode and wanted to allow these programmers to create unicode enabled software. This was the reason to support \u notation.
The language spec says why this is permitted. There might be other unstated reasons, and unintended benefits and consequences; but this provides a direct answer to the question (emphasis mine):
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
...
I have a snippet of code that looks like this:
double Δt = lastPollTime - pollTime;
double α = 1 - Math.exp(-Δt / τ);
average += α * (x - average);
Just how bad an idea is it to use unicode characters in Java identifiers? Or is this perfectly acceptable?
It's a bad idea, for various reasons.
Many people's keyboards do not support these characters. If I were to maintain that code on a qwerty keyboard (or any other without Greek letters), I'd have to copy and paste those characters all the time.
Some people's editors or terminals might not display these characters properly. For example, some editors (unfortunately) still default to some ISO-8859 (Latin) variant. The main reason why ASCII is still so prevalent is that it nearly always works.
Even if the characters can be rendered properly, they may cause confusion. Straight from Sun (emphasis mine):
Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), CYRILLIC SMALL LETTER A (a, \u0430) and MATHEMATICAL BOLD ITALIC SMALL A (a, \ud835\udc82) are all different.
...
Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers.
This is in no way an imaginary problem: α (U+03b1 GREEK SMALL LETTER ALPHA) and ⍺ (U+237a APL FUNCTIONAL SYMBOL ALPHA) are different characters!
There is no way to tell which characters are valid. The characters from your code work, but when I use the FUNCTIONAL SYMBOL ALPHA my Java compiler complains about "illegal character: \9082". Even though the functional symbol would be more appropriate in this code. There seems to be no solid rule about which characters are acceptable, except asking Character.isJavaIdentifierPart().
Even though you may get it to compile, it seems doubtful that all Java virtual machine implementations have been rigorously tested with Unicode identifiers. If these characters are only used for variables in method scope, they should get compiled away, but if they are class members, they will end up in the .class file as well, possibly breaking your program on buggy JVM implementations.
looks good as it uses the correct symbols, but how many of your team will know the keystrokes for those symbols?
I would use an english representation just to make it easier to type. And others might not have a character set that supports those symbols set up on their pc.
That code is fine to read, but horrible to maintain - I suggest use plain English identifiers like so:
double deltaTime = lastPollTime - pollTime;
double alpha = 1 - Math.exp(-delta....
It is perfectly acceptable if it is acceptable in your working group. A lot of the answers here operate on the arrogant assumption that everybody programs in English. Non-English programmers are by no means rare these days and they're getting less rare at an accelerating rate. Why should they restrict themselves to English versions when they have a perfectly good language at their disposal?
Anglophone arrogance aside, there are other legitimate reasons for using non-English identifiers. If you're writing mathematics packages, for example, using Greek is fine if your target is fellow mathematicians. Why should people type out "delta" in your workgroup when everybody can understand "Δ" and likely type it more quickly? Almost any problem domain will have its own jargon and sometimes that jargon is expressed in something other than the Latin alphabet. Why on Earth would you want to try and jam everything into ASCII?
It's an excellent idea. Honest. It's just not easily practicable at the time. Let's keep a reference to it for the future. I would love to see triangles, circles, squares, etc... as part of program code. But for now, please do try to re-write it, the way Crozin suggests.
Why not?
If the people working on that code can type those easily, it's acceptable.
But god help those who can't display unicode, or who can't type them.
In a perfect world, this would be the recommended way.
Unfortunately you run into character encodings when moving outside of plain 7-bit ASCII characters (UTF-8 is different from ISO-Latin-1 is different from UTF-16 etc), meaning that you eventually will run into problems. This has happened to me when moving from Windows to Linux. Our national scandinavian characters broke in the process, but fortunately was only in strings. We then used the \u encoding for all those.
If you can be absolutely certain that you will never, ever run into such a thing - for instance if your files contain a proper BOM - then by all means, do this. It will make your code more readable. If at least the smallest amount of doubt, then don't.
(Please note that the "use non-English languages" is a different matter. I'm just thinking in using symbols instead of letters).
Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?
There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.
Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).