In my Java application I have been passed in a string that looks like this:
"\u00a5123"
When printing that string into the console, I get the same string as the output (as expected).
However, I want to print that out by having the unicode converted into the actual yen symbol (\u00a5 -> yen symbol) - how would I go about doing this?
i.e. so it looks like this: "[yen symbol]123"
I wrote a little program:
public static void main(String[] args) {
System.out.println("\u00a5123");
}
It's output:
¥123
i.e. it output exactly what you stated in your post. I am not sure there is not something else going on. What version of Java are you using?
edit:
In response to your clarification, there are a couple of different techniques. The most straightforward is to look for a "\u" followed by 4 hex-code characters, extract that piece and replace with a unicode version with the hexcode (using the Character class). This of course assumes the string will not have a \u in front of it.
I am not aware of any particular system to parse the String as though it was an encoded Java String.
As has been mentioned before, these strings will have to be parsed to get the desired result.
Tokenize the string by using \u as separator. For example: \u63A5\u53D7 => { "63A5", "53D7" }
Process these strings as follows:
String hex = "63A5";
int intValue = Integer.parseInt(hex, 16);
System.out.println((char)intValue);
You're probably going to have to write a parse for these, unless you can find one in a third party library. There is nothing in the JDK to parse these for you, I know because I fairly recently had an idea to use these kind of escapes as a way to smuggle unicode through a Latin-1-only database. (I ended up doing something else btw)
I will tell you that java.util.Properties escapes and unescapes Unicode characters in this manner when reading and writing files (since the files have to be ASCII). The methods it uses for this are private, so you can't call them, but you could use the JDK source code to inspire your solution.
Could replace the above with this:
System.out.println((char)0x63A5);
Here is the code to print all of the box building unicode characters.
public static void printBox()
{
for (int i=0x2500;i<=0x257F;i++)
{
System.out.printf("0x%x : %c\n",i,(char)i);
}
}
Related
If the name is typed for example- "ОХ699" using a different keyboard. as a result, “OX” is flagged as non-English characters, even though they appear to be English characters. so is there any way to convert the characters like these to English characters directly?
i tried following code to convert "OX" to english alphabets "OX":
String subjectString = "ОХ699";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
but it is not converting to english alphabets.
Showing output : "699"
Expected output : "OX699"
It is not possible using standard lib. You have to implement your own translations. Someone want to translate Р (R in Cyrillic) to p, and someone wants r. Also there is a problem with Chinese characters or emojis.
There is a linux program uni2ascii that do exactly what you want - you can see how it is implemented in other apps https://salsa.debian.org/debian/uni2ascii/-/blob/master/uni2ascii.c (see the extremely big switch statements).
There is also Python clone of this app, but very, very simplified - https://github.com/ajanin/uni2ascii/blob/master/uni2ascii/__init__.py#L65 . You can copy that stwich and implement translation in your app.
Or install the uni2ascii on the server and just call it (or call it using jni).
But any way - the common practice is just to ignore and skip non-ascii chars
EDIT: I found java implementation in Lucene engine - https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
I'm using jflex and i have to recognize characters, which can be:
Normal chars, like 'a'
Numbers, like '\126'
I've made this regular expression (Integer is a macro already defined):
Character = (\'.\')|(\'\\{Integer}\')
I don't know if it's ok, but my real problem is that i don't know what code i have to put to turn both type of strings into Characters, because this doesn't work:
{Character} { this.yylval = new Character(yytext());
return Parser.CHARACTER; }
Any idea?
You have to write valid Java: the only constructor for Character is Character(char) but you are invoking Character(String).
You need to extract what you want from yytext().
This is the useful part of code:
java.util.List<Element> elems = src.getAllElements();
Iterator it = elems.iterator();
Element el;
String key,value,date="",place="";
String [] data;
int k=0;
Segment content;
String contentstr;
String classname;
while(it.hasNext()){
el = (Element)it.next();
if(el.getName().equals("span"))
{
classname=el.getAttributeValue("class");
if(classname.equals("edit_body"))
{
//java.util.List<Element> elemsinner = el.getChildElements();
//Iterator itinner = elemsinner.iterator();
content=el.getContent();
contentstr=content.toString();
if(true)
{
System.out.println("Done!");
System.out.println(classname);
System.out.println(contentstr);
}
}
}
}
No output. But if I remove the if(classname.equals("edit_body")) condition it does print (in one of the iterations):
Done!
edit_body
"I honestly think it is better to be a failure at something you love than to be a success at something you hate."
Can't get the bug part... help!
I am using an external java library BTW for html parsing.
BTW there are two errors at the start of the output, which is there in both the cases, with or without if condition.:
Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: EndTag br at (r1992,c60,p94048) not recognised as type '/normal' because its name and closing delimiter are separated by characters other than white space
Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: Encountered possible EndTag at (r1992,c60,p94048) whose content does not match a registered EndTagType
Hope that wont cause the error
Ok guys, Somebody explain me please! "edit_body".equals(el.getAttributeValue("class")) worked!!
I had right now the exactly same problem.
I success to solve it by using: SomeStringVar.replaceAll("\\P{Print}","");.
This command remove all the Unicode characters in the variant (characteres that you cant see- the strings look like equal, even they not really equal).
I use this command on each variant i needed in the equalization, and it works for me as well.
Looks like you are having leading or trailing whitespaces in your classname.
Try using this: -
if(classname.trim().equals("edit_body"))
This will trim any of those whitespaces at the ends.
Firstly, String.equals() is NOT broken. It works for millions of other programs / programmers. This is NOT the cause of your problems (unless you or someone has deliberately modified ... and broken your Java installation ...)
So why can two apparently equal strings compare as unequal?
There could be leading or trailing whitespace characters on the String.
There could be embedded non-printing characters.
There could be pairs Unicode characters that look the same when you display them with a typical font, but in fact are not the same. For instance the Greek code page contains characters that look by Latin vowels ... but are in fact different codes, and hence are not equal.
change the code to:
classname="edit_body"; //<- hardcode
if(classname.equals("edit_body"))
if the code enters the if statement now, then there must obviously be some difference in the string content when you use the original "classname=el.getAttributeValue("class");".
in such case, loop over the individual characters and compare those to find the difference.
If the code still doesnt enter the if statement, either your code is not compiling and you are running old code, or your java installation is broken ;-)
OR.
if java is anything like .net (I don't know java)
is "el.getAttributeValue" typed as string?
if it is typed as object, then the if statement would not enter since those are two different instances of the same string.
equals() is a method of String class. So, it works with double quotes.
if(someString.equals("something")) ✓
if(someString.equals('something')) ×
I have to print a non-english string in a Java program. I have the string with me. How do I get the unicode of its constituent characters so that I am embed the string within the program?
In which codepage do you have that string? Java sources can be in any encoding, so you can put that string right in the source and use compiler's options to set the code page. See NetBeans -> Project node -> Properties -> Source -> Encoding.
The source files were getting encoded using "MacRoman" (found this from Project Properties -> Resource -> Text file encoding). I changed it to "UTF-8" and then tried embedding the actual non-english string to the program and tried printing. it worked.
You were perhaps corrupting data either on save or during compilation. Source code doesn't carry any intrinsic encoding information, so it is easy to corrupt string literals that contain characters outside the basic "ASCII" range. Consider using Unicode escape sequences in your source files to avoid this problem. You either do that or you ensure that anyone who comes into contact with the source handles it appropriately at all times - the first way is easier.
If this is for a commercial application, consider externalizing the strings to a resource file.
Java: a rough guide to character encoding
Java: character inspector application
As previous answers said, you can definitely write strings containing characters that can't be encoded in conventional ISO-8859-1 or US-ASCII characters sets, directly in the source file. You do need to make sure your IDE saves the file as UTF-8. And, you may need to add "-encoding UTF-8" to your javac command to ensure javac reads it correctly.
But I think you're wondering about how to embed the string using "\uXXXX" syntax, perhaps to avoid any issues of the source file encoding. This short code snippet will probably work for you; it crudely assumes any character whose UTF-16 values is over 255 needs to be escaped.
public static void main(String[] args) {
String s = args[0];
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
int value = (int) c;
if (value < 256) {
System.out.print(c);
} else {
System.out.print("\\u" + Integer.toHexString(value));
}
}
}
python -c "print repr('text goes here'.decode('utf-8'))"
It may not always be 'utf-8', but that is a sane starting point.
I just came across something like this:
String sample = "somejunk+%3cfoobar%3e+morestuff";
Printed out, sample looks like this:
somejunk+<foobar>+morestuff
How does that work? U+003c and U+003e are the Unicode codes for the less than and greater than signs, respectively, which seems like more than a coincidence, but I've never heard of Java automatically doing something like this. I figured it'd be an easy thing to pop into Google, but it turns out Google doesn't like the percent sign.
That string is probably URL encoded You'd decode that in java using the URLDecoder
String res = java.net.URLDecoder.decode(sample, "UTF8");
You can do something like this,
String sample = "somejunk+%3cfoobar%3e+morestuff";
String result = URLDecoder.decode(sample.replaceAll("\\+", "%2B"), "UTF8");
Java does support Unicode escapes in char and String literals, but not URL encoding.
The Unicode escapes use '\uXXXX', where XXXX is the Unicode point in hexadecimal.
Curious tidbit: The grammar allows 'u' to occur multiple times, so that '\uuuuuuuu0041' is a valid Unicode escape (for 'A').