Map supplementary Unicode characters to BMP (if possible) - java

I ran into the issue that my XML parser (VTD-XML) doesn't seem to be able to handle Unicode Supplementary characters (please correct if I'm already wrong here). It seems, the parser only uses the lower 16 bit of such characters.
I cannot switch to another parser within the project I'm occupied with. I am parsing Medline abstracts (https://www.ncbi.nlm.nih.gov/pubmed) and it seems there have been added documents that contain supplementary characters over the last year (e.g. https://www.ncbi.nlm.nih.gov/pubmed/?term=26855708, ends of results section).
As a quick and dirty fix I would just delete all characters above 0xFFFF from the documents. Obviously, that will destroy some expressions in the document texts and so I'm not really happy with that solution.
Since I can't change the parser, I was wondering if there exists some possibility to map supplementary characters to characters within the BMP that are likely to have a glyph with similar appearance, if existent.
Of course I welcome any other idea. It would even be fine to replace the supplementary characters with some kind of placeholder and then put the original character back in but this seems error prone. Better ideas?
Edit: Here is some - hopefully - minimal example of how this issue comes up with VTD-XML:
#Test
public void parseUnicodeBeyondBMP() throws NavException, FileNotFoundException, IOException, EncodingException, EOFException, EntityException, ParseException {
// character codpoint 0x10400
String unicode = "<supplementary>\uD801\uDC00</supplementary>";
byte[] unicodeBytes = unicode.getBytes();
assertEquals(unicode, new String(unicodeBytes, "UTF-8"));
VTDGen vg = new VTDGen();
vg.setDoc(unicodeBytes);
vg.parse(false);
VTDNav vn = vg.getNav();
long fragment = vn.getContentFragment();
int offset = (int) fragment;
int length = (int) (fragment >> 32);
String originalBytePortion = new String(Arrays.copyOfRange(unicodeBytes, offset, offset+length));
String vtdString = vn.toRawString(offset, length);
// this actually succeeds
assertEquals("\uD801\uDC00", originalBytePortion);
// this fails ;-( the returned character is Ѐ, codepoint 0x400, thus the high surrogate is missing
assertEquals("\uD801\uDC00", vtdString);
}

Related

Splitting a string with byte length limits in java

I want to split a String to a String[] array, whose elements meet following conditions.
s.getBytes(encoding).length should not exceed maxsize(int).
If I join the splitted strings with StringBuilder or + operator, the result should be exactly the original string.
The input string may have unicode characters which can have multiple bytes when encoded in e.g. UTF-8.
The desired prototype is shown below.
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize)
And the testing code:
public boolean isNice(String str, String encoding, int max)
{
//boolean success=true;
StringBuilder b=new StringBuilder();
String[] splitted= SplitStringByByteLength(str,encoding,max);
for(String s: splitted)
{
if(s.getBytes(encoding).length>max)
return false;
b.append(s);
}
if(str.compareTo(b.toString()!=0)
return false;
return true;
}
Though it seems easy when the input string has only ASCII characters, the fact that it could cobtain multibyte characters makes me confused.
Thank you in advance.
Edit: I added my code impementation. (Inefficient)
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) throws UnsupportedEncodingException
{
ArrayList<String> splitted=new ArrayList<String>();
StringBuilder builder=new StringBuilder();
//int l=0;
int i=0;
while(true)
{
String tmp=builder.toString();
char c=src.charAt(i);
if(c=='\0')
break;
builder.append(c);
if(builder.toString().getBytes(encoding).length>maxsize)
{
splitted.add(new String(tmp));
builder=new StringBuilder();
}
++i;
}
return splitted.toArray(new String[splitted.size()]);
}
Is this the only way to solve this problem?
The class CharsetEncode has provision for your requirement. Extract from the Javadoc of the Encode method:
public final CoderResult encode(CharBuffer in,
ByteBuffer out,
boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer...
In addition to reading characters from the input buffer and writing bytes to the output buffer, this method returns a CoderResult object to describe its reason for termination:
...
CoderResult.OVERFLOW indicates that there is insufficient space in the output buffer to encode any more characters. This method should be invoked again with an output buffer that has more remaining bytes. This is typically done by draining any encoded bytes from the output buffer.
A possible code could be:
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
Charset cs = Charset.forName(encoding);
CharsetEncoder coder = cs.newEncoder();
ByteBuffer out = ByteBuffer.allocate(maxsize); // output buffer of required size
CharBuffer in = CharBuffer.wrap(src);
List<String> ss = new ArrayList<>(); // a list to store the chunks
int pos = 0;
while(true) {
CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
int newpos = src.length() - in.length();
String s = src.substring(pos, newpos);
ss.add(s); // add what has been encoded to the list
pos = newpos; // store new input position
out.rewind(); // and rewind output buffer
if (! cr.isOverflow()) {
break; // everything has been encoded
}
}
return ss.toArray(new String[0]);
}
This will split the original string in chunks that when encoded in bytes fit as much as possible in byte arrays of the given size (assuming of course that maxsize is not ridiculously small).
The problem lies in the existence of Unicode "supplementary characters" (see Javadoc of the Character class), that take up two "character places" (a surrogate pair) in a String, and you shouldn't split your String in the middle of such a pair.
An easy approach to splitting would be to stick to the worst-case that a single Unicode code point can take at most four bytes in UTF-8, and split the string after every 99 code points (using string.offsetByCodePoints(pos, 99) ). In most cases, you won't fill the 400 bytes, but you'll be on the safe side.
Some words about code points and characters
When Java started, Unicode had less than 65536 characters, so Java decided that 16 bits were enough for a character. Later the Unicode standard exceeded the 16-bit limit, and Java had a problem: a single Unicode element (now called a "code point") no longer fit into a single Java character.
They decided to go for an encoding into 16-bit entities, being 1:1 for most usual code points, and occupying two "characters" for the exotic code points beyond the 16-bit limit (the pair built from so-called "surrogate characters" from a spare code range below 65535). So now it can happen that e.g. string.charAt(5) and string.charAt(6) must be seen in combination, as a "surrogate pair", together encoding one Unicode code point.
That's the reason why you shouldn't split a string at an arbitrary index.
To help the application programmer, the String class then got a new set of methods, working in code point units, and e.g. string.offsetByCodePoints(pos, 99) means: from the index pos, advance by 99 code points forward, giving an index that will often be pos+99 (in case the string doesn't contain anything exotic), but might be up to pos+198, if all the following string elements happen to be surrogate pairs.
Using the code-point methods, you are safe not to land in the middle of a surrogate pair.

How to convert clob to string with encoding in java

We are doing massive batch of xml processing and the logic to convert clob to string is shown below.
import java.sql.Clob
import org.apache.commons.io.IOUtils
String extractXml(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
String sourceXml
try {
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream()), encoding) // 1. Encoding not working
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream(), encoding), encoding) // 2. Encoding working
} catch (Exception e) {
...
}
return sourceXml
}
My queries:
a. I am not sure why (1) doesn't work even though I am using getCharacterStream() instead of getAsciiStream().
but (2) seems to work fine may be I am using explicit overriding of system encoding ?
b. The solution (2) looks bit odd as you are specifing 2 times the encoding format (one for bytes array and one for string creation).
I am not sure if there are any performance issues or wondered if there are better ways to write them?
c. I thought of not using the Apache-commons libraries and use a simple java package solution.
But the suprising thing is, I did not give any explicit encoding but it seems to work perfectly.
Is it because It does "streams character -> straight to string buffering" ?
/*
* working perfectly and retuns encoding correctly
*/
String extractXmlWithoutApacheCommons(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
StringBuffer sb = new StringBuffer((int) xmlClob.length())
try {
Reader r = xmlClob.getCharacterStream()
char[] cbuf = new char[2048]
int n = 0
while ((n = r.read(cbuf, 0, cbuf.length)) != -1) {
if (n > 0) {
sb.append(cbuf, 0, n)
}
}
} catch (Exception e) {
...
}
return sb.toString()
}
Can you guys please shed some light to understand them.
The Clob already has an encoding. It's whatever you've specified in the database, and once you read it on Java side it'll be a String (with the implicit UTF-16 encoding, not that it matters at all).
Whatever you think you're doing with all those encoding tricks is wrong and useless. You only need to specify an encoding when turning bytes to chars or the other way around. You're dealing with chars only (except in your first example where you for some unknown reason want to turn them to bytes).
If you want to use IOUtils, then readFully(Reader input, char[] buffer) would be the method to use.
The platform default encoding has no effect in this whole question, since you shouldn't be working with bytes at all.
Edit:
A slightly more modern way with the standard JDK classes would be to use Reader.read(CharBuffer target) like
CharBuffer cb = CharBuffer.allocate((int) xmlClob.length());
while(r.read(cb) != -1)
;
return cb.toString();
but it doesn't really make a huge difference (it's a bit nicer looking).

How to compare Chinese characters in Java using 'equals()'

I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:
for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {
// Account for 'r' like in dianr/huir
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Also, feel free to suggest a more elegant way to parse this ...
[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)
oh, dang, apparently it does not work simply copy and pasting:
Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:
tmpChar.codePoints().forEach(c -> {
if (c == '兒') {
// ...
}
});
(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).
Either characters, accepting 兒 as substring.
String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
int position2 = position + "兒".length();
s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
// At position i there is a 兒.
}
Or code points where it would be one code point. As that is not really easier, variable substring seem fine.
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.
There are a variety of APIs on the String class for coping.
As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.
Or, you can use the ICU4J library with a richer set of facilities for all of this.

Create string with emoji unicode flag countries

i need to create a String with a country flag unicode emoji..I did this:
StringBuffer sb = new StringBuffer();
sb.append(StringEscapeUtils.unescapeJava("\\u1F1EB"));
sb.append(StringEscapeUtils.unescapeJava("\\u1F1F7"));
Expecting one country flag but i havent..How can i get a unicode country flag emoji in String with the unicodes characters?
The problem is, that the "\uXXXX" notation is for 4 hexadecimal digits, forming a 16 bit char.
You have Unicode code points above the 16 bit range, both U+F1EB and U+1F1F7. This will be represented with two chars, a so called surrogate pair.
You can either use the codepoints to create a string:
int[] codepoints = {0x1F1EB, 0x1F1F7};
String s = new String(codepoints, 0, codepoints.length);
Or use the surrogate pairs, derivable like this:
System.out.print("\"");
for (char ch : s.toCharArray()) {
System.out.printf("\\u%04X", (int)ch);
}
System.out.println("\"");
Giving
"\uD83C\uDDEB\uD83C\uDDF7"
Response to the comment: How to Decode
"\uD83C\uDDEB" are two surrogate 16 bit chars representing U+1F1EB and "\uD83C\uDDF7" is the surrogate pair for U+1F1F7.
private static final int CP_REGIONAL_INDICATOR = 0x1F1E7; // A-Z flag codes.
/**
* Get the flag codes of two (or one) regional indicator symbols.
* #param s string starting with 1 or 2 regional indicator symbols.
* #return one or two ASCII letters for the flag, or null.
*/
public static String regionalIndicator(String s) {
int cp0 = regionalIndicatorCodePoint(s);
if (cp0 == -1) {
return null;
}
StringBuilder sb = new StringBuilder();
sb.append((char)(cp0 - CP_REGIONAL_INDICATOR + 'A'));
int n0 = Character.charCount(cp0);
int cp1 = regionalIndicatorCodePoint(s.substring(n0));
if (cp1 != -1) {
sb.append((char)(cp1 - CP_REGIONAL_INDICATOR + 'A'));
}
return sb.toString();
}
private static int regionalIndicatorCodePoint(String s) {
if (s.isEmpty()) {
return -1;
}
int cp0 = s.codePointAt(0);
return CP_REGIONAL_INDICATOR > cp0 || cp0 >= CP_REGIONAL_INDICATOR + 26 ? -1 : cp0;
}
System.out.println("Flag: " + regionalIndicator("\uD83C\uDDEB\uD83C\uDDF7"));
Flag: EQ
You should be able to do that simply using toChars from java.lang.Character.
This works for me:
StringBuffer sb = new StringBuffer();
sb.append(Character.toChars(127467));
sb.append(Character.toChars(127479));
System.out.println(sb);
prints 🇫🇷, which the client can chose to display like a french flag, or in other ways.
If you want to use emojis often, it could be good to use a library that would handle that unicode stuff for you: emoji-java
You would just add the maven dependency:
<dependency>
<groupId>com.vdurmont</groupId>
<artifactId>emoji-java</artifactId>
<version>1.0.0</version>
</dependency>
And call the EmojiManager:
Emoji emoji = EmojiManager.getForAlias("fr");
System.out.println("HEY: " + emoji.getUnicode());
The entire list of supported emojis is here.
I suppose you want to achieve something like this
Let me give you 2 example of unicodes for country flags:
for ROMANIA ---> \uD83C\uDDF7\uD83C\uDDF4
for AMERICA ---> \uD83C\uDDFA\uD83C\uDDF8
You can get this and other country flags unicodes from this site Emoji Unicodes
Once you enter the site, you will see a table with a lot of emoji. Select the tab with FLAGS from that table (is easy to find it) then will appear all the country flags. You need to select one flag from the list, any flag you want... but only ONE. After that will appear a text code in the message box...that is not important. Important is that you have to look in the right of the site where will appear flag and country name of your selected flag. CLICK on that, and on the page that will open you need to find the TABLE named Emoji Character Encoding Data. Scroll until the last part of table where sais: C/C++/Java Src .. there you will find the correct unicode flag. Attention, always select the unicode that is long like that, some times if you are not carefull you can select a simple unicode, not long like that. So, keep that in mind.
Indications image 1
Indication image 2
In the end i will post a sample code from an Android app of mine that will work on java the same way.
ArrayList<String> listLanguages = new ArrayList<>();
listLanguages.add("\uD83C\uDDFA\uD83C\uDDF8 " + getString(R.string.English));
listLanguages.add("\uD83C\uDDF7\uD83C\uDDF4 " + getString(R.string.Romanian));
Another simple custom example:
String flagCountryName = "\uD83C\uDDEF\uD83C\uDDF2 Jamaica";
You can use this variable where you need it. This will show you the flag of Jamaica in front of the text.
This is all, if you did not understand something just ask.
Look at Creating Unicode character from its number
Could not get my machine to print the Unicode you have there, but for other values it works.

Loosing unicode/ASCII element once parse HTML document with Jsoup

I addressed a strange behavior when I parsed a HTML page which contains a unicode/ASCII element. Here the example git://gist.github.com/2995626.git.
What performed is:
File layout = new File(html_file);
Document doc = Jsoup.parse(layout, "UTF-8");
System.out.println(doc.toString());
What I expected was the HTML triangle, but it is converted to "â–¼". Do you have any suggestions?
Thanks in advance.
Jsoup is perfectly capable of parsing HTML using UTF-8. Even more, it's its default character encoding already. Your problem is caused elsewhere. Based on the information provided so far, I can see two possible problem causes:
The HTML file was originally not saved using UTF-8 (or perhaps it's one step before; it's originally not been read using UTF-8).
The stdout (there where the System.out goes to) does not use UTF-8.
If you make sure that both are correctly set, then your problem should disappear. If not, then there's another possible cause which is not guessable based on the information provided so far in your question. At least, this blog should bring a lot of new insight: Unicode - How to get the characters right?
It is a problem caused by unicode. Here you can have an example following. You can try the code below .The result will show you the cause why the code you write not working.
public static void main(String[] argv) {
String test = "Ch\u00e0o bu\u1ed5i s\u00e1ng";
System.out.println(unicode2String(test));
}
/**
* unicode 转字符串
*/
public static String unicode2String(String unicode) {
StringBuffer string = new StringBuffer();
String[] hex = unicode.split("\\\\u");
string.append(hex[0]);
for (int i = 1; i < hex.length; i++) {
// 转换出每一个代码点
int data = Integer.parseInt(hex[i], 16);
// 追加成string
string.append((char) data);
}
return string.toString();
}
Maybe you code should be as follows:
System.out.println(unicode2String(doc.toString()));

Categories