Change InputStream charset after being set - java

If a string of data contains characters with different encodings, is there a way to change charset encoding after an input stream is created or suggestions on how it could be achieved?
Example to help explain:
// data need to read first 4 characters using UTF-8 and next 4 characters using ISO-8859-2?
String data = "testўёѧẅ"
// use default charset of platform, could pass in a charset
try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
// probably an input stream reader to use char instead of byte would be clearer but hopefully the idea comes across
byte[] bytes = new byte[4];
while (in.read(bytes) != -1) {
// TODO: change the charset here to UTF-8 then read values
// TODO: change the charset here to ISO-8859-2 then read values
}
}
Been looking at decoders, might be the way to go:
What is CharsetDecoder.decode(ByteBuffer, CharBuffer, endOfInput)
Encoding conversion in java
Attempt using same input stream:
String data = "testўёѧẅ";
InputStream inputStream = new ByteArrayInputStream(data.getBytes());
Reader r = new InputStreamReader(inputStream, "UTF-8");
int intch;
int count = 0;
while ((intch = r.read()) != -1) {
System.out.println((char)ch);
if ((++count) == 4) {
r = new InputStreamReader(inputStream, Charset.forName("ISO-8859-2"));
}
}
//outputs test and not the 2nd part

Assuming that you know there will be n UTF-8 characters and m ISO 8859-2 characters in your stream (n=4, m=4 in your example), you can do by using two different InputStreamReaders working on the same InputStream:
try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
InputStreamReader inUtf8 = new InputStreamReader(in, StandardCharsets.UTF_8);
InputStreamReader inIso88592 = new InputStreamReader(in, Charset.forName("ISO-8859-2"));
// read `n` characters using inUtf8, then read `m` characters using inIso88592
}
Note that you need to read characters not bytes (i.e. check how many characters how been read so far, as in UTF-8 a single character may be encoded on 1-4 bytes).

String contains Unicode so it can combine all language scripts.
String data = "testўёѧẅ";
For that String uses a char array, where char is UTF-16. Sometimes a Unicode symbol, a code point, needs to be encoded as two chars. So: char only for a part of the Unicode maps Unicode code points exactly. Here it might do:
String d1 = data.substring(0, 4);
byte[] b1 = data.getBytes(StandardCharsets.UTF_8); // Binary data, UTF-8 text
String d2 = data.substring(4);
Charset charset = Charset.from("ISO-8859-2");
byte[] b2 = data.getBytes(charset); // Binary data, Latin-2 text
The number of bytes do not need to correspond to the number of code points.
Also é might be 1 code point é, or two code points: e and a zero width ´.
To split text by script or Unicode block:
data.codePoints().forEach(cp -> System.out.printf("%-35s - %-25s - %s%n",
Character.getName(cp),
Character.UnicodeBlock.of(cp),
Character.UnicodeScript.of(cp)));
Name: Unicode block: Script:
LATIN SMALL LETTER T - BASIC_LATIN - LATIN
LATIN SMALL LETTER E - BASIC_LATIN - LATIN
LATIN SMALL LETTER S - BASIC_LATIN - LATIN
LATIN SMALL LETTER T - BASIC_LATIN - LATIN
CYRILLIC SMALL LETTER SHORT U - CYRILLIC - CYRILLIC
CYRILLIC SMALL LETTER IO - CYRILLIC - CYRILLIC
CYRILLIC SMALL LETTER LITTLE YUS - CYRILLIC - CYRILLIC
LATIN SMALL LETTER W WITH DIAERESIS - LATIN_EXTENDED_ADDITIONAL - LATIN

Related

How to split a string containing non ascii characters based on the byte size limit?

How to split a string containing non ascii characters based on the byte size limit?
I want to split the below string and add to a List and the split is based on the size limit (e.g) 3 bytes.
The problem here is extended ascii char takes 2 characters and after split the data become junk as shown in the actual output.
what I want is the expected output as given below, its ok to write only 2 bytes, if we come across non-ascii char. Please let me know how to resolve it.
Problem:
String words = "Hello woræd æåéøòôóâ";
List<String> payloads = new ArrayList<>();
try( ByteArrayOutputStream outStream = new ByteArrayOutputStream();) {
byte[] chars = words.getBytes(StandardCharsets.UTF_8);
for (byte ch: chars) {
outStream.write(ch);
if (outStream.size() >= 3) {
String s = outStream.toString("UTF-8");
payloads.add(s);
outStream.flush();
outStream.reset();
}
}
payloads.add(outStream.toString("UTF-8"));
outStream.flush();
System.out.println(payloads);
} catch (IOException e) {
e.printStackTrace();
}
Actual Output: [Hel, lo , wor, æd, �, �å, é�, �ò, ô�, �â, ]
Expected output: [Hel, lo , wor, æd, ,æ, å, é, ø, ò, ô, ó, â] ]
It's UTF-8. UTF-8 is designed so that you can easlly detect character boundaries.
So: convert String to UTF-8 bytes.
Then backtrack until the first excluded byte is a legitimate 'first byte', i.e. not 10xxxxxx. You are now positioned at a character boundary.

Char coding in Java

I open file with notepad, write there: "ą" save and close.
I try to read this file in two ways
First:
InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
int result = inputStream.read();
System.out.println(result);
System.out.println((char) result);
196
Ä
Second:
InputStream inputStream = Files.newInputStream(Paths.get("file.txt"));
Reader reader = new InputStreamReader(inputStream);
int result = reader.read();
System.out.println(result);
System.out.println((char) result);
261
ą
Questions:
1) In binary mode, this letter is saved as 196? Why not as 261?
2) This letter is saved as 196 in which encoding?
I try to understand why there are differences
UTF-8 encodes values from range U+0080 - U+07FF as two bytes in form 110xxxxx 10xxxxxx (more at wiki). So there are only xxxxx xxxxxx 11 bytes available for value.
ą is indexed as U+0105 where 0105 is hexadecimal value (as decimal it is 261). As binary it can be represented as
01 05 (hex)
00000001 00000101 (bin)
xxx xxxxxxxx <- values for U+0080 - U+07FF range encode only those bits
001 00000101 <- which means `x` will be replaced by only this part
So UTF-8 encoding will add 110xxxxx 10xxxxxx mask which means it will combine
110xxxxx 10xxxxxx
00100 000101
into (two bytes):
11000100 10000101
Now, InputStream reads data as raw bytes. So when you call inputStream.read(); first time you are getting 11000100 which is 196 in decimal. Calling inputStream.read(); second time would return 10000101 which is 133 in decimal.
Readers ware introduced in Java 1.1 so we could avoid this kind of mess in our code. Instead we can specify what encoding Reader should use (or let it use default one) to get properly encoded values like in this case 00000001 00000101 (without mask) which is equal to 0105 in hexadecimal form and 261 in decimal form.
In short
use Readers (with properly specified encoding) if you want to read data as text,
use Streams if you want to read data as raw bytes.
Because you read these two letters in different encodings, you can check your encoding via InputStreamReader::getEncoding.
String s = "ą";
char iso_8859_1 = new String(s.getBytes(), "iso-8859-1").charAt(0);
char utf_8 = new String(s.getBytes(), "utf-8").charAt(0);
System.out.println((int) iso_8859_1 + " " + iso_8859_1);
System.out.println((int) utf_8 + " " + utf_8);
The output is
196 Ä
261 ą
Try using an InputStreamReader with UTF-8 encoding, which matches the encoding used to write the file from Notepad++:
// this will use UTF-8 encoding by default
BufferedReader in = Files.newBufferedReader(Paths.get("file.txt"));
String str;
if ((str = in.readLine()) != null) {
System.out.println(str);
}
in.close();
I don't have an exact/reproducible answer for why you are seeing the output you see, but if you are reading with the wrong encoding, you won't necessarily see what you saved. For example, if the single character ą were encoded with two bytes, but you read as ASCII, then you might get back two characters, which would not match your original file.
You are getting decimal value of LATIN letters
You need to save the file with UTF-8 encoding standard.
Make sure when you are reading them with similar standards.
0x0105 261 LATIN SMALL LETTER A WITH OGONEK ą
0x00C4 196 LATIN CAPITAL LETTER A WITH DIAERESIS �
Refer this:-https://www.ssec.wisc.edu/~tomw/java/unicode.html

java convert utf-8 2 byte char to 1 byte char

There are many similar questions, but no one helped me.
utf-8 can be 1 byte or 2,3,4.
ISO-8859-15 is allways 2 bytes.
But I need 1 byte character like code page Code "page 863" (IBM863).
http://en.wikipedia.org/wiki/Code_page_863
For example "é" is code point 233 and is 2 bytes long in utf 8, how can I convert it to IBM863 (1 byte) in Java?
Running on JVM -Dfile.encoding=UTF-8 possible?
Of course that conversion would mean that some characters can be lost, because IBM863 is smaller.
But I need the language specific characters, like french, è, é etc.
Edit1:
String text = "text with é";
Socket socket = getPrinterSocket( printer);
BufferedWriter bwOut = getPrinterWriter(printer,socket);
...
bwOut.write("PRTXT \"" + text + "\n");
...
if (socket != null)
{
bwOut.close();
socket.close();
}
else
{
bwOut.flush();
}
Its going a label printer with Fingerprint 8.2.
Edit 2:
private BufferedWriter getPrinterWriter(PrinterLocal printer, Socket socket)
throws IOException
{
return new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));
}
First of all: there is no such thing as "1 byte char" or, in fact, "n byte char" for whatever n.
In Java, a char is a UTF-16 code unit; depending on the (Unicode) code point, either one, or two chars, are necessary to represent a code point.
You can use the following methods:
Character.toChars() to turn a Unicode code point into a char array representing this code point;
a CharsetEncoder to perform the char[] to byte[] conversion;
a CharsetDecoder to perform the byte[] to char[] conversion.
You obtain the two latter from a Charset's .new{Encoder,Decoder}() methods.
It is crucially important here to know what your input is exactly: is it a code point, is it an encoded byte array? You'll have to adapt your code depending on this.
Final note: the file.encoding setting defines the default charset to use when you don't specify a charset to use, for instance in a FileReader constructors; you should avoid not specifying a charset to begin with!
byte[] someUtf8Bytes = ...
String decoded = new String(someUtf8Bytes, StandardCharsets.UTF8);
byte[] someIso15Bytes = decoded.getBytes("ISO-8859-15");
byte[] someCp863Bytes = decoded.getBytes("cp863");
If you start with a string, use just getBytes with a proper encoding.
If you want to write strings with a proper encoding to a socket, you can either use OutputStream instead of PrintStream or Writer and send byte arrays, or you can do:
new BufferedWriter(new OutputStreamWriter(socket.getOutputStream(), "cp863"))

convert charset X to unicode in Java

How do you convert a specific charset to unicode in Java?
charsets have been discussed quite a lot here, but I think this one hasn't been covered yet.
I have a hex-string that meets the criteria length%4==0 (e.g. \ud3faef8e). usually I just display this in an HTML container and add &#x to the front and ; to the back of each hex quadruple.
but in this case the following procedure led to the correct output (non-Java)
paste hex string into Hex-Editor and save the file to test.txt (utf-8)
open the file with Notepad++
change the encoding to Simplified Chinese (GB2312)
Now I'm trying to do the same in Java.
// having hex convert to ascii
String ascii = "";
for (int cnt = 0; cnt <= unicode.length() - 2; cnt += 2) {
String tmp = unicode.substring(cnt, cnt + 2);
int decimal = Integer.parseInt(tmp, 16);
ascii += (char) decimal;
}
// writing ascii to file at this point leads to the same result as in step 2 before
try {
// get the bytes
byte[] utf8 = ascii.getBytes("UTF-8"); // == UTF8
// convert to gb2312
String converted = new String(utf8, "GB2312"); // == EUC_CN
// write to file (writer with declared UTF-8)
writeToFile(converted, 20 + cntu);
cntu++;
} catch (Exception e) {
System.err.println(e.getMessage());
}
the output looks according the should-output, except the fact that randomly the following character is displayed: � why does this one come up? and how can I get rid of it?
in the end, what I'd like to get is the converted unicode again to be able to display it with my original approach (폴), but I haven't figured out a way to get to the hex values again (they don't match the criteria length%4==0). how do I get the hex values of the characters?
update1
to be more precise, regarding the input, I'm assuming that it is Unicode, because of the start of the String with \u, which would be sufficient for my usual approach, but not in the case I am describing above.
update2
the writeToFile method
FileOutputStream fos = new FileOutputStream("test" + id + ".txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
I tried with GB2312 as well, but there is no change. I still get the ? inbetween the correct characters.
update3
the expected output for \ud3f6ef8e is 遇飵 , you get to it when following the steps 1 to 3. (HxD as an example of an hex editor)
there was no indication that I should delete my question, thus I'm writing my final comment as the answer
I was misinterpreting the incoming hex-digits. they were in a specific charset and not uni-code, so they represented the hex-values of a character in that charset. What I'm doing now is new String(byteArray, "CharsetName"); and get (int)s.charAt(i) to get the unicode value and write it to HTML. thanks for your ideas and hints
for more details see this answer here: https://stackoverflow.com/a/4049781/1338732 , and this question here: How to convert UTF-8 to unicode in Java?

ReadLine and encoding of the extended ascii table

Good day.
I have an ASCII file with Spanish words. They contain only characters between A and Z, plus Ñ, ASCII Code 165 (http://www.asciitable.com/).
I get this file with this source code:
InputStream is = ctx.getAssets().open(filenames[lang_code][w]);
InputStreamReader reader1 = new InputStreamReader(is, "UTF-8");
BufferedReader reader = new BufferedReader(reader1, 8000);
try {
while ((line = reader.readLine()) != null) {
workOn(line);
// do a lot of things with line
}
reader.close();
is.close();
} catch (IOException e) { e.printStackTrace(); }
What here I called workOn() is a function that should extract the characters codes from the strings and is something like that:
private static void workOn(String s) {
byte b;
for (int w = 0; w < s.length(); w++) {
b = (byte)s.charAt(w);
// etc etc etc
}
}
Unfortunately what happens here is that I cannot identify b as an ASCII code when it represents the Ñ letter. The value of b is correct for any ascii letter, and returns -3 when dealing with Ñ, that, brought to signed, is 253, or the ASCII character ². Nothing similar to Ñ...
What happens here? How should I get this simple ASCII code?
What is getting me mad is that I cannot find a correct coding. Even, if I go and browse the UTF-8 table (http://www.utf8-chartable.de/) Ñ is 209dec and 253dec is ý, 165dec is ¥. Again, not event relatives to what I need.
So... help me please! :(
Are you sure that your source file you are reading is UTF-8 encoded? In UTF-8 encoding, all values greater than 127 are reserved for a multi-byte sequence, and they are never seen standing on their own.
My guess is that the file you are reading is encoded using "code page 237" which is the original IBM PC character set. In that character set, the Ñ is represented by the decimal 165.
Many modern systems use ISO-8859-1, which happen to be equivalent to the first 256 characters of the Unicode character set. In those, the Ñ character is a decimal 209. In a comment, the author clarified that a 209 is actually in the file.
If the file was really UTF-8 encoded, then the Ñ would be represented as a two-byte sequence, and would be neither the value 165 nor the value 209.
Based on the above assumption that the file is ISO-8859-1 encoded, you should be able to solve the situation by using:
InputStreamReader reader1 = new InputStreamReader(is, "ISO-8859-1");
This will translate to the Unicode characters, and you should then find the character Ñ represented by decimal 209.

Categories