Convert InputStream from ISO-8859-1 to UTF-8

Convert InputStream from ISO-8859-1 to UTF-8 - java

I have a file in ISO-8859-1 containing german umlauts and I need to unmarshall it using JAXB. But before I need the content in UTF-8.
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
inputWithNamespace = convertFileToUtf(inputWithNamespace);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
...
I get the "file" as an InputStream. My idea was to read the file's content in UTF-8 and make another InputStream to use. This is what I've tried:
private InputStream convertFileToUtf(InputStream inputStream) throws IOException {
byte[] bytesInIso = ByteStreams.toByteArray(inputStream);
String stringIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String stringUtf = new String(bytesInUtf);
return new ByteArrayInputStream(bytesInUtf);
}
I have those 2 Strings to check the contents, but even just reading the ISO file, it gives question marks where umlauts are (?) and converting that to UTF_8 gives strange characters like 1/2 and so on.
UPDATE
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
Verifying contentInIso prints question marks instead of the umlauts and by checking contentInIso instead of umlauts, it has characters like "ï¿½".
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
This method convert it's called by another one called processUsageFile:
private void processUsageFile(File usageFile) {
try (FileInputStream fileInputStream = new FileInputStream(usageFile)) {
usageImporterService.importUsages(usageFile.getName(), fileInputStream, getUsageTypeValidated(usageFile.getName()));
log.info("Usage file {} imported successfully. Moving to archive directory", usageFile.getName());
If i take the code I have written under the UPDATE statement and put it immediately after the try, the first contentInIso has question marks but the contentInUtf has the umlauts. Then, by going into the convert, jabx throws an exception that the file has a premature end of line.

Regarding the behaviour you are getting,
String stringIso = new String(bytesInIso);
In this step, you construct a new String by decoding the specified array of bytes using the platform's default charset.
Since this is probably not ISO_8859_1, I think the String you are looking at becomes garbled here.

Related

How to deserialize avro files

I would like to read a hdfs folder containing avro files with spark . Then I would like to deserialize the avro events contained in these files. I would like to do it without the com.databrics library (or any other that allow to do it easely).
The problem is that I have difficulties with the deserialization.
I assume that my avro file is compressed with snappy because at the begining of the file (just after the schema), I have
avro.codecsnappy
written. Then it's followed by readable or unreadable charaters.
My first attempt to deserialize the avro event is the following :
public static String deserialize(String message) throws IOException {
Schema.Parser schemaParser = new Schema.Parser();
Schema avroSchema = schemaParser.parse(defaultFlumeAvroSchema);
DatumReader<GenericRecord> specificDatumReader = new SpecificDatumReader<GenericRecord>(avroSchema);
byte[] messageBytes = message.getBytes();
Decoder decoder = DecoderFactory.get().binaryDecoder(messageBytes, null);
GenericRecord genericRecord = specificDatumReader.read(null, decoder);
return genericRecord.toString();
}
This function works when I want to deserialise an avro file that doesn't have the avro.codecsbappy in it. When it's the case I have the error :
Malformed data : length is negative : -50
So I tried another way of doing it which is :
private static void deserialize2(String path) throws IOException {
DatumReader<GenericRecord> reader = new GenericDatumReader<>();
DataFileReader<GenericRecord> fileReader =
new DataFileReader<>(new File(path), reader);
System.out.println(fileReader.getSchema().toString());
GenericRecord record = new GenericData.Record(fileReader.getSchema());
int numEvents = 0;
while (fileReader.hasNext()) {
fileReader.next(record);
ByteBuffer body = (ByteBuffer) record.get("body");
CharsetDecoder decoder = Charsets.UTF_8.newDecoder();
System.out.println("Positon of the index " + body.position());
System.out.println("Size of the array : " + body.array().length);
String bodyStr = decoder.decode(body).toString();
System.out.println("THE BODY STRING ---> " bodyStr);
numEvents++;
}
fileReader.close();
}
and it returns the follwing output :
Positon of the index 0
Size of the array : 127482
THE BODY STRING --->
I can see that the array isn't empty but it just return an empty string.
How can I proceed ?

Use this when converting to string:
String bodyStr = new String(body.array());
System.out.println("THE BODY STRING ---> " + bodyStr);
Source: https://www.mkyong.com/java/how-do-convert-byte-array-to-string-in-java/

Well, it seems that you are on a good way. However, your ByteBuffer might not have a proper byte[] array to decode, so let's try the following instead:
byte[] bytes = new byte[body.remaining()];
buffer.get(bytes);
String result = new String(bytes, "UTF-8"); // Maybe you need to change charset
This should work, you have shown in your question that ByteBuffer contains actual data, as given in the code example you might have to change the charset.
List of charsets: https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
Also usful: https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html

Java Spring returning CSV file encoded in UTF-8 with BOM

Apparently for excel to open CSV files nicely, it should have the Byte Order Mark at the start. The download of CSV is implemented by writing into HttpServletResponse's output stream in the controller, as the data is generated during request. I get an exception when I try to write the BOM bytes - java.io.CharConversionException: Not an ISO 8859-1 character: [] (even though the encoding I specified is UTF-8).
The controller's method in question
#RequestMapping("/monthly/list")
public List<MonthlyDetailsItem> queryDetailsItems(
MonthlyDetailsItemQuery query,
#RequestParam(value = "format", required = false) String format,
#RequestParam(value = "attachment", required = false, defaultValue="false") Boolean attachment,
HttpServletResponse response) throws Exception
{
// load item list
List<MonthlyDetailsItem> list = detailsSvc.queryMonthlyDetailsForList(query);
// adjust format
format = format != null ? format.toLowerCase() : "json";
if (!Arrays.asList("json", "csv").contains(format)) format = "json";
// modify common response headers
response.setCharacterEncoding("UTF-8");
if (attachment)
response.setHeader("Content-Disposition", "attachment;filename=duomenys." + format);
// build csv
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
response.getOutputStream().print("\ufeff");
response.getOutputStream().write(buildMonthlyDetailsItemCsv(list).getBytes("UTF-8"));
return null;
}
return list;
}

I have just come across, this same problem. The solution which works for me is to get the output stream from the response Object and write to it as follows
// first create an array for the Byte Order Mark
final byte[] bom = new byte[] { (byte) 239, (byte) 187, (byte) 191 };
try (OutputStream os = response.getOutputStream()) {
os.write(bom);
final PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(data);
w.flush();
w.close();
} catch (IOException e) {
// logit
}
So UTF-8 is specified on the OutputStreamWriter.
As an addendum to this, I should add, the same application needs to allow users to upload files, these may or may not have BOM's. This may be dealt with by using the class org.apache.commons.io.input.BOMInputStream, then using that to construct a org.apache.commons.csv.CSVParser.
The BOMInputStream includes a method hasBOM() to detect if the file has a BOM or not.
One gotcha that I first fell into was that the hasBOM() method reads (obviously!) from the underlying stream, so the way to deal with this is to first mark the stream, then after the test if it doesn't have a BOM, reset the stream. The code I use for this looks like the following:
try (InputStream is = uploadFile.getInputStream();
BufferedInputStream buffIs = new BufferedInputStream(is);
BOMInputStream bomIn = new BOMInputStream(buffIs);) {
buffIs.mark(LOOKAHEAD_LENGTH);
// this should allow us to deal with csv's with or without BOMs
final boolean hasBOM = bomIn.hasBOM();
final BufferedReader buffReadr = new BufferedReader(
new InputStreamReader(hasBOM ? bomIn : buffIs, StandardCharsets.UTF_8));
// if this stream does not have a BOM, then we must reset the stream as the test
// for a BOM will have consumed some bytes
if (!hasBOM) {
buffIs.reset();
}
// collect the validated entity details
final CSVParser parser = CSVParser.parse(buffReadr,
CSVFormat.DEFAULT.withFirstRecordAsHeader());
// Do stuff with the parser
...
// Catch and clean up
Hope this helps someone.

It doesn't make much sense: the BOM is for UTF-16; there is no byte order with UTF-8. The encoding You've set with setCharacterEncoding is used for getWriter, not for getOutputStream.
UPDATE:
OK, try this:
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
PrintWriter out = response.getWriter();
out.print("\uFEFF");
out.print(buildMonthlyDetailsItemCsv(list));
return null;
}
I'm assuming that method buildMonthlyDetailsItemCsv returns a String.

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

My application downloads xml files that happen to be either encoded in UTF-8 or ISO-8859-1 (the software that generates those files is crappy so it does that). I'm from Germany so we're using Umlauts (ä,ü,ö) so it really makes a difference how those files are encoded.
I know that the XmlPullParser has a method .getInputEncoding() which correctly detects how my files are encoded. However I have to set the encoding in my FileInputStream already (which is before I get to call .getInputEncoding()). So far I'm just using a BufferedReader to read the XML file and search for the entry that specifies the encoding and then instantiate my PullParser afterwards.
private void setFileEncoding() {
try {
bufferedReader.reset();
String firstLine = bufferedReader.readLine();
int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="
String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));
// now set the encoding to the reader to be used for parsing afterwards
bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
bufferedReader.mark(0);
} catch (IOException e) {
e.printStackTrace();
}
}
Is there a different way to do this? Can I take advantage of the .getInputEncoding method? Right now the method seems kinda useless to me because how does my encoding matter if I've already had to set it before being able to check for it.

If you trust the creator of the XML to have set the encoding correctly in the XML declaration, you can sniff it as you're doing. However, be aware that it can be wrong; it can disagree with the actual encoding.
If you want to detect the encoding directly, independently of the (potentially wrong) XML declaration encoding setting, use a library such as ICU CharsetDetector or the older jChardet.
ICU CharsetDetector:
CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData);
match = detector.detect();
jChardet:
// Initalize the nsDetector() ;
int lang = (argv.length == 2)? Integer.parseInt(argv[1])
: nsPSMDetector.ALL ;
nsDetector det = new nsDetector(lang) ;
// Set an observer...
// The Notify() will be called when a matching charset is found.
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true ;
System.out.println("CHARSET = " + charset);
}
});
URL url = new URL(argv[0]);
BufferedInputStream imp = new BufferedInputStream(url.openStream());
byte[] buf = new byte[1024] ;
int len;
boolean done = false ;
boolean isAscii = true ;
while( (len=imp.read(buf,0,buf.length)) != -1) {
// Check if the stream is only ascii.
if (isAscii)
isAscii = det.isAscii(buf,len);
// DoIt if non-ascii and not done yet.
if (!isAscii && !done)
done = det.DoIt(buf,len, false);
}
det.DataEnd();
if (isAscii) {
System.out.println("CHARSET = ASCII");
found = true ;
}

You may be able to get the correct character-set from the content-type header, if your server sends it correctly.

How to read/write extended ASCII characters as a string into ANSI coded text file in java

This is my encryption program. Primarily used to encrypt Files(text)
This part of the program converts List<Integer> elements intobyte [] and writes it into a text file. Unfortunately i cannot provide the algorithm.
void printit(List<Integer> prnt, File outputFile) throws IOException
{
StringBuilder building = new StringBuilder(prnt.size());
for (Integer element : prnt)
{
int elmnt = element;
//building.append(getascii(elmnt));
building.append((char)elmnt);
}
String encryptdtxt=building.toString();
//System.out.println(encryptdtxt);
byte [] outputBytes = offo.getBytes();
FileOutputStream outputStream =new FileOutputStream(outputFile);
outputStream.write(outputBytes);
outputStream.close();
}
This is the decryption program where the decryption program get input from a .enc file
void getfyle(File inputFile) throws IOException
{
FileInputStream inputStream = new FileInputStream(inputFile);
byte[] inputBytes = new byte[(int)inputFile.length()];
inputStream.read(inputBytes);
inputStream.close();
String fylenters = new String(inputBytes);
for (char a:fylenters.toCharArray())
{
usertext.add((int)a);
}
for (Integer bk : usertext)
{
System.out.println(bk);
}
}
Since the methods used here, in my algorithm require List<Integer> byte[] gets converted to String first and then to List<Integer>and vice versa.
The elements while writing into a file during encryption do not match the elements read from the .enc file.
Is my method of converting List<Integer> to byte[] correct??
or is something else wrong? . I do know that java can't print extended ASCII characters so i used this .But, even this failed.It gives a lot of ?s
Is there a solution??
please help me .. and also how to do it for other formats(.png.mp3....etc)
The format of the encrypted file can be anything (it needn't be .enc)
thanxx

There are thousands of different 'extended ASCII' codes and Java supports about a hundred of them,
but you have to tell it which 'Charset' to use or the default often causes data corruption.
While representing arbitrary "binary" bytes in hex or base64 is common and often necessary,
IF the bytes will be stored and/or transmitted in ways that preserve all 256 values, often called "8-bit clean",
and File{Input,Output}Stream does, you can use "ISO-8859-1" which maps Java char codes 0-255 to and from bytes 0-255 without loss, because Unicode is based partly on 8859-1.
on input, read (into) a byte[] and then new String (bytes, charset) where charset is either the name "ISO-8859-1"
or the java.nio.charset.Charset object for that name, available as java.nio.charset.StandardCharSets.ISO_8859_1;
or create an InputStreamReader on a stream reading the bytes from a buffer or directly from the file, using that charset name or object, and read chars and/or a String from the Reader
on output, use String.getBytes(charset) where charset is that charset name or object and write the byte[];
or create an OutputStreamWriter on a stream writing the bytes to a buffer or the file, using that charset name or object, and write chars and/or String to the Writer
But you don't actually need char and String and Charset at all. You actually want to write a series of Integers as bytes, and read a series of bytes as Integers. So just do that:
void printit(List<Integer> prnt, File outputFile) throws IOException
{
byte[] outputBytes = new byte[prnt.size()]; int i = 0;
for (Integer element : prnt) outputBytes[i++] = (byte)element;
FileOutputStream outputStream =new FileOutputStream(outputFile);
outputStream.write(b);
outputStream.close();
// or replace the previous three lines by one
java.nio.file.Files.write (outputFile.toPath(), outputBytes);
}
void getfyle(File inputFile) throws IOException
{
FileInputStream inputStream = new FileInputStream(inputFile);
byte[] inputBytes = new byte[(int)inputFile.length()];
inputStream.read(inputBytes);
inputStream.close();
// or replace those four lines with
byte[] inputBytes = java.nio.file.Files.readAllBytes (inputFile.toPath());
for (byte b: inputBytes) System.out.println (b&0xFF);
// or if you really wanted a list not just a printout
ArrayList<Integer> list = new ArrayList<Integer>(inputBytes.length);
for (byte b: inputBytes) list.add (b&0xFF);
// return list or store it or whatever
}

Arbitrary data bytes are not all convertible to any character encoding and encryption creates data bytes including all values 0 - 255.
If you must convert the encrypted data to a string format the standard methods are to convert to Base64 or hexadecimal.

In encryption part:
`for (Integer element : prnt)
{
int elmnt = element;
//building.append(getascii(elmnt));
char b = Integer.toString(elmnt).charAt(0);
building.append(b);
}`
-->this will convert int to char like 1 to '1' and 5 to '5'

How to save Chinese Characters to file with java?

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.
StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;
FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();
What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?

There are several factors at work here:
Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);
Here is a method of reliably appending UTF-8 data to a file:
private static void writeUtf8ToFile(File file, boolean append, String data)
throws IOException {
boolean skipBOM = append && file.isFile() && (file.length() > 0);
Closer res = new Closer();
try {
OutputStream out = res.using(new FileOutputStream(file, append));
Writer writer = res.using(new OutputStreamWriter(out, Charset
.forName("UTF-8")));
if (!skipBOM) {
writer.write('\uFEFF');
}
writer.write(data);
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
String chinese = "\u4E0A\u6D77";
boolean append = true;
writeUtf8ToFile(new File("chinese.txt"), append, chinese);
}
Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.
Here is the Closer type used in this code:
public class Closer implements Closeable {
private Closeable closeable;
public <T extends Closeable> T using(T t) {
closeable = t;
return t;
}
#Override public void close() throws IOException {
if (closeable != null) {
closeable.close();
}
}
}
This code makes a Windows-style best guess about how to read the file based on byte order marks:
private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };
private static Charset getEncoding(InputStream in) throws IOException {
charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
byte[] bom = "\uFEFF".getBytes(encodings);
in.mark(bom.length);
for (byte b : bom) {
if ((0xFF & b) != in.read()) {
in.reset();
continue charsetLoop;
}
}
return encodings;
}
return Charset.defaultCharset();
}
private static String readText(File file) throws IOException {
Closer res = new Closer();
try {
InputStream in = res.using(new FileInputStream(file));
InputStream bin = res.using(new BufferedInputStream(in));
Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
StringBuilder out = new StringBuilder();
for (int ch = reader.read(); ch != -1; ch = reader.read())
out.append((char) ch);
return out.toString();
} finally {
res.close();
}
}
Usage:
public static void main(String[] args) throws IOException {
System.out.println(readText(new File("chinese.txt")));
}
(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)

If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:
Writer w = new FileWriter("test.txt");
w.append("上海");
w.close();
The safest way is to always explicitly specify the encoding:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
w.append("上海");
w.close();
P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.

Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:
Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).
The following bug exists to describe the issue in Java:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:
http://mindprod.com/jgloss/bom.html
and for a more correct solution see the following link:
http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html

Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:
String FileName = "output.txt";
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();
I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.

Try this,
StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileName,Append), "UTF8"));
for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
out.close();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert InputStream from ISO-8859-1 to UTF-8 - java

Related

How to deserialize avro files

Java Spring returning CSV file encoded in UTF-8 with BOM

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

How to read/write extended ASCII characters as a string into ANSI coded text file in java

How to save Chinese Characters to file with java?

Categories

Resources