Reading yenc encoded data into java - java

I have this text file that i encoded with an yenc encoder and i intended to decode it with a java yenc decoder. The yenc decoder isn't the problem.
I read the encoded file in and then print the lines on the console:
But characters get mispresented:
THIS
q˜œ‹–74˜“›ŸJsnJJJJ
TURNS INTO q?˜?œ‹–74˜“›Ÿ?JsnJJJJ
Not only when I print it on the console, but also when writing it to a file.
So the decoded file will not be the encoded file in the beginning.
How can i solve this?
I really have no idea.
I tried replacing the "?" but that didn't have any effect.

Related

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}
Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");
In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();

JAVA : formatting of text

I load a txt file as a collection of strings, then I save it in a Database HSQLDB. When i load from DB and print a TextArea The output type of text is this:
Quando il flusso � maggiore nella narice di destra � la Nadi Pingala a
predominare. L'energia vitale � molto pi� attiva e di conseguenza
saremo pi� forti fisicamente, saremo pi� introversi e solari. Durante
il sonno tende a non.
How I can format it normally?
Try to format your output text to UTF-8 or ISO-8859-1
The original text file must be in UTF-8 otherwise, it will be necessary to convert at the file reading time by:
new String (String_Readed_From_File.getBytes ("ISO-8859-1"), "UTF-8");
If the encoding of the file is ISO-8859-1, otherwise adapt to file's encoding
Based on what you're saying, you're trying to read your text file as if it was encoded in UTF-8, but it is not. Therefore, you're failing the initial step of reading the file, and nothing you do afterwards can recover from this failure. It is useless to speak about what to do after reading the file.
We cannot guess what is the real encoding of your initial file. You would need to put this file somewhere for us to download it. All you've shown so far is that it is not in UTF-8 (because if it were, you would not have the problem described.)
You've said that you're using this code:
new String(encoded, "UTF-8");
Because "encoded" contains the bytes of your file, and your file is not in UTF-8, this instruction is wrong. You need to replace "UTF-8" with whatever is the true encoding of your file.
For instance it might be:
new String(encoded, StandardCharsets.ISO_8859_1);
Another solution, would be to not touch your Java code and leave it as-is, but make it correct by making its assumption that the file is in UTF-8, correct. For that you'd use a text editor out there like Notepad++, tell it to convert the file to UTF-8, and save.

Character encoding in csv

We have a requirement of picking the data from Oracle DB table and dump that data into a csv file and a plain pipe seperated text file. Give a link to user on application so user can view the generated csv/text files.
As lot of parsing was involved so we wrote a Unix shell script and are calling it from out Struts/J2ee application.
Earlier we were loosing the Chinese and Roman chars in the generated files and the generated file were having us-ascii charset(cheked using-> file -i). Later we used NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and this gave us utf-8 format files.
But still the characters were gibberish, so again we tried iconv command and converted utf-8 files to utf-16le charset.
iconv -f utf-8 -t utf-16le $recordFile > $tempFile
This works fine for the generated text file. But with CSV the Chinese and Roman chars are still not correct. Now if we open this csv file in a Notepad and give a newline by pressing Enter key from keyboard, save it. Open it with MS-Excel, all characters are coming fine including the Chinese and Romans but now the text is in single line for each row instead of columns.
Not sure what's going on.
Java code
PrintWriter out = servletResponse.getWriter();
servletResponse.setContentType("application/vnd.ms-excel; charset=UTF-8");
servletResponse.setCharacterEncoding("UTF-8");
servletResponse.setHeader("Content-Disposition","attachment; filename="+ fileName.toString());
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.close();
Please let me know if i missed out any details.
Thanks to all for taking out time to go through this.
Was able to solve it out. First as mentioned by Aaron removed UTF-16LE encoding to avoid future issues and encoded files to UTF-8. Changed the PrintWriter in Java code to OutputStream and was able to see the correct characters in my text file.
CSV was still showing garbage. Came to know that we need to prepend EF BB BF at the beginning of file as the BOM aware software like MS-Excel needs it. So changing the Java code as below did the trick for csv.
OutputStream out = servletResponse.getOutputStream();
os.write(239); //0xEF
os.write(187); //0xBB
out.write(191); //0xBF
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.flush();
out.close();
As always with Unicode problems, every single step of the transformation chain must work perfectly. If you make a mistake in one place, data will be silently corrupted. There is no easy way to figure out where it happens, you have to debug the code or write unit tests.
The Java code above only works if the file actually contains UTF-8 encoded data; it doesn't "magically" figure out what's in the file and converts it to UTF-8. So if the file already contains garbage, you just slap a "this is UTF-8" label on it but it's still garbage.
That means for you that you need to create test cases which take known test data and move that through every step of the chain: Inserting into database, reading from the database, writing to CSV, writing to the text file, reading those files and download to the user.
For each step, you need to write unit tests which takes a known Unicode string like abc öäü and processes it and then check the result. To make it easier to input in Java code, use "abc \u00f6\u00e4\u00fc" You may also want to add spaces at the beginning and end of the string to see whether they are properly preserved or not.
file -i doesn't help you much here since it just makes a guess what the file contains. There is no indicator (data or metadata) in a text file which says "this is UTF-8". UTF-16 supports a BOM header for this but almost no one uses UTF-16, so many tools don't support it (properly).

how to read utf-8 chars in opencsv

I am trying to read from csv file. The file contains UTF-8 characters. So based on Parse CSV file containing a Unicode character using OpenCSV and How read Japanese fields from CSV file into java beans? I just wrote
CSVReader reader = new CSVReader(new InputStreamReader(new FileInputStream("data.csv"), "UTF-8"), ';');
But it does not work. The >>Sí, es nuevo<< text is visible correctly in Notepad, Excel and various other text editing tools, but when I parse the file via opencsv I'm getting >>S�, es nuevo<< ( The í is a special character if you were wondering ;)
What am I doing wrong?
you can use encoder=UTF-16LE,I'm write a file for Japanese
Thanks aioobe. It turned out the file was not really UTF-8 despite most Win programs showing it as such. Notepad++ was the only one that did not show the file as UTF-8 encoded and after converting the data file the code works.
Use the below code for your issue it might helpful to you...
String value = URLEncoder.encode(msg[no], "UTF-8");
thanks,
Yash
Use ISO-8859-1 or ISO-8859-14 or ISO-8859-15 or ISO-8859-10 or ISO-8859-13 or ISO-8859-2 instead of using UTF-8

I have UTF-8 - but still get "Invalid byte 1 of 1-byte UTF-8 sequence"

I create a XML String on the fly (NOT reading from a file). Then I use Cocoon 3 to transform it via FOP to a PDF. Somewhere in the middle Xerces runs. When I use the hardcoded stuff everything works. As soon as I put a german Umlaut into the database and enrich my xml with that data I get:
Caused by: org.apache.cocoon.pipeline.ProcessingException: Can't parse the XML string.
at org.apache.cocoon.sax.component.XMLGenerator$StringGenerator.execute(XMLGenerator.java:326)
at org.apache.cocoon.sax.component.XMLGenerator.execute(XMLGenerator.java:104)
at org.apache.cocoon.pipeline.AbstractPipeline.invokeStarter(AbstractPipeline.java:146)
at org.apache.cocoon.pipeline.AbstractPipeline.execute(AbstractPipeline.java:76)
at de.grobmeier.tab.webapp.modules.documents.InvoicePipeline.generateInvoice(InvoicePipeline.java:74)
... 87 more
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)
I have then debugged my app and found out, my "Ä" (which comes frome the database) has the byte value of 196, which is C4 in hex. This is what I have expected according to this: http://www.utf8-zeichentabelle.de/
I do not know why my code fails.
I have then tried to add a BOM manually, like that:
byte[] bom = new byte[3];
bom[0] = (byte) 0xEF;
bom[1] = (byte) 0xBB;
bom[2] = (byte) 0xBF;
String myString = new String(bom) + inputString;
I know this is not exactly good, but I tried it - of course it failed. I have tried to add a xml header in front:
<?xml version="1.0" encoding="UTF-8"?>
Which failed too. Then I combined it. Failed.
After all I tried something like that:
xmlInput = new String(xmlInput.getBytes("UTF8"), "UTF8");
Which is doing nothing in fact, because it is already UTF-8. Still it fails.
So... any ideas what I am doing wrong and what Xerces is expecting from me?
Thanks
Christian
If your database contains only a single byte (with value 0xC4) then you aren't using UTF-8 encoding.
The character "LATIN CAPITAL LETTER A WITH DIAERESIS" has a code-point value U+00C4, but UTF-8 can't encode that in a single byte. If you check the third column "UTF-8 (hex.)" on UTF8-zeichentabelle.de you'll see that UTF-8 encodes that as 0xC3 84 (two bytes).
Please read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more info.
EDIT: Christian found the answer himself; turned out it was a problem in the Cocoon 3 SAX component (I guess it's the alpha 3 version). It turns out that if you pass an XML as a String into the XMLGenerator class, something will go wrong during SAX parsing causing this mess.
I looked up the code to find the actual problem in Cocoon-stax:
if (XMLGenerator.this.logger.isDebugEnabled()) {
XMLGenerator.this.logger.debug("Using a string to produce SAX events.");
}
XMLUtils.toSax(new ByteArrayInputStream(this.xmlString.getBytes()), XMLGenerator.this.getSAXConsumer();
As you can see, the call getBytes() will create a Byte array with the JRE's default encoding which will then fail to parse. This is because the XML declares itself to be UTF-8 whereas the data is now in bytes again, and likely using your Windows codepage.
As a workaround, one can use the following:
new org.apache.cocoon.sax.component.XMLGenerator(xmlInput.getBytes("UTF-8"),
"UTF-8");
This will trigger the right internal actions (as Christian found out by experimenting with the API).
I've opened an issue in Apache's bug tracker.
EDIT 2: The issue is fixed and will be included in an upcoming release.
The C4 you see on that page refers to the unicode code point, U+00C4. The byte sequence used to represent such a code point in UTF-8 is NOT "\xC4". What you want is what's in the UTF-8 (hex.) column, namely "\xC3\x84".
Therefore, your data is not in UTF-8.
You can read about how data is encoded in UTF-8 here.
I'm running Windows 7 with TextPad as a text editor for manually building the xml data file. I was getting the MalformedByteSequenceException. My spec in the xml file was UTF-8. After poking around, I found that my editor had a tool "Tools ... Convert to DOS". I did that, re-saved the file, and the exception went away and my code ran fine.
I then looked at the default encoding for that file type in my editor. It was ASCII, though when I changed the xml encoding parameter to ASCII, I got another different MalformedByteSequenceException.
So on Windows systems, you might try keeping the xml encoding to UTF-8, but save the file encoded DOS. I did not dig any further as to why this works.

Categories