SAX Parser doesn't recognize windows-1255 encoding - java

I'm working on a rss parser in android
(upgrading a parser I found on the internet).
From what I know SAX Parser recognize the encoding automatically from the xml tag, but when I try to parse a feed that declare windows-1255 encoding it doesn't parsing it and throws and exception.
I tried few things:
final InputSource source = new InputSource(feed);
Reader isr = new InputStreamReader(feed);
source.setCharacterStream(isr);
I even tried telling him the specific encoding.
source.setEncoding("Windows-1255");
Tried to look at the locator:
#Override
public void setDocumentLocator(Locator locator) {
}
And it recognize the encoding as UTF-16.
Please help me solve this annoying problem!
Sorry for the mess with code snippets the code button refuse to work for some reason.

Chances are the platform itself doesn't know about the "windows-1255" encoding. After all, it's a Windows-based encoding - I wouldn't want to rely on it being available on any other platforms, particularly mobile ones where things are generally cut down to the "must-have" options.

You need to set the encoding to the InputStreamReader.
Reader isr = new InputStreamReader(feed, "windows-1255");
final InputSource source = new InputSource(isr);
From javadoc the logic for reading from InputSource goes something like this:
Is there a character stream? if there is, use that(This is what happens if you use a Reader like InputStreamReader)
Otherwise:
No character stream? Use byte stream. (InputStream)
Is there a encoding set for InputSource? Use that
There was no encoding set? Try parsing the encoding from the xml file

Related

How to convert MultipartFile to UTF-8 always while using CSVFormat?

I am using a spring boot REST API to upload csv file MultipartFile. CSVFormat library of org.apache.commons.csv is used to format the MultipartFile and CSVParser is used to parse and the iterated records are stored into the MySql database.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream()));
Observation is that when the CSV files are uploaded with charset of UTF-8 then it works good. But if the CSV file is of a different format (ANSI etc.,) other than it, its encoding German and other language characters to some random symbols.
Example äößü are encoded to ����
I tried the below to specify the encoding standard, it did not work too.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8));
Can you please advise. Thank you so much in advance.
What you did new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8) tells the CSV parser that the content of the inputstream is UTF-8 encoded.
Since UTF-8 is (usally) the standard encoding, this is actually the same as using new InputStreamReader(csvFile.getInputStream()).
If I get your question correctly, this is not what you intended. Instead you want to automatically choose the right encoding based on the Import-file, right?
Unfortunatelly the csv-format does not store the information which encoding was used.
There are some libraries you could use to guess the most probable encoding based on the characters contained in the file. While they are pretty accurate, they are still guessing and there is no guarantee that you will get the right encoding in the end.
Depending on your use case it might be easier to just agree with the consumer on a fixed encoding (i.e. they can upload UTF-8 or ANSI, but not both)
Try as shown below which worked for me for the same issue
new InputStreamReader(csvFile.getInputStream(), "UTF-8")

Reading UTF-8 .properties files in Java 1.5?

I have a project where everything is in UTF-8. I was using the Properties.load(Reader) method to read properties files in this encoding. But now, I need to make the project compatible with Java 1.5, and the mentioned method doesn't exist in Java 1.5. There is only a load method that takes an InputStream as a parameter, which is assumed to be in ISO-8859-1.
Is there any simple way to make my project 1.5-compatible without having to change all the .properties files to ISO-8859-1? I don't really want to have a mix of encodings in my project (encodings are already a time sink one at a time, let alone when you mix them) or change all my project to ISO-8859-1.
With "a simple way" I mean "without creating a custom Properties class from scratch".
Could you use xml-properties instead? As I understand by the spec .properties files should be in ISO-8859-1, if you want other characters, they should be quoted, using the native2ascii tool.
One strategy that might work for this situation is as follows:
Read the bytes of the Reader into a ByteArrayOutputStream.
Once that is completed, call toByteArray() See below.
With the byte[] construct a ByteArrayInputStream
Use the ByteArrayInputStream in Properties.load(InputStream)
As pointed out, the above failed to actually convert the character set from UTF-8 to ISO-8859-1. To fix that, a tweak.
After the BAOS has been filled, instead of calling toByteArray()..
Call toString("ISO-8859-1") to get an ISO-8859-1 encoded String. Then look to..
Call String.getBytes() to get the byte[]
What you can do is open a thread that would read data using a BufferedReader then write out the data to a PipedOutputStream which is then linked by a PipedInputStream that load uses.
PipedOutputStream pos = new PipedOutputStream();
PipedInputStream pis = new PipedInputStream(pos);
ReaderRunnable reader = new ReaderRunnable(pos, new File("utfproperty.properties"));
Thread t = new Thread(reader);
t.start();
properties.load(pis);
t.join();
The BufferedReader will read the data one character at a time and if it detects it to be a character data not to be within the US-ASCII (i.e. low 7-bit) range then it writes "\u" + the character code into the PipedOutputStream.
ReaderRunnable would be a class that looks like:
public class ReaderRunnable implements Runnable {
public ReaderRunnable(OutputStream os, File f) {
this.os = os;
this.f = f;
}
private final OutputStream os;
private final File f;
public void run() {
// open file
// read file, escape any non US-ASCII characters
}
}
Now after writing all that I was thinking that someone should've had this problem before and solved it, and the best place to look for these things is in Apache Commons. Fortunately, they have an implementation there.
https://commons.apache.org/io/apidocs/org/apache/commons/io/input/ReaderInputStream.html
The implementation from Apache is not without flaws though. Your input file even if it is UTF-8 must only contain the characters from the ISO-8859-1 character set. The design I had provided above can handle that situation.
Depending on your build engine you can \uXXXX-escape the properties into the build target directory. Maven can filter them via the native2ascii-maven-plugin.
What I personally do in my projects is I keep my properties in UTF-8 files with an extension .uproperties and I convert them to ISO at the build time to .properties files using native2ascii.exe. This allows me to maintain my properties in UTF-8 and the Ant script does everything else for me.
What I just now experienced is, Make all .java files also UTF-8 encoding type (not only properties file where you store UTF-8 characters). This way there no need to use for InputStreamReader also. Also, make sure to compile to UTF-8 encoding.
This has worked for me without any added parameter of UTF-8.
To test this, write a simple stub program in eclipse and change the format of that java file by going to properties of that file and Resource section, to set the UTF-8 encoding format.

Stop Jsoup from encoding

I'm trying to parese an URL with JSoup which contains the following Text: Ætterni.
After parsing the document the same string looks like that: Ætterni.
How do I prevent this form happening? I want the document 1:1 exactly like it was.
Code:
doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();
Use
doc.outputSettings().escapeMode(EscapeMode.xhtml);
for avoiding entities conversion.
You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL. This way you have a 1:1 copy of the response.
InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.
You should not use Reader/Writer for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.

Howto let the SAX parser determine the encoding from the xml declaration?

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan
Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.
I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

File is not saved in UTF-8 encoding even when I set encoding to UTF-8

When I check my file with Notepad++ it's in ANSI encoding. What I am doing wrong here?
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}
UPDATE:
This is solved now, reason for jboss not understanding my xml wasn't encoding, but it was naming of my xml. Thanx all for help, even there really wasn't any problem...
If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.
Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.
You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.
UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.
UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.
It's only when you start to get into the non-ASCII codepoints that things start looking different.
But in the common case, byte-for-byte, ANSI and UTF-8 are identical.
If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.
(And there seems to be an issue with UTF8 in Java anyways...)
The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.
I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.
Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.
You can do it like this,
public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}

Categories