I am trying to parse RSS feed, and have problem with encoding
if encoding utf-8, result correct, but problem with another type, espessially windows-1251
the code is below
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new URL(channel.getUrl()).openStream();;
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
I don't want save a content to locale file, after read. Can anybody help?
It is very difficult to guess the encoding by just analyzing the bytes from the input stream. Therefore normally, the platform's default encoding is used, when you dio not specify it.
However, an XMLInputFactory can create XMLEventReader that uses a specific encoding. Just call the method XMLInputFactory.createXMLEventReader(InputStream stream, String encoding).
That means, you must know the encoding before. Maybe there is a contract for the interface you are serving.
Related
I am using the below oracle query to retrieve the data from Oracle database. My column type is XMLTYPE:
select a.xmlrecord.getClobVal() xmlrecord "+"
from" + " " + tablename + " a
The reason why I am using getclobVal() is we have a limitations in getstringVal() query where we cannot retrieve more than 4000 characters in Oracle.
Currently I am extracting the data from database and sending it directly to sax parser. Below is the piece of code which I'm using
while (orset.next()){
Reader reader = new BufferedReader(orset.getCharacterStream("xmlrecord")); // to retrieve getClob
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
sp.parse(is, handler);
}
The problem is we are unable to retrieve UTF-8 characters even though I am encoding UTF-8 in my code.
Kindly assist.
Your reader is a CharacterStream and not a ByteStream. Encodings are ignored for character stream and has an effect only on byte streams so if you wish to incorporate encoding , create your BufferedReader for byte stream instead of character stream ,
I am quoting two sources below,
Class InputSource
The SAX parser will use the InputSource object to determine how to
read XML input. If there is a character stream available, the parser
will read that stream directly, disregarding any text encoding
declaration found in that stream. If there is no character stream, but
there is a byte stream, the parser will use that byte stream, using
the encoding specified in the InputSource or else (if no encoding is
specified) autodetecting the character encoding using an algorithm
such as the one in the XML specification. If neither a character
stream nor a byte stream is available, the parser will attempt to open
a URI connection to the resource identified by the system identifier.
setEncoding
This method has no effect when the application provides a character
stream.
UTF-8 is working fine with characterstream resultset.
The above piece of code returned UTF-8 characters and the problem is due to the Windows machine doesn't support UTF-8 character set.
Finally we installed a package for Arabic character(UTF-8) in windows PC and the issue is resolved.
I'm working on a rss parser in android
(upgrading a parser I found on the internet).
From what I know SAX Parser recognize the encoding automatically from the xml tag, but when I try to parse a feed that declare windows-1255 encoding it doesn't parsing it and throws and exception.
I tried few things:
final InputSource source = new InputSource(feed);
Reader isr = new InputStreamReader(feed);
source.setCharacterStream(isr);
I even tried telling him the specific encoding.
source.setEncoding("Windows-1255");
Tried to look at the locator:
#Override
public void setDocumentLocator(Locator locator) {
}
And it recognize the encoding as UTF-16.
Please help me solve this annoying problem!
Sorry for the mess with code snippets the code button refuse to work for some reason.
Chances are the platform itself doesn't know about the "windows-1255" encoding. After all, it's a Windows-based encoding - I wouldn't want to rely on it being available on any other platforms, particularly mobile ones where things are generally cut down to the "must-have" options.
You need to set the encoding to the InputStreamReader.
Reader isr = new InputStreamReader(feed, "windows-1255");
final InputSource source = new InputSource(isr);
From javadoc the logic for reading from InputSource goes something like this:
Is there a character stream? if there is, use that(This is what happens if you use a Reader like InputStreamReader)
Otherwise:
No character stream? Use byte stream. (InputStream)
Is there a encoding set for InputSource? Use that
There was no encoding set? Try parsing the encoding from the xml file
I've tried several popular CSV to java deserializers - OpenCSV, JSefa, and Smooks - none correctly read the file:
First Name,Last Name
エリック,山中
花子,鈴木
一郎,鈴木
裕子,田中
政治,山村
into my java object collection.
OpenCsv code:
HeaderColumnNameTranslateMappingStrategy<Contact> strat = new HeaderColumnNameTranslateMappingStrategy<Contact>();
strat.setType(Contact.class);
strat.setColumnMapping(colNameTranslateMap);
InputStreamReader fileReader=null;
CsvToBean<Contact> csv = new CsvToBean<Contact>();
fileReader = new InputStreamReader(new FileInputStream(file), "UTF-8");
contacts = csv.parse(strat, new CSVReader(fileReader));
I've tried setting the Charset to UTF-8, UTF-16 and ISO-8859-1 when I create the FileInputStream, but the collection is never populated properly. As seen in the debugger and System.out the fields contain garbage and often the number of records is wrong.
FileInputStream is for reading streams of binary data, like an mp3 or PNG. Instead of a FIS, use a FileReader for reading streams of characters.
To be blunt: who cares what charsets you tried using if they didn't work? You need to figure out what encoding the CSV file is actually using, and set that encoding when reading the file. To specify the encoding when using a FileReader:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan
Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.
I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);
When I check my file with Notepad++ it's in ANSI encoding. What I am doing wrong here?
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}
UPDATE:
This is solved now, reason for jboss not understanding my xml wasn't encoding, but it was naming of my xml. Thanx all for help, even there really wasn't any problem...
If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.
Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.
You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.
UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.
UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.
It's only when you start to get into the non-ASCII codepoints that things start looking different.
But in the common case, byte-for-byte, ANSI and UTF-8 are identical.
If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.
(And there seems to be an issue with UTF8 in Java anyways...)
The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.
I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.
Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.
You can do it like this,
public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}