safest way to read clob into xml parser - java

I'm getting an input stream from a Clob in oracle 11 (using the the oracle 11 jdbc driver), and passing the input stream to an xml parser in Java:
java.sql.Clob clob = resultSet.getClob("myClob");
InputStream is = clob.getAsciiStream();
MyDom dom = MyDomParser.parse(is);
Wondering if using a CharacterStream would be safer? e.g instead:
Reader r = clob.getCharacterStream();
MyDom dom = MyDomParser.parse(r);
My thinking is that getCharacterStream() might be doing some encoding that helps guarantee nice UTF-8 is returned. Not sure if there is any real difference between the two ways shown here of reading the clob.

Not much difference, getCharacterStream is better for unicode data. Check the link
http://community.actian.com/wiki/Manipulating_SQL_CLOB_data_with_JDBC

Related

How to convert MultipartFile to UTF-8 always while using CSVFormat?

I am using a spring boot REST API to upload csv file MultipartFile. CSVFormat library of org.apache.commons.csv is used to format the MultipartFile and CSVParser is used to parse and the iterated records are stored into the MySql database.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream()));
Observation is that when the CSV files are uploaded with charset of UTF-8 then it works good. But if the CSV file is of a different format (ANSI etc.,) other than it, its encoding German and other language characters to some random symbols.
Example äößü are encoded to ����
I tried the below to specify the encoding standard, it did not work too.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8));
Can you please advise. Thank you so much in advance.
What you did new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8) tells the CSV parser that the content of the inputstream is UTF-8 encoded.
Since UTF-8 is (usally) the standard encoding, this is actually the same as using new InputStreamReader(csvFile.getInputStream()).
If I get your question correctly, this is not what you intended. Instead you want to automatically choose the right encoding based on the Import-file, right?
Unfortunatelly the csv-format does not store the information which encoding was used.
There are some libraries you could use to guess the most probable encoding based on the characters contained in the file. While they are pretty accurate, they are still guessing and there is no guarantee that you will get the right encoding in the end.
Depending on your use case it might be easier to just agree with the consumer on a fixed encoding (i.e. they can upload UTF-8 or ANSI, but not both)
Try as shown below which worked for me for the same issue
new InputStreamReader(csvFile.getInputStream(), "UTF-8")

UTF-8 in clobval query and sax parser

I am using the below oracle query to retrieve the data from Oracle database. My column type is XMLTYPE:
select a.xmlrecord.getClobVal() xmlrecord "+"
from" + " " + tablename + " a
The reason why I am using getclobVal() is we have a limitations in getstringVal() query where we cannot retrieve more than 4000 characters in Oracle.
Currently I am extracting the data from database and sending it directly to sax parser. Below is the piece of code which I'm using
while (orset.next()){
Reader reader = new BufferedReader(orset.getCharacterStream("xmlrecord")); // to retrieve getClob
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
sp.parse(is, handler);
}
The problem is we are unable to retrieve UTF-8 characters even though I am encoding UTF-8 in my code.
Kindly assist.
Your reader is a CharacterStream and not a ByteStream. Encodings are ignored for character stream and has an effect only on byte streams so if you wish to incorporate encoding , create your BufferedReader for byte stream instead of character stream ,
I am quoting two sources below,
Class InputSource
The SAX parser will use the InputSource object to determine how to
read XML input. If there is a character stream available, the parser
will read that stream directly, disregarding any text encoding
declaration found in that stream. If there is no character stream, but
there is a byte stream, the parser will use that byte stream, using
the encoding specified in the InputSource or else (if no encoding is
specified) autodetecting the character encoding using an algorithm
such as the one in the XML specification. If neither a character
stream nor a byte stream is available, the parser will attempt to open
a URI connection to the resource identified by the system identifier.
setEncoding
This method has no effect when the application provides a character
stream.
UTF-8 is working fine with characterstream resultset.
The above piece of code returned UTF-8 characters and the problem is due to the Windows machine doesn't support UTF-8 character set.
Finally we installed a package for Arabic character(UTF-8) in windows PC and the issue is resolved.

How to convert byte[] from ANSI to UTF-8

byte[] data;
ResultSet resultSet
data = resultSet.getBytes("xml");//It is XML of ANSI type from database(ms-sql);
I am trying to convert XML to UTF-8 type.
Please help me figure this out.
You might use the constructor of String that takes an encoding; for example, new String(data, "UTF-8");
It's probably easiest to just call resultSet.getString("xml");
You probably should just feed the bytes directly to an XML parser. XML almost requires the encoding to be specified, and the parser will figure it out on its own.

How read Japanese fields from CSV file into java beans?

I've tried several popular CSV to java deserializers - OpenCSV, JSefa, and Smooks - none correctly read the file:
First Name,Last Name
エリック,山中
花子,鈴木
一郎,鈴木
裕子,田中
政治,山村
into my java object collection.
OpenCsv code:
HeaderColumnNameTranslateMappingStrategy<Contact> strat = new HeaderColumnNameTranslateMappingStrategy<Contact>();
strat.setType(Contact.class);
strat.setColumnMapping(colNameTranslateMap);
InputStreamReader fileReader=null;
CsvToBean<Contact> csv = new CsvToBean<Contact>();
fileReader = new InputStreamReader(new FileInputStream(file), "UTF-8");
contacts = csv.parse(strat, new CSVReader(fileReader));
I've tried setting the Charset to UTF-8, UTF-16 and ISO-8859-1 when I create the FileInputStream, but the collection is never populated properly. As seen in the debugger and System.out the fields contain garbage and often the number of records is wrong.
FileInputStream is for reading streams of binary data, like an mp3 or PNG. Instead of a FIS, use a FileReader for reading streams of characters.
To be blunt: who cares what charsets you tried using if they didn't work? You need to figure out what encoding the CSV file is actually using, and set that encoding when reading the file. To specify the encoding when using a FileReader:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

Stop Jsoup from encoding

I'm trying to parese an URL with JSoup which contains the following Text: Ætterni.
After parsing the document the same string looks like that: Ætterni.
How do I prevent this form happening? I want the document 1:1 exactly like it was.
Code:
doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();
Use
doc.outputSettings().escapeMode(EscapeMode.xhtml);
for avoiding entities conversion.
You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL. This way you have a 1:1 copy of the response.
InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.
You should not use Reader/Writer for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.

Categories