How to Decide InputStream Encoding? - java

My objective is to download an xml feed into an InputStream, then convert it to a String so that if may be used with XmlPullParser.
I convert the InputStream into a String like this:
InputStream input_stream = connection.getInputStream();
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(input_stream,"UTF-8"));
while ((line = br.readLine()) != null) {
sb.append(line);
}
Here's the problem, some XML feeds define specific encoding. Take this one for example:
http://voxinox.ch/podcasts/valdo/feed.xml
If I use a default of "UTF-8" encoding some characters from the feed look like a black rhombus shape with a question mark in it. If I use the encoding specified in the xml header it works (iso-8859-1), not a surprise.
The thing is how do I decide what encoding to use before I start reading the input stream which contains encoding specifications? Is there a better way of doing this?

Example how i get encoding from XML inputstream
FileInputStream finput = new FileInputStream(myFile);
String encoding = getInputEncoding(finput);
Log.d("Encoding: ", "> " + encoding);
public String getInputEncoding(FileInputStream finput){
String encoding = "";
if(finput!=null){
try{
BufferedReader myReader = new BufferedReader(new InputStreamReader(finput));
String getline = "";
getline = myReader.readLine();
myReader.close();
Log.d("Line: ", "> " + getline);
String[] separated = getline.split("encoding=\"");
String encoding1 = separated[1];
String[] separated2 = encoding1.split("\"");
encoding = separated2[0];
} catch (Exception e) {
}
}
return encoding;
}

Related

Chinese characters from HTML text print properly for some websites, but not others

I was trying to print out the HTML text for https://top.baidu.com and https://www.qq.com, which both use GB2312 character encoding. It prints normally to the console except for the Chinese characters, which come out as unreadable text like ��㿴���ģ�ȫ�й��...
However, the Chinese characters come out just fine when I change the address to https://www.sina.com.cn or https://world.taobao.com, both of which use UTF-8.
Other than nicely asking Baidu and QQ to switch to UTF-8, is there anything I can do about this? Here is my code.
try {
String address1 = "https://top.baidu.com"; //unreadable
String address2 = "https://www.qq.com"; //also unreadable
String address3 = "https://www.sina.com.cn"; //readable
String address4 = "https://world.taobao.com"; //readable, too
URL url = new URL(address1);
StringBuilder htmlText = new StringBuilder();
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
InputStream stream = connection.getInputStream();
InputStreamReader reader = new InputStreamReader(stream);
int data = reader.read();
while (data != -1) {
char current = (char) data;
htmlText.append(current);
data = reader.read();
}
System.out.println(htmlText);
} catch (Exception e) {
e.printStackTrace();
}
After reading Andreas's comment, I looked up an alternative constructor for InputStreamReader and came up with the following.
InputStreamReader reader = new InputStreamReader(stream, Charset.forName("GB2312"));

How to guarantee a java POST request string / text to be UTF-8 encoding

I have a textmessage/string with letters like ä,ü,ß. I want everything to be UTF-8 encoded. When I write to a file or print the string to console, everything is fine. But when I want to send the same string to a web service, I get instead of ä,ü,ß the following �
I read the file from a Servlet.
Do I really have to use the following 2 lines to get a UTF-8 encoded text?
byte [] bray = text.getBytes("UTF-8");
text = new String(bray);
.
public static String readAsStream_UTF8(String filePathName){
String text ="";
InputStream input = Thread.currentThread().getContextClassLoader().getResourceAsStream("resources/"+filePathName);
if(input == null){
System.out.println("Inputstream null.");
}else{
InputStreamReader isr = null;
try {
isr = new InputStreamReader((InputStream)input, "UTF-8");
BufferedReader reader = new BufferedReader(isr);
StringBuilder sb = new StringBuilder();
String sCurrentLine;
while ((sCurrentLine = reader.readLine()) != null) {
sb.append(sCurrentLine);
}
text= sb.toString();
//it works only if I use the following 2 lines
byte [] bray = text.getBytes("UTF-8");
text = new String(bray);
} catch (Exception e1) {
e1.printStackTrace();
}
}
return text;
}
My sendPOST method looks something like the following:
String charset = "UTF-8";
OutputStreamWriter writer = null;
HttpURLConnection con = null;
String response_txt ="";
InputStream iss = null;
try {
URL url = new URL(urlService);
con = (HttpURLConnection)url.openConnection();
con.setDoOutput(true); //triggers POST
con.setDoInput(true);
con.setRequestMethod("POST");
con.setRequestProperty("accept-charset", charset);
//con.setRequestProperty("Content-Type", "application/soap+xml");
con.setRequestProperty("Content-Type", "application/soap+xml;charset=UTF-8");
writer = new OutputStreamWriter(con.getOutputStream());
writer.write(msg); //send POST data string
writer.flush();
writer.close();
What do I have to do to force the msg, that will be sent to the web service, to really be UTF-8 encoded.
If you know the encoding of the file which you want to send you don't need to convert it to an intermediary string. Simply copy its bytes to the output:
// inputstream to a UTF-8 encoded resource file
InputStream in = Thread.currentThread().getContextClassLoader().getResourceAsStream("resources/"+filePathName);
HttpURLConnection con = ...
// set contenttype and encoding
con.setRequestProperty("Content-Type", "application/soap+xml;charset=UTF-8");
// copy input to output
copy(in, con.getOutputStream());
using some copy function.
Additionally you could also set the Content-Length header to the size of the resource file.

How to read the blob data from servlet request object

There is client and server components, the client is sending the data in more secure way by converting the data in blob using POST method to the server.
Can any suggest me how to convert that blob data to string object in server side(Java).i have tried some code below
Way 1):
==============================
String streamLength = request.getHeader("Content-Length");
int streamIntLength = Integer.parseInt(streamLength);
byte[] bytes = new byte[streamIntLength];
request.getInputStream().read(bytes, 0, bytes.length);
String content = DatatypeConverter.printBase64Binary(bytes);
System.out.println(content);
Output for above code is : some junk data is displaying.
dABlAG0AcABsAGEAdABlAD0AMgAzADUAUgBfAFAAcgBvAHYAaQBkAGUAcgBfA
Way 2) :
======
BufferedReader reader = new BufferedReader(new InputStreamReader(
request.getInputStream()));
StringBuilder sb = new StringBuilder();
for (String line; (line = reader.readLine()) != null;) {
String str = new String(line.getBytes());
System.out.println(str);
}
Please suggest me any one, above both ways are not worked out.
Below code works for me.
StringBuilder stringBuilder = new StringBuilder();
BufferedReader bufferedReader = null;
try {
String streamLength = request.getHeader("Content-Length");
int streamIntLength = Integer.parseInt(streamLength);
InputStream inputStream = request.getInputStream();
if (inputStream != null) {
bufferedReader = new BufferedReader(new InputStreamReader(
inputStream));
char[] charBuffer = new char[streamIntLength];
int bytesRead = -1;
while ((bytesRead = bufferedReader.read(charBuffer)) > 0) {
stringBuilder.append(charBuffer, 0, bytesRead);
}
} else {
stringBuilder.append("");
}
} catch (IOException ex) {
throw ex;
}
String body = stringBuilder.toString();
//System.out.println(body);
byte[] bytes = body.getBytes();
System.out.println(StringUtils.newStringUtf16Le(bytes));
From the first approach, it looks like the data is encoded (possibly in Base64 format). After decoding it, what is the problem you are facing ? If the data is String and then encoded to Base64, you should get the actual string after decoding it. (Assuming platform locales on client and server side are same).
If its a binary data, better you keep it inside a byte stream only. If you anyhow want it to convert to a string, then the first approach looks okay.
If this binary data represents some kind of file, you can get the related information using the HTTP headers and write it to temp location for further use.

Character looks like "?" at Reading the Content of an Uploaded File

I have a client that uploads a vcf file, and I get this file at server side and reads it contents and saves them to a txt file. But there is a character error when I try read it, if there is turkish characters it looks like "?". My read code is here:
FileItemStream item = null;
ServletFileUpload upload = new ServletFileUpload();
FileItemIterator iterator = upload.getItemIterator(request);
String encoding = null;
while (iterator.hasNext()) {
item = iterator.next();
if ("fileUpload".equals(item.getFieldName())) {
InputStreamReader isr = new InputStreamReader(item.openStream(), "UTF-8");
String str = "";
String temp="";
BufferedReader br = new BufferedReader(isr);
while((temp=br.readLine()) != null){
str +=temp;
}
br.close();
File f = new File("C:/sedat.txt");
BufferedWriter buf = new BufferedWriter(new FileWriter(f));
buf.write(str);
buf.close();
}
BufferedWriter buf = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f), "UTF-8"));
If this is production code, i would recommend writing the output straight to the file and not accumulating it in the string first. And, you could avoid any potential encoding issues by reading the source as an InputStream and writing as an OutputStream (and skipping the conversion to characters).

How to convert the DataInputStream to the String in Java?

I want to ask a question about Java. I have use the URLConnection in Java to retrieve the DataInputStream. and I want to convert the DataInputStream into a String variable in Java. What should I do? Can anyone help me. thank you.
The following is my code:
URL data = new URL("http://google.com");
URLConnection dataConnection = data.openConnection();
DataInputStream dis = new DataInputStream(dataConnection.getInputStream());
String data_string;
// convent the DataInputStream to the String
import java.net.*;
import java.io.*;
class ConnectionTest {
public static void main(String[] args) {
try {
URL google = new URL("http://www.google.com/");
URLConnection googleConnection = google.openConnection();
DataInputStream dis = new DataInputStream(googleConnection.getInputStream());
StringBuffer inputLine = new StringBuffer();
String tmp;
while ((tmp = dis.readLine()) != null) {
inputLine.append(tmp);
System.out.println(tmp);
}
//use inputLine.toString(); here it would have whole source
dis.close();
} catch (MalformedURLException me) {
System.out.println("MalformedURLException: " + me);
} catch (IOException ioe) {
System.out.println("IOException: " + ioe);
}
}
}
This is what you want.
You can use commons-io IOUtils.toString(dataConnection.getInputStream(), encoding) in order to achieve your goal.
DataInputStream is not used for what you want - i.e. you want to read the content of a website as String.
If you want to read data from a generic URL (such as www.google.com), you probably don't want to use a DataInputStream at all. Instead, create a BufferedReader and read line by line with the readLine() method. Use the URLConnection.getContentType() field to find out the content's charset (you will need this in order to create your reader properly).
Example:
URL data = new URL("http://google.com");
URLConnection dataConnection = data.openConnection();
// Find out charset, default to ISO-8859-1 if unknown
String charset = "ISO-8859-1";
String contentType = dataConnection.getContentType();
if (contentType != null) {
int pos = contentType.indexOf("charset=");
if (pos != -1) {
charset = contentType.substring(pos + "charset=".length());
}
}
// Create reader and read string data
BufferedReader r = new BufferedReader(
new InputStreamReader(dataConnection.getInputStream(), charset));
String content = "";
String line;
while ((line = r.readLine()) != null) {
content += line + "\n";
}

Categories