get content from website with utf8 format - java

i want how to get the content from websites with utf8 format,,
i have writing the following code is
try {
String webnames = "http://pathivu.com";
URL url = new URL(webnames);
URLConnection urlc = url.openConnection();
//BufferedInputStream buffer = new BufferedInputStream(urlc.getInputStream());
BufferedReader buffer = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF8"));
StringBuilder builder = new StringBuilder();
int byteRead;
while ((byteRead = buffer.read()) != -1)
builder.append((char) byteRead);
buffer.close();
String text=builder.toString();
System.out.println(text);
}
catch (IOException e)
{
e.printStackTrace();
}
but i cant get the correct format...
thanks and advance..

The problem might be that your console or your System.out are not UTF-8.
Try writing this to a file instead
Set the console stream via System.setOut(..)
You may have to use -Dfile.encoding=utf-8 or OutputStreamWriter

Your code looks ok.. the problem here it will be that in server the data will not be in UTF-8 format..

Related

Chinese characters from HTML text print properly for some websites, but not others

I was trying to print out the HTML text for https://top.baidu.com and https://www.qq.com, which both use GB2312 character encoding. It prints normally to the console except for the Chinese characters, which come out as unreadable text like ��㿴���ģ�ȫ�й��...
However, the Chinese characters come out just fine when I change the address to https://www.sina.com.cn or https://world.taobao.com, both of which use UTF-8.
Other than nicely asking Baidu and QQ to switch to UTF-8, is there anything I can do about this? Here is my code.
try {
String address1 = "https://top.baidu.com"; //unreadable
String address2 = "https://www.qq.com"; //also unreadable
String address3 = "https://www.sina.com.cn"; //readable
String address4 = "https://world.taobao.com"; //readable, too
URL url = new URL(address1);
StringBuilder htmlText = new StringBuilder();
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
InputStream stream = connection.getInputStream();
InputStreamReader reader = new InputStreamReader(stream);
int data = reader.read();
while (data != -1) {
char current = (char) data;
htmlText.append(current);
data = reader.read();
}
System.out.println(htmlText);
} catch (Exception e) {
e.printStackTrace();
}
After reading Andreas's comment, I looked up an alternative constructor for InputStreamReader and came up with the following.
InputStreamReader reader = new InputStreamReader(stream, Charset.forName("GB2312"));

Java gzip pdf from url to file - result gives minor character mismatch

I'm trying to download a gzip pdf from an url, unpacking it and writing it to a file. It almost works, but currently some characters in the pdf made from my code mismatches the real pdf. I checked this by opening both of the pdf's in notepad.
I provide some short text samples from the two pdfs.
From my code:
’8 /qªMiUe°Ä[H`ðKíulýªäqvA®v8;xÒhÖßÚ²ý!Æ¢ØK$áýçpF[¸t1#y$93
From the real pdf:
ƒ8 /qªMiUe°Ä[H`ðKíulªäqvA®—v8;ŸÒhÖßÚ²!ˆ¢ØK$áçpF[¸t1#y$‘‹3
Here is my code:
public void readPDFfromURL(String urlStr) throws IOException {
URL myURL = new URL(urlStr);
HttpURLConnection urlCon = (HttpURLConnection) myURL.openConnection();
urlCon.setRequestProperty("Accept-Encoding", "gzip");
urlCon.setRequestProperty("Content-Type", "application/pdf");
urlCon.setRequestMethod("GET");
urlCon.setDoInput(true);
urlCon.connect();
Reader reader;
if ("gzip".equals(urlCon.getContentEncoding())) {
reader = new InputStreamReader(new GZIPInputStream(urlCon.getInputStream()));
}
else {
reader = new InputStreamReader(urlCon.getInputStream());
}
FileOutputStream fos = new FileOutputStream("document.pdf");
int data = reader.read();
while(data != -1) {
char c = (char) data;
fos.write(c);
data = reader.read();
}
fos.close();
reader.close();
}
I can open the pdf, and it has the correct amount of pages, but the pages are all blank.
My initial thought is that it might got something to do with character codes to do, like some setting in my java project, intellij etc.
Alternatively, I don't actually need to put it in a file. I just need to download it so I can upload it to another place. However, the pdf should of course be working in either case. I'm really just putting it in an actual file to check if it works.
Thank you for your help!
Here is my new implementation, which solves my question:
public void readPDFfromURL(String urlStr) throws IOException {
URL myURL = new URL(urlStr);
HttpURLConnection urlCon = (HttpURLConnection) myURL.openConnection();
urlCon.setRequestProperty("Accept-Encoding", "gzip");
urlCon.setRequestProperty("Content-Type", "application/pdf");
urlCon.setRequestMethod("GET");
urlCon.setDoInput(true);
urlCon.connect();
GZIPInputStream reader = new GZIPInputStream(urlCon.getInputStream());
FileOutputStream fos = new FileOutputStream("document.pdf");
byte[] buffer = new byte[1024];
int len;
while((len = reader.read(buffer)) != -1){
fos.write(buffer, 0, len);
}
fos.close();
reader.close();
}

How to guarantee a java POST request string / text to be UTF-8 encoding

I have a textmessage/string with letters like ä,ü,ß. I want everything to be UTF-8 encoded. When I write to a file or print the string to console, everything is fine. But when I want to send the same string to a web service, I get instead of ä,ü,ß the following �
I read the file from a Servlet.
Do I really have to use the following 2 lines to get a UTF-8 encoded text?
byte [] bray = text.getBytes("UTF-8");
text = new String(bray);
.
public static String readAsStream_UTF8(String filePathName){
String text ="";
InputStream input = Thread.currentThread().getContextClassLoader().getResourceAsStream("resources/"+filePathName);
if(input == null){
System.out.println("Inputstream null.");
}else{
InputStreamReader isr = null;
try {
isr = new InputStreamReader((InputStream)input, "UTF-8");
BufferedReader reader = new BufferedReader(isr);
StringBuilder sb = new StringBuilder();
String sCurrentLine;
while ((sCurrentLine = reader.readLine()) != null) {
sb.append(sCurrentLine);
}
text= sb.toString();
//it works only if I use the following 2 lines
byte [] bray = text.getBytes("UTF-8");
text = new String(bray);
} catch (Exception e1) {
e1.printStackTrace();
}
}
return text;
}
My sendPOST method looks something like the following:
String charset = "UTF-8";
OutputStreamWriter writer = null;
HttpURLConnection con = null;
String response_txt ="";
InputStream iss = null;
try {
URL url = new URL(urlService);
con = (HttpURLConnection)url.openConnection();
con.setDoOutput(true); //triggers POST
con.setDoInput(true);
con.setRequestMethod("POST");
con.setRequestProperty("accept-charset", charset);
//con.setRequestProperty("Content-Type", "application/soap+xml");
con.setRequestProperty("Content-Type", "application/soap+xml;charset=UTF-8");
writer = new OutputStreamWriter(con.getOutputStream());
writer.write(msg); //send POST data string
writer.flush();
writer.close();
What do I have to do to force the msg, that will be sent to the web service, to really be UTF-8 encoded.
If you know the encoding of the file which you want to send you don't need to convert it to an intermediary string. Simply copy its bytes to the output:
// inputstream to a UTF-8 encoded resource file
InputStream in = Thread.currentThread().getContextClassLoader().getResourceAsStream("resources/"+filePathName);
HttpURLConnection con = ...
// set contenttype and encoding
con.setRequestProperty("Content-Type", "application/soap+xml;charset=UTF-8");
// copy input to output
copy(in, con.getOutputStream());
using some copy function.
Additionally you could also set the Content-Length header to the size of the resource file.

How to read the blob data from servlet request object

There is client and server components, the client is sending the data in more secure way by converting the data in blob using POST method to the server.
Can any suggest me how to convert that blob data to string object in server side(Java).i have tried some code below
Way 1):
==============================
String streamLength = request.getHeader("Content-Length");
int streamIntLength = Integer.parseInt(streamLength);
byte[] bytes = new byte[streamIntLength];
request.getInputStream().read(bytes, 0, bytes.length);
String content = DatatypeConverter.printBase64Binary(bytes);
System.out.println(content);
Output for above code is : some junk data is displaying.
dABlAG0AcABsAGEAdABlAD0AMgAzADUAUgBfAFAAcgBvAHYAaQBkAGUAcgBfA
Way 2) :
======
BufferedReader reader = new BufferedReader(new InputStreamReader(
request.getInputStream()));
StringBuilder sb = new StringBuilder();
for (String line; (line = reader.readLine()) != null;) {
String str = new String(line.getBytes());
System.out.println(str);
}
Please suggest me any one, above both ways are not worked out.
Below code works for me.
StringBuilder stringBuilder = new StringBuilder();
BufferedReader bufferedReader = null;
try {
String streamLength = request.getHeader("Content-Length");
int streamIntLength = Integer.parseInt(streamLength);
InputStream inputStream = request.getInputStream();
if (inputStream != null) {
bufferedReader = new BufferedReader(new InputStreamReader(
inputStream));
char[] charBuffer = new char[streamIntLength];
int bytesRead = -1;
while ((bytesRead = bufferedReader.read(charBuffer)) > 0) {
stringBuilder.append(charBuffer, 0, bytesRead);
}
} else {
stringBuilder.append("");
}
} catch (IOException ex) {
throw ex;
}
String body = stringBuilder.toString();
//System.out.println(body);
byte[] bytes = body.getBytes();
System.out.println(StringUtils.newStringUtf16Le(bytes));
From the first approach, it looks like the data is encoded (possibly in Base64 format). After decoding it, what is the problem you are facing ? If the data is String and then encoded to Base64, you should get the actual string after decoding it. (Assuming platform locales on client and server side are same).
If its a binary data, better you keep it inside a byte stream only. If you anyhow want it to convert to a string, then the first approach looks okay.
If this binary data represents some kind of file, you can get the related information using the HTTP headers and write it to temp location for further use.

How to send local .png to .php file on server using java application?

I have a local .png file that I want to send using POST data to a .php script that will save the data to a .png file on the server. How do I do this? Do I have to encode or something? All I have is a File and a way to POST data.
Here is how I am sending the .png:
public static byte[] imageToByte(File file) throws FileNotFoundException {
FileInputStream fis = new FileInputStream(file);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
try {
for (int readNum; (readNum = fis.read(buf)) != -1;) {
bos.write(buf, 0, readNum);
}
} catch (IOException ex) {
}
byte[] bytes = bos.toByteArray();
return bytes;
}
public static void sendPostData(String url, HashMap<String, String> data)
throws Exception {
URL siteUrl = new URL(url);
HttpURLConnection conn = (HttpURLConnection) siteUrl.openConnection();
conn.setRequestMethod("POST");
conn.setDoOutput(true);
conn.setDoInput(true);
DataOutputStream out = new DataOutputStream(conn.getOutputStream());
Set keys = data.keySet();
Iterator keyIter = keys.iterator();
String content = "";
for (int i = 0; keyIter.hasNext(); i++) {
Object key = keyIter.next();
if (i != 0) {
content += "&";
}
content += key + "=" + URLEncoder.encode(data.get(key), "UTF-8");
}
System.out.println(content);
out.writeBytes(content);
out.flush();
out.close();
BufferedReader in = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
String line = "";
while ((line = in.readLine()) != null) {
System.out.println(line);
}
in.close();
}
The PHP script:
<?
// Config
$uploadBase = "../screenshots/";
$uploadFilename = $_GET['user'] . ".png";
$uploadPath = $uploadBase . $uploadFilename;
// Upload directory
if(!is_dir($uploadBase))
mkdir($uploadBase);
// Grab the data
$incomingData = $_POST['img'];
// Valid data?
if(!$incomingData || !isset($_POST['img']))
die("No input data");
// Write to disk
$fh = fopen($uploadPath, 'w') or die("Error opening file");
fwrite($fh, $incomingData) or die("Error writing to file");
fclose($fh) or die("Error closing file");
echo "Success";
?>
I must admit, I am surprised that you almost get the correct file. Actually, when you send a file using a browser, the form tag has an encoding defined: enctype="multipart/form-data". I don´t know how it works (It is defined in https://www.rfc-editor.org/rfc/rfc2388), but it includes encoding the file (for example, in Base64). Anyhow, you can forget about the internals if you use a http client library like the one from Apache HttpComponents
My minimalistic code works:
$body = file_get_contents('php://input');
$fh = fopen('file.txt', 'w') or die("Error opening fil
e");
fwrite($fh, $body) or die("Error writing to file");
fclose($fh)
curl --upload-file download.txt http://example.com/upload.php
However, set the method to PUT.

Categories