UTF-8 Encoding in java, retrieving data from website

UTF-8 Encoding in java, retrieving data from website - java

I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?

I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.

Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.

Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'

If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.

Related

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

My application downloads xml files that happen to be either encoded in UTF-8 or ISO-8859-1 (the software that generates those files is crappy so it does that). I'm from Germany so we're using Umlauts (ä,ü,ö) so it really makes a difference how those files are encoded.
I know that the XmlPullParser has a method .getInputEncoding() which correctly detects how my files are encoded. However I have to set the encoding in my FileInputStream already (which is before I get to call .getInputEncoding()). So far I'm just using a BufferedReader to read the XML file and search for the entry that specifies the encoding and then instantiate my PullParser afterwards.
private void setFileEncoding() {
try {
bufferedReader.reset();
String firstLine = bufferedReader.readLine();
int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="
String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));
// now set the encoding to the reader to be used for parsing afterwards
bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
bufferedReader.mark(0);
} catch (IOException e) {
e.printStackTrace();
}
}
Is there a different way to do this? Can I take advantage of the .getInputEncoding method? Right now the method seems kinda useless to me because how does my encoding matter if I've already had to set it before being able to check for it.

If you trust the creator of the XML to have set the encoding correctly in the XML declaration, you can sniff it as you're doing. However, be aware that it can be wrong; it can disagree with the actual encoding.
If you want to detect the encoding directly, independently of the (potentially wrong) XML declaration encoding setting, use a library such as ICU CharsetDetector or the older jChardet.
ICU CharsetDetector:
CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData);
match = detector.detect();
jChardet:
// Initalize the nsDetector() ;
int lang = (argv.length == 2)? Integer.parseInt(argv[1])
: nsPSMDetector.ALL ;
nsDetector det = new nsDetector(lang) ;
// Set an observer...
// The Notify() will be called when a matching charset is found.
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true ;
System.out.println("CHARSET = " + charset);
}
});
URL url = new URL(argv[0]);
BufferedInputStream imp = new BufferedInputStream(url.openStream());
byte[] buf = new byte[1024] ;
int len;
boolean done = false ;
boolean isAscii = true ;
while( (len=imp.read(buf,0,buf.length)) != -1) {
// Check if the stream is only ascii.
if (isAscii)
isAscii = det.isAscii(buf,len);
// DoIt if non-ascii and not done yet.
if (!isAscii && !done)
done = det.DoIt(buf,len, false);
}
det.DataEnd();
if (isAscii) {
System.out.println("CHARSET = ASCII");
found = true ;
}

You may be able to get the correct character-set from the content-type header, if your server sends it correctly.

How to get encoding type of a .txt or .sql file

Is there a possibility to get the encoding of a existing .txt file? for example: you know a customer needs a specific encoding and you want to automize the process of .sql-data delivery. then you read out the endcoding from a client config and compare it to the current encoding of the file to be delivered. if they differ you change the encoding. could not find a solution till now. any help would be appreciated.

There is no explicit declaration of text encoding in files, but you can guess the encoding by analyzing specific byte sequences that are characteristic of a certain encoding.
Chardet does exactly that and tries to guess. If it can't say for sure what the encoding is, it will give you a list with confidence values (e.g. "90% this is utf8"). The project includes both a Python module and a command line tool. For a Java version, see JChardet.
My 2cents: if you just need a quick way to detect, the command line chardet tool is the way to go.

juniversalchardet is one of the best available API for detecting the encoding type. Please checkout this link. You can go through the list of encoding types supported by it
Working Example from the site
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Hope this helps!

Parsing PDF that has been downloaded from internet

I have searched questions about this topic on stackoverflow. They really helped me but I stuck again.
My problem is that I need do write a method that downloads pdf from a site like (www.example.com/abc.pdf) and then I want to read the output. I don't want to save this file, just read in system out. I don't need to put bytes to fileoutputstream. I tried to cast bytes to char to get characters ( it can be dumbest solution ). But I got unknown characters. Any idea or am I understood it in a wrong way?
Here is the code and its output:
String textlink="http://www.selab.isti.cnr.it/ws-mate/example.pdf";// it comes from main class
public String HtmlTest(String textLink) throws IOException{
StringBuilder sd=new StringBuilder();
URL link=new URL(textLink);
URLConnection urlConn = link.openConnection();
BufferedInputStream in = null;
try
{
in = new BufferedInputStream(urlConn.getInputStream());
byte data[] = new byte[1024];
in.read(data, 0, 1024);
for (int j = 0; j < data.length; j++) {
if(j%100==0){
sd.append((char)data[j]+"\n"); // i used this for making readable text
}
else{
sd.append((char)data[j]);
}
}
}
finally
{
if (in != null)
in.close();
}
return sd.toString();
}
Output
run:
%
PDF-1.3
%ￇ￬ﾏﾢ
7 0 obj
<</Length 8 0 R/Filter /FlateDecode>>
stream
xﾜﾭY[ﾓￛﾶ￮ﾳ&?BoNf,,q%￠ﾼ4￞x&ﾞ6ﾩﾛlￓ
ﾗﾼ￐ﾽￋZeﾑ￲f￻￫￻ﾁ

You're not going to get very far trying to read a .pdf file as though it were basically a text file. For starters, the "text" is in a compressed binary format; there are other issues you'll probably also have to deal with.
STRONG SUGGESTION:
Use a Java .pdf library like Apache PDFBox
IMHO>.

java convert to utf-8 from postgres

I'm generating html files whit this data (stored in postgres):
The html files are generated as UTF-8, but the string looks like they appear in the DB.
How I can do to make the text appear correctly? Like: Últiles de Escritorio
Note. I'm not able to change postgres configuration, I'm using Java 1.6, Postgres 8.4, JDBC
UPDATE:
I use this code to create the html files:
public static void stringToFile (String file_name, String file_content) throws IOException {
OutputStream out = new FileOutputStream(file_name);
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
try {
try {
writer.write(file_content);
} finally {
writer.close();
}
} finally {
out.close();
}
}
And I use it like:
StringBuilder html_content = new StringBuilder();
ResultSet result_set = statement.executeQuery(sql_query);
while (result_set.next()) {
html_content.append(String.format('<li>%s</li>', result_set.getString(1)));
}
Utils.stringToFile('thehtmlfile.html', html_content.toString());
UPDATE: [SOLVED]
This works for me:
new String(str.getBytes("ISO-8859-1"), "UTF-8")

I hope you're sure that the error isn't happening before you add the string to the db.
Because you can't change your db settings, this isn't so easy to handle, first of all you have to know in which format the text will be saved in your db, see here.
Than you should be a familiar with UTF16 see here and here.
Now, after you are familiar with codecs you wanna use, you have to create the correct utf16 value for each character you get from the db. I will just point out how it could work, the correct implementation you have to do by your own.
public char createUTF16char( char first, char second ) {
char res = first;
res = res << 8;
res = res & (0x0F & second);
return res;
}
This code should just combine the last 8bit of each char ( first and second ) to a new char.
Maybe this is the operation you need but it depends on the coding what is used on the server.
Sincerely

Convert byte[] to Base64 string for data URI

I know this has probably been asked 10000 times, however, I can't seem to find a straight answer to the question.
I have a LOB stored in my db that represents an image; I am getting that image from the DB and I would like to show it on a web page via the HTML IMG tag. This isn't my preferred solution, but it's a stop-gap implementation until I can find a better solution.
I'm trying to convert the byte[] to Base64 using the Apache Commons Codec in the following way:
String base64String = Base64.encodeBase64String({my byte[]});
Then, I am trying to show my image on my page like this:
<img src="data:image/jpg;base64,{base64String from above}"/>
It's displaying the browser's default "I cannot find this image", image.
Does anyone have any ideas?
Thanks.

I used this and it worked fine (contrary to the accepted answer, which uses a format not recommended for this scenario):
StringBuilder sb = new StringBuilder();
sb.append("data:image/png;base64,");
sb.append(StringUtils.newStringUtf8(Base64.encodeBase64(imageByteArray, false)));
contourChart = sb.toString();

According to the official documentation Base64.encodeBase64URLSafeString(byte[] binaryData) should be what you're looking for.
Also mime type for JPG is image/jpeg.

That's the correct syntax. It might be that your web browser does not support the data URI scheme. See Which browsers support data URIs and since which version?
Also, the JPEG MIME type is image/jpeg.

You may also want to consider streaming the images out to the browser rather than encoding them on the page itself.
Here's an example of streaming an image contained in a file out to the browser via a servlet, which could easily be adopted to stream the contents of your BLOB, rather than a file:
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException
{
ServletOutputStream sos = resp.getOutputStream();
try {
final String someImageName = req.getParameter(someKey);
// encode the image path and write the resulting path to the response
File imgFile = new File(someImageName);
writeResponse(resp, sos, imgFile);
}
catch (URISyntaxException e) {
throw new ServletException(e);
}
finally {
sos.close();
}
}
private void writeResponse(HttpServletResponse resp, OutputStream out, File file)
throws URISyntaxException, FileNotFoundException, IOException
{
// Get the MIME type of the file
String mimeType = getServletContext().getMimeType(file.getAbsolutePath());
if (mimeType == null) {
log.warn("Could not get MIME type of file: " + file.getAbsolutePath());
resp.setStatus(HttpServletResponse.SC_INTERNAL_SERVER_ERROR);
return;
}
resp.setContentType(mimeType);
resp.setContentLength((int)file.length());
writeToFile(out, file);
}
private void writeToFile(OutputStream out, File file)
throws FileNotFoundException, IOException
{
final int BUF_SIZE = 8192;
// write the contents of the file to the output stream
FileInputStream in = new FileInputStream(file);
try {
byte[] buf = new byte[BUF_SIZE];
for (int count = 0; (count = in.read(buf)) >= 0;) {
out.write(buf, 0, count);
}
}
finally {
in.close();
}
}

If you don't want to stream from a servlet, then save the file to a directory in the webroot and then create the src pointing to that location. That way the web server does the work of serving the file. If you are feeling particularly clever, you can check for an existing file by timestamp/inode/crc32 and only write it out if it has changed in the DB which can give you a performance boost. This file method also will automatically support ETag and if-modified-since headers so that the browser can cache the file properly.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

UTF-8 Encoding in java, retrieving data from website - java

Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'

Related

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

How to get encoding type of a .txt or .sql file

Parsing PDF that has been downloaded from internet

java convert to utf-8 from postgres

Convert byte[] to Base64 string for data URI

Categories

Resources