MySQL - Hebrew characters become question marks in DB - java

I have DB which all of its columns are set to be "hebrew_general_ci".
When I try to manually insert hebrew values to my DB, or through Postman, I can see that the values at the DB are indeed in hebrew.
But, When I try to insert the values from my app (android app - coded in java), the values become question marks - ????
I tried to code my text to UTF-8 at the app itself but it didn't work.
here is the code which suppose to do this:
private String POST(String url, String jsonParamsAsString) {
String result = "";
String fixedUrl = url.replace(" ","%20");
try {
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost postRequest = new HttpPost(fixedUrl);
byte ptext[] = jsonParamsAsString.getBytes();
jsonParamsAsString = new String(ptext, "UTF-8");
StringEntity input = new StringEntity(jsonParamsAsString);
input.setContentType("application/json; charset=utf-8" );
//input.setContentType("application/json");
postRequest.setEntity(input);
HttpResponse response = httpClient.execute(postRequest);
result = convertInputStreamToString(response.getEntity().getContent());
/*byte ptext[] = result.getBytes();
result = new String(ptext, "UTF-8");*/
} catch (Exception e) {
Log.d("InputStream", e.getLocalizedMessage());
}
return result;
}

You need to set your encoding for the database as utf8 / utf8_general_ci or utf8mb4 / utf8mb4_general_ci if you are running latest version of MySQL & need to handle emoji. Here is the documentation. Basically, you don't need to set your table to a specific char set for a particular language. I've used the above settings and it handles Arabic, Russian, Chinese, English, etc. out of the box, it is language agnostic and just works. Good luck.
Edit: You also need to make sure your query connection has the following two parameters: useUnicode=yes and characterEncoding=UTF-8

Related

Do not Encode Euro symbol when using java.net.URLEncoder

is it possible to retain Euro symbol post encoding for example.
HttpClient httpClient = getHttpClient();
// set POST method details
PostMethod post = new PostMethod(url_p);
post.setRequestHeader(
"Content-Type", PostMethod.FORM_URL_ENCODED_CONTENT_TYPE);
String beforeEncoding = "Price is €100";
String afterEncoding = java.net.URLEncoder.encode(beforeEncoding,UTF-8);
post.setRequestBody(afterEncoding);
httpClient.executeMethod(post);
it displays Price+is+%80100
is possible to display Price+is+€100
If the communication is using UTF-8 (which is sensible for the € Symbol), you should do:
String afterEncoding = java.net.URLEncoder.encode(beforeEncoding, "UTF-8");
String afterEncoding = java.net.URLEncoder.encode(beforeEncoding, StandardCharsets.UTF_8);
The overloaded encode without Encoding is deprecated anyway.
Mind, that System.out uses the platform's Encoding: System.getProperty("file.encoding") or Charset.defaultCharset().
After comment
Or do not encode at all, and set the encoding of the body.
PostMethod post = new PostMethod(url);
post.getParams().setContentCharset("UTF-8");
No, not with URLEncoder.encode. If your example is that trivial, you may stick to a simple String.replace:
String beforeEncoding = "Price is €100";
String afterEncoding = beforeEncoding.replace(' ', '+');
System.out.println(afterEncoding);

Java - how to encode URL path for non Latin characters

Currently there is final URL url = new URL(urlString); but I run into server not supporting non-ASCII in path.
Using Java (Android) I need to encode URL from
http://acmeserver.com/download/agc/fcms/儿子去哪儿/儿子去哪儿.png
to
http://acmeserver.com/download/agc/fcms/%E5%84%BF%E5%AD%90%E5%8E%BB%E5%93%AA%E5%84%BF/%E5%84%BF%E5%AD%90%E5%8E%BB%E5%93%AA%E5%84%BF.png
just like browsers do.
I checked URLEncoder.encode(s, "UTF-8"); but it also encodes / slashes
http%3A%2F%2acmeserver.com%2Fdownload%2Fagc%2Ffcms%2F%E5%84%BF%E5%AD%90%E5%8E%BB%E5%93%AA%E5%84%BF%2F%E5%84%BF%E5%AD%90%E5%8E%BB%E5%93%AA%E5%84%BF.png
Is there way to do it simply without parsing string that the method gets?
from http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars
B.2.1 Non-ASCII characters in URI attribute values Although URIs do
not contain non-ASCII values (see [URI], section 2.1) authors
sometimes specify them in attribute values expecting URIs (i.e.,
defined with %URI; in the DTD). For instance, the following href value
is illegal:
...
We recommend that user agents adopt the following convention for
handling non-ASCII characters in such cases:
Represent each character in UTF-8 (see [RFC2279]) as one or more
bytes.
Escape these bytes with the URI escaping mechanism (i.e., by
converting each byte to %HH, where HH is the hexadecimal notation of
the byte value).
You should just encode the special characters and the parse them together. If you tried to encode the entire URI then you'd run into problems.
Stick with:
String query = URLEncoder.encode("apples oranges", "utf-8");
String url = "http://stackoverflow.com/search?q=" + query;
Check out this great guide on URL encoding.
That being said, a little bit of searching suggests that there may be other ways to do what you want:
Give this a try:
String urlStr = "http://abc.dev.domain.com/0007AC/ads/800x480 15sec h.264.mp4";
URL url = new URL(urlStr);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = uri.toURL();
(You will need to have those spaces encoded so you can use it for a request.)
This takes advantage of a couple features available to you in Android
classes. First, the URL class can break a url into its proper
components so there is no need for you to do any string search/replace
work. Secondly, this approach takes advantage of the URI class
feature of properly escaping components when you construct a URI via
components rather than from a single string.
The beauty of this approach is that you can take any valid url string
and have it work without needing any special knowledge of it yourself.
final URL url = new URL( new URI(urlString).toASCIIString() );
worked for me.
I did it as below, which is cumbersome
//was: final URL url = new URL(urlString);
String asciiString;
try {
asciiString = new URL(urlString).toURI().toASCIIString();
} catch (URISyntaxException e1) {
Log.e(TAG, "Error new URL(urlString).toURI().toASCIIString() " + urlString + " : " + e1);
return null;
}
Log.v(TAG, urlString+" -> "+ asciiString );
final URL url = new URL(asciiString);
url is later used in
connection = (HttpURLConnection) url.openConnection();

JSON Jackson + HTTPClient with german umlauts

I'm having a problem regarding a json string, i acquire with the Apache http client, containing german umlauts.
The mapping of json strings is only working, if the string does not contain any german umlaut, otherwise i get an "JsonMappingException: Can not deserialize instance of [...] out of START_ARRAY.
The Apache http client is set with "Accept-Charset" to HTTP.UTF-8, but as result i always get e.g. "\u00fc" instead "ü". When i manually replace e.g. "\u00fc" with "ü" the mapping works perfect.
How can i get a utf-8 encoded json response from Apache http client?
Or is the server output the problem?
params.setParameter(HttpProtocolParams.USE_EXPECT_CONTINUE, false);
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params, HTTP.UTF_8);
httpclient = new DefaultHttpClient(params);
httpclient = new DefaultHttpClient(params);
HttpGet httpGetContentLoad = new HttpGet(url);
httpGetContentLoad.setHeader("Accept-Charset", "utf-8");
httpGetContentLoad.setParams(params);
response = httpclient.execute(httpGetContentLoad);
entity = response.getEntity();
String loadedContent = null;
if (entity != null)
{
loadedContent = EntityUtils.toString(entity, HTTP.UTF_8);
entity.consumeContent();
}
if (HttpStatus.SC_OK != response.getStatusLine().getStatusCode())
{
throw new Exception("Loading content failed");
}
closeConnection();
return loadedContent;
And the json code is mapped here:
String jsonMetaData = loadGetRequestContent(getLatestEditionUrl(newspaperEdition));
Newspaper loadedNewspaper = mapper.readValue(jsonMetaData, Newspaper.class);
loadedNewspaper.setEdition(newspaperEdition);
Update 1:
JsonMetaData is type of String containing the fetched json code.
Update2:
This code i use to transform the json output to me needs:
public static String convertJsonLatestEditionMeta(String jsonCode)
{
jsonCode = jsonCode.replaceFirst("\\[\"[A-Za-z0-9-[:blank:]]+\",\\{", "{\"edition\":\"an-a1\",");
jsonCode = jsonCode.replaceFirst("\"pages\":\\{", "\"pages\":\\[");
jsonCode = Helper.replaceLast(jsonCode, "}}}]", "}]}");
jsonCode = jsonCode.replaceAll("\"[\\d]*\"\\:\\{\"", "\\{\"");
return jsonCode;
}
Update3:
Json conversion example:
jsoncode before conversion:
["Newspaper title",
{
"date":"20130103",
"pages":
{
"1": {"ressort":"ressorttitle1","pdfpfad":"pathToPdf1","number":1,"size":281506},
"2":{"ressort":"ressorttitle2","pdfpfad":"pathToPdf2","number":2,"size":281533},
[...]
}
}
]
Jsoncode after conversion:
{
"edition":"Newspaper title",
"date":"20130103",
"pages":
[
{"ressort":"Resorttitle1","pdfpfad":"pathToPdf1","number":1,"size":281506},
{"ressort":"Resorttitle2","pdfpfad":"pathToPdf2","number":2,"size":281533},
[...]
]
}
Solution:
I started using GSON as #Boris suggested and the problem regarding umlauts is gone! Further more GSON really seems to be faster than Jackson Json.
A workaround would be to replace the characters manually following this table:
Sign Unicode representation
Ä, ä \u00c4, \u00e4
Ö, ö \u00d6, \u00f6
Ü, ü \u00dc, \u00fc
ß \u00df
€ \u20ac
Try parsing like that:
entity = response.getEntity();
Newspaper loadedNewspaper=mapper.readValue(entity.getContent(), Newspaper.class);
No reason to go through String, Jackson parses InputStreams directly. Also Jackson will automatically detect the encoding if you use my proposed approach.
EDIT By the way consider using GSON JSON parsing library. It is even faster than Jackson and easier to use. However, Jackson recently started parsing XMl, too, which is a virtue.
EDIT2 After all you have added as details I would suppose the problem is with the server implementation of the services - the umlauts are not to be unicode escaped in the json - UTF 8 is native encoding for it. Why don't you instead of manually replace e.g. "\u00fc" with "ü" do it via regex?

java convert to utf-8 from postgres

I'm generating html files whit this data (stored in postgres):
The html files are generated as UTF-8, but the string looks like they appear in the DB.
How I can do to make the text appear correctly? Like: Últiles de Escritorio
Note. I'm not able to change postgres configuration, I'm using Java 1.6, Postgres 8.4, JDBC
UPDATE:
I use this code to create the html files:
public static void stringToFile (String file_name, String file_content) throws IOException {
OutputStream out = new FileOutputStream(file_name);
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
try {
try {
writer.write(file_content);
} finally {
writer.close();
}
} finally {
out.close();
}
}
And I use it like:
StringBuilder html_content = new StringBuilder();
ResultSet result_set = statement.executeQuery(sql_query);
while (result_set.next()) {
html_content.append(String.format('<li>%s</li>', result_set.getString(1)));
}
Utils.stringToFile('thehtmlfile.html', html_content.toString());
UPDATE: [SOLVED]
This works for me:
new String(str.getBytes("ISO-8859-1"), "UTF-8")
I hope you're sure that the error isn't happening before you add the string to the db.
Because you can't change your db settings, this isn't so easy to handle, first of all you have to know in which format the text will be saved in your db, see here.
Than you should be a familiar with UTF16 see here and here.
Now, after you are familiar with codecs you wanna use, you have to create the correct utf16 value for each character you get from the db. I will just point out how it could work, the correct implementation you have to do by your own.
public char createUTF16char( char first, char second ) {
char res = first;
res = res << 8;
res = res & (0x0F & second);
return res;
}
This code should just combine the last 8bit of each char ( first and second ) to a new char.
Maybe this is the operation you need but it depends on the coding what is used on the server.
Sincerely

UTF-8 Encoding in java, retrieving data from website

I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?
I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.
Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.
Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.

Categories