Turkish characters displayed as '?' and inserted the database incorrectly.
I can change the text in phpmyadmin and make it turkish again. By this way everything works perfect.
My encoding filter:
private String encoding;
public void init(FilterConfig config) throws ServletException {
encoding = config.getInitParameter("requestEncoding");
if (encoding == null)
encoding = "UTF-8";
}
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain next) throws IOException, ServletException {
// Respect the client-specified character encoding
// (see HTTP specification section 3.4.1)
if (null == request.getCharacterEncoding())
request.setCharacterEncoding(encoding);
/**
* Set the default response content type and encoding
*/
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
next.doFilter(request, response);
}
public void destroy() {
}
I don't have turkish character problem in localhost.I see the problem in www.gurkanilleez.com only.
Question marks for each non-English character? What probably happened:
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
Use utf8 instead of latin1.
For JDBC, include ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call.
There are several threads about Turkish character set problems in stackoverflow; read them.
I cannot make a comment because of my reputation. I visited your website and I see the turkish characters as they should be. So, there is no problem or?
Related
In a Java servlet I'm doing:
protected void handleRequests(HttpServletRequest request, HttpServletResponse response)
PrintWriter pw = response.getWriter();
/*...*/
Vector<String> buf = new Vector<>();
for(...) {
ret.add(">žd¿ [?²„·ÜðÈ ‘");
}
/*JSONArray*/ responseArray.put(responseArray.length(), buf);
/*...*/
pw.println(responseArray);
pw.close();
}
In a web page client javascript I'm doing a XMLHttpRequest and the reply is incorrect, looks like: >?d¿ [\u001a?²\u201e·ÜðÈ \u2018
(for the above >žd¿ [?²„·ÜðÈ ‘ input)
Then I tried on the servlet:
ret.add(URLEncoder.encode(">žd¿ [?²„·ÜðÈ ‘", "UTF-8"));
and I get:
%3E%C5%BEd%C2%BF%C2%A0%5B%7F%1A%3F%C2%B2%E2%80%9E%C2%B7%C3%9C%C3%B0%C3%88%C2%A0%E2%80%98
in javascript, then I apply:
unescape(reply.replace(/\+/g,' ') (the replace is because + signs are not converted to spaces)
which nets me:
>žd¿ [?²â??·Ã?ðÃ? â
What do I do wrong?
(Some other questions tells me the servlet should send as utf8. But when do I encode in utf8 - before placing inside a JSON object (I use org.json.) or after (with a .toString on the JSON response array and then convert to utf8 before PrintWriter.println)
P.S. This is not all my code, I've inherited it and some of the theoretical background I'm lacking.
Edit:
doing a decodeURIComponent(reply).replace(/\+/g,' ') in javascript seems to do the trick. But I could not find the difference between URLEncoder.encode and decodeURIComponent. Is the +/space the only mismatch?
decodeURIComponent nets the expected result
decodeURIComponent("%3E%C5%BEd%C2%BF%C2%A0%5B%7F%1A%3F%C2%B2%E2%80%9E%C2%B7%C3%9C%C3%B0%C3%88%C2%A0%E2%80%98");
">žd¿ [?²„·ÜðÈ ‘"
This is my code i am tring to Send Message Over SMPP but as Output ? is coming:
public class Encoding
{
public static void main(String[] args) throws SocketTimeoutException, AlreadyBoundException, VersionException, SMPPProtocolException, UnsupportedOperationException, IOException
{
SubmitSM sm=new SubmitSM();
String strMessage="Pour se désinscrire du service TT ZONE, envoyez GRATUITEMENT « DTTZ » ";
String utf8 = new String(strMessage.getBytes("UTF-8"));
UCS2Encoding uc = UCS2Encoding.getInstance(true);
sm.setDataCoding(2);
sm.setMessageText(utf8);
System.out.println(sm.getMessageText());
}
}
Your problem is here:
String strMessage="Pour se désinscrire du service TT ZONE, envoyez GRATUITEMENT « DTTZ » ";
String utf8 = new String(strMessage.getBytes("UTF-8"));
Why do you do that at all? Since the UCS2Encoding class accepts a String as an argument, it will take care of the encoding itself.
Just do:
sm.setMessageText(strMessage);
As I mentioned in the other question you asked, you are mixing a LOT of concepts. Remind that a String is a sequence of chars; it is independent of the encoding you use. The fact that internally Java uses UTF-16 is totally irrelevant here. It could use UTF-32 or EBCDIC, or even use carrier pigeons, the process itself would not change:
encode decode
String (char[]) --------> byte[] --------> String (char[])
And by using the String constructor taking a byte array as an argument, you create a seqeunce of chars from these bytes using the default JVM encoding. Which may, or may not, be UTF-8.
In particular, if you are using Windows, the default encoding will be windows-1252. Let us replace encode and decode above with the charset names. What you do is:
UTF-8 windows-1252
String (char[]) -------> byte[] --------------> String (char[])
"Houston, we have a problem!"
For more details, see the javadocs for Charset, CharsetEncoder and CharsetDecoder.
In my application I'm getting the user info from LDAP and sometimes the full username comes in a wrong charset. For example:
ТеÑÑ61 ТеÑÑовиÑ61
It can also be in English or in Russian and displayed correctly. If the username changes it's updated in database. Even if I change the value in the db it wont solve the problem.
I can fix it before saving by doing this
new String(incorrect.getBytes("ISO-8859-1"), "UTF-8");
However, if I will use it for the string including characters in Russian (for ex., "Тест61 Тестович61") I get something like this "????61 ????????61".
Can you please suggest something that can determine the charset of string?
Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form.
You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.
Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/
There are plenty of other charset detectors out there as well
I had the same problem. Tika is too large and juniversalchardet do not detect ISO-8859-1. So, I did myself and now is working well in production:
public String convert(String value, String fromEncoding, String toEncoding) {
return new String(value.getBytes(fromEncoding), toEncoding);
}
public String charset(String value, String charsets[]) {
String probe = StandardCharsets.UTF_8.name();
for(String c : charsets) {
Charset charset = Charset.forName(c);
if(charset != null) {
if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
return c;
}
}
}
return StandardCharsets.UTF_8.name();
}
Full description here: Detect the charset in Java strings.
I recommend Apache.tika CharsetDetector, very friendly and strong.
CharsetDetector detector = new CharsetDetector();
detector.setText(yourStr.getBytes());
detector.detect(); // <- return the result, you can check by .getName() method
Further, you can convert any encoded string to your desired one, take utf-8 as example:
detector.getString(yourStr.getBytes(), "utf-8");
I highly appreciate Lluís Turró Cutiller's answer (+1), but want to add a variant based on that.
private String convert(String value, Charset fromEncoding, Charset toEncoding) throws UnsupportedEncodingException {
return new String(value.getBytes(fromEncoding), toEncoding);
}
private boolean probe(String value, Charset charset) throws UnsupportedEncodingException {
Charset probe = StandardCharsets.UTF_8;
return value.equals(convert(convert(value, charset, probe), probe, charset));
}
public String convert(String value, Charset charsetWanted, List<Charset> charsetsOther) throws UnsupportedEncodingException {
if (probe(value, charsetWanted)) {
return value;
}
for (Charset other: charsetsOther) {
if (probe(value, other)) {
return convert(value, other, charsetWanted);
}
}
System.err.println("WARNING: Could not convert string: " + value);
return value;
}
Your LDAP database is set up incorrectly. The application putting data into it should convert to a known character set encoding, in your case, likely UTF_16. Pick a standard. All methods of detecting encoding are guesses.
The application writing the value is the only one that knows definitively which encoding it is using and can properly convert to another encoding such as UTF_16.
In your web-application, you may declare an encoding-filter that makes sure you receive data in the right encoding.
<filter>
<description>Explicitly set the encoding of the page to UTF-8</description>
<filter-name>encodingFilter</filter-name>
<filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>forceEncoding</param-name>
<param-value>true</param-value>
</init-param>
</filter>
A spring provided filter makes sure that the controllers/servlets receive parameters in UTF-8.
I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?
I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.
Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.
Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.
I have a struts2 web application which accepts both POST and GET requests in many different charsets, does conversion of them into utf-8, displays the correct utf-8 characters on the screen and then writes them into utf-8 database.
I have tried at least 5 different methods for doing simple losless charset conversion of windows-1250 to utf-8 to start with, and all of them did not work. Utf-8 being the "larger set", it should work without a problem (at least this is my understanding).
Can you propose how to do a charset conversion from windows-1250 to utf-8, and is it possible that struts2 is doing something weird with the params charset, which would explain why I can't seem to get it right.
This is my latest attempt:
String inputData = getSimpleParamValue("some_input_param_from_get");
Charset inputCharset = Charset.forName("windows-1250");
Charset utfCharset = Charset.forName("UTF-8");
CharsetDecoder decoder = inputCharset.newDecoder();
CharsetEncoder encoder = utfCharset.newEncoder();
String decodedData = "";
try {
ByteBuffer inputBytes = ByteBuffer.wrap(inputData.getBytes()); // I've tried putting UTF-8 here as well, with no luck
CharBuffer chars = decoder.decode(inputBytes);
ByteBuffer utfBytes = encoder.encode(chars);
decodedData = new String(utfBytes.array());
} catch (CharacterCodingException e) {
logger.error(e);
}
Any ideas on what to try to get this working?
Thank you and best regards,
Bozo
I'm not sure of your scenario. In Java, a String is Unicode, one only deals with charset conversion when has to convert from/to String to/from a binary representation.
In your example, when getSimpleParamValue("some_input_param_from_get") is called, inputData should already have the "correct" String, the conversion from the stream of bytes (that travelled from the client browser to the web server) to a string should have already taken part (responsability of the web server+web layer of your application).
For this, I enforce UTF-8 for the web trasmission, placing a filter in the web.xml (before Struts), for example:
public class CharsetFilter implements Filter {
public void destroy() {}
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
HttpServletRequest req = (HttpServletRequest) request;
HttpServletResponse res = (HttpServletResponse) response;
req.setCharacterEncoding("UTF-8");
chain.doFilter(req, res);
String contentType = res.getContentType();
if( contentType !=null && contentType.startsWith("text/html"))
res.setCharacterEncoding("UTF-8");
}
public void init(FilterConfig filterConfig) throws ServletException {
}
}
If you cannot do this, and if your getSimpleParamValue() "errs" in the charset conversion (eg: it assumed the byte stream was UTF-8 and was windows-1250) you now have an "incorrect" string, and you must try to recover it by undoing and redoing the byte-to-string conversion - in which case you must know the wrong AND the correct charset - and, worse, deal with the possibity of missing chars (if it was interpreted as UTF8, i maight have found illegal char sequence).
If you have to deal with this in a Struts2 action, I'd say you are in problems, you should deal with it explicitly before/after it (in the upper web layer - or in the Database driver or File encoding or whatever)