Text encoding converts junk character in Play! 1.2.4 framework

Text encoding converts junk character in Play! 1.2.4 framework - java

Issue: Character encoding in Play! 1.2.4 framework becomes.
Context: We are trying to store the text "《我叫MT繁體版》台港澳專屬伺服器上線！" from input text field to mysql using Play! 1.2.4 framework.
Steps that we followed:
1) UI to get the input from user. just any lang text, so we tried Japneese Char. Note: page is set to UTF-8 character encoding.
2) Post submission to Play! controller, the controller just reads the input and stores it using Play! model. snippet mentiond below,
public static void text_create() throws UnsupportedEncodingException,
ParseException {
System.out.println("params :: text string value :: " + params.get("text"));
String oldString = params.get("text");
// Converting the input string(which is UTF-8 format) and parsing to Windown-1252
String newString = new String(oldString.getBytes(), "WINDOWS-1252");
// 1. passing encoded text to mysql.
// 2. TextCheck table and the column 'text' has encoding and collation format as UTF-8.
// 3. TextCheck > text column mentioned as String in model.
TextCheck a = new TextCheck(newString);
List<Object> text = TextCheck.TextList();
render(a,text);
}
It stores as TEXT value as "ã€Šæˆ‘å�«MTç¹�é«”ç‰ˆã€‹å�°æ¸¯æ¾³å°ˆå±¬ä¼ºæœ�å™¨ä¸Šç·šï¼�"
Problem is there are character � in between value. when i read this
raw data from mysql using other platforms like java, ruby or some
other language it converts but makes those � characters as junk. just
junk.
Note: Interstingly when i read it from same Play! framework. it looks all fine even that junk characters were read correctly.
Question: Why those junk characters ?

The problem is the following line:
String newString = new String(oldString.getBytes(), "WINDOWS-1252");
This looks like nonsense to me. Java stores all strings internally using UTF-16, so you can't adjust the encoding of a Java string in the manner you've attempted here.
The getBytes() method returns the bytes of the string using the default platform encoding. You then covert these bytes into a new string using a (probably) different charset. The result is almost certain to be broken.

Related

How to handle (object replacement character) in URL

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:
url = URLDecoder.decode(url, "UTF-8" );
but it still remains in the code looking like this:
I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."
But if this is the case I should be able to print the symbol if it is plain text but when I run
System.out.println("");
I get the following complication error:
and it reverts back to the last save.
Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/
NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
System.out.println("The same");
}else {
System.out.println("Not the same");
}

That's not a compilation error. That's the eclipse code editor telling you it can't save the source code to a file, because you have told it to save the file in a cp1252 encoding, but that encoding can't express a .
Put differently, your development environment is currently configured to store source code in the cp1252 encoding, which doesn't support the character you want, so you either configure your development environment to store source code using a more flexible encoding (such as UTF-8 the error message suggests), or avoid having that character in your source code, for instance by using its unicode escape sequence instead:
System.out.println("\ufffc");
Note that as far as the Java language and runtime are concerned,  is a character like any other, so there may not be a particular need to "handle" it. Also, I am unsure why you'd expect URLDecoder to do anything if the URL hasn't been URL-encoded to begin with.

"ef bf bc" is a 3 bytes UTF-8 character so as the error says, there's no representation for that character in "CP1252" Windows page encoding.
An option could be to replace that percent encoding sequence with an ascii representation to make the filename for saving:
String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/".replace("%ef%bf%bc", "-xEFxBFxBC"), "UTF-8");
url ==> "https://www.breightgroup.com/job/hse-advisor-emb ... contract-roles-xEFxBFxBC/"
Another option using CharsetDecoder
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
CharsetDecoder decoder = Charset.forName("CP1252").newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE);
String urlDec = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
ByteBuffer buffer = ByteBuffer.wrap(urlDec.getBytes(Charset.forName("UTF-8")));
decoder.decode(buffer).toString();
Result
"https://www.breightgroup.com/job/hse-advisor-embedded-contract-rolesï¿¼/"

I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..
So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:
boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=#?-]*$");

What is the correct format for the API SurveyQuestionImage.Data field?

I am working with the GCS API, attempting to create a survey with image data.
I am using the NuGet package Google.Apis.ConsumerSurveys.v2 version 1.14.0.564 on the .Net platform. I can create surveys that do not contain image data without problem. However, when I try to create a survey with image data I receive an error from the API.
I have on hand base64 encoded png format image data. My images display properly in an IMG tag on a web page when the src attribute is set to
'data:image/png;base64,<image base64 string>'
I want to send this image data to the API to populate the survey image. My understanding is that I need to set the Data property of the Google.Apis.ConsumerSurveys.v2.Data.SurveyQuestionImage object to a string containing the image data. I have not been successful.
I first decode my base64 string to a byte array:
byte[] bytes = Convert.FromBase64String(<image base64 string>);
I have tried setting the Data property in the SurveyQuestionImage object as:
image.Data = Encoding.Unicode.GetString(bytes);
This results in this error from the API:
Google.Apis.Requests.RequestError Invalid value for ByteString: <the Data string>
I have also tried converting the byte array to a hexadecimal encoded string as:
StringBuilder sb = new StringBuilder(bytes.Length);
foreach (Byte b in bytes)
{
sb.Append(b.ToString("X2"));
}
image.Data = sb.ToString();
This results in the more hopeful error:
Google.Apis.Requests.RequestError Invalid Value supplied to API: image_data was bad. Request Id: 579665c300ff05e6c316a09e600001737e3430322d747269616c320001707573682d30372d32322d72313000010112 [400] Errors [ Message[Invalid Value supplied to API: image_data was bad. Request Id: 579665c300ff05e6c316a09e600001737e3430322d747269616c320001707573682d30372d32322d72313000010112] Location[ - ] Reason[INVALID_VALUE] Domain[global] ]
Does anyone know the correct format for the Data property of the Google.Apis.ConsumerSurveys.v2.Data.SurveyQuestionImage object?

The data needs to be base64 encoded and also "urlsafe" or "websafe" depending on what language you are using. (python and java, respectively)
In other words, you'll need to first base64 encode then:
Web safe encoding uses '-' instead of '+', '_' instead of '/'
Hope this helps!
For c# users, check out this technique for making websafe b64:
How to achieve Base64 URL safe encoding in C#?
For .net users, look at the comments in this question:
Converting string to web-safe Base64 format
And also this link for more info about .net specific options for encoding:
http://www.codeproject.com/Tips/76650/Base-base-url-base-url-and-z-base-encoding
And to specifically answer the original poster, try this for converting your byte array to a string.
public static string ToBase64ForUrlString(byte[] input)
{
StringBuilder result = new StringBuilder(Convert.ToBase64String(input).TrimEnd('='));
result.Replace('+', '-');
result.Replace('/', '_');
return result.ToString();
}

Why is my UTF-8 encoded data not staying ?UTF-8? encoded?

The problem I'm trying to fix is this:
Users of our application are copy/pasting characters from windows-related docs like Word for instance, and our application is not recognizing single and double quotes or bullets.
These are the steps I've taken so far to get this data into UTF format:
inside servers.xml, in Connector tag, I added the attribute URIEncoding="UTF-8".
in the bean charged with storing the input, I created a byte[] and passed in String holding inputNote text, then converted it to UTF-8. Then passed the UTF-8 converted String back to inputNoteText String. Please see directly below for condensed code on this.
byte[] bytesInUTF8inputNoteText = inputNoteText.getBytes("UTF-8");
inputNoteText = new String(bytesInUTF8inputNoteText, "UTF-8");
this.var = inputNoteText;
In the variable-setter charged with holding the result from the db query:
setNoteText(noteText) to convert the note data coming from database query into bytes in UTF8 format, then converted it back into a String and set it to String noteText property. Also below.
public void setNoteText(String noteText) throws UnsupportedEncodingException {
byte[] bytesInUTF8inputNoteText = noteText.getBytes("UTF-8");
String noteTextUTF8 = new String(bytesInUTF8inputNoteText, "UTF-8");
this.noteText = noteTextUTF8;}
In SQL Server I changed the data type from text to nvarchar(MAX) to store the data in Unicode, even though that is a different type of Unicode.
What I see when I copy/paste from a MS Word doc into our JSF input textbox:
In Eclipse if I set a watch on the property in the bean, once the data in that String property has been converted into UTF-8, all characters are in UTF-8 format. When I post to to SQL Server the string of data held in nvarchar(max) datatype shows all characters in UTF-8 format correctly. Then when the resultSet is returned and the holding property is populated with the String returned from the db query, it also shows as all being correctly formatted in UTF-8....BUT,...somewhere in between the correct string value that's sitting in the property that's tied into the JSF page and the JSF page, 1.2 by the way, the value is being unformatted so that I see question marks where I should see single/double quotes and bullet points. I hope that someone has run into this type of issue before and can shed some light on what I need to do to fix this. Seems kind of like a JSF bug, thanks in advance for your input!!

try this
String noteText = new String (noteText.getBytes ("iso-8859-1"), "UTF-8");

When you copy paste from windows documents, the encoding format is not UTF-8 but [Windows-1252] (http://en.wikipedia.org/wiki/Windows-1252). Note the cells marked in thick green borders. These chars DONT map to UTF-8 charset and so you will have to use Windows-1252 encoding while reading.

throw exception when string is not encoded in UTF-8

I've got method where one of input attributes is String xml. I just want to create control for encoding of that xml. If any character is in other encoding that UTF-8, error will be thrown.
can you please tell me the easiest way how to create and test it?
I've used something like this:
String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"));
Document doc = builder.parse(IOUtils.toInputStream(xml, "UTF-8"));
added letters like Ľ,Š,Ť,Ž,ľ,š,ť,ž and save it as cp1250 file.
but no error.
what am I doing wrong?

This cannot be done natively in Java. A file is just a string of bytes, they can be interpreted however you feel like, Java by default has no way to add meaning. I recommend using this library (no I didn't write it):
http://code.google.com/p/juniversalchardet/
Follow these instructions (copy pasted from that link):
How to use it
Construct an instance of org.mozilla.universalchardet.UniversalDetector.
Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
Notify the detector of the end of data by calling UniversalDetector.dataEnd().
Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
Don't forget to call UniversalDetector.reset() before you reuse the detector instance.

String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"));
If this IOUtils is org.apache.commons.io.IOUtils then its Javadoc says
"Get the contents of an InputStream as a String using the default character encoding of the platform."
As you are saving as cp1250, I guess cp1250 is also your platform character encoding. What your code would be doing is
Read the file as a byte stream
Convert the byte stream to chars using cp1250 (platform encoding)
Transform the chars to Java internal representation (UTF-16)
Convert from UTF-16 to UTF-8
Create XML document
That will always work as cp1250 really is your file encoding, UTF-16 has every character in cp1250 and UTF-8 has every character in UTF-16.
If you want to read the bytes as UTF-8 and avoid automatic conversions, you should use one of the two-parameter variant of IOUtils.toString():
public static String toString(InputStream input, Charset encoding)
public static String toString(InputStream input, String encoding)
So I would try:
// Helper import: I always forget if the constant is "UTF8" or "UTF-8"
import org.apache.commons.lang.CharEncoding;
String xml = IOUtils.toString(new FileInputStream("c:/encoding.xml"), CharEncoding.UTF_8);
Document doc = builder.parse(IOUtils.toInputStream(xml, CharEncoding.UTF_8));
The rule of thumb here is: NEVER do any byte-to-string / string-to-byte conversion without specifying the source / destination encoding.
A minor rule of thumb would be: Unless you need to use some other encoding, use UTF-8 everywhere.
Both of those rules of thumb are independent of your programming language of choice.

UTF-8 character encoding in Java

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.
The original string is
HANDICAP╔ES
which is supposed to be
HANDICAPÉES
Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.
Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
Map<String, Object> row = this.rowIter.next();
// convert fields to UTF
Map<String, Object> rowUTF = new HashMap<String, Object>();
try {
for (String key : row.keySet()) {
Object o = row.get(key);
if (o != null) {
String valueCP850 = o.toString();
// String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
rowUTF.put(key, valueUTF8);
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Encoding exception: " + e);
}
}
In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.
Thanks,
Cam

New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.
Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
getBytes("CP850") converts the character ╔ to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");
...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.
More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.
And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.
Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:
System.out.println("HANDICAPÉES");
The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:
CHCP 1252
To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.
As for writing to a file, just specify the desired encoding when you create the Writer:
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("myFile.txt"), "UTF-8");

String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES
This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).
Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.
Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:
Align your environment and binary pipelines to use all the one and same character encoding.
You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.
Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.
If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.

You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:
DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
Table table = open.getTable("FolderInfo");

Using "ISO-8859-1" helped me deal with the French charactes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.