URL encoding arbitary characters - java

I need to submit application/x-www-form-urlencoded data to a web server.
The server expects the data to be encoded using ISO-8859-1.
Unfortunately URLEncoder.encode(string, "ISO-8859-1"); does not always work.
Any character that is not part of ISO-8859-1, gets encoded as %3F (which is '?').
Firefox handles those chars in some other way that works on the server side.
\uFEFF (Zero Width No-Break Space) gets encoded to %26%2365279%3B which is exactly what I need.
Could anyone please tell me how to mimic this behaviour/what FF does?

To answer my own question:
FF converts the unmappable chars to decimal HTML entities and encodes those using the charset.
\uFEFF -> & #65279; (ignore the space in between) -> %26%2365279%3B
( %26 = & | %23 = # | %3B = ; )
Here is a method that does the first step in Java:
public static String htmlEscapeUnmappableCharaters(String source, String charset) {
CharsetEncoder cse = Charset.forName(charset).newEncoder();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < source.length(); i++) {
if (cse.canEncode(source.charAt(i))) {
sb.append(source.charAt(i));
} else {
sb.append('&');
sb.append('#');
sb.append(source.codePointAt(i));
sb.append(';');
}
}
return sb.toString();
}

Related

Change InputStream charset after being set

If a string of data contains characters with different encodings, is there a way to change charset encoding after an input stream is created or suggestions on how it could be achieved?
Example to help explain:
// data need to read first 4 characters using UTF-8 and next 4 characters using ISO-8859-2?
String data = "testўёѧẅ"
// use default charset of platform, could pass in a charset
try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
// probably an input stream reader to use char instead of byte would be clearer but hopefully the idea comes across
byte[] bytes = new byte[4];
while (in.read(bytes) != -1) {
// TODO: change the charset here to UTF-8 then read values
// TODO: change the charset here to ISO-8859-2 then read values
}
}
Been looking at decoders, might be the way to go:
What is CharsetDecoder.decode(ByteBuffer, CharBuffer, endOfInput)
Encoding conversion in java
Attempt using same input stream:
String data = "testўёѧẅ";
InputStream inputStream = new ByteArrayInputStream(data.getBytes());
Reader r = new InputStreamReader(inputStream, "UTF-8");
int intch;
int count = 0;
while ((intch = r.read()) != -1) {
System.out.println((char)ch);
if ((++count) == 4) {
r = new InputStreamReader(inputStream, Charset.forName("ISO-8859-2"));
}
}
//outputs test and not the 2nd part
Assuming that you know there will be n UTF-8 characters and m ISO 8859-2 characters in your stream (n=4, m=4 in your example), you can do by using two different InputStreamReaders working on the same InputStream:
try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
InputStreamReader inUtf8 = new InputStreamReader(in, StandardCharsets.UTF_8);
InputStreamReader inIso88592 = new InputStreamReader(in, Charset.forName("ISO-8859-2"));
// read `n` characters using inUtf8, then read `m` characters using inIso88592
}
Note that you need to read characters not bytes (i.e. check how many characters how been read so far, as in UTF-8 a single character may be encoded on 1-4 bytes).
String contains Unicode so it can combine all language scripts.
String data = "testўёѧẅ";
For that String uses a char array, where char is UTF-16. Sometimes a Unicode symbol, a code point, needs to be encoded as two chars. So: char only for a part of the Unicode maps Unicode code points exactly. Here it might do:
String d1 = data.substring(0, 4);
byte[] b1 = data.getBytes(StandardCharsets.UTF_8); // Binary data, UTF-8 text
String d2 = data.substring(4);
Charset charset = Charset.from("ISO-8859-2");
byte[] b2 = data.getBytes(charset); // Binary data, Latin-2 text
The number of bytes do not need to correspond to the number of code points.
Also é might be 1 code point é, or two code points: e and a zero width ´.
To split text by script or Unicode block:
data.codePoints().forEach(cp -> System.out.printf("%-35s - %-25s - %s%n",
Character.getName(cp),
Character.UnicodeBlock.of(cp),
Character.UnicodeScript.of(cp)));
Name: Unicode block: Script:
LATIN SMALL LETTER T - BASIC_LATIN - LATIN
LATIN SMALL LETTER E - BASIC_LATIN - LATIN
LATIN SMALL LETTER S - BASIC_LATIN - LATIN
LATIN SMALL LETTER T - BASIC_LATIN - LATIN
CYRILLIC SMALL LETTER SHORT U - CYRILLIC - CYRILLIC
CYRILLIC SMALL LETTER IO - CYRILLIC - CYRILLIC
CYRILLIC SMALL LETTER LITTLE YUS - CYRILLIC - CYRILLIC
LATIN SMALL LETTER W WITH DIAERESIS - LATIN_EXTENDED_ADDITIONAL - LATIN

Java convert unicode code point to string

How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?
I have tried something like:
Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));
but it does not convert to a valid code point.
That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).
Update
Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
assert src.charAt(j) == '=';
bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);
System.out.println(str);
Output
Газета
In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.
All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.
Here is some sample code that shows one way this can be achieved:
public static void main(String[] args) throws Exception {
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
// Parse string into hex string tokens.
String[] tokens = Arrays.stream(src.split("="))
.filter(s -> s.length() != 0)
.toArray(String[]::new);
// Convert the hex string representations to a byte array.
byte[] utf8bytes = new byte[tokens.length];
for (int i = 0; i < utf8bytes.length; i++) {
utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
}
// Convert UTF-8 bytes to Java String.
String str = new String(utf8bytes, StandardCharsets.UTF_8);
// Display string + individual unicode code points.
System.out.println(str);
str.codePoints().forEach(System.out::println);
}
Output:
Газета
1043
1072
1079
1077
1090
1072

convert charset X to unicode in Java

How do you convert a specific charset to unicode in Java?
charsets have been discussed quite a lot here, but I think this one hasn't been covered yet.
I have a hex-string that meets the criteria length%4==0 (e.g. \ud3faef8e). usually I just display this in an HTML container and add &#x to the front and ; to the back of each hex quadruple.
but in this case the following procedure led to the correct output (non-Java)
paste hex string into Hex-Editor and save the file to test.txt (utf-8)
open the file with Notepad++
change the encoding to Simplified Chinese (GB2312)
Now I'm trying to do the same in Java.
// having hex convert to ascii
String ascii = "";
for (int cnt = 0; cnt <= unicode.length() - 2; cnt += 2) {
String tmp = unicode.substring(cnt, cnt + 2);
int decimal = Integer.parseInt(tmp, 16);
ascii += (char) decimal;
}
// writing ascii to file at this point leads to the same result as in step 2 before
try {
// get the bytes
byte[] utf8 = ascii.getBytes("UTF-8"); // == UTF8
// convert to gb2312
String converted = new String(utf8, "GB2312"); // == EUC_CN
// write to file (writer with declared UTF-8)
writeToFile(converted, 20 + cntu);
cntu++;
} catch (Exception e) {
System.err.println(e.getMessage());
}
the output looks according the should-output, except the fact that randomly the following character is displayed: � why does this one come up? and how can I get rid of it?
in the end, what I'd like to get is the converted unicode again to be able to display it with my original approach (폴), but I haven't figured out a way to get to the hex values again (they don't match the criteria length%4==0). how do I get the hex values of the characters?
update1
to be more precise, regarding the input, I'm assuming that it is Unicode, because of the start of the String with \u, which would be sufficient for my usual approach, but not in the case I am describing above.
update2
the writeToFile method
FileOutputStream fos = new FileOutputStream("test" + id + ".txt");
Writer out = new OutputStreamWriter(fos, "UTF8");
out.write(str);
out.close();
I tried with GB2312 as well, but there is no change. I still get the ? inbetween the correct characters.
update3
the expected output for \ud3f6ef8e is 遇飵 , you get to it when following the steps 1 to 3. (HxD as an example of an hex editor)
there was no indication that I should delete my question, thus I'm writing my final comment as the answer
I was misinterpreting the incoming hex-digits. they were in a specific charset and not uni-code, so they represented the hex-values of a character in that charset. What I'm doing now is new String(byteArray, "CharsetName"); and get (int)s.charAt(i) to get the unicode value and write it to HTML. thanks for your ideas and hints
for more details see this answer here: https://stackoverflow.com/a/4049781/1338732 , and this question here: How to convert UTF-8 to unicode in Java?

Java URL encoding

From my web application I am doing a redirect to an external URL which has some credentials as a part of the URL string. I would like to encode the credential part alone before redirection. I have the following URL:
String url1 = "http://servername:7778/reports/rwservlet?server=server1&ORACLE_SHUTDOWN=YES&PARAMFORM=no&report=test.rdf&desformat=pdf&desname=test.pdf&destype=cache&param1=56738&faces-redirect=true&";
I am encoding it as:
String URL = "userid=username/passwd#DBname";
encodedURL = URLEncoder.encode(URL, "UTF-8");
String redirectURL = url1 + encodedURL1;
The URL generated by this code is
http://servername:7778/reports/rwservlet?server=server1&ORACLE_SHUTDOWN=YES&PARAMFORM=no&report=test.rdf&desformat=pdf&desname=test.pdf&destype=cache&param1=56738&faces-redirect=true&userid=%3Dusername%2Fpasswd%40DBname
As we can see towards the end of the encoded URL, only the special characters like / have been encoded. i.e. userid=username/passwd#DBname has become userid=%3Dusername%2Fpasswd%40DBname
I want to generate a URL which will have the the entire string "username/passwd#DBname" encoded . Something like :
userid=%61%62
How can I achieve this?
So in fact you want the url to become somewhat unreadable, without the need for decoding, Decoding would be needed for a Base64 encoding (with replacing / and -).
Yes you may abuse the URL encoding.
String encodeURL(String s) {
byte[] bytes = s.getBytes("UTF-8");
StringBuilder sb = new StringBuilder();
for (byte b : bytes) {
String hex = String.format("%%%02X", ((int)b) & 0xFF);
sb.append(hex);
}
return sb.toString();
}
%% being the percentage sign itself, and %02X hex, 2 digits, zero-filled, capitals.
Mind that some browsers will display such links decoded, on mouse-over. But you are just redirecting.

How to remove bad characters that are not suitable for utf8 encoding in MySQL?

I have dirty data. Sometimes it contains characters like this. I use this data to make queries like
WHERE a.address IN ('mydatahere')
For this character I get
org.hibernate.exception.GenericJDBCException: Illegal mix of collations (utf8_bin,IMPLICIT), (utf8mb4_general_ci,COERCIBLE), (utf8mb4_general_ci,COERCIBLE) for operation ' IN '
How can I filter out characters like this? I use Java.
Thanks.
When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:
use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
print Encode::decode('UTF-8', $_);
}
This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with � (U+FFFD, Unicode replacement character).
If you run this script on good UTF-8 input, output should be identical to input.
If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.
This is Perl one-liner version of this same script:
perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt
EDIT: Added Java-only solution.
This is an example how to do this in Java:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
public class UtfFix {
public static void main(String[] args) throws InterruptedException, CharacterCodingException {
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
ByteBuffer bb = ByteBuffer.wrap(new byte[] {
(byte) 0xD0, (byte) 0x9F, // 'П'
(byte) 0xD1, (byte) 0x80, // 'р'
(byte) 0xD0, // corrupted UTF-8, was 'и'
(byte) 0xD0, (byte) 0xB2, // 'в'
(byte) 0xD0, (byte) 0xB5, // 'е'
(byte) 0xD1, (byte) 0x82 // 'т'
});
CharBuffer parsed = decoder.decode(bb);
System.out.println(parsed);
// this prints: Пр?вет
}
}
You can encode and then decode it to/from UTF-8:
String label = "look into my eyes 〠.〠";
Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();
System.out.println(label);
output:
look into my eyes ?.?
edit: I think this might only work on Java 6.
You can filter surrogate characters with this regex:
String str = "𠀀"; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0
Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(values[i].replaceAll(
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
, ""));
}
or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(Pattern.matches(
".*(" +
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
+ ").*"
, values[i]));
}
For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.
May be this will help someone as it helped me.
public static String removeBadChars(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
for(int i=0;i<s.length();i++){
if (Character.isHighSurrogate(s.charAt(i))) continue;
sb.append(s.charAt(i));
}
return sb.toString();
}
In PHP - I approach this by only allowing printable data. This really helps in cleaning the data for DB.
It's pre-processing though and sometimes you don't have that luxury.
$str = preg_replace('/[[:^print:]]/', '', $str);

Categories