URL decoding in Java for non-ASCII characters

URL decoding in Java for non-ASCII characters - java

I'm trying in Java to decode URL containing % encoded characters
I've tried using java.net.URI class to do the job, but it's not always working correctly.
String test = "https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise";
URI uri = new URI(test);
System.out.println(uri.getPath());
For the test String "https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise", the result is correct "/wiki/Fondation_Alliance_française" (%C3%A7 is correctly replaced by ç).
But for some other test strings, like "http://sv.wikipedia.org/wiki/Anv%E4ndare:Lsjbot/Statistik#Drosophilidae", it gives an incorrect result "/wiki/Anv�ndare:Lsjbot/Statistik" (%E4 is replaced by � instead of ä).
I did some testing with getRawPath() and URLDecoder class.
System.out.println(URLDecoder.decode(uri.getRawPath(), "UTF8"));
System.out.println(URLDecoder.decode(uri.getRawPath(), "ISO-8859-1"));
System.out.println(URLDecoder.decode(uri.getRawPath(), "WINDOWS-1252"));
Depending on the test String, I get correct results with different encodings:
For %C3%A7, I get a correct result with "UTF-8" encoding as expected, and incorrect results with "ISO-8859-1" or "WINDOWS-1252" encoding
For %E4, it's the opposite.
For both test URL, I get the correct page if I put them in Chrome address bar.
How can I correctly decode the URL in all situations ?
Thanks for any help
==== Answer ====
Thanks to the suggestions in McDowell answer below, it now seems to work. Here's what I now have as code:
private static void appendBytes(ByteArrayOutputStream buf, String data) throws UnsupportedEncodingException {
byte[] b = data.getBytes("UTF8");
buf.write(b, 0, b.length);
}
private static byte[] parseEncodedString(String segment) throws UnsupportedEncodingException {
ByteArrayOutputStream buf = new ByteArrayOutputStream(segment.length());
int last = 0;
int index = 0;
while (index < segment.length()) {
if (segment.charAt(index) == '%') {
appendBytes(buf, segment.substring(last, index));
if ((index < segment.length() + 2) &&
("ABCDEFabcdef0123456789".indexOf(segment.charAt(index + 1)) >= 0) &&
("ABCDEFabcdef0123456789".indexOf(segment.charAt(index + 2)) >= 0)) {
buf.write((byte) Integer.parseInt(segment.substring(index + 1, index + 3), 16));
index += 3;
} else if ((index < segment.length() + 1) &&
(segment.charAt(index + 1) == '%')) {
buf.write((byte) '%');
index += 2;
} else {
buf.write((byte) '%');
index++;
}
last = index;
} else {
index++;
}
}
appendBytes(buf, segment.substring(last));
return buf.toByteArray();
}
private static String parseEncodedString(String segment, Charset... encodings) {
if ((segment == null) || (segment.indexOf('%') < 0)) {
return segment;
}
try {
byte[] data = parseEncodedString(segment);
for (Charset encoding : encodings) {
try {
if (encoding != null) {
return encoding.newDecoder().
onMalformedInput(CodingErrorAction.REPORT).
decode(ByteBuffer.wrap(data)).toString();
}
} catch (CharacterCodingException e) {
// Incorrect encoding, try next one
}
}
} catch (UnsupportedEncodingException e) {
// Nothing to do
}
return segment;
}

Anv%E4ndare
As PopoFibo says this is not a valid UTF-8 encoded sequence.
You can do some tolerant best-guess decoding:
public static String parse(String segment, Charset... encodings) {
byte[] data = parse(segment);
for (Charset encoding : encodings) {
try {
return encoding.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(data))
.toString();
} catch (CharacterCodingException notThisCharset_ignore) {}
}
return segment;
}
private static byte[] parse(String segment) {
ByteArrayOutputStream buf = new ByteArrayOutputStream();
Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])")
.matcher(segment);
int last = 0;
while (matcher.find()) {
appendAscii(buf, segment.substring(last, matcher.start()));
byte hex = (byte) Integer.parseInt(matcher.group(1), 16);
buf.write(hex);
last = matcher.end();
}
appendAscii(buf, segment.substring(last));
return buf.toByteArray();
}
private static void appendAscii(ByteArrayOutputStream buf, String data) {
byte[] b = data.getBytes(StandardCharsets.US_ASCII);
buf.write(b, 0, b.length);
}
This code will successfully decode the given strings:
for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise",
"Anv%E4ndare")) {
String result = parse(test, StandardCharsets.UTF_8,
StandardCharsets.ISO_8859_1);
System.out.println(result);
}
Note that this isn't some foolproof system that allows you to ignore correct URL encoding. It works here because v%E4n - the byte sequence 76 E4 6E - is not a valid sequence as per the UTF-8 scheme and the decoder can detect this.
If you reverse the order of the encodings the first string can happily (but incorrectly) be decoded as ISO-8859-1.
Note: HTTP doesn't care about percent-encoding and you can write a web server that accepts http://foo/%%%%% as a valid form. The URI spec mandates UTF-8 but this was done retroactively. It is really up to the server to describe what form its URIs should be and if you have to handle arbitrary URIs you need to be aware of this legacy.
I've written a bit more about URLs and Java here.

Related

Encode String to Base36

Currently I am working at an algorithm to encode a normal string with each possible character to a Base36 string.
I have tried the following but it doesn't work.
public static String encode(String str) {
return new BigInteger(str, 16).toString(36);
}
I guess it's because the string is not just a hex string. If I use the string "Hello22334!" In Base36, then I get a NumberFormatException.
My approach would be to convert each character to a number. Convert the numbers to the hexadecimal representation, and then convert the hexstring to Base36.
Is my approach okay or is there a simpler or better way?

First you need to convert your string to a number, represented by a set of bytes. Which is what you use an encoding for. I highly recommend UTF-8.
Then you need to convert that number, set of bytes to a string, in base 36.
byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
String base36 = new BigInteger(1, bytes).toString(36);
To decode:
byte[] bytes = new Biginteger(base36, 36).toByteArray();
// Thanks to #Alok for pointing out the need to remove leading zeroes.
int zeroPrefixLength = zeroPrefixLength(bytes);
String string = new String(bytes, zeroPrefixLength, bytes.length-zeroPrefixLength, StandardCharsets.UTF_8));
private int zeroPrefixLength(final byte[] bytes) {
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] != 0) {
return i;
}
}
return bytes.length;
}

From Base10 to Base36
public static String toBase36(String str) {
try {
return Long.toString(Long.valueOf(str), 36).toUpperCase();
} catch (NumberFormatException | NullPointerException ex) {
ex.printStackTrace();
}
return null;
}
From Base36String to Base10
public static String fromBase36(String b36) {
try {
BigInteger base = new BigInteger( b36, 36);
return base.toString(10);
}catch (Exception e){
e.printStackTrace();
}
return null;
}

what this android function returns

I am trying to decode an APK file. I need to get what m21862a function returns.
Simply I need HASH value. Hash is requested to https://api.SOMESITE.net/external/auth. How it is generated?
Here is my part code:
a = HttpTools.m22199a("https://api.somesite.net/external/hello", false);
String str = BuildConfig.FLAVOR;
str = BuildConfig.FLAVOR;
str = BuildConfig.FLAVOR;
try {
str = ((String) new JSONObject(a).get("token")) + ZaycevApp.f15130a.m21564W();
Logger.m22256a("ZAuth", "token - " + str);
str = m21862a(str);
a = new JSONObject(HttpTools.m22199a(String.format("https://api.SOMESITE.net/external/auth?code=%s&hash=%s", new Object[]{a, str}), false)).getString("token");
if (!ae.m21746b((CharSequence) a)) {
ZaycevApp.f15130a.m21595f(a);
}
}
I need to know what is m21862a function. Is there PHP replacement for m21862a? Here is m21862a function:
private String m21862a(String str) {
try {
MessageDigest instance = MessageDigest.getInstance("MD5");
instance.update(str.getBytes());
byte[] digest = instance.digest();
StringBuffer stringBuffer = new StringBuffer();
for (byte b : digest) {
String toHexString = Integer.toHexString(b & RadialCountdown.PROGRESS_ALPHA);
while (toHexString.length() < 2) {
toHexString = "0" + toHexString;
}
stringBuffer.append(toHexString);
}
return stringBuffer.toString();
} catch (Exception e) {
Logger.m22252a((Object) this, e);
return BuildConfig.FLAVOR;
}
}

The function computes the MD5 digest of the input, takes each byte of the computed MD5, "ANDize" with RadialCountdown.PROGRESS_ALPHA, translates to hex (pad with 0 to have 2 char) and appends that to the ouput.
There is probably a way to do the same thing in php (using md5()?).

String invalid length after writing to StringBuilder and ByteArrayOutputStream from FileInputStream, issue with "null characters"

The goal is to read a file name from a file, which is a max of 100 bytes, and the actual name is the file name filled with "null-bytes".
Here is what it looks like in GNU nano
Where .PKGINFO is the valid file name, and the ^# represent "null bytes".
I tried here with StringBuilder
package falken;
import java.io.*;
public class Testing {
public Testing() {
try {
FileInputStream tarIn = new FileInputStream("/home/gala/falken_test/test.tar");
final int byteOffset = 0;
final int readBytesLength = 100;
StringBuilder stringBuilder = new StringBuilder();
for ( int bytesRead = 1, n, total = 0 ; (n = tarIn.read()) != -1 && total < readBytesLength ; bytesRead++ ) {
if (bytesRead > byteOffset) {
stringBuilder.append((char) n);
total++;
}
}
String out = stringBuilder.toString();
System.out.println(">" + out + "<");
System.out.println(out.length());
} catch (Exception e) {
/*
This is a pokemon catch not used in final code
*/
e.printStackTrace();
}
}
}
But it gives an invalid String length of 100, while the output on IntelliJ shows the correct string passed withing the >< signs.
>.PKGINFO<
100
Process finished with exit code 0
But when i paste it here on StackOverflow I get the correct string with unknown "null-characters", whose size is actually 100.
>.PKGINFO <
What regex can i use to get rid of the characters after the valid file name?
The file I am reading is ASCII encoded.
I also tried ByteArrayOutputStream, with the same result
package falken;
import java.io.*;
import java.nio.charset.StandardCharsets;
public class Testing {
public Testing() {
try {
FileInputStream tarIn = new FileInputStream("/home/gala/falken_test/test.tar");
final int byteOffset = 0;
final int readBytesLength = 100;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
for ( int bytesRead = 1, n, total = 0 ; (n = tarIn.read()) != -1 && total < readBytesLength ; bytesRead++ ) {
if (bytesRead > byteOffset) {
byteArrayOutputStream.write(n);
total++;
}
}
String out = byteArrayOutputStream.toString();
System.out.println(">" + out + "<");
System.out.println(out.length());
} catch (Exception e) {
/*
This is a pokemon catch not used in final code
*/
e.printStackTrace();
}
}
}
What could be the issue here?

Well, it seems to be reading null characters as actual characters, spaces in fact. If it's possible, see if you can read the filename, then, cut out the null characters. In your case, you need a data.trim(); and a data2 = data.substring(0,(data.length()-1))

You need to stop appending to the string buffer once you read the first null character from the file.
You seem to want to read a tar archive, have a look at the following code which should get you started.
byte[] buffer = new byte[500]; // POSIX tar header is 500 bytes
FileInputStream is = new FileInputStream("test.tar");
int read = is.read(buffer);
// check number of bytes read; don't bother if not at least the whole
// header has been read
if (read == buffer.length) {
// search for first null byte; this is the end of the name
int offset = 0;
while (offset < 100 && buffer[offset] != 0) {
offset++;
}
// create string from byte buffer using ASCII as the encoding (other
// encodings are not supported by tar)
String name = new String(buffer, 0, offset,
StandardCharsets.US_ASCII);
System.out.println("'" + name + "'");
}
is.close();
You really shouldn't use trim() on the filename, this will break whenever you encounter a filename with leading or trailing blanks.

Java java.io.IOException: Not in GZIP format

I searched for an example of how to compress a string in Java.
I have a function to compress then uncompress. The compress seems to work fine:
public static String encStage1(String str)
{
String format1 = "ISO-8859-1";
String format2 = "UTF-8";
if (str == null || str.length() == 0)
{
return str;
}
System.out.println("String length : " + str.length());
ByteArrayOutputStream out = new ByteArrayOutputStream();
String outStr = null;
try
{
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
outStr = out.toString(format2);
System.out.println("Output String lenght : " + outStr.length());
} catch (Exception e)
{
e.printStackTrace();
}
return outStr;
}
But the reverse is complaining about the string not being in GZIP format, even when I pass the return from encStage1 straight back into the decStage3:
public static String decStage3(String str)
{
if (str == null || str.length() == 0)
{
return str;
}
System.out.println("Input String length : " + str.length());
String outStr = "";
try
{
String format1 = "ISO-8859-1";
String format2 = "UTF-8";
GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(str.getBytes(format2)));
BufferedReader bf = new BufferedReader(new InputStreamReader(gis, format2));
String line;
while ((line = bf.readLine()) != null)
{
outStr += line;
}
System.out.println("Output String lenght : " + outStr.length());
} catch (Exception e)
{
e.printStackTrace();
}
return outStr;
}
I get this error when I call with a string return from encStage1:
public String encIDData(String idData)
{
String tst = "A simple test string";
System.out.println("Enc 0: " + tst);
String stg1 = encStage1(tst);
System.out.println("Enc 1: " + toHex(stg1));
String dec1 = decStage3(stg1);
System.out.println("unzip: " + toHex(dec1));
}
Output/Error:
Enc 0: A simple test string
String length : 20
Output String lenght : 40
Enc 1: 1fefbfbd0800000000000000735428efbfbdefbfbd2defbfbd495528492d2e51282e29efbfbdefbfbd4b07005aefbfbd21efbfbd14000000
Input String length : 40
java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)

A small error is:
gzip.write(str.getBytes());
takes the default platform encoding, which on Windows will never be ISO-8859-1. Better:
gzip.write(str.getBytes(format1));
You could consider taking "Cp1252", Windows Latin-1 (for some European languages), instead of "ISO-8859-1", Latin-1. That adds comma like quotes and such.
The major error is converting the compressed bytes to a String. Java separates binary data (byte[], InputStream, OutputStream) from text (String, char, Reader, Writer) which internally is always kept in Unicode. A byte sequence does not need to be valid UTF-8. You might get away by converting the bytes as a single byte encoding (ISO-8859-1 for instance).
The best way would be
gzip.write(str.getBytes(StandardCharsets.UTF_8));
So you have full Unicode, every script may be combined.
And uncompressing to a ByteArrayOutputStream and new String(baos.toByteArray(), StandardCharsets.UTF_8).
Using BufferedReader on an InputStreamReader with UTF-8 is okay too, but a readLine throws away the newline characters
outStr += line + "\r\n"; // Or so.
Clean answer:
public static byte[] encStage1(String str) throws IOException
{
try (ByteArrayOutputStream out = new ByteArrayOutputStream())
{
try (GZIPOutputStream gzip = new GZIPOutputStream(out))
{
gzip.write(str.getBytes(StandardCharsets.UTF_8));
}
return out.toByteArray();
//return out.toString(StandardCharsets.ISO_8859_1);
// Some single byte encoding
}
}
public static String decStage3(byte[] str) throws IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(str)))
{
int b;
while ((b = gis.read()) != -1) {
baos.write((byte) b);
}
}
return new String(baos.toByteArray(), StandardCharset.UTF_8);
}

usage of toString/getBytes for encoding/decoding is a wrong way. try to use something like BASE64 encoding for this purpose (java.util.Base64 in jdk 1.8)
as a proof try this simple test:
import org.testng.annotations.Test;
import java.io.ByteArrayOutputStream;
import static org.testng.Assert.assertEquals;
public class SimpleTest {
#Test
public void test() throws Exception {
final String CS = "utf-8";
byte[] b0 = {(byte) 0xff};
ByteArrayOutputStream out = new ByteArrayOutputStream();
out.write(b0);
out.close();
byte[] b1 = out.toString(CS).getBytes(CS);
assertEquals(b0, b1);
}
}

How to convert binary text into useable file

So I use the following methods
(File is converted to Byte Array through 'convertFileToByteArray()', then written to .txt file by 'convertByteArrayToBitTextFile()'
to convert any kind of file into a Binary Text file (and by that I mean only 1's and 0's in human readable form.)
public static byte[] convertFileToByteArray(String path) throws IOException
{
File file = new File(path);
byte[] fileData;
fileData = new byte[(int)file.length()];
FileInputStream in = new FileInputStream(file);
in.read(fileData);
in.close();
return fileData;
}
public static boolean convertByteArrayToBitTextFile(String path, byte[] bytes)
{
String content = convertByteArrayToBitString(bytes);
try
{
PrintWriter out = new PrintWriter(path);
out.println(content);
out.close();
return true;
}
catch (FileNotFoundException e)
{
return false;
}
}
public static String convertByteArrayToBitString(byte[] bytes)
{
String content = "";
for (int i = 0; i < bytes.length; i++)
{
content += String.format("%8s", Integer.toBinaryString(bytes[i] & 0xFF)).replace(' ', '0');
}
return content;
}
Edit: Additional Code:
public static byte[] convertFileToByteArray(String path) throws IOException
{
File file = new File(path);
byte[] fileData;
fileData = new byte[(int)file.length()];
FileInputStream in = new FileInputStream(file);
in.read(fileData);
in.close();
return fileData;
}
public static boolean convertByteArrayToBitTextFile(String path, byte[] bytes)
{
try
{
PrintWriter out = new PrintWriter(path);
for (int i = 0; i < bytes.length; i++)
{
out.print(String.format("%8s", Integer.toBinaryString(bytes[i] & 0xFF)).replace(' ', '0'));
}
out.close();
return true;
}
catch (FileNotFoundException e)
{
return false;
}
}
public static boolean convertByteArrayToByteTextFile(String path, byte[] bytes)
{
try
{
PrintWriter out = new PrintWriter(path);
for(int i = 0; i < bytes.length; i++)
{
out.print(bytes[i]);
}
out.close();
return true;
}
catch (FileNotFoundException e)
{
return false;
}
}
public static boolean convertByteArrayToRegularFile(String path, byte[] bytes)
{
try
{
PrintWriter out = new PrintWriter(path);
for(int i = 0; i < bytes.length; i++)
{
out.write(bytes[i]);
}
out.close();
return true;
}
catch (FileNotFoundException e)
{
return false;
}
}
public static boolean convertBitFileToByteTextFile(String path)
{
try
{
byte[] b = convertFileToByteArray(path);
convertByteArrayToByteTextFile(path, b);
return true;
}
catch (IOException e)
{
return false;
}
}
I do this to try methods of compression on a very fundamental level, so please let's not discuss why use human-readable form.
Now this works quite well so far, however I got two problems.
1)
It takes foreeeever (>20 Minutes for 230KB into binary text). Is this just a by-product of the relatively complicated conversion or are there other methods to do this faster?
2) and main problem:
I have no idea how to convert the files back to what they used to be. Renaming from .txt to .exe does not work (not too surprising as the resulting file is two times larger than the original)
Is this still possible or did I lose Information about what the file is supposed to represent by converting it to a human-readable text file?
If so, do you know any alternative that prevents this?
Any help is appreciated.

The thing that'll cost you most time is the construction of an ever increasing String. A better approach would be to write the data as soon as you have it.
The other problem is very easy. You know that every sequence of eight characters ('0' or '1') was made from a byte. Hence, you know the values of each character in an 8-character block:
01001010
^----- 0*1
^------ 1*2
^------- 0*4
^-------- 1*8
^--------- 0*16
^---------- 0*32
^----------- 1*64
^------------ 0*128
-----
64+8+2 = 74
You only need to add the values where an '1' is present.
You can do it in Java like this, without even knowing the individual bit values:
String sbyte = "01001010";
int bytevalue = 0;
for (i=0; i<8; i++) {
bytevalue *= 2; // shifts the bit pattern to the left 1 position
if (sbyte.charAt(i) == '1') bytevalue += 1;
}

Use StringBuilder to avoid generating enormous numbers of unused String instances.
Better yet, write directly to the PrintWriter instead of building it in-memory at all.
Loop through every 8-character subsequence and call Byte.parseByte(text, 2) to parse it back to a byte.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

URL decoding in Java for non-ASCII characters - java

Related

Encode String to Base36

what this android function returns

String invalid length after writing to StringBuilder and ByteArrayOutputStream from FileInputStream, issue with "null characters"

Java java.io.IOException: Not in GZIP format

How to convert binary text into useable file

Categories

Resources