Encode URL with US-ASCII character set - java

I refer to the following web site:
http://coderstoolbox.net/string/#!encoding=xml&action=encode&charset=us_ascii
Choosing "URL", "Encode", and "US-ASCII", the input is converted to the desired output.
How do I produce the same output with Java codes?
Thanks in advance.

I used this and it seems to work fine.
public static String encode(String input) {
Pattern doNotReplace = Pattern.compile("[a-zA-Z0-9]");
return input.chars().mapToObj(c->{
if(!doNotReplace.matcher(String.valueOf((char)c)).matches()){
return "%" + (c<256?Integer.toHexString(c):"u"+Integer.toHexString(c));
}
return String.valueOf((char)c);
}).collect(Collectors.joining("")).toUpperCase();
}
PS: I'm using 256 to limit the placement of the prefix U to non-ASCII characters. No need of prefix U for standard ASCII characters which are within 256.
Alternate option:
There is a built-in Java class (java.net.URLEncoder) that does URL Encoding. But it works a little differently (For example, it does not replace the Space character with %20, but replaces with a + instead. Something similar happens with other characters too). See if it helps:
String encoded = URLEncoder.encode(input, "US-ASCII");
Hope this helps!

You can use ESAPi.encoder().encodeForUrl(linkString)
Check more details on encodeForUrl https://en.wikipedia.org/wiki/Percent-encoding
please comment if that does not satisfy your requirement or face any other issue.
Thanks

Related

Need help on validating domains on the basis of ASCII and BASE64 encoded UTF-8 string

I am doing some tests related to ldap in java using JDK 1.7
I have configuration file from which I am reading value of one property like "dc=domain1,dc=com" to pass that later to ldap for searching operations.
Here I want to validate the value which is coming from properties file and that value should be only ASCII or Base64 encoded UTF-8 strings.
I have written following regex to validate the string but seems like it is having some issues.
here is my sample code:
public class ValidateDN {
public static void main(String[] args) {
String istr = "dc=domain1,dc=com";
String myregex = "^dc=[a-zA-Z0-9\\-\\.]*[,dc=[a-zA-Z0-9\\-\\.]*]*";
if (istr.matches(myregex)){
System.out.println("String matches");
}
else{
System.out.println("String not matching");
}
}
}
It should pass all strings like:
dc=com
dc=domain1,dc=com
dc=domain2,dc=domain1,dc=com
It should fail for the values:
dc=domain1,dc=com,d
dc=domain1,dc=com,dc
(incomplete key or invalid syntax)
Can anyone suggest what should be done here to validate this properly?
You have a major error in your regex - you're using square brackets instead of parenthesis. Square brackets mean: "Any character", not a sequence of characters.
Further, your regex can be simplified to:
(dc=[\w-]+,?)*
As LDAP DNs may contain spaces, you may want to consider using:
(\s*dc\s*=\s*[\w-]+\s*,?)*
Remember to escape the slashes as necessary when inserting into your code.
I believe the problem you are having is due to the structure of your regex.
Your regex:
"^dc=[a-zA-Z0-9\\-\\.]*[,dc=[a-zA-Z0-9\\-\\.]*]*"
has a flaw with the second character class. Specifically:
(`[,dc=[a-zA-Z0-9\\-\\.]*]*.
It should be changed to (,dc=[a-zA-Z0-9\\-\\.]*)* for the sake of having the literal ",dc=" match as well as the inner character class match.
The complete regex that should work is:
^dc=[a-zA-Z0-9\\-\\.]*(,dc=[a-zA-Z0-9\\-\\.]*)*

Regular expression String for URL in JAVA

I would like to check URL Validation in JAVA with regular-expression. I found this comment and I tried to use it in my code as follow...
private static final String PATTERN_URL = "/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:ww‌​w.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?‌​(?:[\w]*))?)/";
.....
if (!urlString.matches(PATTERN_URL)) {
System.err.println("Invalid URL");
return false;
}
But I got compile time exception for writing my PATTERN_URL variable. I have no idea how to format it and I am worried about will it become invalid regex if I have modified. Can anyone fix it for me without losing original ? Thanks for your helps.
Your regex looks fine. You just need to format it for a Java string, by escaping all the escape-slashes:
\ --> \\
Resulting in this:
"/((([A-Za-z]{3,9}:(?:\\/\\/)?)(?:[-;:&=\\+\\$,\\w]+#)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:ww‌​w.|[-;:&=\\+\\$,\\w]+#)[A-Za-z0-9.-]+)((?:\\/[\\+~%\\/.\\w-_]*)?\\??(?:[-\\+=&;%#.\\w_]*)#?‌​(?:[\\w]*))?)/"
After Java interprets this string into a java.util.regex.Pattern, it will strip out those extra escape-slashes and become exactly the regex you want. You can prove this by printing it:
System.out.println(Pattern.compile(PATTERN_URL));

Japanese Character Encoding in Base64

I have been asked to fix a bug in our email processing software.
When a message whose subject is encoded in RFC 2047 like this:
=?ISO-2022-JP?B?GyRCR1s/LiVGJTklSC1qRnxLXDhsGyhC?=
is received, it is incorrectly decoded - one of the Japanese characters is not rendered properly. It is rendered like this: 配信テスト?日本語 when it should be 配信テスト㈱日本語
(I do not understand Japanese) - clearly one of the characters, the one which looks its in brackets, has not been rendered.
The decoding is carried out by javax.mail.internet.MimeUtility.decodeText()
If I try it with an on-line decoder (the only one I've found is here) it seems to work OK, so I was suspecting a bug in MimeUtility.
So I tried some experiments, in the form of this little program:
public class Encoding {
private static final Charset CHARSET = Charset.forName("ISO-2022-JP");
public static void main(String[] args) throws UnsupportedEncodingException {
String control = "繋がって";
String subject= "配信テスト㈱日本語";
String controlBase64 = japaneseToBase64(control);
System.out.println(controlBase64);
System.out.println(base64ToJapanese(controlBase64));
String subjectBase64 = japaneseToBase64(subject);
System.out.println(subjectBase64);
System.out.println(base64ToJapanese(subjectBase64));
}
private static String japaneseToBase64(String in) {
return Base64.encodeBase64String(in.getBytes(CHARSET));
}
private static String base64ToJapanese(String in) {
return new String(Base64.decodeBase64(in), CHARSET);
}
}
(The Base64 and Hex classes are in org.apache.commons.codec)
When I run it, here's the output:
GyRCN1IkLCRDJEYbKEI=
繋がって
GyRCR1s/LiVGJTklSCEpRnxLXDhsGyhC
配信テスト?日本語
The first, shorter Japanese string is a control, and this returns the same as the input, having been converted into Base64 and back again, using Charset ISO-2022-JP. All OK there.
The second Japanese string is the one with the dodgy character. As you see, it returns with a ? instead of the character. The Base64 encoding output is also different from the original subject encoding.
Sorry if this is long, I wanted to be thorough. What's going on, and how can I decode this character correctly?
The bug is not in your software, but the subject string itself is incorrectly encoded. Other software may be able to decode the text by making further assumptions about the content, just as it is often assumed that characters in the range 0x80-0x9f are Cp1252-encoded, although ISO-8859-1 or ISO-8859-15 is specified.
ISO-2022-JP is a multi-charset encoding, using escape sequences to switch between the actually used character set. Your encoded string starts with ESC $ B, indicating that the character set JIS X 0208-1983 is used. The offending character is encoded as 0x2d6a. That code point is not defined in the referred character set, but later added to JIS X 0213:2000, a newer version of the JIS X character set specifications.
Try using "MS932" or "Shift-JIS" in your encoding. Means
private static final Charset CHARSET = Charset.forName("MS932");
There are different scripts in Japanese like kanji, katakana. Some of the encoding like Cp132 will not support some characters of Japanese. The problem you face is because of the encoding "ISO-2022-JP" you have used in your code.
ISO-2022-JP uses pairs of bytes, called ku and ten, that index into a 94×94 table of characters. The pair that fails has ku 12 and ten 73, which is not listed in table of valid characters I have (based on JIS X 0208). All of ku=12 seems to be unused.
Wikipedia doesn't list any updates to JIS X 0208, either. Perhaps the sender is using some sort of vendor-defined extension?
Despite the fact that ISO-2022-JP is a variable width encoding, it seems as though Java doesn't support the section of the character set that it lies in (possibly as a result of the missing escape sequences in ISO-2022-JP-2 that are present in ISO-2022-JP-3 and ISO-2022-JP-2004 which aren't supported). UTF-8, UTF-16 and UTF-32 do however support all of the characters.
UTF-32:
AAB+SwAAMEwAADBjAAAwZg==
繋がって
AACRTQAAT+EAADDGAAAwuQAAMMgAADIxAABl5QAAZywAAIqe
配信テスト㈱日本語
As an extra tidbit, regardless of whether UTF-32 was used, when the strings were printed as-is they retained their natural encoding and appeared normally.

How to remove Ascii code from the JTextArea?

My java program gets some weather information from an API. But it has weird letters in the text. Looks like ASCII code.
Here is an example:
Min temp: 0°C (32°F) which should be: Min temp: 0C (32F) (i think).
How can I change it?
Well one solution can be before posting you can do following
String withoutDegSymbol = str.replaceAll("°", "");
Where str contains you temperature data.
try this
String s = "0°C (32°F)".replaceAll("[\u0080-\u00FF]", "");
or if you have HTML character references in your text use
String s = "Min temp: 0°C (32°F)".replaceAll("&#x.+?;", "");
If using ASCII character encoding in your codes, when you saved your code, did your IDE asked you in what format you want to save it. Because in Eclipse IDE, if you are using an ASCII character, it prompts you to save your code in UTF-8 format. Hope this helps.
You need to know that :
° :
is Unicode Character for 'DEGREE SIGN'
Encoding :HTML Entity (hex)
so if this is the just problem you have (i mean this is the only character you use "Degree sign") , so you can convert it manually Like that :
String s = "Min temp: 0°C (32°F)".replaceAll("°", "°");
System.out.println(s);
if this is not the special character you have , so you may use : Class StringEscapeUtils
or jsoup library to convert it to java

How to find out if string has already been URL encoded?

How could I check if string has already been encoded?
For example, if I encode TEST==, I get TEST%3D%3D. If I again encode last string, I get TEST%253D%253D, I would have to know before doing that if it is already encoded...
I have encoded parameters saved, and I need to search for them. I don't know for input parameters, what will they be - encoded or not, so I have to know if I have to encode or decode them before search.
Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.
I hope one can't write a quine in urlencode, or this algorithm would get stuck.
Exception: When a string contains "+" character url decoder replaces it with a space even though the string is not url encoded
Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).
Try decoding the url. If the resulting string is shorter than the original then the original URL was already encoded, else you can safely encode it (either it is not encoded, or even post encoding the url stays as is, so encoding again will not result in a wrong url). Below is sample pseudo (inspired by ruby) code:
# Returns encoded URL for any given URL after determining whether it is already encoded or not
def escape(url)
unescaped_url = URI.unescape(url)
if (unescaped_url.length < url.length)
return url
else
return URI.escape(url)
end
end
You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.
Check your URL for suspicious characters[1].
List of candidates:
WHITE_SPACE ,", < , > , { , } , | , \ , ^ , ~ , [ , ] , . and `
I use:
private static boolean isAlreadyEncoded(String passedUrl) {
boolean isEncoded = true;
if (passedUrl.matches(".*[\\ \"\\<\\>\\{\\}|\\\\^~\\[\\]].*")) {
isEncoded = false;
}
return isEncoded;
}
For the actual encoding I proceed with:
https://stackoverflow.com/a/49796882/1485527
Note: Even if your URL doesn't contain unsafe characters you might want to apply, e.g. Punnycode encoding to the host name. So there is still much space for additional checks.
[1] A list of candidates can be found in the section "unsafe" of the URL spec at Page 2.
In my understanding '%' or '#' should be left out in the encoding check, since these characters can occur in encoded URLs as well.
Using Spring UriComponentsBuilder:
import java.net.URI;
import org.springframework.web.util.UriComponentsBuilder;
private URI getProperlyEncodedUri(String uriString) {
try {
return URI.create(uriString);
} catch (IllegalArgumentException e) {
return UriComponentsBuilder.fromUriString(uriString).build().toUri();
}
}
If you want to be sure that string is encoded correctly (if it needs to be encoded) - just decode and encode it once again.
metacode:
100%_correctly_encoded_string = encode(decode(input_string))
already encoded string will remain untouched. Unencoded string will be encoded. String with only url-allowed characters will remain untouched too.
According to the spec (https://www.rfc-editor.org/rfc/rfc3986) all URLs MUST start with a scheme followed by a :
Since colons are required as the delimiter between a scheme and the rest of the URI, any string that contains a colon is not encoded.
(This assumes you will not be given an incomplete URI with no scheme.)
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
You can make this loop simpler if you know what schemes you can expect.
Thanks to this answer I coded a function (JS Language) that encodes the URL just once with encodeURI so you can call it to make sure is encoded just once and you don't need to know if the URL is already encoded.
ES6:
var getUrlEncoded = sURL => {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Pre ES6:
var getUrlEncoded = function(sURL) {
if (decodeURI(sURL) === sURL) return encodeURI(sURL)
return getUrlEncoded(decodeURI(sURL))
}
Here are some tests so you can see the URL is only encoded once:
getUrlEncoded("https://example.com/media/Screenshot27 UI Home.jpg")
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"

Categories