Invalid URI with Chinese characters (Java) - java

Having trouble setting up a URL connection with Chinese characters in the URL. It works with Latin characters:
String xstr = "维也纳恩斯特哈佩尔球场" ;
URI uri = new URI("http","ajax.googleapis.com","/ajax/services/language/detect","v=1.0&q="+xstr,null);
URL url = uri.toURL();
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream() ;
The getInputStream() call results in:
java.lang.IllegalArgumentException: Invalid uri 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&q=???????????': Invalid query

The problem is caused by the fact that URI.toURL() doesn't percent-encode non-ASCII characters. Use the following instead:
URL url = new URL(uri.toASCIIString());

axtavt's answer above saved me from insanity, thanks! Just one comment (I could not figure out how to comment below the answer:)
If you start with a URL, you need to encode quotes before you build the URI:
String s = "your_url?with=\"quotes\"";
URI su = new URI (s.replaceAll("\"", "%22");
URL ur = new URL( su.toASCIIString());

I think it is related to the "UTF-8" charset. Have a look at this topic to learn more and also this chinese in java

Per the URI RFC (see section 2.4), non-US-ASCII characters aren't valid in a URI. You must encode them.

Related

encode special character in URL

{URL}/text=Congratulations%21+You+are+eligible+for+.%0A
%0A = New line encoded character
I am passing encoded new line syntax in parameter. But the problem is that when I am building the above URL then its again encoded the % as %25
so above URL become {URL}/text=Congratulations%21+You+are+eligible+for+.%250A
I am not able to understand why URLBuilder encode already encoded character.
Used below code for building URLBuilder
URI url = new URIBuilder("URL").build();
If you don't need url encoding why do you use URIBuilder at all? You could simply create a new URI.
You need #buildFromEncoded if you want to feed in pre-encoded strings.

Extract parameters from URL

I have problems with the character +(and maybe others) at the URIBuilder is suppose to get a decoded url but when I extract the query the + is replaced
String decodedUrl = "www.foo.com?sign=AZrhQaTRSiys5GZtlwZ+H3qUyIY=&more=boo";
URIBuilder builder = new URIBuilder(decodedUrl);
List<NameValuePair> params = builder.getQueryParams();
String sign = params.get(0).getValue();
the value of sing is AZrhQaTRSiys5GZtlwZ H3qUyIY= with a space instead +. How can I extract the correct value?
other way is:
URI uri = new URI(decodedUrl);
String query = uri.getQuery();
the value of query is sign=AZrhQaTRSiys5GZtlwZ+H3qUyIY=&more=boo in this case is correct, but I have to strip it. Is there another way to do that?
Use it differently:
String decodedUrl = "www.foo.com";
URIBuilder builder = new URIBuilder(decodedUrl);
builder.addParameter("sign", "AZrhQaTRSiys5GZtlwZ+H3qUyIY=");
builder.addParameter("more", "boo");
List<NameValuePair> params = builder.getQueryParams();
String sign = params.get(0).getValue();
addParameter method is responsible for adding parameters as to the builder. The constructor of the builder should include the base URL only.
If this URL is given to you as is, then the + is already decoded and stands for the space character. If you are the one who generates this URL, you probably skipped the URL encoding step (which can be done using the code snipped above).
Read a bit about URL encoding: http://en.wikipedia.org/wiki/Query_string#URL_encoding
That is because if you send space as parameter in url it is encoded as +. This happens because there are some rules which characters are valid in URL. See URL RFC.
It is necessary to encode any characters disallowed in a URL, including spaces and other binary data not in the allowed character set, using the standard convention of the "%" character followed by two hexadecimal digits.
If you want to have + as symbol in url you need to encode it into %2B. For example 2+2 is encoded as 2%2B2 and i am as i+am. So in your case I believe you have to correct result as AZrhQaTRSiys5GZtlwZ+H3qUyIY decodes into AZrhQaTRSiys5GZtlwZ H3qUyIY.

Regular expression for retrieving http:// from the given URL

I want to retrieve http:// or https:// as per the protocol being used from the given URL.
How can i do it using Pattern and a Matcher?
If there's some other way to do it then suggest me the snippet.
here is my URL
https://www.google.co.in/#q=retrieving+http:%2F%2F+from+url+using+java+regular+expression+
thanks in advance
Another way:
String urlString = "https://www.google.co.in";
URL url = new URL(urlString);
String protocol = url.getProtocol();
Simply something like
^(https?://)
should match

Encoding an URL sent to a server (not in query)

I need to be testing my server for several URLs daily since these URLs are updated by my users - and this will be dine in Java. However, these URLs contains strange characters (like the german umlaut). Basicly what I am doing is:
for every URL in the list to check
URL u = new URL(the_url);
u.openConnection(..);
// read the content and handle it
Now, what Ive found is that org.apache.commons.codec.net.URLCodec is fine for encoding string to paste into the QueryString, it is not as suitable to encode strange URLs into their hex counterparts. Here are some examples of URLs:
http:// www.example com/u/überraum-03/
http:// www.example com/u/são-paulo-dude/
http:// www.example com/u/håkon-hellström/
The desired result for the first would be;
http:// www.example com/u/%c3%9berraum-03/
Are there any library in the Apache Commons or java itself, to convert special character in the ACTUAL url (not querystring - and therefore not replace the same kind of characters) ?
Thank you for your time.
Edited
Firefox translates "yr.no/place/Norway/Nordland/Moskenes/Å/data.html"; into "yr.no/place/Norway/Nordland/Moskenes/%C3%85/data.html" (try this by entering the first URL, press enter, then copy the url into a document). It is this effect that I am looking for - since this is the actual translation. What is most likely happening is either FF knows Å is a bad thing, it tries multiple versions or it accepts the servers "Location" header; either way - there is a tranformation from "Å" to "%C3%85" on only a subset of the URL. This is the function we need.
Edited
I just verified that the code given by commentor does not work sadly. As an example, try this:
try{
String urlStr = "http://www.yr.no/place/Norway/Nordland/Moskenes/Å/data.html";
URL u=new URL(urlStr);
URI uri = new URI(u.getProtocol(),
u.getUserInfo(), u.getHost(), u.getPort(),
u.getPath(), u.getQuery(),
null); // removing ref
URL urlObj = uri.toURL();
HttpURLConnection connection = (HttpURLConnection) urlObj.openConnection();
connection.setInstanceFollowRedirects(false);
connection.connect();
for (int i=0;i<connection.getHeaderFields().size();i++)
System.out.println(connection.getHeaderFieldKey(i)+": "+connection.getHeaderField(i));
System.exit(0);
}catch(Exception e){e.printStackTrace();};
Will yield a 404 error - strangely enough the encoded part does also not work.
If you need a URL that is a valid URI (RFC 2396 compliant) you can create one like this in Java
String urlString = "http://www.example.com/u/håkon-hellström/";
URL url = new URL(urlString);
URI uri = new URI(url.getProtocol(),url.getAuthority(), url.getPath(), url.getQuery(), url.getRef());
url = new URL(uri.toASCIIString());
That being said all three sample strings you provided are RFC 2396 compliant and do not need to be encoded. I am assuming the spaces in the authority part of the URLs you provided are typos.
EDIT:
I updated the code block above. By using URI.toASCIIString() you can limit the resulting URI to only US-ASCII characters (other characters are encoded). The resulting string can then be used to create a new, valid URL.
http://www.example.com/u/håkon-hellström/
changes to
http://www.example.com/u/h%C3%A5kon-hellstr%C3%B6m/

Escape non english characters in a url

How can I escape non-english characters like "ö" from my url since it causes 404 response error. I am using Java. Please help me.
E.g. by using URL-Encoding as specified in RFC3986 (http://tools.ietf.org/html/rfc3986). Please also have a look at: http://en.wikipedia.org/wiki/Percent-encoding
Java provides some methods to do this:
http://download.oracle.com/javase/1.4.2/docs/api/java/net/URLEncoder.html
Be aware of different encodings like ISO-8859-1/15, UTF-8. Depending on this for example an 'ö' will be encoded to %F6 or &C3%D6 (or sth. like this).
use URLEncoder/ URLDecoder in the java.net package
Try the java.net.URLEncoder
I had a similar problem, there was a 'ü' in URL path. After a few hours of experimenting with various SO posts I got this (from here):
URL url = new URL(urlString);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = new URL(uri.toASCIIString());
Trick is in converting URI to URL. Most answers ended with URI.toURL() method call. While this method correctly encodes whitespaces and non-letter characters, it doesn't encode non-ASCII letters. Method URI.toASCIIString() is answer to that problem.

Categories