I have to use HttpClient 2.0 (can not use anything newer), and I am running into the next issue. When I use the method (post, in that case), it "codify" the parameters to the Hexadecimal ASCII code, and the "spaces" turned into "+" (something that the receiver don't want).
Does anyone know a way to avoid it?
Thanks a lot.
Even your browser does that, converting space character into +. See here http://download.oracle.com/javase/1.5.0/docs/api/java/net/URLEncoder.html
It encodes URL, converts to UTF-8 like string.
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Also, see here http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by&'.
To answer your question, if you do not want to encode. I guess, URLDecoder.decode will help you to undo the encoded string.
You could in theory avoid this by constructing the query string or request body containing parameters by hand.
But this would be a bad thing to do, because the HTML, HTTP, URL and URI specs all mandate that reserved characters in request parameters are encoded. And if you violate this, you may find that server-side HTTP stacks, proxies and so on reject your requests as invalid, or misbehave in other ways.
The correct way to deal with this issue is to do one of the following:
If the server is implemented in Java EE technology, use the relevant servlet API methods (e.g. ServletRequest.getParam(...)) to fetch the request parameters. These will take care of any decoding for you.
If the parameters are part of a URL query string, you can instantiate a Java URL or URI object and use the getter to return you the query with the encoding removed.
If your server is implemented some other way (or if you need to unpick the request URL's query string or POST data yourself), then use URLDecoder.decode or equivalent to remove the % encoding and replace +'s ... after you have figured out where the query and parameter boundaries, etc are.
Related
I have a question regarding azure64encode function in the indexer. When I try to encode via Java I got different encoding rather than in azure indexer:
In azure
{
sourceString= "00cbc05fc051e634d7d485c7879fe7bdb4f6509a"
base64EncodedString= "MDBjYmMwNWZjMDUxZTYzNGQ3ZDQ4NWM3ODc5ZmU3YmRiNGY2NTA5YQ2",
}
In Java
{
sourceString= "00cbc05fc051e634d7d485c7879fe7bdb4f6509a"
base64EncodedString= "MDBjYmMwNWZjMDUxZTYzNGQ3ZDQ4NWM3ODc5ZmU3YmRiNGY2NTA5YQ==",
}
Why in azure at the end "2" in java "=="???
Both are decoded to the same string.
The "2" at the end from indexer field mappings represents there are 2 equal signs in "==".
Standard base64 encoding uses equal signs as padding characters at the end of a string to make the length a multiple of 4, but they're not necessary to decode the original string.
Since standard encoding uses characters that are meaningful in URL query strings and sometimes the encoded strings will be passed through the URL, so there are versions that swap out/omit characters to make the encoding URL-safe.
The indexer has 2 implementations of base64Encode and defaults to using HttpServerUtility.UrlTokenEncode, which replaces all equal signs at the end of encoded strings with the count of those equal signs. The other implementation simply omits the equal signs, and you can choose between the two behaviors by setting useHttpServerUtilityUrlTokenEncode (defaults to true but you probably want false).
You can encode the string 00>00?00 in the indexer/Java to see exactly which behavior you're getting, and check this table to see how to convert between them.
N.B. - using standard base64 decoding with HttpServerUtility.UrlTokenEncode is very misleading and should be avoided. Try encoding and decoding a, aa, aaa, sometimes you get the original string back and sometimes you don't.
I have an API at the following path
/v0/segments/ch/abc/view/status/ACTIVE?sc=%s&expiryGteInMs=%d
I am building a Client using the URIBuilder in Java.
return UriBuilder
.fromUri(config.getHost())
.path(String.format(config.getPath(),request.List(), request.getTime()))
.build();
The request contains a list to be substituted in place of %s and the time to be substituted in place of %d. But the request being formed has a path like this
/v0/segments/ch/abc/view/status/ACTIVE%3Fsc=FK,GR&expiryGteInMs=1611081000000
Basically the "?" character is being replaced by %3F. Can somebody help me with this, please?
P.S: I know that we can use the ".queryParam" option available in URIBuilder. Looking for the actual reason why this is happening?
Most probably library that you are using is encoding url, and ? encodes to %3F.
Why this happens(in short): url could contain only specific set of character, and ? is not one of them, so, in order to transfer this character, we should encode it (so called Percent-encoding).
A bit longer explanation (taken from here):
URL encoding converts characters into a format that can be transmitted over the Internet.
URLs can only be sent over the Internet using the ASCII character-set.
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.
I know there are other questions but they seem to have answers which are assumptions rather than being definitive.
My limited understanding is that cookie values are:
semi-colons are already used to separate cookies attributes within a single cookie.
equals signs are used to separate cookie names and values
colons are used to separate multiple cookies within a header.
Are there any other "special" characters ?
Some other q/a suggest that one base64 encodes the value but this does of course may include equals signs which of course are not valid.
i have also seen some suggestions that values may be quoted this however leads to other questions.
do the special characters need to be quoted ?
do quoted values support the usual backslash escaping mechanisms.
RFC
I read a few RFCs including some of the many cookie RFCS but i am still unsure as there is cross reference to another RFC etc with no definitive simple explaination or sample that "answers" my query.
Hopefully no one will say read the RFC because the question becomes which RFC...?
I think i have also read that different browsers have slightly different rules so hopefully please note this in your answers if this matters.
The latest RFC is 6265, and it states that previous Cookie RFCs are obsoleted.
Here's what the syntax rules in the RFC say:
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
Thus:
The special characters are white-space characters, double quote, comma, semicolon and backslash. Equals is not a special character.
The special characters cannot be used at all, with the exception that double quotes may surround the value.
Special characters cannot be quoted.
Backslash does not act as an escape.
It follows that base-64 encoding can be used, because equals is not special.
Finally, from what I can tell, the RFC 6265 cookie values are defined so that they will work with any browser that implements any of the Cookie RFCs. However, if you tried to use cookie values that don't conform to RFC 6265 (but do arguably do conform to earlier RFCs), you may find that cookie behavior varies with different browsers.
In short, conform to the letter of RFC 6265 and you should be fine.
If you need pass cookie values that include any of the forbidden characters, your application needs to do its own encoding and decoding of the values; e.g. using base64.
There was the mention of base64, so here is a cooked cookie solution using that in cookies. The functions are about a modified version of base64, they only use [0-9a-zA-Z_-]
You can use it for both the name and value part of cookies, is binary safe, as they say.
The gzdeflate/gzinflate takes back 30% or so space created by base64, could not resist using it. Note that php gzdeflate/gzinflate is only available in most hosting companies, not all.
//write
setcookie
(
'mycookie'
,code_base64_FROM_bytes_cookiesafe(gzdeflate($mystring))
,time()+365*24*3600
);
//read
$mystring=gzinflate(code_bytes_FROM_base64_cookiesafe($_COOKIE['mycookie']));
function code_base64_FROM_bytes_cookiesafe($bytes)
{
//safe for name and value part [0-9a-zA-Z_-]
return strtr(base64_encode($bytes),Array
(
'/'=>'_',
'+'=>'-',
'='=>'',
' '=>'',
"\n"=>'',
"\r"=>'',
));
}
function code_bytes_FROM_base64_cookiesafe($enc)
{
$enc=str_pad($enc,strlen($enc)%4,'=',STR_PAD_RIGHT);//add back =
$enc=chunk_split($enc);//inserts \r\n every 76 chars
return base64_decode(strtr($enc,Array
(
'_'=>'/',
'-'=>'+',
)));
}
WHat will be the best practice to replace Unicode character in URL.
For example if I have a multilingual website and support East European languages
How should I format the URL that it always contains valid characters?
What you want todo is called slugify.
$slugified_url_part = iconv('utf-8', 'us-ascii//TRANSLIT', $url_part);
The above code will turn non ascii chars to it's closest ascii char.
You should also trim whitespace and replace inner whitespace with a dash or underscore.
Making all chars lowercase is also common.
Slugify is handy for remembering URLS and SEO.
You could ofcourse use percent encoding but that can look ugly.
Use Percent-encoding. Most languages have a helper function already built in.
Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI) under certain circumstances. Although it is known as URL encoding it is, in fact, used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such it is also used in the preparation of data of the "application/x-www-form-urlencoded" media type, as is often used in email messages and the submission of HTML form data in HTTP requests.
when using php you can use urlencode() to build your urls
The tags on this are a bit confusing, containing both PHP and Java.
For the Java side.
Use URLEncoder.encode("Your String Here", "UTF-8");
I am working with apache http client 4 for all of my web accesses.
This means that every query that I need to do has to pass the URI syntax checks.
One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:
http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8)
The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8"))
results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.
by the way URL object can be created and even used to connect to the web site using the non converted url.
Is there any way of creating URI in non UTF-8 encoding?
Is there any way of working with apache httpclient 4 with regular URL(and not URI)?
thanks,
Niv
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.
%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.
(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)
If a site requires %u#### sequences in its query string, it is very badly broken.
Is there any way of creating URI in non UTF-8 encoding?
Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!