I am posting the temperature value form my java code to opentsdb. So in one of the tags I wanted to display the measurement-type like whether the reading is in °C or °F. So I tried to post the unicode "\u00b0" from java, though in the System.out.println I am able to see the degree symbol but when I post the opentsdb is not accepting the value.
I also read the article where it defines the characters which are accepted by opentsdb.(in the Metrics and Tags section) and it defines that it accepts Unicode letters. but when I try to send the unicode of degree it doesn't work.
So does it accept the unicode of these characters? How can I send them.
http://opentsdb.net/docs/build/html/user_guide/writing.html
The following rules apply to metric and tag values:
Strings are case sensitive, i.e. "Sys.Cpu.User" will be stored separately from "sys.cpu.user"
Spaces are not allowed.
Only the following characters are allowed: a to z, A to Z, 0 to 9, -, _, ., / or Unicode letters (as per the specification)
But in fact, other than above mentioned characters no other is supported by opentsdb.
As of opentsdb version 2.3 there is support for specifying additional characters to allow via the config variable (cross posting from OpenTsdb: Is Space character allowed in Metric and tag information )
tsd.core.tag.allow_specialchars = !##$%^&*()_+{}|: <>?~`-=[]\;',./°
http://opentsdb.net/docs/build/html/user_guide/configuration.html gives more details
Related
I found an interesting regex in a Java project: "[\\p{C}&&\\S]"
I understand that the && means "set intersection", and \S is "non-whitespace", but what is \p{C}, and is it okay to use?
The java.util.regex.Pattern documentation doesn't mention it. The only similar class on the list is \p{Cntrl}, but they behave differently: they both match on control characters, but \p{C} matches twice on Unicode characters above U+FFFF, such as PILE OF POO:
public class StrangePattern {
public static void main(String[] argv) {
// As far as I can tell, this is the simplest way to create a String
// with code points above U+FFFF.
String poo = new String(Character.toChars(0x1F4A9));
System.out.println(poo); // prints `💩`
System.out.println(poo.replaceAll("\\p{C}", "?")); // prints `??`
System.out.println(poo.replaceAll("\\p{Cntrl}", "?")); // prints `💩`
}
}
The only mention I've found anywhere is here:
\p{C} or \p{Other}: invisible control characters and unused code points.
However, \p{Other} does not seem to exist in Java, and the matching code points are not unused.
My Java version info:
$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
Bonus question: what is the likely intent of the original pattern, "[\\p{C}&&\\S]"? It occurs in a method which validates a string before it is sent in an email: if that pattern is matched, an exception with the message "Invalid string" is raised.
Buried down in the Pattern docs under Unicode Support, we find the following:
This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.
...
Categories may be specified with the optional prefix Is: Both \p{L}
and \p{IsL} denote the category of Unicode letters. Same as scripts
and blocks, categories can also be specified by using the keyword
general_category (or its short form gc) as in general_category=Lu or
gc=Lu.
The supported categories are those of The Unicode Standard in the
version specified by the Character class. The category names are those
defined in the Standard, both normative and informative.
From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.
It probably should support \p{Other}, but apparently it doesn't.
Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:
To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.
There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.
According to https://regex101.com/, \p{C} matches
Invisible control characters and unused code points
(the \ has to be escaped because java string, so string \\p{C} is regex \p{C})
I'm guessing this is a 'hacked string check' as a \p{C} probably should never appear inside a valid (character filled) string, but the author should have left a comment as what they checked and what they wanted to check are usually 2 different things.
Anything other than a valid two-letter Unicode category code or a single letter that begins a Unicode category code is illegal since Java supports only single letter and two-letter abbreviations for Unicode categories. That's why \p{Other} doesn't work here.
\p{C} matches twice on Unicode characters above U+FFFF, such as PILE
OF POO.
Right. Java uses UTF-16 encoding internally for Unicode characters and 💩 is encoded as two 16-bit code units (0xD83D 0xDCA9) called surrogate pairs (high surrogates) and since \p{C} matches each half separately
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16
encoding.
you see two matches in result set.
What is the likely intent of the original pattern, [\\p{C}&&\\S]?
I don't see a much valid reason but it seems developer worried about characters in category Other (like avoiding spammy goomojies in email's subject) so simply tried to block them.
As for the Bonus question: the expression [\\p{C}&&\\S] finds control characters excluding whitespace characters like tabs or line feeds in Java. These characters have no value in regular mails and therefore it is a good idea to filter them away (or, as in this case, declare an email content as faulty). Be aware that the double backslashes (\\) are only necessary to escape the expression for Java processing. The correct regular expression would be: [\p{C}&&\S]
Our application sends strings which then shall be localized on client side. Sometimes those are whole strings, sometimes only substring, so we have to mark them. It would be the best if it only used Unicode as it wouldn't require any protocol changes.
Example:
"Length: (mark)10(mark)"
where 10 is length in cm but it should be converted so it is displayed as inches or mm.
Are Unicode special characters (0xFFF0-0xFFFF) right choice for marking such special substrings in text?
No, code points in the Specials block have their own uses. Using them for other purposes may result in unexpected effects. Even if you code all the processing yourself, the incoming data might contain those code points. It is of course possible to detect them and filter them out, but it is better to use code points that cannot clash with any assigned code points.
Use code points in the range U+FDD0..U+FDEF. They are designated as “noncharacters” and intended for use inside an application. See the Unicode FAQ section Private-Use Characters, Noncharacters & Sentinels FAQ.
I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?
You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.
It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");
I know there are other questions but they seem to have answers which are assumptions rather than being definitive.
My limited understanding is that cookie values are:
semi-colons are already used to separate cookies attributes within a single cookie.
equals signs are used to separate cookie names and values
colons are used to separate multiple cookies within a header.
Are there any other "special" characters ?
Some other q/a suggest that one base64 encodes the value but this does of course may include equals signs which of course are not valid.
i have also seen some suggestions that values may be quoted this however leads to other questions.
do the special characters need to be quoted ?
do quoted values support the usual backslash escaping mechanisms.
RFC
I read a few RFCs including some of the many cookie RFCS but i am still unsure as there is cross reference to another RFC etc with no definitive simple explaination or sample that "answers" my query.
Hopefully no one will say read the RFC because the question becomes which RFC...?
I think i have also read that different browsers have slightly different rules so hopefully please note this in your answers if this matters.
The latest RFC is 6265, and it states that previous Cookie RFCs are obsoleted.
Here's what the syntax rules in the RFC say:
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
Thus:
The special characters are white-space characters, double quote, comma, semicolon and backslash. Equals is not a special character.
The special characters cannot be used at all, with the exception that double quotes may surround the value.
Special characters cannot be quoted.
Backslash does not act as an escape.
It follows that base-64 encoding can be used, because equals is not special.
Finally, from what I can tell, the RFC 6265 cookie values are defined so that they will work with any browser that implements any of the Cookie RFCs. However, if you tried to use cookie values that don't conform to RFC 6265 (but do arguably do conform to earlier RFCs), you may find that cookie behavior varies with different browsers.
In short, conform to the letter of RFC 6265 and you should be fine.
If you need pass cookie values that include any of the forbidden characters, your application needs to do its own encoding and decoding of the values; e.g. using base64.
There was the mention of base64, so here is a cooked cookie solution using that in cookies. The functions are about a modified version of base64, they only use [0-9a-zA-Z_-]
You can use it for both the name and value part of cookies, is binary safe, as they say.
The gzdeflate/gzinflate takes back 30% or so space created by base64, could not resist using it. Note that php gzdeflate/gzinflate is only available in most hosting companies, not all.
//write
setcookie
(
'mycookie'
,code_base64_FROM_bytes_cookiesafe(gzdeflate($mystring))
,time()+365*24*3600
);
//read
$mystring=gzinflate(code_bytes_FROM_base64_cookiesafe($_COOKIE['mycookie']));
function code_base64_FROM_bytes_cookiesafe($bytes)
{
//safe for name and value part [0-9a-zA-Z_-]
return strtr(base64_encode($bytes),Array
(
'/'=>'_',
'+'=>'-',
'='=>'',
' '=>'',
"\n"=>'',
"\r"=>'',
));
}
function code_bytes_FROM_base64_cookiesafe($enc)
{
$enc=str_pad($enc,strlen($enc)%4,'=',STR_PAD_RIGHT);//add back =
$enc=chunk_split($enc);//inserts \r\n every 76 chars
return base64_decode(strtr($enc,Array
(
'_'=>'/',
'-'=>'+',
)));
}
I have to use HttpClient 2.0 (can not use anything newer), and I am running into the next issue. When I use the method (post, in that case), it "codify" the parameters to the Hexadecimal ASCII code, and the "spaces" turned into "+" (something that the receiver don't want).
Does anyone know a way to avoid it?
Thanks a lot.
Even your browser does that, converting space character into +. See here http://download.oracle.com/javase/1.5.0/docs/api/java/net/URLEncoder.html
It encodes URL, converts to UTF-8 like string.
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Also, see here http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by&'.
To answer your question, if you do not want to encode. I guess, URLDecoder.decode will help you to undo the encoded string.
You could in theory avoid this by constructing the query string or request body containing parameters by hand.
But this would be a bad thing to do, because the HTML, HTTP, URL and URI specs all mandate that reserved characters in request parameters are encoded. And if you violate this, you may find that server-side HTTP stacks, proxies and so on reject your requests as invalid, or misbehave in other ways.
The correct way to deal with this issue is to do one of the following:
If the server is implemented in Java EE technology, use the relevant servlet API methods (e.g. ServletRequest.getParam(...)) to fetch the request parameters. These will take care of any decoding for you.
If the parameters are part of a URL query string, you can instantiate a Java URL or URI object and use the getter to return you the query with the encoding removed.
If your server is implemented some other way (or if you need to unpick the request URL's query string or POST data yourself), then use URLDecoder.decode or equivalent to remove the % encoding and replace +'s ... after you have figured out where the query and parameter boundaries, etc are.