Why does DecimalFormat ".#" and "0.#" have different results on 23.0? - java

Why does java.text.DecimalFormat evaluate the following results:
new DecimalFormat("0.#").format(23.0) // result: "23"
new DecimalFormat(".#").format(23.0) // result: "23.0"
I would have expected the result to be 23 in both cases, because special character # omits zeros. How does the leading special character 0 affect the fraction part? (Tried to match/understand it with the BNF given in javadoc, but failed to do so.)

The second format seems to be invalid according to the JavaDoc, but somehow it parses without error anyway.
Pattern:
PositivePattern
PositivePattern ; NegativePattern
PositivePattern:
Prefixopt Number Suffixopt
NegativePattern:
Prefixopt Number Suffixopt
Prefix:
any Unicode characters except \uFFFE, \uFFFF, and special characters
Suffix:
any Unicode characters except \uFFFE, \uFFFF, and special characters
Number:
Integer Exponentopt
Integer . Fraction Exponentopt
Integer:
MinimumInteger
#
# Integer
# , Integer
MinimumInteger:
0
0 MinimumInteger
0 , MinimumInteger
Fraction:
MinimumFractionopt OptionalFractionopt
MinimumFraction:
0 MinimumFractionopt
OptionalFraction:
# OptionalFractionopt
Exponent:
E MinimumExponent
MinimumExponent:
0 MinimumExponentopt
In this case I'd expect the behaviour of the formatter to be undefined. That is, it may produce any old thing and we can't rely on that being consistent or meaningful in any way. So, I don't know why you're getting the 23.0, but you can assume that it's nonsense that you should avoid in your code.
Update:
I've just run a debugger through Java 7's DecimalFormat library. The code not only explicitly says that '.#' is allowed, there is a comment in there (java.text.DecimalFormat:2582-2593) that says it's allowed, and an implementation that allows it (line 2597). This seems to be in violation of the documented BNF for the pattern.
Given that this is not documented behaviour, you really shouldn't rely on it as it's liable to change between versions of Java or even library implementations.

The following source comment explains the rather unintuitive handling of ".#". Lines 3383-3385 in my DecimalFormat.java file (JDK 8) have the following comment:
// Handle patterns with no '0' pattern character. These patterns
// are legal, but must be interpreted. "##.###" -> "#0.###".
// ".###" -> ".0##".
Seems like the developers have chosen to interpret ".#" as ".0##", instead of what you expected ("0.#").

Related

Undocumented Java regex character class: \p{C}

I found an interesting regex in a Java project: "[\\p{C}&&\\S]"
I understand that the && means "set intersection", and \S is "non-whitespace", but what is \p{C}, and is it okay to use?
The java.util.regex.Pattern documentation doesn't mention it. The only similar class on the list is \p{Cntrl}, but they behave differently: they both match on control characters, but \p{C} matches twice on Unicode characters above U+FFFF, such as PILE OF POO:
public class StrangePattern {
public static void main(String[] argv) {
// As far as I can tell, this is the simplest way to create a String
// with code points above U+FFFF.
String poo = new String(Character.toChars(0x1F4A9));
System.out.println(poo); // prints `💩`
System.out.println(poo.replaceAll("\\p{C}", "?")); // prints `??`
System.out.println(poo.replaceAll("\\p{Cntrl}", "?")); // prints `💩`
}
}
The only mention I've found anywhere is here:
\p{C} or \p{Other}: invisible control characters and unused code points.
However, \p{Other} does not seem to exist in Java, and the matching code points are not unused.
My Java version info:
$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
Bonus question: what is the likely intent of the original pattern, "[\\p{C}&&\\S]"? It occurs in a method which validates a string before it is sent in an email: if that pattern is matched, an exception with the message "Invalid string" is raised.
Buried down in the Pattern docs under Unicode Support, we find the following:
This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.
...
Categories may be specified with the optional prefix Is: Both \p{L}
and \p{IsL} denote the category of Unicode letters. Same as scripts
and blocks, categories can also be specified by using the keyword
general_category (or its short form gc) as in general_category=Lu or
gc=Lu.
The supported categories are those of The Unicode Standard in the
version specified by the Character class. The category names are those
defined in the Standard, both normative and informative.
From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.
It probably should support \p{Other}, but apparently it doesn't.
Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:
To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.
There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.
According to https://regex101.com/, \p{C} matches
Invisible control characters and unused code points
(the \ has to be escaped because java string, so string \\p{C} is regex \p{C})
I'm guessing this is a 'hacked string check' as a \p{C} probably should never appear inside a valid (character filled) string, but the author should have left a comment as what they checked and what they wanted to check are usually 2 different things.
Anything other than a valid two-letter Unicode category code or a single letter that begins a Unicode category code is illegal since Java supports only single letter and two-letter abbreviations for Unicode categories. That's why \p{Other} doesn't work here.
\p{C} matches twice on Unicode characters above U+FFFF, such as PILE
OF POO.
Right. Java uses UTF-16 encoding internally for Unicode characters and 💩 is encoded as two 16-bit code units (0xD83D 0xDCA9) called surrogate pairs (high surrogates) and since \p{C} matches each half separately
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16
encoding.
you see two matches in result set.
What is the likely intent of the original pattern, [\\p{C}&&\\S]?
I don't see a much valid reason but it seems developer worried about characters in category Other (like avoiding spammy goomojies in email's subject) so simply tried to block them.
As for the Bonus question: the expression [\\p{C}&&\\S] finds control characters excluding whitespace characters like tabs or line feeds in Java. These characters have no value in regular mails and therefore it is a good idea to filter them away (or, as in this case, declare an email content as faulty). Be aware that the double backslashes (\\) are only necessary to escape the expression for Java processing. The correct regular expression would be: [\p{C}&&\S]

Replacing Emoji Unicode Range from Arabic Tweets using Java

I am trying to replace emoji from Arabic tweets using java.
I used this code:
String line = "اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز 😂😂";
Pattern unicodeOutliers = Pattern.compile("([\u1F601-\u1F64F])", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(line);
line = unicodeOutlierMatcher.replaceAll(" $1 ");
But it is not replacing them. Even if I am matching only the character itself "\u1F602" it is not replacing it. May be because it is 5 digits after the u?! I am not sure, just a guess.
Note that:
1- the emotion at the end of the tweet (😂) is the "U+1F602" which is "face with tears of joy"
2- this question is not a duplicate for this question.
Any Ideas?
From the Javadoc for the Pattern class
A Unicode character can also be represented in a regular-expression by
using its Hex notation(hexadecimal code point value) directly as
described in construct \x{...}, for example a supplementary character
U+2011F can be specified as \x{2011F}, instead of two consecutive
Unicode escape sequences of the surrogate pair \uD840\uDD1F.
This means that the regular expression that you're looking for is ([\x{1F601}-\x{1F64F}]). Of course, when you write this as a Java String literal, you must escape the backslashes.
Pattern unicodeOutliers = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");
Note that the construct \x{...} is only available from Java 7.
Java 5 and 6
If you are stuck running your program on Java 5 or 6 JVM, and you want to match characters in the range from U+1F601 to U+1F64F, use surrogate pairs in the character class:
Pattern emoticons = Pattern.compile("[\uD83D\uDE01-\uD83D\uDE4F]");
This method is valid even in Java 7 and above, since in Sun/Oracle's implementation, if you decompile Pattern.compile() method, the String containing the pattern is converted into an array of code points before compilation.
Java 7 and above
You can use the construct \x{...} in David Wallace's answer, which is available from Java 7.
Or alternatively, you can also specify the whole Emoticons Unicode block, which spans from code point U+1F600 (instead of U+1F601) to U+1F64F.
Pattern emoticons = Pattern.compile("\\p{InEmoticons}");
Since Emoticons block support is added in Java 7, this method is also only valid from Java 7.
Although the other methods are preferred, you can specify supplemental characters by specifying the escape in the regex. While there is no reason to do this in the source code, this change in Java 7 corrects the behavior in applications where regex is used for searching, and directly pasting the character is not possible.
Pattern emoticons = Pattern.compile("[\\uD83D\\uDE01-\\uD83D\\uDE4F]");
/!\ Warning
Never ever mix the syntax together when you specify a supplemental code point, like:
"[\\uD83D\uDE01-\\uD83D\\uDE4F]"
"[\uD83D\\uDE01-\\uD83D\\uDE4F]"
Those will specify to match the code point U+D83D and the range from code point U+DE01 to code point U+1F64F in Oracle's implementation.
Note
In Java 5 and 6, Oracle's implementation, the implementation of Pattern.u() doesn't collapse valid regex-escaped surrogate pairs "\\uD83D\\uDE01". As the result, the pattern is interpreted as 2 lone surrogates, which will fail to match anything.

BigDecimal.floatValue versus Float.valueOf

Has anybody ever come across this:
System.out.println("value of: " + Float.valueOf("3.0f")); // ok, prints 3.0
System.out.println(new BigDecimal("3.0f").floatValue()); // NumberFormatException
I would argue that the lack of consistency here is a bug, where BigDecimal(String) doesn't follow the same spec as Float.valueOf() (I checked the JDK doc).
I'm using a library that forces me to go through BigDecimal, but it can happen that I have to send "3.0f" there. Is there a known workaround (BigDecimal is inaccessible in a library).
The second example would never work since the documentation doesn't mention anything concerning an f in the String:
The String representation consists of an optional sign, '+' ('\u002B') or '-' ('\u002D'), followed by a sequence of zero or more decimal digits ("the integer"), optionally followed by a fraction, optionally followed by an exponent...
A workaround could be simply stripping the f off of the String. It should be valid then.
BigDecimal has its own documentation. As per BigDecimal javadoc
this constructor is compatible with the values returned by
Float.toString(float) and Double.toString(double).
It doesn't mention anything about Float.valueOf().

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex
WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.
It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.
It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).
Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

How to check the ranges of numbers in ANTLR 3?

I know this might end up being language specific, so a Java or Python solution would be acceptable.
Given the grammar:
MONTH : DIGIT DIGIT ;
DIGIT : ('0'..'9') ;
I want a check constraint on MONTH to ensure the value is between 01 and 12. Where do I start looking, and how do I specify this constraint as a rule?
You can embed custom code by wrapping { and } around it. So you could do something like:
MONTH
: DIGIT DIGIT
{
int month = Integer.parseInt(getText());
// do your check here
}
;
As you can see, I called getText() to get a hold of the matched text of the token.
Note that I assumed you're referencing this MONTH rule from another lexer rule. If you're going to throw an exception if 1 > month > 12, then whenever your source contains an illegal month value, non of the parser rules will ever be matched. Although lexer- and parser rules can be mixed in one .g grammar file, the input source is first tokenized based on the lexer rules, and once that has happened, only then the parser rules will be matched.
You can use this free online utility Regex_For_Range to generate a regular expression for any continuous integer range. For the values 01-12 (with allowed leading 0's) the utility gives:
0*([1-9]|1[0-2])
From here you can see that if you want to constrain this to just the 2-digit strings '01' through '12', then adjust this to read:
0[1-9]|1[0-2]
For days 01-31 we get:
0*([1-9]|[12][0-9]|3[01])
And for the years 2000-2099 the expression is simply:
20[0-9]{2}

Categories