Are escape characters in Java platform-dependent? - java

I just read this question about comparing "%n" and "\n"
What's up with Java's "%n" in printf?
The answer confirms that %n can be used across platform, while \n is not. So I wonder what about other escape characters such as \t , \b, \', \", \\ .... Are they all platform-dependent just like \n?

The String escape codes mean the same thing on all platforms. They map to specified Unicode codepoints that in turn correspond to standard 7-bit ASCII control characters.
The only (theoretical) concern might be some native character set which didn't have a way of representing the equivalent of those codepoints / characters. I'm pretty sure you'd be OK on ancient 6-bit and 5-bit character sets from 50+ years ago.
However, if you are trying to output text in the platform preferred form, you do need to consider two things:
Different platforms use different character sequences as the preferred way to designate an "end of line". (Or line separator ...)
The default TAB stop positions vary between platforms. On Windows they are every 4 character positions, and Unix / Linux every 8 characters.
So when you format data for fixed-width character display (e.g. on a "console"), you need to consider these platform dependencies.
There is also some uncertainty / variability about what will "happen" when you send those characters to a display, or include them in a file. But that's not really Java's fault, or anything that Java could address.
By contrast, "%n" ... in the context of a format string ... means the platform preferred line separator. So, on a Linux/UNIX it means "\n", on Windows it means "\r" and on Macs it means "\r\n". Note that this ONLY applies to format Strings; i.e. the first argument to String.format(...), or something else that does that style of formatting.

\t \' \" and \\ will most likely act in the same way across all platforms as they represent real ASCII characters and there are not many platforms left that do not implement the full ASCII character set.
\b - well that's a different matter. That will almost certainly not do the same thing across any platforms as it is supposed to implement the BEL control code which, in itself, is not platform generic.
What were you hoping to get from your ... in the question?
Added: It seems \b is backspace - still unlikely to be cross-platform though.
Added: And as for \f - just don't use it as it will probably only ever do something that stops working when you replace your printer - if it ever actually does something at all.

Some platforms use \r\n as a new line, some other \n. Using %n will ensure the right new line emitted in the output.
That has nothing to do with the backslash character preceding characters to designate special characters like the ones you mentioned. Feel free to use it in your source code.

Related

printf: Difference between \n and %n [duplicate]

I'm reading Effective Java and it uses %n for the newline character everywhere. I have used \n rather successfully for newline in Java programs.
Which is the 'correct' one? What's wrong with \n ? Why did Java change this C convention?
From a quick google:
There is also one specifier that doesn't correspond to an argument. It is "%n" which outputs a line break. A "\n" can also be used in some cases, but since "%n" always outputs the correct platform-specific line separator, it is portable across platforms whereas"\n" is not.
Please refer
https://docs.oracle.com/javase/tutorial/java/data/numberformat.html
Original source
%n is portable across platforms
\n is not.
See the formatting string syntax in the reference documentation:
'n' line separator The result is the
platform-specific line separator
While \n is the correct newline character for Unix-based systems, other systems may use different characters to represent the end of a line. In particular, Windows system use \r\n, and early MacOS systems used \r.
By using %n in your format string, you tell Java to use the value returned by System.getProperty("line.separator"), which is the line separator for the current system.
Warning:
If you're doing NETWORKING code, you might prefer the certainty of \n, as opposed to %n which may send different characters across the network, depending upon what platform it's running on.
"correct" depends on what exactly it is you are trying to do.
\n will always give you a "unix style" line ending.
\r\n will always give you a "dos style" line ending.
%n will give you the line ending for the platform you are running on
C handles this differently. You can choose to open a file in either "text" or "binary" mode. If you open the file in binary mode \n will give you a "unix style" line ending and "\r\n" will give you a "dos style" line ending. If you open the file in "text" mode on a dos/windows system then when you write \n the file handling code converts it to \r\n. So by opening a file in text mode and using \n you get the platform specific line ending.
I can see why the designers of java didn't want to replicate C's hacky ideas regarding "text" and "binary" file modes.
Notice these answers are only true when using System.out.printf() or System.out.format() or the Formatter object. If you use %n in System.out.println(), it will simply produce a %n, not a newline.
In java, \n always generate \u000A linefeed character. To get correct line separator for particular platform use %n.
So use \n when you are sure that you need \u000A linefeed character, for example in networking.
In all other situations use %n
%n format specifier is a line separator that's portable across operating systems. However, it cannot be used as an argument to System.out.print or System.out.println functions.
It is always recommended to use this new version of line separator above \n.

Is there "oldline" character in Java?

Is there exist the opposite to the newline '\n' character in Java which will move back to the previous line in the console?
ASCII doesn't standardize a "line starve" or reverse line feed control character. Some character based terminals/terminal emulators recognize control code sequences that move the cursor up a line; these aren't Java-specific, and depend on your OS and configuration. Here's a starting point if you're using Linux: http://www.kernel.org/doc/man-pages/online/pages/man4/console_codes.4.html
Java supports Unicode, which has the character "REVERSE LINE FEED" (U+008D). In Java it would be '\u008D' (as a char) or "\u008D" (as a String). Whether this would do what you want on a console, printout, or whatever, depends on the device. Java does not define any behavior for that character.

Java regex to distinguish special characters while allowing non english chars

I am trying to do above. One option is get a set of chars which are special characters and then with some java logic we can accomplish this. But then I have to make sure I include all special chars.
Is there any better way of doing this ?
You need to decide what constitutes a special character. One method that may be of interest is Character.getType(char) which returns an int which will match one of the constant values of Character such as Character.LOWERCASE_LETTER or Character.CURRENCY_SYMBOL. This lets you determine the general category of a character, and then you need to decide which categories count as 'special' characters and which you will accept as part of text.
Note that Java uses UTF-16 to encode its char and String values, and consequently you may need to deal with supplementary characters (see the link in the description of the getType method). This is a nuisance, but the Character method does offer methods which help you detect this situation and work around it. See the Character.isSupplementaryCodepoint(int) and Character.codepointAt(char[], int) methods.
Also be aware that Java 6 is far less knowledgeable about Unicode than is Java 7. The newest version of Java has added far more to its Unicode database, but code running on Java 6 will not recognise some (actually quite a few) exotic codepoints as being part of a Unicode block or general category, so you need to bear this in mind when writing your code.
It sounds like you would like to remove all control characters from a Unicode string. You can accomplish this by using a Unicode character category identifier in a regex. The category "Cc" contains those characters, see http://www.fileformat.info/info/unicode/category/Cc/list.htm.
myString = myString.replaceAll("[\p{Cc}]+", "");

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex
WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.
It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.
It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).
Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

What's up with Java's "%n" in printf?

I'm reading Effective Java and it uses %n for the newline character everywhere. I have used \n rather successfully for newline in Java programs.
Which is the 'correct' one? What's wrong with \n ? Why did Java change this C convention?
From a quick google:
There is also one specifier that doesn't correspond to an argument. It is "%n" which outputs a line break. A "\n" can also be used in some cases, but since "%n" always outputs the correct platform-specific line separator, it is portable across platforms whereas"\n" is not.
Please refer
https://docs.oracle.com/javase/tutorial/java/data/numberformat.html
Original source
%n is portable across platforms
\n is not.
See the formatting string syntax in the reference documentation:
'n' line separator The result is the
platform-specific line separator
While \n is the correct newline character for Unix-based systems, other systems may use different characters to represent the end of a line. In particular, Windows system use \r\n, and early MacOS systems used \r.
By using %n in your format string, you tell Java to use the value returned by System.getProperty("line.separator"), which is the line separator for the current system.
Warning:
If you're doing NETWORKING code, you might prefer the certainty of \n, as opposed to %n which may send different characters across the network, depending upon what platform it's running on.
"correct" depends on what exactly it is you are trying to do.
\n will always give you a "unix style" line ending.
\r\n will always give you a "dos style" line ending.
%n will give you the line ending for the platform you are running on
C handles this differently. You can choose to open a file in either "text" or "binary" mode. If you open the file in binary mode \n will give you a "unix style" line ending and "\r\n" will give you a "dos style" line ending. If you open the file in "text" mode on a dos/windows system then when you write \n the file handling code converts it to \r\n. So by opening a file in text mode and using \n you get the platform specific line ending.
I can see why the designers of java didn't want to replicate C's hacky ideas regarding "text" and "binary" file modes.
Notice these answers are only true when using System.out.printf() or System.out.format() or the Formatter object. If you use %n in System.out.println(), it will simply produce a %n, not a newline.
In java, \n always generate \u000A linefeed character. To get correct line separator for particular platform use %n.
So use \n when you are sure that you need \u000A linefeed character, for example in networking.
In all other situations use %n
%n format specifier is a line separator that's portable across operating systems. However, it cannot be used as an argument to System.out.print or System.out.println functions.
It is always recommended to use this new version of line separator above \n.

Categories