How to remove ANSI control chars (VT100) from a Java String - java

I am working with automation and using Jsch to connect to remote boxes and automate some tasks.
I am having problem parsing the command results because sometimes they come with ANSI Control chars.
I've already saw this answer and this other one but it does not provide any library to do that. I don't want to reinvent the wheel, if there is any. And I don't feel confident with those answers.
Right now, I am trying this, but I am not really sure it's complete enough.
reply = reply.replaceAll("\\[..;..[m]|\\[.{0,2}[m]|\\(Page \\d+\\)|\u001B\\[[K]|\u001B|\u000F", "");
How to remove ANSI control chars (VT100) from a Java String?

Most ANSI VT100 sequences have the format ESC [, optionally followed by a number or by two numbers separated by ;, followed by some character that is not a digit or ;. So something like
reply = reply.replaceAll("\u001B\\[[\\d;]*[^\\d;]","");
or
reply = reply.replaceAll("\\e\\[[\\d;]*[^\\d;]",""); // \e matches escape character
should catch most of them, I think. There may be other cases that you could add individually. (I have not tested this.)
Some of the alternatives in the regex you posted start with \\[, rather than the escape character, which may mean that you could be deleting some text you're not supposed to delete, or deleting part of a control sequence but leaving the ESC character in.

Related

Avoid / Non-capture non printable Unicode characters regex [duplicate]

So, I'm having an issue. I'm catching some stuff from a Logger, And the output looks something like this:
11:41:19 [INFO] ←[35;1m[Server] hi←[m
I need to know how to remove those pesky ASCII color codes (or to parse them).
If they're intact, they should consist of ESC (U+001B) plus [ plus a semicolon-separated list of numbers, plus m. (See https://stackoverflow.com/a/9943250/978917.) In that case, you can remove them by writing:
final String msgWithoutColorCodes =
msgWithColorCodes.replaceAll("\u001B\\[[;\\d]*m", "");
. . . or you can take advantage of them by using less -r when examining your logs. :-)
(Note: this is specific to color codes. If you also find other ANSI escape sequences, you'll want to generalize that a bit. I think a fairly general regex would be \u001B\\[[;\\d]*[ -/]*[#-~]. You may find http://en.wikipedia.org/wiki/ANSI_escape_code to be helpful.)
If the sequences are not intact — that is, if they've been mangled in some way — then you'll have to investigate and figure out exactly what mangling has happened.
How about this regex
replaceAll("\\d{1,2}(;\\d{1,2})?", "");
Based on the format found here: http://bluesock.org/~willg/dev/ansi.html

JSCH Library: Getting strange character while reading readLine() [duplicate]

I am working with automation and using Jsch to connect to remote boxes and automate some tasks.
I am having problem parsing the command results because sometimes they come with ANSI Control chars.
I've already saw this answer and this other one but it does not provide any library to do that. I don't want to reinvent the wheel, if there is any. And I don't feel confident with those answers.
Right now, I am trying this, but I am not really sure it's complete enough.
reply = reply.replaceAll("\\[..;..[m]|\\[.{0,2}[m]|\\(Page \\d+\\)|\u001B\\[[K]|\u001B|\u000F", "");
How to remove ANSI control chars (VT100) from a Java String?
Most ANSI VT100 sequences have the format ESC [, optionally followed by a number or by two numbers separated by ;, followed by some character that is not a digit or ;. So something like
reply = reply.replaceAll("\u001B\\[[\\d;]*[^\\d;]","");
or
reply = reply.replaceAll("\\e\\[[\\d;]*[^\\d;]",""); // \e matches escape character
should catch most of them, I think. There may be other cases that you could add individually. (I have not tested this.)
Some of the alternatives in the regex you posted start with \\[, rather than the escape character, which may mean that you could be deleting some text you're not supposed to delete, or deleting part of a control sequence but leaving the ESC character in.

Are escape characters in Java platform-dependent?

I just read this question about comparing "%n" and "\n"
What's up with Java's "%n" in printf?
The answer confirms that %n can be used across platform, while \n is not. So I wonder what about other escape characters such as \t , \b, \', \", \\ .... Are they all platform-dependent just like \n?
The String escape codes mean the same thing on all platforms. They map to specified Unicode codepoints that in turn correspond to standard 7-bit ASCII control characters.
The only (theoretical) concern might be some native character set which didn't have a way of representing the equivalent of those codepoints / characters. I'm pretty sure you'd be OK on ancient 6-bit and 5-bit character sets from 50+ years ago.
However, if you are trying to output text in the platform preferred form, you do need to consider two things:
Different platforms use different character sequences as the preferred way to designate an "end of line". (Or line separator ...)
The default TAB stop positions vary between platforms. On Windows they are every 4 character positions, and Unix / Linux every 8 characters.
So when you format data for fixed-width character display (e.g. on a "console"), you need to consider these platform dependencies.
There is also some uncertainty / variability about what will "happen" when you send those characters to a display, or include them in a file. But that's not really Java's fault, or anything that Java could address.
By contrast, "%n" ... in the context of a format string ... means the platform preferred line separator. So, on a Linux/UNIX it means "\n", on Windows it means "\r" and on Macs it means "\r\n". Note that this ONLY applies to format Strings; i.e. the first argument to String.format(...), or something else that does that style of formatting.
\t \' \" and \\ will most likely act in the same way across all platforms as they represent real ASCII characters and there are not many platforms left that do not implement the full ASCII character set.
\b - well that's a different matter. That will almost certainly not do the same thing across any platforms as it is supposed to implement the BEL control code which, in itself, is not platform generic.
What were you hoping to get from your ... in the question?
Added: It seems \b is backspace - still unlikely to be cross-platform though.
Added: And as for \f - just don't use it as it will probably only ever do something that stops working when you replace your printer - if it ever actually does something at all.
Some platforms use \r\n as a new line, some other \n. Using %n will ensure the right new line emitted in the output.
That has nothing to do with the backslash character preceding characters to designate special characters like the ones you mentioned. Feel free to use it in your source code.

How to avoid backslash in java

First look at the code below
public static void main(String args[])
{
System.out.println(new Stringtest().test("The system has saved your payment under transaction number \369825655."));
}
private String test(String aa)
{
return aa.substring(58);
}
so logically this method should print 369825665. But it is printing 9825655 because of that backslash. Now I want the whole number. What should I do. I can not change the \ to \\ because the text is coming from a website.
\ is an escape character. You need to escape the escape character.
System.out.println(new Stringtest().test("The system has saved your payment under transaction number \\369825655."));
For more on escape sequences, see this Oracle document. Essentially the backslash tells the system that the character(s) after it should be interpreted in a special manner (not simply as plain text). When you escape the escape, it gets treated specially as well... as plain text, instead of as an escape character.
It can get confusing sometimes but is intuitive once you learn the core concepts.
EDIT: If you can't control the format the String comes to you in, you might be out of luck. I've been debugging this in Eclipse, and it seems like as soon as you create your String with that, the escape character gets processed and you lose the first two digits of your transaction number. You may need to get your database guys (or whomever formatted this terrible String) to change their implementation for you to do what you need to do. The Eclipse debugger suggests this, at least.
It just so happens that, apparently, \36 processes fine and gets interpreted as another ASCII character that doesn't show up. But in other cases, this will likely throw an Exception as an invalid escape sequence.
In my own testing, it seems that as soon as the String literal is declared/created, the loss of information occurs. So there will be no way to recover it after that to my knowledge.
Debug Screenshot
Try \\369825655 - basically it escapes the escape sign. Then replace all \ characters with empty characters.
Add another backslash (escape character)
ArrayList System.out.println(new Stringtest().test("The system has saved your payment under transaction number \\369825655."));

Java String#contains() using String#matches() with escape character

I need a simple way to implement the contains function using matches. I believe this is my starting point:
xxx.matches("'.*yyy.*'");
But I need to make it a universal method and pre-process whatever I search for to be accepted by matches! This must be done using only the escape '\' character!
Imagine a string SEARCH_FOR that can contain some special characters that must be "regex escaped"...
String SEARCH_FOR="*.\\"
xxx.matches("'.*" + SEARCH_FOR + ".*'");
Are there any catches? Special situations? Any other "special chars should be taken into account?
Are you looking for Pattern.quote(String) ?
This escapes special characters for you.
EDIT:
After reading the comments, I really hope you try Pattern.quote(yourString.toLowerCase()) as it sounds like you've been using Pattern.quote(yourString).toLowerCase(). If DataNucleus is applying the regex then there should be no problems with using the \Q and \E escape sequence.
Since you have really asked for it, ".\\".replaceAll("(\\.|\\$|\\+|\\*|\\\\)", "\\\\\$1") outputs \.\\
This will escape .'s, $'s, + 's, *'s and \'s. Note that the security of this is now all upon you. If you don't escape something you needed to, or you escape it incorrectly, you will either allow people to use regex inside the search term when you weren't expecting to or it won't returns results that you were expecting.

Categories