JSCH Library: Getting strange character while reading readLine() [duplicate] - java

I am working with automation and using Jsch to connect to remote boxes and automate some tasks.
I am having problem parsing the command results because sometimes they come with ANSI Control chars.
I've already saw this answer and this other one but it does not provide any library to do that. I don't want to reinvent the wheel, if there is any. And I don't feel confident with those answers.
Right now, I am trying this, but I am not really sure it's complete enough.
reply = reply.replaceAll("\\[..;..[m]|\\[.{0,2}[m]|\\(Page \\d+\\)|\u001B\\[[K]|\u001B|\u000F", "");
How to remove ANSI control chars (VT100) from a Java String?

Most ANSI VT100 sequences have the format ESC [, optionally followed by a number or by two numbers separated by ;, followed by some character that is not a digit or ;. So something like
reply = reply.replaceAll("\u001B\\[[\\d;]*[^\\d;]","");
or
reply = reply.replaceAll("\\e\\[[\\d;]*[^\\d;]",""); // \e matches escape character
should catch most of them, I think. There may be other cases that you could add individually. (I have not tested this.)
Some of the alternatives in the regex you posted start with \\[, rather than the escape character, which may mean that you could be deleting some text you're not supposed to delete, or deleting part of a control sequence but leaving the ESC character in.

Related

Avoid / Non-capture non printable Unicode characters regex [duplicate]

So, I'm having an issue. I'm catching some stuff from a Logger, And the output looks something like this:
11:41:19 [INFO] ←[35;1m[Server] hi←[m
I need to know how to remove those pesky ASCII color codes (or to parse them).
If they're intact, they should consist of ESC (U+001B) plus [ plus a semicolon-separated list of numbers, plus m. (See https://stackoverflow.com/a/9943250/978917.) In that case, you can remove them by writing:
final String msgWithoutColorCodes =
msgWithColorCodes.replaceAll("\u001B\\[[;\\d]*m", "");
. . . or you can take advantage of them by using less -r when examining your logs. :-)
(Note: this is specific to color codes. If you also find other ANSI escape sequences, you'll want to generalize that a bit. I think a fairly general regex would be \u001B\\[[;\\d]*[ -/]*[#-~]. You may find http://en.wikipedia.org/wiki/ANSI_escape_code to be helpful.)
If the sequences are not intact — that is, if they've been mangled in some way — then you'll have to investigate and figure out exactly what mangling has happened.
How about this regex
replaceAll("\\d{1,2}(;\\d{1,2})?", "");
Based on the format found here: http://bluesock.org/~willg/dev/ansi.html

Java print string as unicode

I was processing some data tweeter using java. I read them from the file, do some process and print to the stdout.
The text in file looks like this:
"RT #Bollogosta319a: #BuyBookSilentSinners \u262fGain Followers\n\u262fRT This\n\u262fMUST FOLLOW ME I FOLLOW BACK\n\u262fFollow everyone who rts\n\u262fGain\n #ANDROID \u2026"
I read it in, and print it out to stdout. The output is supposed to be:
"RT #Bollogosta319a: #BuyBookSilentSinners ☯Gain Followers\n☯RT This\n☯MUST FOLLOW ME I FOLLOW BACK\n☯Follow everyone who rts\n☯Gain\n #ANDROID …"
But my output is like this:
"RT #Bollogosta319a: #BuyBookSilentSinners ?Gain Followers
?RT This
?MUST FOLLOW ME I FOLLOW BACK
?Follow everyone who rts
?Gain
#ANDROID ?"
So, it seems that I have two problems to deal with:
1. print the exact Unicode character instead of Unicode string
2. keep "\n" as it is, instead of a newline in the output.
How can I do this? (I'm really crazy about dealing with different coding in Java)
I don't know how you are parsing the file, but the method you are using seems to be interpreting escape codes (like \n and \u262f). To leave instances of \n in the file literally, you could replace \n with \\n prior to using whatever means of interpreting the escape codes. The \\ will be converted to a single \, and the n will be left alone. Have you tried using a plain java.io.FileReader to read the file? That may be simpler.
The Unicode symbols may actually be read correctly; many terminals do not support the full range of Unicode characters and print some symbol in place of those it does not understand. Perhaps your program prints ☯ and the terminal simply doesn't know how to render it, so it prints a ? instead.

How to remove ANSI control chars (VT100) from a Java String

I am working with automation and using Jsch to connect to remote boxes and automate some tasks.
I am having problem parsing the command results because sometimes they come with ANSI Control chars.
I've already saw this answer and this other one but it does not provide any library to do that. I don't want to reinvent the wheel, if there is any. And I don't feel confident with those answers.
Right now, I am trying this, but I am not really sure it's complete enough.
reply = reply.replaceAll("\\[..;..[m]|\\[.{0,2}[m]|\\(Page \\d+\\)|\u001B\\[[K]|\u001B|\u000F", "");
How to remove ANSI control chars (VT100) from a Java String?
Most ANSI VT100 sequences have the format ESC [, optionally followed by a number or by two numbers separated by ;, followed by some character that is not a digit or ;. So something like
reply = reply.replaceAll("\u001B\\[[\\d;]*[^\\d;]","");
or
reply = reply.replaceAll("\\e\\[[\\d;]*[^\\d;]",""); // \e matches escape character
should catch most of them, I think. There may be other cases that you could add individually. (I have not tested this.)
Some of the alternatives in the regex you posted start with \\[, rather than the escape character, which may mean that you could be deleting some text you're not supposed to delete, or deleting part of a control sequence but leaving the ESC character in.

Regular expression, excluding .. in suffix of email addy [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using a regular expression to validate an email address
This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.
I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.
First, the addresses.
me#example..com - invalid
someone.nothere#1.0.0.127 - valid
The part of my expression which validates the suffix
I originally started with
#.+\\.[[a-z]0-9]+
And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.
#[[\\w]+\\.[\\w]+]+
or
#[\\w]+\\.[\\w]+
I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.
I know what I want it to do, I want it to match a character class of "character+"."character+"+
The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.
I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.
You need a negative look-ahead :
#\w+\.(?!\.)
See http://www.regular-expressions.info/lookaround.html
test in Perl :
Perl> $_ = 'someone.nothere#1.0.0.127'
someone.nothere#1.0.0.127
Perl> print "OK\n" if /\#\w+\.(?!\.)/
OK
1
Perl> $_ = 'me#example..com'
me#example..com
Perl> print "OK\n" if /\#\w+\.(?!\.)/
Perl>
#([\\w]+\\.)+[\\w]+
Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.
I think you want this:
#[\\w]+(\\.[\\w]+)+
This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)
The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.
The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

In Java (Pig) Regex, how could I do the following?

I have data coming in a txt file delimited by pipes. The unfortunate thing is 2 fields can have multiple values. To separate these multiples, the sender used pipes again, but put quotes around it. My regex worked for months until a certain rare situation...
Regex currently:
([^\|]*)\|"?([^"]*)"?\|([^\|]*)\|"?([^"]*)"?
And it worked for the following situation which happens most of the time:
abc|"part1|part2"|abc|"tool1|tool2"
But this case is where the ([^"]*) jumps ahead and takes all from the blank to the end of the quotes:
abc||abc|"tool1|tool2"
So I realize I must account for when there is a pipe next instead of a quote.
Just not sure how.............
P.S. For those PIG people that might be looking at this, I removed a backslash from each escape, to make it look more like Java, but in PIG you need 2, fyi.
In your expression you need to specify that the part between |s can be either quoted or not quoted. You can do it as follows:
(("[^"]*")|((?!")[^|]*))
Now you can repeat this part several times with |s in between, as you need.

Categories