Avoid / Non-capture non printable Unicode characters regex [duplicate]

Avoid / Non-capture non printable Unicode characters regex [duplicate] - java

So, I'm having an issue. I'm catching some stuff from a Logger, And the output looks something like this:
11:41:19 [INFO] ←[35;1m[Server] hi←[m
I need to know how to remove those pesky ASCII color codes (or to parse them).

If they're intact, they should consist of ESC (U+001B) plus [ plus a semicolon-separated list of numbers, plus m. (See https://stackoverflow.com/a/9943250/978917.) In that case, you can remove them by writing:
final String msgWithoutColorCodes =
msgWithColorCodes.replaceAll("\u001B\\[[;\\d]*m", "");
. . . or you can take advantage of them by using less -r when examining your logs. :-)
(Note: this is specific to color codes. If you also find other ANSI escape sequences, you'll want to generalize that a bit. I think a fairly general regex would be \u001B\\[[;\\d]*[ -/]*[#-~]. You may find http://en.wikipedia.org/wiki/ANSI_escape_code to be helpful.)
If the sequences are not intact — that is, if they've been mangled in some way — then you'll have to investigate and figure out exactly what mangling has happened.

How about this regex
replaceAll("\\d{1,2}(;\\d{1,2})?", "");
Based on the format found here: http://bluesock.org/~willg/dev/ansi.html

Related

Java regex to match quoted numbers

I need to clean up a JSON including incorrectly quoted numbers via a short Java (not JS!) Regex snippet. Example for what I have:
[{"series":"a","x":"1","y":"111.71"},{"series":"a","x":"2","y":"120.25"}]
Example for what I would need to get:
[{"series":"a","x":1,"y":111.71},{"series":"a","x":2,"y":120.25}]
So I only need to match and eliminate quote characters if preceeded or followed by [0-9], but how to avoid replacing part of the number is beyond my lowly regex skills.
Any help greatly appreciated!
EDIT (2nd round):
Thanks for the fast feedback! I'm not too worried about false positives since I can control the contents of the descriptors, and I'll make sure they're text-only. Spaces can be avoided as well, only negative numbers might occur - good one! Separators are always commas (",") for the JSON, the arbitrary number of decimals in of the double values are always separated by dots ("."). I cannot fix the JSON source unfortunately, and I definitely want to clean this up in Java.
Trying out the suggestions now and reporting back. I'll also toy around with this: http://www.regular-expressions.info/lookaround.html#lookbehind

How about replaceAll("\"(-?\\d+([.]\\d+)?)\"","$1");

This works for your specific example, but would not work if other numbers have a different format (see my comment):
String s = "[{\"series\":\"a\",\"x\":\"1\",\"y\":\"111.71\"},{\"series\":\"a\",\"x\":\"2\",\"y\":\"120.25\"}]";
String clean = s.replaceAll("\"(\\d+\\.?\\d*)\"", "$1");
System.out.println(clean);
outputs:
[{"series":"a","x":1,"y":111.71},{"series":"a","x":2,"y":120.25}]

Why isn't char of type SPACE_SEPARATOR recognized as whitespace?

I have String like "12 345 678" and I wanted to remove whitespaces (because of conversion to int). So I did the usual: myString.replaceAll("\\s", "");, but what a surprise! It did nothing, the space was still there.
When I investigated further, I figured out that this space character is of type Character.SPACE_SEPARATOR (Character.getType(myString.charAt(<positionOfSpaceChar>))).
What I don't get is why isn't this oblivious space character (from Unicode category Zs
http://www.fileformat.info/info/unicode/category/Zs/list.htm) recognized as whitespace (not even with Character.isWhitespace(char)).
Reading through java api isn't helpful (so far).
note: In the end, I just want to remove that character... and I will probably find a way how to do it, but I'm really interested in some explanation of why it's behaving like this. Thanks

Your problem is that \s is defined as [ \t\n\x0B\f\r]. What you want to use is \p{javaWhitespace}, which is defined as all characters for which java.lang.Character.isWhitespace() is true.
Not sure if it applies in this case, but note that a non-breaking space is not considered whitespace. Character.SPACE_SEPARATOR is generally whitespace, but '\u00A0', '\u2007', '\u202F' are not included because they are non-breaking. If you want to include non-breaking spaces, then include those 3 characters explicitly in addition to \p{javaWhitespace}. It's kind of a pain, but that's the way it is.
Actually, in your specific case of converting to int, I'd recommend:
myString.replaceAll("\\D", "");,
to strip out everything that is not a digit.

with regex, is using both "is" and "is not" range definitons within the same range possible?

Note: I'm using a 3rd party app that uses regex for searches which has its own flavor but almost always works like java's flavor of regex. Of course this may not matter.
After searching for many different ways of this same question (phrased many ways), I did not see any tutorials, examples, or even mentions of whether it is possible to use both an "is" (positive?) and "is not" (negative?) definition within the same range.
I can't run a test the example right now in the app to see if my ideas work, because the amount of data being searched is massive and will screw up the matches it has already gathered. I'm only asking because of this.
Here are examples of what I thought might work but caused tester to act weird:
[\w^\s<>.!?]{2}
[\w|^\s<>.!?]{2}
I would rather have it work the way I think the first one would work (any digit, lower case, or upper case character, or other normal character that is not a space, >, <, period, !, or ?) rather then the second which only has an or operator.
The regex testers I used gave me different funky results which is what is confusing me.
Also note: I'm using this within a capture group which is followed by a catch everything match which I may or may not be using properly. So if you'd like to include how to follow what I'm attempting with how to properly do that, feel free. I AM MAINLY JUST CURIOUS TO IF THIS WAS POSSIBLE OR NOT, OR IF IT WAS A IMPROPER METHOD.

Why do you need the \w at all?
[^\s<>.!?]{2}
This already matches all alphanumeric characters since they are neither space nor any of the punctuation characters you mentioned.
In general, you can substract character classes to some degree, for example, to match alphanumerics exluding digits, you can do
[^\W\d]
because [^\W] matches the same as \w, and \d is substracted from that because it's in a negated character class.
Edit:
Some regex engines (like XPath, .NET and JGSoft) allow flexible character class substraction like this:
[a-z-[e-g]]
to match any character from the range [a-z], excluding e, f and g. But Java does not have this feature.

Another possibility is to use two ranges and combine them; e.g.
([\w]|[^\s<>.!?]){2}
However, this does bring up the question of what you are actually trying to express here. Because this example (as I've rewritten it) doesn't make a lot of sense.
What it says is "a word character, or any character that is not whitespace or certain punctuation". But the class of characters that are not "whitespace or certain punctuation" ALREADY includes all of the word characters. So, unless you mean something different, the \w is redundant.

From your question, it looks like a no-space regex would match your needs, you can achieve that with:
[\S]{2}

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex

WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.

It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.

It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).

Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

Regex to find variables and ignore methods

I'm trying to write a regex that finds all variables (and only variables, ignoring methods completely) in a given piece of JavaScript code. The actual code (the one which executes regex) is written in Java.
For now, I've got something like this:
Matcher matcher=Pattern.compile(".*?([a-z]+\\w*?).*?").matcher(string);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
So, when value of "string" is variable*func()*20
printout is:
variable
func
Which is not what I want. The simple negation of ( won't do, because it makes regex catch unnecessary characters or cuts them off, but still functions are captured. For now, I have the following code:
Matcher matcher=Pattern.compile(".*?(([a-z]+\\w*)(\\(?)).*?").matcher(formula);
while(matcher.find()) {
if(matcher.group(3).isEmpty()) {
System.out.println(matcher.group(2));
}
}
It works, the printout is correct, but I don't like the additional check. Any ideas? Please?
EDIT (2011-04-12):
Thank you for all answers. There were questions, why would I need something like that. And you are right, in case of bigger, more complicated scripts, the only sane solution would be parsing them. In my case, however, this would be excessive. The scraps of JS I'm working on are intented to be simple formulas, something like (a+b)/2. No comments, string literals, arrays, etc. Only variables and (probably) some built-in functions. I need variables list to check if they can be initalized and this point (and initialized at all). I realize that all of it can be done manually with RPN as well (which would be safer), but these formulas are going to be wrapped with bigger script and evaluated in web browser, so it's more convenient this way.
This may be a bit dirty, but it's assumed that whoever is writing these formulas (probably me, for most of the time), knows what is doing and is able to check if they are working correctly.
If anyone finds this question, wanting to do something similar, should now the risks/difficulties. I do, at least I hope so ;)

Taking all the sound advice about how regex is not the best tool for the job into consideration is important. But you might get away with a quick and dirty regex if your rule is simple enough (and you are aware of the limitations of that rule):
Pattern regex = Pattern.compile(
"\\b # word boundary\n" +
"[A-Za-z]# 1 ASCII letter\n" +
"\\w* # 0+ alnums\n" +
"\\b # word boundary\n" +
"(?! # Lookahead assertion: Make sure there is no...\n" +
" \\s* # optional whitespace\n" +
" \\( # opening parenthesis\n" +
") # ...at this position in the string",
Pattern.COMMENTS);
This matches an identifier as long as it's not followed by a parenthesis. Of course, now you need group(0) instead of group(1). And of course this matches lots of other stuff (inside strings, comments, etc.)...

If you are rethinking using regex and wondering what else you could do, you could consider using an AST instead to access your source programatically. This answer shows you could use the Eclipse Java AST to build a syntax tree for Java source. I guess you could do similar for Javascript.

A regex won't cut in this case because Java isn't regular. Your best best is to get a parser that understands Java syntax and build onto that. Luckily, ANTLR has a Java 1.6 grammar (and 1.5 grammar).
For your rather limited use case you could probably easily extend the variable assignment rules and get the info you need. It's a bit of a learning curve but this will probably be your best best for a quick and accurate solution.

It's pretty well established that regex cannot be reliably used to parse structured input. See here for the famous response: RegEx match open tags except XHTML self-contained tags
As any given sequence of characters may or may not change meaning depending on previous or subsequent sequences of characters, you cannot reliably identify a syntactic element without both lexing and parsing the input text. Regex can be used for the former (breaking an input stream into tokens), but cannot be used reliably for the latter (assigning meaning to tokens depending on their position in the stream).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.