Java Regex for multiline text

Java Regex for multiline text - java

I need to match a string against a regex in Java. The string is multiline and therefore contains multiple \n like the followings
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
String regex = "\\S*";
System.out.println(text.matches(regex));
I only want to match whether the text contains at least a non-whitespace character. The output is false. I have also tried \\S*(\n)* for the regex, which also returns false.
In the real program, both the text and regex are not hard-coded. What is the right regex to check is a multiline string contains any non-whitespace character?

The problem is not to do with the multi lines, directly. It is that matches matches the whole string, not just a part of it.
If you want to check for at least one non-whitespace character, use:
"\\s*\\S[\\s\\S]*"
Which means
Zero or more whitespace characters at the start of the string
One non-whitespace character
Zero or more other characters (whitespace or non-whitespace) up to the end of the string

If you just want to check whether there is at least one non white space character in the string, you can just trim the text and check the size without involving regex at all.
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
if (!text.trim().isEmpty()){
//your logic here
}
If you really want to use regex, you can use a simple regex like below.
String text = "abcde\n"
+ "fghij\n"
+ "klmno\n";
String regex = ".*\\S+.*";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
if (matcher.find()){
// your logic here
}

Using String.matches()
!text.matches("\\s*")
Check if the input text consist solely of whitespace characters (this includes newlines), invert the match result with !
Using Matcher.find()
Pattern regexp = Pattern.compile("\\S");
regexp.matcher(text).find()
Will search for the first non-whitespace character, which is more efficient as it will stop on the first match and also uses a pre-compiled pattern.

Related

Java regex validate a string consisting of repeated sequences of comma separated digits enclosed in parentheses

I have an input containing repeated pattern like this:
(3,5)(6,7)(8,9).....
I am working on a regex to validate the above string pattern.
I tried:
Pattern.compile("\\((\\d+),(\\d+)\\)")

If you have a specific pattern that may repeat one or more times in a string, and you want to make sure your string only consists of the repeated occurrences of the same pattern, you may use
^(?:YOUR_PATTERN)+$
\A(?:YOUR_PATTERN)+\z
Note that $ also matches before a final newline char, that is why \z anchor is preferred when you need to match the very end of the string.
If you allow an empty string use the * quantifier instead of +:
^(?:YOUR_PATTERN)*$
\A(?:YOUR_PATTERN)*\z
In this case, the YOUR_PATTERN is \(\d+,\d+\), thus, the repeated sequence validating pattern will be \A(?:\(\d+,\d+\))+\z.
In Java, you may omit the ^/$ and \A/\z anchors when validating a string with .matches() method since it requires a full string match:
boolean isValid = text.matches("(?:\\(\\d+,\\d+\\))+");
Or, decalre the Pattern class instance first and then create a matcher with the string as input and run Matcher#matches()
Pattern p = Pattern.compile("(?:\\(\\d+,\\d+\\))+");
Matcher m = p.matcher(text);
if (m.matches()) {
// The text is valid!
}

Erase any string that doesn't match a pattern using replaceall()

I need to replace ALL characters that don't follow a pattern with "".
I have strings like:
MCC-QX-1081
TEF-CO-QX-4949
SPARE-QX-4500
So far the closest I am using the following regex.
String regex = "[^QX,-,\\d]";
Using the replaceAll String method I get QX1081 and the expected result is QX-1081

You're using a character class which matches single characters, not patterns.
You want something like
String resultString = subjectString.replaceAll("^.*?(QX-\\d+)?$", "$1");
which works as long as nothing follows the QX-digits part in your strings.

Put the dash at the end of the regex: [^QX,\d-]
Next you just have to substring to filter out the first dash.
Don't know exactly what you expect for all strings but if you want to match a dash in a character class then it must be set as last character.

You are using a character class where you have to either escape the hyphen or put it at the start or at the end like [^QX,\d-] or else you are matching a range from a comma to a comma. But changing that will give you -QX-1081 which is not the desired result.
You could match your pattern and then replace with the first capturing group $1:
^(?:[A-Z]+-)+(QX-\d+)$
In Java you have to double escape matching a digit \\d
That will match:
^ Start of the string
(?:[A-Z]+-)+ Repeat 1+ times one or more uppercase charactacters followed by a hyphen
(QX-\d+) Capture in a group QX- followed by 1+ digits
$ End of the string
For example:
String result = "MCC-QX-1081".replaceAll("^(?:[A-Z]+-)+(QX-\\d+)$", "$1");
System.out.println(result); // QX-1081
See the Regex demo | Java demo
Note that if you are doing just 1 replacement, you could also use replaceFirst

A regex that doesn't match with this character sequence

Here is my Regex, I am trying to search all special characters so that I can escape them.
(\(|\)|\[|\]|\{|\}|\?|\+|\\|\.|\$|\^|\*|\||\!|\&|\-|\#|\#|\%|\_|\"|\:|\<|\>|\/|\;|\'|\`|\~)
My problem here is, I don't want to escape some sepcial characters only when the come in a sequence
like this (.*)
So, Lets consider an example.
Sting message = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (,*) &$#%#*(....))(((";
After escaping according to current regex what i get is,
Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , \(,\*\) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\(
But is don't want to escape this part (.*) want to keep it as it is.
My above regex is only used for searching, So i just don't want to match with this part (.*) and my problem will be solved
Can anyone suggest regex that doesn't escape that part of the string?

See #nhahtdh for how to do this with a regex.
As an alternative, Here is a solution which does not use a regex, using Guava's CharMatcher instead:
private static final CharMatcher SPECIAL
= CharMatcher.anyOf("allspecialcharshere");
private static final String NO_ESCAPE = "(.*)";
public String doEncode(String input)
{
StringBuilder sb = new StringBuilder(input.length());
String tmp = input;
while (!tmp.isEmpty()) {
if (tmp.startsWith(NO_ESCAPE)) {
sb.append(NO_ESCAPE);
tmp = tmp.substring(NO_ESCAPE.length());
continue;
}
char c = tmp.charAt(0);
if (SPECIAL.matches(c))
sb.append('\\');
sb.append(c);
tmp = tmp.substring(1);
}
return sb.toString();
}

This answer is to demonstrate the possibility only. Using it in production code is questionable.
It is possible with Java String replaceAll function:
String input = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (.*) &$#%#*(....))(((";
String output = input.replaceAll("\\G((?:[^()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-]|\\Q(.*)\\E)*+)([()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-])", "$1\\\\$2");
Result:
"Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , (.*) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\("
Another test:
String input = "(.*) sdfHi test message <> >>>>><<<<f<f<,,,,<> <>(.*) sdf (.*) sdf (.*)";
Result:
"(.*) sdfHi test message \<\> \>\>\>\>\>\<\<\<\<f\<f\<,,,,\<\> \<\>(.*) sdf (.*) sdf (.*)"
Explanation
Raw regex:
\G((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.*)\E)*+)([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-])
Note that \ is escaped once more when the regex is specified inside the string, and " needs to be escaped. The resulting regex in string can be seen above.
Raw replacement string:
$1\\$2
Since $ has special meaning in replacement string, and you want to keep it for $2, you need to escape the \ so that \ won't escape the $. And putting the replacement string in quoted string, you need to double up the number of \ to escape the \.
Before we dissect the monster, let's talk about the idea. We will consume non-special characters, and the sequence that we don't want to replace, and as many times as possible. The next character will either be a special character not forming sequence we don't want to replace, or is the end of the string (which means that we have found all character that needs replacing if any).
Naturally, we can think of any arbitrary string as consisting of many of the following pattern consecutively: [0 or more (non-special character or special pattern not to be replace)][special character], and the string ends with [0 or more (non-special character or special pattern not to be replace)].
replaceAll function when used with a regex without \G may find matches that are not consecutive, which can cut in the middle of the sequence not to be replaced and mess it up. \G means the boundary of last match, and can be used to make sure the next match starts from where the last match left off.
\G: Starts from last match
((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.\*)\E)*+): Capture 0 or more of, the non-special character or the special pattern not to be replaced. Note that I have added the possessive qualifier + after *. This will prevent the engine from backtracking when it cannot find the special character that we specify after this.
[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]: Negated character class of special characters.
\Q(.*)\E: Special sequence (.*) not to be replaced, literal quoted by \Q and \E.
([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]): Capture the single special character.
The whole regex will match string with minimum length of 1 (the special character). The first capturing group contains the parts that shouldn't be replaced, and the 2nd capturing group contains the special character that should be replaced.

How can I remove all leading and trailing punctuation?

I want to remove all the leading and trailing punctuation in a string. How can I do this?
Basically, I want to preserve punctuation in between words, and I need to remove all leading and trailing punctuation.
., #, _, &, /, - are allowed if surrounded by letters
or digits
\' is allowed if preceded by a letter or digit
I tried
Pattern p = Pattern.compile("(^\\p{Punct})|(\\p{Punct}$)");
Matcher m = p.matcher(term);
boolean a = m.find();
if(a)
term=term.replaceAll("(^\\p{Punct})", "");
but it didn't work!!

Ok. So basically you want to find some pattern in your string and act if the pattern in matched.
Doing this the naiive way would be tedious. The naiive solution could involve something like
while(myString.StartsWith("." || "," || ";" || ...)
myString = myString.Substring(1);
If you wanted to do a bit more complex task, it could be even impossible to do the way i mentioned.
Thats why we use regular expressions. Its a "language" with which you can define a pattern. the computer will be able to say, if a string matches that pattern. To learn about regular expressions, just type it into google. One of the first links: http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial
As for your problem, you could try this:
myString.replaceFirst("^[^a-zA-Z]+", "")
The meaning of the regex:
the first ^ means that in this pattern, what comes next has to be at
the start of the string.
The [] define the chars. In this case, those are things that are NOT
(the second ^) letters (a-zA-Z).
The + sign means that the thing before it can be repeated and still
match the regex.
You can use a similar regex to remove trailing chars.
myString.replaceAll("[^a-zA-Z]+$", "");
the $ means "at the end of the string"

You could use a regular expression:
private static final Pattern PATTERN =
Pattern.compile("^\\p{Punct}*(.*?)\\p{Punct}*$");
public static String trimPunctuation(String s) {
Matcher m = PATTERN.matcher(s);
m.find();
return m.group(1);
}
The boundary matchers ^ and $ ensure the whole input is matched.
A dot . matches any single character.
A star * means "match the preceding thing zero or more times".
The parentheses () define a capturing group whose value is retrieved by calling Matcher.group(1).
The ? in (.*?) means you want the match to be non-greedy, otherwise the trailing punctuation would be included in the group.

Use this tutorial on patterns. You have to create a regex that matches string starting with alphabet or number and ending with alphabet or number and do inputString.matches("regex")

What's wrong with this regex?

I need to match Twitter-Hashtags within an Android-App, but my code doesn't seem to do what it's supposed to.
What I came up with is:
ArrayList<String> tags = new ArrayList<String>(0);
Pattern p = Pattern.compile("\b#[a-z]+", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(tweet); // tweet contains the tweet as a String
while(m.find()){
tags.add(m.group());
}
The variable tweet contains a regular tweet including hashtags - but find() doesn't trigger. So I guess my regular expression is wrong.

Your regex fails because of the \b word boundary anchor. This anchor only matches between a non-word character and a word-character (alphanumeric character). So putting it directly in front of the # causes the regex to fail unless there is an alphanumeric character before the #! Your regex would match a hashtag in foobarfoo#hashtag blahblahblah but not in foobarfoo #hashtag blahblahblah.
Use #\w+ instead, and remember, inside a string, you need to double the backslashes:
Pattern p = Pattern.compile("#\\w+");

Your pattern should be "#(\\w+)" if you are trying to just match the hash tag. Using this and the tweet "retweet pizza to #pizzahut", doing m.group() would give "#pizzahut" and m.group(1) would give "pizzahut".
Edit: Note, the html display is messing with the backslashes for escape, you'll need to have two for the w in your string literal in Java.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex for multiline text - java

Related

Java regex validate a string consisting of repeated sequences of comma separated digits enclosed in parentheses

Erase any string that doesn't match a pattern using replaceall()

A regex that doesn't match with this character sequence

How can I remove all leading and trailing punctuation?

What's wrong with this regex?

Categories

Resources