Java Regex: Match any word from pattern - java

I'm trying to implement a search function.
The user types a phrase and I want to match any word from the phrase and the phrase itself in an array of strings.
The problem is that the phrase is stored in a variable, so the Pattern.compile method won't interpret its special characters.
I'm using the following flags for the compile method:
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.LITERAL |
Pattern.MULTILINE
How could I achieve the desired result?
Thanks in advance.
edit:
For example, the phrase:
"Dog cats donuts"
would result in the pattern:
Dogs | cats | donuts | Dogs cats donuts

Split the user-specified phrase by \s+ into, say, arr.
Build the following pattern:
"\\b(?:" + Pattern.quote(arr[0]) + "|" + Pattern.quote(arr[1]) + "|" + Pattern.quote(arr[2]) + ... + "\\b"
Compile without the Pattern.LITERAL option.
In other words, if you want your patterns to match words in a user-specified phrase, you have to use alternation (the pipes) so that any one of those words can be considered a match. However, using the Pattern.LITERAL option makes the alternation operators literal—therefore you have to "literalize" just the words themselves, using the Pattern.quote(...) method. The \\b are word boundaries so that you do not match, say, a word in the user's phrase like "bar" when encountering text like "barrage".
Edit. In response to your edit. If you want to match the longest possible match, e.g. not "Dogs" and "cats" and "donuts" but rather "Dogs cats donuts", you should place the complete phrase in the beginning of the alternation series, e.g.
\\b(Dogs cats donuts|Dogs|cats|donuts)\\b

Try this:
String regex = "\\b(" + phrase + "|" + phrase.replaceAll("\\s+", "|") + ")\\b";
In action:
String phrase = "Dog cats donuts";
String regex = "\\b(" + phrase + "|" + phrase.replaceAll("\\s+", "|") + ")\\b";
System.out.println(regex);
Output:
\b(Dog cats donuts|Dog|cats|donuts)\b

Related

Replacing certain combination of characters

I'm trying to remove the first bad characters (CAP letter + dot + Space) of this.
A. Shipping Length of Unit
C. OVERALL HEIGHT
Overall Weigth
X. Max Cutting Height
I tried something like that, but it doesn't work:
string.replaceAll("[A-Z]+". ", "");
The result should look like this:
Shipping Length of Unit
OVERALL HEIGHT
Overall Weigth
Max Cutting Height
This should work:
string.replaceAll("^[A-Z]\\. ", "")
Examples
"A. Shipping Length of Unit".replaceAll("^[A-Z]\\. ", "")
// => "Shipping Length of Unit"
"Overall Weigth".replaceAll("^[A-Z]\\. ", "")
// => "Overall Weigth"
input.replaceAll("[A-Z]\\.\\s", "");
[A-Z] matches an upper case character from A to Z
\. matches the dot character
\s matches any white space character
However, this will replace every character sequence that matches the pattern.
For matching a sequence at the beginning you should use
input.replaceAll("^[A-Z]\\.\\s", "");
Without looking your code it is hard to tell the problem. but from my experience this is the common problem which generally we make in our initial days:
String string = "A. Test String";
string.replaceAll("^[A-Z]\\. ", "");
System.out.println(string);
String is an immutable class in Java. what it means once you have create a object it can not be changed. so here when we do replaceAll in existing String it simply create a new String Object. that you need to assign to a new variable or overwrite existing value something like below :
String string = "A. Test String";
string = string.replaceAll("^[A-Z]\\. ", "");
System.out.println(string);
Try this :
myString.replaceAll("([A-Z]\\.\\s)","")
[A-Z] : match a single character in the range between A and Z.
\. : match the dot character.
\s : match the space character.

Java pattern for [j-*]

Please help me with the pattern matching. I want to build a pattern which will match the word starting with j- or c- in the following in a string (Say for example)
[j-test] is a [c-test]'s name with [foo] and [bar]
The pattern needs to find [j-test] and [c-test] (brackets inclusive).
What I have tried so far?
String template = "[j-test] is a [c-test]'s name with [foo] and [bar]";
Pattern patt = Pattern.compile("\\[[*[j|c]\\-\\w\\-\\+\\d]+\\]");
Matcher m = patt.matcher(template);
while (m.find()) {
System.out.println(m.group());
}
And its giving output like
[j-test]
[c-test]
[foo]
[bar]
which is wrong. Please help me, thanks for your time on this thread.
Inside a character class, you don't need to use alternation to match j or c. Character class itself means, match any single character from the ones inside it. So, [jc] itself will match either j or c.
Also, you don't need to match the pattern that is after j- or c-, as you are not bothered about them, as far as they start with j- or c-.
Simply use this pattern:
Pattern patt = Pattern.compile("\\[[jc]-[^\\]]*\\]");
To explain:
Pattern patt = Pattern.compile("(?x) " // Embedded flag for Pattern.COMMENT
+ "\\[ " // Match starting `[`
+ " [jc] " // Match j or c
+ " - " // then a hyphen
+ " [^ " // A negated character class
+ " \\]" // Match any character except ]
+ " ]* " // 0 or more times
+ "\\] "); // till the closing ]
Using (?x) flag in the regex, ignores the whitespaces. It is often helpful, to write readable regexes.

How do I find a group of words using Reg-ex?

Here is the code:
String Str ="Animals \n" +
"Dog \n" +
"Cat \n" +
"Fruits \n" +
"Apple \n" +
"Banana \n" +
"Watermelon \n" +
"Sports \n" +
"Soccer \n" +
"Volleyball \n";
The Str basically has 3 categories (Animals, Fruits, Sports). Each of them in separate line. Using Regular Expression, how do I find the Fruits' contents, which will give me the output like this:
Apple
Banana
Watermelon
I would like an explanation that goes with your answer as well, so that I will have a better understand about this problem.
Thanks. :)
Assuming that you want to extract the text between the word "Fruits" and the word "Sports" you could use a regular expression with a capturing group. This way, if a string matches then you still have to extract the group that contains the text that you want.
For example:
Pattern p = Pattern.compile("Fruits(.*?)Sports", Pattern.DOTALL);
// The string "Fruits" ------^ ^ ^ ^
// Capture everything in between --^ ^ ^
// The string "Sports" -----------------^ ^
// This tells the regex to treat newlines ^
// like normal characters ---------------------^
See the railroad diagram below:
Alternatively, you can use a more advanced regular expression using positive lookahead and lookbehinds. This means that you can make your regular expression still look for text between the words "Fruit" and "Sports" but not consider those strings themselves as part of the match.
Pattern p = Pattern.compile("(?<!Fruits).*?(?=Sports)", Pattern.DOTALL);
I would start by splitting the string into an array of words (String[] words = Regex.Split(Str, "\n");), then loop through the words array, adding elements to their proper categories as you go along, switching between the categories as you see headings.

Regex to find words with letters and numbers separated or not by symbols

I need to build a regex that match words with these patterns:
Letters and numbers:
A35, 35A, B503X, 1ABC5
Letters and numbers separated by "-", "/", "\":
AB-10, 10-AB, A10-BA, BA-A10, etc...
I wrote this regex for it:
\b[A-Za-z]+(?=[(?<!\-|\\|\/)\d]+)[(?<!\-|\\|\/)\w]+\b|\b[0-9]+(?=[(?<!\-|\\|\/)A-Za-z]+)[(?<!\-|\\|\/)\w]+\b
It works partially, but it's match only letters or only numbers separated by symbols.
Example:
10-10, open-office, etc.
And I don't wanna this matches.
I guess that my regex is very repetitive and somewhat ugly.
But it's what I have for now.
Could anyone help me?
I'm using java/groovy.
Thanks in advance.
Interesting challenge. Here is a java program with a regex that picks out the types of "words" you are after:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A35, 35A, B503X, 1ABC5 " +
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +
"10-10, open-office, etc.";
Pattern regex = Pattern.compile(
"# Match special word having one letter and one digit (min).\n" +
"\\b # Match first word having\n" +
"(?=[-/\\\\A-Za-z]*[0-9]) # at least one number and\n" +
"(?=[-/\\\\0-9]*[A-Za-z]) # at least one letter.\n" +
"[A-Za-z0-9]+ # Match first part of word.\n" +
"(?: # Optional extra word parts\n" +
" [-/\\\\] # separated by -, / or //\n" +
" [A-Za-z0-9]+ # Match extra word part.\n" +
")* # Zero or more extra word parts.\n" +
"\\b # Start and end on a word boundary",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(s);
while (regexMatcher.find()) {
System.out.print(regexMatcher.group() + ", ");
}
}
}
Here is the correct output:
A35, 35A, B503X, 1ABC5, AB-10, 10-AB, A10-BA, BA-A10,
Note that the only complex regexes which are "ugly", are those that are not properly formatted and commented!
Just use this:
([a-zA-Z]+[-\/\\]?[0-9]+|[0-9]+[-\/\\]?[a-zA-Z]+)
In Java \\ and \/ should be escaped:
([a-zA-Z]+[-\\\/\\\\]?[0-9]+|[0-9]+[-\\\/\\\\]?[a-zA-Z]+)
Excuse me to write my solution in Python, I don't know enough Java to write in Java.
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## This part verifies that
'[^ ]*' ## there are at least one
'(?(1)\d|[A-Z]))' ## letter and one digit.
'('
'(?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])' # start of second group
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,)' # end of second group
')',
re.IGNORECASE) # this group 2 catches the string
.
My solution catches the desired string in the second group: ((?:(?<={ ,])[A-Z0-9]|\A[A-Z0-9])[A-Z0-9-/\\\\]*[A-Z0-9](?= |\Z|,))
.
The part before it verifies that one letter at least and one digit at least are present in the catched string:
(?(1)\d|[A-Z]) is a conditional regex that means "if group(1) catched something, then there must be a digit here, otherwise there must be a letter"
The group(1) is ([A-Z]) in (?=(?:([A-Z])|[0-9])
(?:([A-Z])|[0-9]) is a non-capturing group that matches a letter (catched) OR a digit, so when it matches a letter, the group(1) isn't empty
.
The flag re.IGNORECASE allows to treat strings with upper or lower cased letters.
.
In the second group, I am obliged to write (?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9]) because lookbehind assertions with non fixed length are not allowed. This part signifies one character that can't be '-' preceded by a blank or the head of the string.
At the opposite, (?= |\Z[,) means 'end of string or a comma or a blank after'
.
This regex supposes that the characters '-' , '/' , '\' can't be the first character or the last one of a captured string . Is it right ?
import re
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## (from here) This part verifies that
'[^ ]*' # there are at least one
'(?(1)\d|[A-Z]))' ## (to here) letter and one digit.
'((?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])'
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,))',
re.IGNORECASE) # this group 2 catches the string
ch = "ALPHA13 10 ZZ 10-10 U-R open-office ,10B a10 UCS5000 -TR54 code vg4- DV-3000 SEA 300-BR gt4/ui bn\\3K"
print [ mat.group(2) for mat in pat.finditer(ch) ]
s = "A35, 35A, B503X,1ABC5 " +\
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +\
"10-10, open-office, etc."
print [ mat.group(2) for mat in pat.finditer(s) ]
result
['ALPHA13', '10B', 'a10', 'UCS5000', 'DV-3000', '300-BR', 'gt4/ui', 'bn\\3K']
['A35', '35A', 'B503X', '1ABC5', 'AB-10', '10-AB', 'A10-BA', 'BA-A10']
My first pass yields
(^|\s)(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)(\s|$)
Sorry, but it's not java formatted (you'll need to edit the \ \s etc.). Also, you can't use \b b/c a word boundary is anything that is not alphanumeric and underscore, so I used \s and the start and end of the string.
This is still a bit raw
EDIT
Version 2, slightly better, but could be improved for performance by usin possessive quantifiers. It matches ABC76 AB-32 3434-F etc, but not ABC or 19\23 etc.
((?<=^)|(?<=\s))(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)((?=$)|(?=\s))
A condition (A OR NOT A) can be omited. So symbols can savely been ignored.
for (String word : "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" "))
if (word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)"))
// do something
You didn't mention -x4, 4x-, 4-x-, -4-x or -4-x-, I expect them all to match.
My expression looks just for something-alpha-something-digits-something, where something might be alpha, digits or symbols, and the opposite: something-alpha-something-digits-something. If something else might occur, like !#$~()[]{} and so on, it would get longer.
Tested with scala:
scala> for (word <- "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" ")
| if word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)")) yield word
res89: Array[java.lang.String] = Array(10B, A10, UCS5000, DV-3000, 300-BR)
Slightly modified to filter matches:
String s = "A35, 35A, B53X, 1AC5, AB-10, 10-AB, A10-BA, BA-A10, etc. -4x, 4x- -4-x- 10-10, oe-oe, etc";
Pattern pattern = java.util.regex.Pattern.compile ("\\b([^ ,]*[A-Za-z][^ ,]*[0-9])[^ ,]*|([^ ,]*[0-9][^ ,]*[A-Za-z][^ ,]*)\\b");
matcher = pattern.matcher (s);
while (matcher.find ()) { System.out.print (matcher.group () + "|") }
But I still have an error, which I don't find:
A35|35A|B53X|1AC5|AB-10|10-AB|A10-BA|BA-A10|-4x|4x|-4-x|
4x should be 4x-, and -4-x should be -4-x-.

Can you help with regular expressions in Java?

I have a bunch of strings which may of may not have random symbols and numbers in them. Some examples are:
contains(reserved[j])){
close();
i++){
letters[20]=word
I want to find any character that is NOT a letter, and replace it with a white space, so the above examples look like:
contains reserved j
close
i
letters word
What is the best way to do this?
It depends what you mean by "not a letter", but assuming you mean that letters are a-z or A-Z then try this:
s = s.replaceAll("[^a-zA-Z]", " ");
If you want to collapse multiple symbols into a single space then add a plus at the end of the regular expression.
s = s.replaceAll("[^a-zA-Z]+", " ");
yourInputString = yourInputString.replaceAll("[^\\p{Alpha}]", " ");
^ denotes "all characters except"
\p{Alpha} denotes all alphabetic characters
See Pattern for details.
I want to find any character that is NOT a letter
That will be [^\p{Alpha}]+. The [] indicate a group. The \p{Alpha} matches any alphabetic character (both uppercase and lowercase, it does basically the same as \p{Upper}\p{Lower} and a-zA-Z. The ^ inside group inverses the matches. The + indicates one-or-many matches in sequence.
and replace it with a white space
That will be " ".
Summarized:
string = string.replaceAll("[^\\p{Alpha}]+", " ");
Also see the java.util.regex.Pattern javadoc for a concise overview of available patterns. You can learn more about regexs at the great site http://regular-expression.info.
Use the regexp /[^a-zA-Z]/ which means, everything that is not in the a-z/A-Z characters
In ruby I would do:
"contains(reserved[j]))".gsub(/[^a-zA-Z]/, " ")
=> "contains reserved j "
In Java should be something like:
import java.util.regex.*;
...
String inputStr = "contains(reserved[j])){";
String patternStr = "[^a-zA-Z]";
String replacementStr = " ";
// Compile regular expression
Pattern pattern = Pattern.compile(patternStr);
// Replace all occurrences of pattern in input
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll(replacementStr);

Categories