Regular expression matching whole word OR operator - java

I am trying to match full word from some lines, wanted to know how to use the OR in regex,
If i use only one keyword, it works fine. Example,
regex = ".*\\b" + "KEYWORD1" + "\\b.*";
String regex = ".*\\b" + "KEYWORD1|KEYWORD2|KEYWORD3" + "\\b.*";
for (int i = start; i < end; i++) {
if (lines[i].matches(regex)) {
System.out.println("Matches");
}
}

You want:
String regex = ".*\\b(KEYWORD1|KEYWORD2|KEYWORD3)\\b.*";
Originally, your regex was being evaluated like this:
.*\bKEYWORD1
|
KEYWORD2
|
KEYWORD3\b.*
But you want:
.*\b
(
KEYWORD1
|
KEYWORD2
|
KEYWORD3
)
\b.*
This cool tool can help you analyse regexes and find bugs like this one.

The pipe character | can be used as an OR operator, which is called alternation in regex.
To get this to work properly in your example, you just need to create a group around the alternation to be sure that you are doing the OR only on the keywords you are interested in, for example:
String regex = ".*\\b(KEYWORD1|KEYWORD2|KEYWORD3)\\b.*";
What you currently have would mean .*\\bKEYWORD1 OR KEYWORD2 OR KEYWORD3\\b.*.

Related

Split a string with multiple delimiters while keeping these delimiters

Let's say we have a string:
String x = "a| b |c& d^ e|f";
What I want to obtain looks like this:
a
|
b
|
c
&
d
^
e
|
f
How can I achieve this by using x.split(regex)? Namely what is the regex?
I tried this link: How to split a string, but also keep the delimiters?
It gives a very good explanation on how to do it with one delimiter.
However, using and modifying that to fit multiple delimiters (lookahead and lookbehind mechanism) is not that obvious to someone who is not familiar with regex.
The regex for splitsplitting on optional spaces after a word boundary is
\\b\\s*
Note that \\b checks if the preceding character is a letter, or a digit or an underscore, and then matches any number of whitespace characters that the string will be split on.
Here is a sample Java code on IDEONE:
String str = "a| b |c& d^ e|f";
String regex = "\\b\\s*";
String[] spts = str.split(regex);
for(int i =0; i < spts.length && i < 20; i++)
{
System.out.println(spts[i]);
}

How can I look for two specific characters in a string?

String abc = "||:::|:|::";
It should return true if there's two | and three : appearances.
I'm not sure how to use "regex" or if it's the right method to use. There's no specific pattern in the abc String.
Using a regex would be a bad idea, especially if there's no specific order to them. Make a function that counts the number of times a character sppears in a string, and use that:
public int count(String base, char toFind)
{
int count = 0;
char[] haystack = base.toCharArray();
for (int i = 0; i < haystack.length; i++)
if (haystack[i] == toFind)
count++;
return count;
}
String abc = "||:::|:|::";
if (count(abc,"|") >= 2 && count(abc,":") >= 3)
{
//Do some code here
}
My favorite method for searching for the number of characters in a string is int num = s.length() - s.replaceAll("|","").length(); you can do that for both and test those ints.
If you want to test all conditions in one regex you can use look-ahead (?=condition).
Your regex can look like
String regex =
"(?=(.*[|]){2})"//contains two |
+ "(?=(.*:){3})"//contains three :
+ "[|:]+";//is build only from : and | characters
Now you can use it with matches like
String abc = "||:::|:|::";
System.out.println(abc.matches(regex));//true
abc = "|::::::";
System.out.println(abc.matches(regex));//false
Anyway I you can avoid regex and write your own method which will calculate number of | and : in your string and check if this numbers are greater or equal to 2 and 3. You can use StringUtils.countMatches from apache-commons so your test code could look like
public static boolean testString(String s){
int pipes = StringUtils.countMatches(s, "|");
int colons = StringUtils.countMatches(s, ":");
return pipes>=2 && colons>=3;
}
or
public static boolean testString(String s){
return StringUtils.countMatches(s, "|")>=2
&& StringUtils.countMatches(s, ":")>=3;
}
This is assuming you are looking for two '|' to be one after the other and the same for the three ':'
and one follows the other .Do it using the following single regular expressions.
".*||.*:::.*"
If you are looking to just check the presence of characters and their irrespective of their order then use String.matches method using the two regular expressions with a logical AND
".*|.*|.*"
".*:.*:.*:.*"
Here is a cheat sheet for regular expressions. Its fairly simple to learn. Look at groups and quantifiers in the document to understand the above expression.
Haven't tested it, but this should work
Pattern.compile("^(?=.*[|]{2,})(?=.*[:]{3,})$");
The entire string is read by ?=.* and checked wether the allowed characters (|) occurs at least twice. The same is then done for :, only that this has to match at least three times.

Split String at n-th character preserving words

Expanding on this answer, using this regex (?<=\\G.{" + count + "}); I would also like to modify the expression to not split words in the middle.
Example:
String string = "Hello I would like to split this string preserving these words";
if I want to split on 10 characters it would look like this:
[Hello I wo, uld like t, o split th, is string , preserving, these wor, ds]
Question:
Is this even possible using only regex, or would a lexer or some other string manipulation be needed?
UPDATE
This is what I want to use it on:
+ -------------------------------------------JVM Information------------------------------------------ +
| sun.boot.class.path : C:\Program Files\Java\jdk1.6.0_33\jre\lib\resources.jar;C:\Program Files\Java\ |
| jdk1.6.0_33\jre\lib\rt.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\sunrsasig |
| n.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\jsse.jar;C:\Program Files\Java |
| \jdk1.6.0_33\jre\lib\jce.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\charset |
| s.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\modules\jdk.boot.jar;C:\Progra |
| m Files\Java\jdk1.6.0_33\jre\classes |
+ ---------------------------------------------------------------------------------------------------- +
The box surrounding it has the character limit minus the key width, however this does not look good. This example is also not the only use-case, i use that box for multiple types of information.
I have looked at this problem and none of those replies actually convinced me! Here is my version. It is very likely that it can be improved.
public static String[] splitPresenvingWords(String text, int length) {
return text.replaceAll("(?:\\s*)(.{1,"+ length +"})(?:\\s+|\\s*$)", "$1\n").split("\n");
}
"not split words in the middle" does not define what should happen in case of "not splitting".
Given the split length being 10 and the string:
Hello I would like to split this string preserving these words
If you want to split right after a word, resulting in the list:
Hello I would, like to split, this string, preserving, these words
You can accomplish all kinds of tricky "splits" by using plain matching.
Simply match all occurences of this expression:
(?s)\G.{10,}?\b
(Using (?s) to turn on the DOTALL flag.)
In Perl it's as simple as #array = $str =~ /\G.{10,}?\b/gs, but Java seems to lack a quick function to return all matches, so you'd probably have to use a matcher and push the results on to an array/list.
No regex, but it seems to work:
List<String> parts = new ArrayList<String>();
while (true) {
// look for space to the left of n-th character
int index = string.lastIndexOf(" ", n);
if (index == -1) {
// no space to the left (very long word) -> next space to the right
// change this to 'index = n' to break words in this case
index = string.indexOf(" ", n);
}
if (index == -1) {
break;
}
parts.add(string.substring(0, index));
string = string.substring(index+1);
}
parts.add(string);
This will first look if there is a space to the left of the n-th character. In this case, the string is split there. Otherwise, it looks for the next space to the right. Alternatively, you could break the word in this case.

How can I find overlapping sets of words with regex expression?

Right now I have a regex expression that looks like "\\w+ \\w+" to find 2-word phrases, however, they do not overlap. For example, if my sentence was The dog ran inside, the output would show "The dog", "ran inside" when I need it to show "The dog", "dog ran", "ran inside". I know there's a way to do this but I'm just way too new to using regex expressions to know how to do this.
Thanks!
You can do this with a lookahead, a capturing group and a word boundary anchor:
Pattern regex = Pattern.compile("\\b(?=(\\w+ \\w+))");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
This is not possible purely with regex, you can't match the same characters twice ("dog" can't be in two separate groups). Something like this doesn't need regex at all, you can simply split the string by spaces and combine it however you like:
>>> words = "The dog ran inside".split(" ")
>>> [" ".join(words[i:i+2]) for i in range(len(words)-1)]
['The dog', 'dog ran', 'ran inside']
If that doesn't solve your problem please provide more details about what exactly you're trying to accomplish.
Use a lookahead to get the second word, the concatenate the non-lookahead with the lookahead part.
# This is Perl. The important bits:
#
# $1 is what the first parens captured.
# $2 is what the second parens captured.
# . is the concatenation operator (like Java's "+").
while (/(\w+)(?=(\s+\w+))/g) {
my $phrase = $1 . $2;
...
}
Sorry, don't know enough Java, but this should be easy enough to do in Java too.
The easy (and faster for big String) way is to use split :
final String[] arrStr = "The dog ran inside".split(" ");
for (int i = 0, n = arrStr.length - 1; i < n; i++) {
System.out.format("%s %s%n", arrStr[i], arrStr[i + 1]);
}
out put
The dog
dog ran
ran inside
No found trick with regex

Java regex split on whitespace not preceded or followed by single or double quotes

I can't get this to work..
I have an String which I want to split on spaces. However, I do not want to split inside Strings. That is, text which is inside double or single quotes.
Example
Splitting the following string:
private String words = " Hello, today is nice " ;
..should produce the following tokens:
private
String
words
=
" Hello, today is nice "
;
What kind of regex can I use for this?
The regex ([^ "]*)|("[^"]*") should match all the tokens. Drawing on my limited knowledge of Java and http://www.regular-expressions.info/java.html, you should be able to do something like this:
// Please excuse any syntax errors, I'm used to C#
Pattern pattern = Pattern.compile("([^ \"]*)|(\"[^\"]*\")");
Matcher matcher = pattern.matcher(theString);
while (matcher.find())
{
// do something with matcher.group();
}
Have you tried this?
((['"]).*?\2|\S+)
Here is what it does:
( <= Group everything
(['"]) <= Find a simple or double quote
.*? <= Capture everything after the quote (ungreedy)
\2 <= Find the simple or double quote (same as we had before)
| <= Or
\S+ <= Non space characters (one at least)
)
On another note, if you want to create a parser, do a parser and don't use regexes.

Categories