How do I find a group of words using Reg-ex? - java

Here is the code:
String Str ="Animals \n" +
"Dog \n" +
"Cat \n" +
"Fruits \n" +
"Apple \n" +
"Banana \n" +
"Watermelon \n" +
"Sports \n" +
"Soccer \n" +
"Volleyball \n";
The Str basically has 3 categories (Animals, Fruits, Sports). Each of them in separate line. Using Regular Expression, how do I find the Fruits' contents, which will give me the output like this:
Apple
Banana
Watermelon
I would like an explanation that goes with your answer as well, so that I will have a better understand about this problem.
Thanks. :)

Assuming that you want to extract the text between the word "Fruits" and the word "Sports" you could use a regular expression with a capturing group. This way, if a string matches then you still have to extract the group that contains the text that you want.
For example:
Pattern p = Pattern.compile("Fruits(.*?)Sports", Pattern.DOTALL);
// The string "Fruits" ------^ ^ ^ ^
// Capture everything in between --^ ^ ^
// The string "Sports" -----------------^ ^
// This tells the regex to treat newlines ^
// like normal characters ---------------------^
See the railroad diagram below:
Alternatively, you can use a more advanced regular expression using positive lookahead and lookbehinds. This means that you can make your regular expression still look for text between the words "Fruit" and "Sports" but not consider those strings themselves as part of the match.
Pattern p = Pattern.compile("(?<!Fruits).*?(?=Sports)", Pattern.DOTALL);

I would start by splitting the string into an array of words (String[] words = Regex.Split(Str, "\n");), then loop through the words array, adding elements to their proper categories as you go along, switching between the categories as you see headings.

Related

How do I replace a certain char in between 2 strings using regex

I'm new to regex and have been trying to work this out on my own but I don't seem to get it working. I have an input that contains start and end flags and I want to replace a certain char, but only if it's between the flags.
So for example if the start flag is START and the end flag is END and the char i'm trying to replace is " and I would be replacing it with \"
I would say input.replaceAll(regex, '\\\"');
I tried making a regex to only match the correct " chars but so far I have only been able to get it to match all chars between the flags and not just the " chars. -> (?<=START)(.*)(?=END)
Example input:
This " is START an " example input END string ""
START This is a "" second example END
This" is "a START third example END " "
Expected output:
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "
Find all characters between START and END, and for those characters replace " with \".
To achieve this, apply a replacer function to all matches of characters between START and END:
string = Pattern.compile("(?<=START).*?(?=END)").matcher(string)
.replaceAll(mr -> mr.group().replace("\"", "\\\\\""));
which produces your expected output.
Some notes on how this works.
This first step is to match all characters between START and END, which uses look arounds with a reluctant quantifier:
(?<=START).*?(?=END)
The ? after the .* changes the match from greedy (as many chars as possible while still matching) to reluctant (as few chars as possible while still matching). This prevents the middle quote in the following input from being altered:
START a"b END c"d START e"f END
A greedy quantifier will match from the first START all the way past the next END to the last END, incorrectly including c"d.
The next step is for each match to replace " with \". The full match is group 0, or just MatchResult#group. and we don't need regex for this replacement - just plain string replace is enough (and yes, replace() replaces all occurrences).
For now i've been able to solve it by creating 3 capture groups and continuously replacing the match until there are no more matches left. In this case I even had to insert a replace indentifier because replacing with " would keep the " char there and create an infinite loop. Then when there are no more matches left I replaced my identifier and i'm now getting the expected result.
I still feel like there has to be a way cleaner way to do this using only 1 replace statement...
Code that worked for me:
class Playground {
public static void main(String[ ] args) {
String input = "\"ThSTARTis is a\" te\"\"stEND \" !!!";
String regex = "(.*START.+)\"+(.*END+.*)";
while(input.matches(regex)){
input = input.replaceAll(regex, "$1---replace---$2");
}
String result = input.replace("---replace---", "\\\"");
System.out.println(result);
}
}
Output:
"ThSTARTis is a\" te\"\"stEND " !!!
I would love any suggestions as to how I could solve this in a better/cleaner way.
Another option is to make use of the \G anchor with 2 capture groups. In the replacement use the 2 capture groups followed by \"
(?:(START)(?=.*END)|\G(?!^))((?:(?!START|END)(?>\\+\"|[^\r\n\"]))*)\"
Explanation
(?: Non capture group
(START)(?=.*END) Capture group 1, match START and assert there is END to the right
| Or
\G(?!^) Assert the current position at the end of the previous match
) Close non capture group
( Capture group 2
(?: Non capture group
(?!START|END) Negative lookhead, assert not START or END directly to the right
(?>\\+\"|[^\r\n\"]) Match 1+ times \ followed by " or match any char except " or a newline
)* Close the non capture group and optionally repeat it
) Close group 2
\" Match "
See a Java regex demo and a Java demo
For example:
String regex = "(?:(START)(?=.*END)|\\G(?!^))((?:(?!START|END)(?>\\\\+\\\"|[^\\r\\n\\\"]))*)\\\"";
String string = "This \" is START an \" example input END string \"\"\n"
+ "START This is a \"\" second example END\n"
+ "This\" is \"a START third example END \" \"";
String subst = "$1$2\\\\\"";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
String result = matcher.replaceAll(subst);
System.out.println(result);
Output
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "

Java - Regex Match Multiple Words

Lets say that you want to match a string with the following regex:
".when is (\w+)." - I am trying to get the event after 'when is'
I can get the event with matcher.group(index) but this doesnt work if the event is like Veteran's Day since it is two words. I am only able to get the first word after 'when is'
What regex should I use to get all of the words after 'when is'
Also, lets say I want to capture someones bday like
'when is * birthday
How do I capture all of the text between is and birthday with regex?
You could try this:
^when is (.*)$
This will find a string that starts with when is and capture everything else to the end of the line.
The regex will return one group. You can access it like so:
String line = "when is Veteran's Day.";
Pattern pattern = Pattern.compile("^when is (.*)$");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
System.out.println("group 2: " + matcher.group(2));
}
And the output should be:
group 1: when is Veteran's Day.
group 2: Veteran's Day.
If you want to allow whitespace to be matched, you should explicitly allow whitespace.
([\w\s]+)
However, roydukkey's solution will work if you want to capture everything after when is.
Don't use regular expressions when you don't need to!! Although the theory of regular expressions is beautiful in the thought that you can have a string do code operations for you, it is very memory inefficient for simple use cases.
If you are trying to get the word after "when is" ending by a space, you could do something like this:
String start = "when is ";
String end = " ";
int startLocation = fullString.indexOf(start) + start.length();
String afterStart = fullString.substring(startLocation, fullString.length());
String word = afterStart.substring(0, afterStart.indexOf(end));
If you know the last word is Day, you can just make end = "Day" and add the length of that string of where to end the second substring.
You can express this as a character class and include spaces in it: when is ([\w ]+).
\w only includes word characters, which doesn't include spaces. Use [\w ]+ instead.

Java Regex : String Formatting

After runing this
Names.replaceAll("^(\\w)\\w+", "$1.")
I have a String Like
Names = F.DA, ABC, EFG
I want a String format like
F.DA, A.BC & E.FG
How do I do that ?
Update :
If I had a name Like
Robert Filip, Robert Morris, Cirstian Jed
I want like
R.Filp, R.Morris & C.Jed
I will be happy, If also you suggest me a good resource on JAVA Regex.
You need to re-assign the result back to Names, since Strings are immutable, the replaceAll methods does not do in place replacement, rather it returns a new String:
names = names.replaceAll(", (?=[^,]*$)", " & ")
Following should work for you:
String names = "Robert Filip, Robert Morris, Cirstian Jed, S.Smith";
String repl = names.replaceAll("((?:^|[^A-Z.])[A-Z])[a-z]*\\s(?=[A-Z])", "$1.")
.replaceAll(", (?=[^,]*$)", " & ");
System.out.println(repl); //=> R.Filip, R.Morris, C.Jed & S.Smith
Explanation:
1st replaceAll call is matching a non-word && non-dot character + a capital letter in group #1 + 0 or more lower case letters + a space which should be followed by 1 capital letter. It is then inserting a dot in front of the match $1.
2ns replaceAll call is matching a comma that is not followed by another comma and replacing that by literal string " & ".
Try this
String names = "Amal.PM , Rakesh.KR , Ajith.N";
names = names.replaceAll(" , (?=[^,]*$)", " & ");
System.out.println("New String : "+names);

Java Regex: Match any word from pattern

I'm trying to implement a search function.
The user types a phrase and I want to match any word from the phrase and the phrase itself in an array of strings.
The problem is that the phrase is stored in a variable, so the Pattern.compile method won't interpret its special characters.
I'm using the following flags for the compile method:
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.LITERAL |
Pattern.MULTILINE
How could I achieve the desired result?
Thanks in advance.
edit:
For example, the phrase:
"Dog cats donuts"
would result in the pattern:
Dogs | cats | donuts | Dogs cats donuts
Split the user-specified phrase by \s+ into, say, arr.
Build the following pattern:
"\\b(?:" + Pattern.quote(arr[0]) + "|" + Pattern.quote(arr[1]) + "|" + Pattern.quote(arr[2]) + ... + "\\b"
Compile without the Pattern.LITERAL option.
In other words, if you want your patterns to match words in a user-specified phrase, you have to use alternation (the pipes) so that any one of those words can be considered a match. However, using the Pattern.LITERAL option makes the alternation operators literal—therefore you have to "literalize" just the words themselves, using the Pattern.quote(...) method. The \\b are word boundaries so that you do not match, say, a word in the user's phrase like "bar" when encountering text like "barrage".
Edit. In response to your edit. If you want to match the longest possible match, e.g. not "Dogs" and "cats" and "donuts" but rather "Dogs cats donuts", you should place the complete phrase in the beginning of the alternation series, e.g.
\\b(Dogs cats donuts|Dogs|cats|donuts)\\b
Try this:
String regex = "\\b(" + phrase + "|" + phrase.replaceAll("\\s+", "|") + ")\\b";
In action:
String phrase = "Dog cats donuts";
String regex = "\\b(" + phrase + "|" + phrase.replaceAll("\\s+", "|") + ")\\b";
System.out.println(regex);
Output:
\b(Dog cats donuts|Dog|cats|donuts)\b

Can you help with regular expressions in Java?

I have a bunch of strings which may of may not have random symbols and numbers in them. Some examples are:
contains(reserved[j])){
close();
i++){
letters[20]=word
I want to find any character that is NOT a letter, and replace it with a white space, so the above examples look like:
contains reserved j
close
i
letters word
What is the best way to do this?
It depends what you mean by "not a letter", but assuming you mean that letters are a-z or A-Z then try this:
s = s.replaceAll("[^a-zA-Z]", " ");
If you want to collapse multiple symbols into a single space then add a plus at the end of the regular expression.
s = s.replaceAll("[^a-zA-Z]+", " ");
yourInputString = yourInputString.replaceAll("[^\\p{Alpha}]", " ");
^ denotes "all characters except"
\p{Alpha} denotes all alphabetic characters
See Pattern for details.
I want to find any character that is NOT a letter
That will be [^\p{Alpha}]+. The [] indicate a group. The \p{Alpha} matches any alphabetic character (both uppercase and lowercase, it does basically the same as \p{Upper}\p{Lower} and a-zA-Z. The ^ inside group inverses the matches. The + indicates one-or-many matches in sequence.
and replace it with a white space
That will be " ".
Summarized:
string = string.replaceAll("[^\\p{Alpha}]+", " ");
Also see the java.util.regex.Pattern javadoc for a concise overview of available patterns. You can learn more about regexs at the great site http://regular-expression.info.
Use the regexp /[^a-zA-Z]/ which means, everything that is not in the a-z/A-Z characters
In ruby I would do:
"contains(reserved[j]))".gsub(/[^a-zA-Z]/, " ")
=> "contains reserved j "
In Java should be something like:
import java.util.regex.*;
...
String inputStr = "contains(reserved[j])){";
String patternStr = "[^a-zA-Z]";
String replacementStr = " ";
// Compile regular expression
Pattern pattern = Pattern.compile(patternStr);
// Replace all occurrences of pattern in input
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll(replacementStr);

Categories