I am relatively new to Java and I need some help to extract multiple substrings from a string. An example of a string is as given below:
String = "How/WRB can/MD I/PRP find/VB a/DT list/NN of/IN celebrities/NNS '/POS real/JJ names/NNS ?/."
Desired result: WRB MD PRP VB DT NN IN NNS POS JJ NNS
I have a text file with possibly thousands of similar POS-tagged lines that I need to extract the POS tags from and do some calculation based on the POS tags.
I have tried using tokenizer but didn't really get the result I wanted. I even tried using split() and saving to arrays because I need to store it and use it later and that still didn't work.
Lastly, I tried using Pattern Matcher and I am having problems with the regex as it return the word with the forward slash.
Regex: [\/](.*?)\s\b
Result: /WRB /MD ....
If there's a better way to do this, please let me know or if anyone can help me figure out what's wrong with my regex.
This should work:
String string = "How/WRB can/MD I/PRP find/VB a/DT list/NN of/IN celebrities/NNS '/POS real/JJ names/NNS ?/.";
System.out.println(string.replaceAll("[^/]+/([^ ]+ ?)", "$1"));
Prints: WRB MD PRP VB DT NN IN NNS POS JJ NNS .
If you still wanted to use pattern matching, look at positive lookbehinds. It will allow you to match a word that begins with a slash, but not actually match the slash itself.
An example would be something like this:
(?<=/).+?(?= |$)
Matches anything that starts with a slash, and is followed by a space OR the end of the string
Here is a working example written in Java:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.LinkedList;
public class SO {
public static void main(String[] args) {
String string = "How/WRB can/MD I/PRP find/VB a/DT list/NN of/IN celebrities/NNS '/POS real/JJ names/NNS ?/.";
Pattern pattern = Pattern.compile("(?<=/).+?(?= |$)");
Matcher matcher = pattern.matcher(string);
LinkedList<String> list = new LinkedList<String>();
// Loop through and find all matches and store them into the List
while(matcher.find()) {
list.add(matcher.group());
}
// Print out the contents of this List
for(String match : list) {
System.out.println(match);
}
}
}
String string = "How/WRB can/MD I/PRP find/VB a/DT list/NN of/IN celebrities/NNS '/POS real/JJ names/NNS ?/.";
string = string .replaceAll("\\S+/", "").replace(".", "");
System.out.println(string );
What about str = str.repalceAll("\\S+/", "")? It will replace remove non-whitespace characters followed by slash.
Related
At the moment I have: text.split("[^\\w+]"
But I also need to include words like: Can't but not something like: 'HEART'
I can't find a solution, that splits a text into words, including the letters, numbers and the aposthroph, if it's between other letters. Thx
If you want to match words using \w, instead of using split you can use word boundaries and assert not ' at the left and at the right.
\b(?<!')\w+(?:'\w+)*\b(?!')
In Java
String regex = "\\b(?<!')\\w+(?:'\\w+)*\\b(?!')";
String string = "Can't but not something like: 'HEART'";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output
Can't
but
not
something
like
It may be simpler to get rid of the single quotes/apostrophes when they occur before/after the word, and then split using the initial delimiter pattern with excluded apostrophe:
String text = "Modern Talking's Hit: 'You're my heart, you're my soul', 1985";
String[] words = text.replaceAll("(?:^|\\W)'|'(?:\\W|$)", "").split("[^\\w^']+");
System.out.println(Arrays.toString(words));
Output:
[Modern, Talking's, Hit, You're, my, heart, you're, my, soul, 1985]
Instead of splitting, you could use Pattern and MatchResult libraries to list the words you want with \w+('\w+)? regex
import java.util.regex.Pattern;
import java.util.regex.MatchResult;
String regex = "\\w+('\\w+)?";
String text = "sampl'e 'text'";
String[] words = Pattern.compile(regex)
.matcher(text)
.results()
.map(MatchResult::group)
.toArray(String[]::new);
You could also split for a whitespace surrounded (or not) by apostrophes
text.split("'?\s'?");
I want to split a sentence having spaces or any special character into an array of words with spaces or special character also an element of array.
Sentence like:
aman,amit and sumit went to top-up
should be split into an array of String:
{"aman",",","amit"," ","and"," ","sumit"," ","went"," ","to"," ","top","-","up")
Please suggest any regex or logic to split the same using java.
I missed one thing in my question. I also need to split on numeric character as well.. But using split("\b") does not split a string having something like
abc12def
into
{ "abc", "12","def") or {"abc","1","2","def")
It seems all you need is to match either word characters (\w+) or non-word ones (\W+). Combine these with an alternation operator and - perhaps - add a Pattern.UNICODE_CHARACTER_CLASS (or its inline/embedded version (?U)) to make the pattern Unicode-aware:
String value = "aman,amit and sumit went to top-up";
String pattern = "(?U)\\w+|\\W+";
List<String> lst = new ArrayList<>();
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(value);
while (m.find())
lst.add(m.group(0));
System.out.println(lst);
See the Java demo
I hope the below code snippet helps you solve this.
public static void main(final String[] args) {
String message = "aman,amit and sumit went to top-up";
String[] messages = message.split("\\b");
for(String string : messages) {
System.out.println(string);
}
}
Maybe someone could help me. I'm trying to include within a java code a regex to match all strings except the ZZ78. I'd like to know what it's missing in the regex I have.
The input string is str = "ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78"
and I'm trying with this regex (?:(?![ZZF8]).)* but if you test in http://regexpal.com/
this regex against the string, you'll see that is not working completely.
str = new String ("ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78");
Pattern pattern = Pattern.compile("(?:(?![ZZ78]).)*");
the matched strings should be
ab57cd
efghZZ7ij#klm
noCODpqr
stuvw27z#xyz
Update:
Hello Avinash Raj and Chthonic Project. Thanks so much for your help and solutions provided.
I originally thougth in split method, but I was trying to avoid get empty strings as result
when for example the delimiter string is at the beginning or at the end of the main string.
Then, I thought that a regex could help me to extract all except "ZZ78", avoiding in this way
empty results in the output.
Below I show the code using split method (Chthonic´s) and regex (Avinash´s) both produce empty
string if the commented "if()" conditions are not used.
Does the use of those "if()" are the only way to not print empty strings? or could be the regex
tweaked a little bit to match not empty strings?
This is the code I have tested so far:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
System.out.println("########### Matches with Split ###########");
String str = "ZZ78ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78";
for (String s : str.split("ZZ78")) {
//if ( !s.isEmpty() ) {
System.out.println("This is a match <<" + s + ">>");
//}
}
System.out.println("##########################################");
System.out.println("########### Matches with Regex ###########");
String s = "ZZ78ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78";
Pattern regex = Pattern.compile("((?:(?!ZZ78).)*)(ZZ78|$)");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
//if ( !matcher.group(1).isEmpty() ) {
System.out.println("This is a match <<" + matcher.group(1) + ">>");
//}
}
}
}
**and the output (without use the "if()´s"):**
########### Matches with Split ###########
This is a match <<>>
This is a match <<ab57cd>>
This is a match <<efghZZ7ij#klm>>
This is a match <<noCODpqr>>
This is a match <<stuvw27z#xyz>>
##########################################
########### Matches with Regex ###########
This is a match <<>>
This is a match <<ab57cd>>
This is a match <<efghZZ7ij#klm>>
This is a match <<noCODpqr>>
This is a match <<stuvw27z#xyz>>
This is a match <<>>
Thanks for help so far.
Thanks in advance
Update #2:
Excellent both of your answers and solutions. Now it works very nice. This is the final code I've tested with both solutions.
Many thanks again.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
System.out.println("########### Matches with Split ###########");
String str = "ZZ78ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78";
Arrays.stream(str.split("ZZ78")).filter(s -> !s.isEmpty()).forEach(System.out::println);
System.out.println("##########################################");
System.out.println("########### Matches with Regex ###########");
String s = "ZZ78ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78";
Pattern regex = Pattern.compile("((?:(?!ZZ78).)*)(ZZ78|$)");
Matcher matcher = regex.matcher(s);
ArrayList<String> allMatches = new ArrayList<String>();
ArrayList<String> list = new ArrayList<String>();
while(matcher.find()){
allMatches.add(matcher.group(1));
}
for (String s1 : allMatches)
if (!s1.equals(""))
list.add(s1);
System.out.println(list);
}
}
And output:
########### Matches with Split ###########
ab57cd
efghZZ7ij#klm
noCODpqr
stuvw27z#xyz
##########################################
########### Matches with Regex ###########
[ab57cd, efghZZ7ij#klm, noCODpqr, stuvw27z#xyz]
The easiest way to do this is as follows:
public static void main(String[] args) {
String str = "ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78";
for (String s : str.split("ZZ78"))
System.out.println(s);
}
The output, as expected, is:
ab57cd
efghZZ7ij#klm
noCODpqr
stuvw27z#xyz
If the pattern used to split the string is at the beginning (i.e. "ZZ78" in your example code), the first element returned will be an empty string, as you have already noted. To avoid that, all you need to do is filter the array. This is essentially the same as putting an if, but you can avoid the extra condition line this way. I would do this as follows (in Java 8):
String test_str = ...; // whatever string you want to test it with
Arrays.stream(str.split("ZZ78")).filter(s -> !s.isEmpty()).foreach(System.out::println);
You must need to remove the character class since [ZZ78] matches a single charcater from the given list. (?:(?!ZZ78).)* alone won't give the match you want. Consider this ab57cdZZ78 as an input string. At first this (?:(?!ZZ78).)* matches the string ab57cd, next it tries to match the following Z and check the condition (?!ZZ78) which means match any character but not of ZZ78. So it failes to match the following Z, next the regex engine moves on to the next character Z and checks this (?!ZZ78) condition. Because of the second Z isn't followed by Z78, this Z got matched by the regex engine.
String s = "ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78";
Pattern regex = Pattern.compile("((?:(?!ZZ78).)*)(ZZ78|$)");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output:
ab57cd
efghZZ7ij#klm
noCODpqr
stuvw27z#xyz
Explanation:
((?:(?!ZZ78).)*) Capture any character but not of ZZ78 zero or more times.
(ZZ78|$) And also capture the following ZZ78 or the end of the line anchor into group 2.
Group index 1 contains single or group of characters other than ZZ78
Update:
String s = "ZZ78ab57cdZZ78efghZZ7ij#klmZZ78noCODpqrZZ78stuvw27z#xyzZZ78";
Pattern regex = Pattern.compile("((?:(?!ZZ78).)*)(ZZ78|$)");
Matcher matcher = regex.matcher(s);
ArrayList<String> allMatches = new ArrayList<String>();
ArrayList<String> list = new ArrayList<String>();
while(matcher.find()){
allMatches.add(matcher.group(1));
}
for (String s1 : allMatches)
if (!s1.equals(""))
list.add(s1);
System.out.println(list);
Output:
[ab57cd, efghZZ7ij#klm, noCODpqr, stuvw27z#xyz]
I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?
I'd like to get a portion of a matched string coming from a Matcher, like this:
Pattern pat = Pattern.compile("a.*l.*z");
Matcher match = pat.matcher("abcdlmnoz"); // I'd want to get bcd AND mno
ArrayList<String> values = match.magic(); //here is where your magic happens =)
ArrayList<String> is only for this example, I could be happy to recieve either a List or individual String items. The best would be what.htaccess files and RewriteRule's do:
RewriteRule (.*)/path?(.*) $1/$2/modified-path/
Well, putting those (.*) into $arguments would be as cool as an ArrayList or accessing String separately. I've been looking for something at Java Matcher API, but I didn't happen to see anything useful inside.
Thanks in advance, guys.
You can capture groups in a regexp match using (_):
Pattern pat = Pattern.compile("a(.*)l(.*)z");
boolean b = match.matches(); // don't forget to attempt the match
Then use match.group(n) to get that portion of the capture. The groups are stored in the match object.
Capturing GroupsOracle
Look at the matcher's "group" method and peruse the doc you linked to for references to groups, which is what the parentheses in the regex do :)
...
String testStr = "abcdlmnoz";
String myRE = "a(.*)l(.*)z";
Pattern myRECompiled = Pattern.compile (myRE,
DOTALL);
Matcher myMatcher = myRECompiled.matcher (testStr);
myMatcher.find ();
System.out.println (myMatcher.group (1));
System.out.println (myMatcher.group (2));
...