Im trying to tokenize a string input, but I cant get my head around how to do it.
The Idea is, to split the string into instances of alphabetical words and non alphabetical symbols.
For example the String "Test, ( abc)" would be split into ["Test" , "," , "(" , "abc" , ")" ].
Right now I use this regular Expression:
"(?<=[a-zA-Z])(?=[^a-zA-Z])"
but it doesnt do what I want.
Any ideas what else I could use?
I see that you want to group the alphabets (like Test and abc) but no grouping of the non-alphabetical characters. Also I see that you do not want to show space char. For this I will use "(\\w+|\\W)" after removing all spaces from the strings to match.
Sample code
String str = "Test, ( abc)";
str = str.replaceAll(" ",""); // in case you do not want space as separate char.
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output
Test
,
(
abc
)
I hope this answers your question.
Try this:
public static ArrayList<String> res(String a) {
String[] tokens = a.split("\\s+");
ArrayList<String> strs = new ArrayList<>();
for (String token : tokens) {
String[] alpha = token.split("\\W+");
String[] nonAlpha = token.split("\\w+");
for (String str : alpha) {
if (!str.isEmpty()) strs.add(str);
}
for (String str : nonAlpha) {
if (!str.isEmpty()) strs.add(str);
}
}
return strs;
}
Try this:
String s = "I want to walk my dog, and why not?";
Pattern pattern = Pattern.compile("(\\w+|\\W)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group());
}
Outputs:
I
want
to
walk
my
dog
,
and
why
not
?
\w can be used to match word characters ([A-Za-z0-9_]), so that punctuation is removed from the results
(Taken from: here)
I guess in the simplest form, split using
"(?<=[a-zA-Z])(?=[^\\sa-zA-Z])|(?<=[^\\sa-zA-Z])(?=[a-zA-Z])|\\s+"
Explained
(?<= [a-zA-Z] ) # Letter behind
(?= [^\sa-zA-Z] ) # not letter/wsp ahead
| # or,
(?<= [^\sa-zA-Z] ) # Not letter/wsp behind
(?= [a-zA-Z] ) # letter ahead
| # or,
\s+ # whitespaces (disgarded)
Related
I am trying to get all the matching groups in my string.
My regular expression is "(?<!')/|/(?!')". I am trying to split the string using regular expression pattern and matcher. string needs to be split by using /, but '/'(surrounded by ') this needs to be skipped. for example "One/Two/Three'/'3/Four" needs to be split as ["One", "Two", "Three'/'3", "Four"] but not using .split method.
I am currently the below
// String to be scanned to find the pattern.
String line = "Test1/Test2/Tt";
String pattern = "(?<!')/|/(?!')";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.matches()) {
System.out.println("Found value: " + m.group(0) );
} else {
System.out.println("NO MATCH");
}
But it always saying "NO MATCH". where i am doing wrong? and how to fix that?
Thanks in advance
To get the matches without using split, you might use
[^'/]+(?:'/'[^'/]*)*
Explanation
[^'/]+ Match 1+ times any char except ' or /
(?: Non capture group
'/'[^'/]* Match '/' followed by optionally matching any char except ' or /
)* Close group and optionally repeat it
Regex demo | Java demo
String regex = "[^'/]+(?:'/'[^'/]*)*";
String string = "One/Two/Three'/'3/Four";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output
One
Two
Three'/'3
Four
Edit
If you do not want to split don't you might also use a pattern to not match / but only when surrounded by single quotes
[^/]+(?:(?<=')/(?=')[^/]*)*
Regex demo
Try this.
String line = "One/Two/Three'/'3/Four";
Pattern pattern = Pattern.compile("('/'|[^/])+");
Matcher m = pattern.matcher(line);
while (m.find())
System.out.println(m.group());
output:
One
Two
Three'/'3
Four
Here is simple pattern matching all desired /, so you can split by them:
(?<=[^'])\/(?=')|(?<=')\/(?=[^'])|(?<=[^'])\/(?=[^'])
The logic is as follows: we have 4 cases:
/ is sorrounded by ', i.e. `'/'
/ is preceeded by ', i.e. '/
/ is followed by ', i.e. /'
/ is sorrounded by characters other than '
You want only exclude 1. case. So we need to write regex for three cases, so I have written three similair regexes and used alternation.
Explanation of the first part (other two are analogical):
(?<=[^']) - positiva lookbehind, assert what preceeds is differnt frim ' (negated character class [^']
\/ - match / literally
(?=') - positiva lookahead, assert what follows is '\
Demo with some more edge cases
Try something like this:
String line = "One/Two/Three'/'3/Four";
String pattern = "([^/]+'/'\d)|[^/]+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
boolean found = false;
while(m.find()) {
System.out.println("Found value: " + m.group() );
found = true;
}
if(!found) {
System.out.println("NO MATCH");
}
Output:
Found value: One
Found value: Two
Found value: Three'/'3
Found value: Four
Below is my Java code to delete all pair of adjacent letters that match, but I am getting some problems with the Java Matcher class.
My Approach
I am trying to find all successive repeated characters in the input e.g.
aaa, bb, ccc, ddd
Next replace the odd length match with the last matched pattern and even length match with "" i.e.
aaa -> a
bb -> ""
ccc -> c
ddd -> d
s has single occurrence, so it's not matched by the regex pattern and excluded from the substitution
I am calling Matcher.appendReplacement to do conditional replacement of the patterns matched in input, based on the group length (even or odd).
Code:
public static void main(String[] args) {
String s = "aaabbcccddds";
int i=0;
StringBuffer output = new StringBuffer();
Pattern repeatedChars = Pattern.compile("([a-z])\\1+");
Matcher m = repeatedChars.matcher(s);
while(m.find()) {
if(m.group(i).length()%2==0)
m.appendReplacement(output, "");
else
m.appendReplacement(output, "$1");
i++;
}
m.appendTail(output);
System.out.println(output);
}
Input : aaabbcccddds
Actual Output : aaabbcccds (only replacing ddd with d but skipping aaa, bb and ccc)
Expected Output : acds
This can be done in a single replaceAll call like this:
String repl = str.replaceAll( "(?:(.)\\1)+", "" );
Regex expression (?:(.)\\1)+ matches all occurrences of even repetitions and replaces it with empty string this leaving us with first character of odd number of repetitions.
RegEx Demo
Code using Pattern and Matcher:
final Pattern p = Pattern.compile( "(?:(.)\\1)+" );
Matcher m = p.matcher( "aaabbcccddds" );
String repl = m.replaceAll( "" );
//=> acds
You can try like that:
public static void main(String[] args) {
String s = "aaabbcccddds";
StringBuffer output = new StringBuffer();
Pattern repeatedChars = Pattern.compile("(\\w)(\\1+)");
Matcher m = repeatedChars.matcher(s);
while(m.find()) {
if(m.group(2).length()%2!=0)
m.appendReplacement(output, "");
else
m.appendReplacement(output, "$1");
}
m.appendTail(output);
System.out.println(output);
}
It is similar to yours but when getting just the first group you match the first character and your length is always 0. That's why I introduce a second group which is the matched adjacent characters. Since it has length of -1 I reverse the odd even logic and voila -
acds
is printed.
You don't need multiple if statements. Try:
(?:(\\w)(?:\\1\\1)+|(\\w)\\2+)(?!\\1|\\2)
Replace with $1
Regex live demo
Java code:
str.replaceAll("(?:(\\w)(?:\\1\\1)+|(\\w)\\2+)(?!\\1|\\2)", "$1");
Java live demo
Regex breakdown:
(?: Start of non-capturing group
(\\w) Capture a word character
(?:\\1\\1)+ Match an even number of same character
| Or
(\\w) Capture a word character
\\2+ Match any number of same character
) End of non-capturing group
(?!\\1|\\2) Not followed by previous captured characters
Using Pattern and Matcher with StringBuffer:
StringBuffer output = new StringBuffer();
Pattern repeatedChars = Pattern.compile("(?:(\\w)(?:\\1\\1)+|(\\w)\\2+)(?!\\1|\\2)");
Matcher m = repeatedChars.matcher(s);
while(m.find()) m.appendReplacement(output, "$1");
m.appendTail(output);
System.out.println(output);
I need help with a simple task in java. I have the following sentence:
Where Are You [Employee Name]?
your have a [Shift] shift..
I need to extract the strings that are surrounded by [ and ] signs.
I was thinking of using the split method with " " parameter and then find the single words, but I have a problem using that if the phrase I'm looking for contains: " ". using indexOf might be an option as well, only I don't know what is the indication that I have reached the end of the String.
What is the best way to perform this task?
Any help would be appreciated.
Try with regex \[(.*?)\] to match the words.
\[: escaped [ for literal match as it is a meta char.
(.*?) : match everything in a non-greedy way.
Sample code:
Pattern p = Pattern.compile("\\[(.*?)\\]");
Matcher m = p.matcher("Where Are You [Employee Name]? your have a [Shift] shift.");
while(m.find()) {
System.out.println(m.group());
}
Here you go Java regular expression that extract text between two brackets including white spaces:
import java.util.regex.*;
class Main
{
public static void main(String[] args)
{
String txt="[ Employee Name ]";
String re1=".*?";
String re2="( )";
String re3="((?:[a-z][a-z]+))"; // Word 1
String re4="( )";
String re5="((?:[a-z][a-z]+))"; // Word 2
String re6="( )";
Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String ws1=m.group(1);
String word1=m.group(2);
String ws2=m.group(3);
String word2=m.group(4);
String ws3=m.group(5);
System.out.print("("+ws1.toString()+")"+"("+word1.toString()+")"+"("+ws2.toString()+")"+"("+word2.toString()+")"+"("+ws3.toString()+")"+"\n");
}
}
}
if you want to ignore white space remove "( )";
This is a Scanner base solution
Scanner sc = new Scanner("Where Are You [Employee Name]? your have a [Shift] shift..");
for (String s; (s = sc.findWithinHorizon("(?<=\\[).*?(?=\\])", 0)) != null;) {
System.out.println(s);
}
output
Employee Name
Shift
Use a StringBuilder (I assume you don't need synchronization).
As you suggested, indexOf() using your square bracket delimiters will give you a starting index and an ending index. use substring(startIndex + 1, endIndex - 1) to get exactly the string you want.
I'm not sure what you meant by the end of the String, but indexOf("[") is the start and indexOf("]") is the end.
That's pretty much the use case for a regular expression.
Try "(\\[[\\w ]*\\])" as your expression.
Pattern p = Pattern.compile("(\\[[\\w ]*\\])");
Matcher m = p.matcher("Where Are You [Employee Name]? your have a [Shift] shift..");
if (m.find()) {
String found = m.group();
}
What does this expression do?
First it defines a group (...)
Then it defines the starting point for that group. \[ matches [ since [ itself is a 'keyword' for regular expressions it has to be masked by \ which is reserved in Java Strings and has to be masked by another \
Then it defines the body of the group [\w ]*... here the regexpression [] are used along with \w (meaning \w, meaning any letter, number or undescore) and a blank, meaning blank. The * means zero or more of the previous group.
Then it defines the endpoint of the group \]
and closes the group )
I have a string like this:
abc:def,ghi,jkl;mno:pqr,stu;vwx:yza,aaa,bbb;
I want to split first on ; and then on :
Finally the output should be only the latter part around : i.e. my output should be
def, ghi, jkl, pqr, stu, yza,aaa,bbb
This can be done using Split twice i.e. once with ; and then with : and then pattern match to find just the right part next to the :. Howvever, is there a better and optimized solution to achieve this?
So basically you want to fetch the content between ; and :, with : on the left and ; on the right.
You can use this regex: -
"(?<=:)(.*?)(?=;)"
This contains a look-behind for : and a look-ahead for ;. And matches the string preceded by a colon(:) and followed by a semi-colon (;).
Regex Explanation: -
(?<= // Look behind assertion.
: // Check for preceding colon (:)
)
( // Capture group 1
. // Any character except newline
* // repeated 0 or more times
? // Reluctant matching. (Match shortest possible string)
)
(?= // Look ahead assertion
; // Check for string followed by `semi-colon (;)`
)
Here's the working code: -
String str = "abc:def,ghi,jkl;mno:pqr,stu;vwx:yza,aaa,bbb;";
Matcher matcher = Pattern.compile("(?<=:)(.*?)(?=;)").matcher(str);
StringBuilder builder = new StringBuilder();
while (matcher.find()) {
builder.append(matcher.group(1)).append(", ");
}
System.out.println(builder.substring(0, builder.lastIndexOf(",")));
OUTPUT: -
def,ghi,jkl, pqr,stu, yza,aaa,bbb
String[] tabS="abc:def,ghi,jkl;mno:pqr,stu;vwx:yza,aaa,bbb;".split(";");
StringBuilder sb = new StringBuilder();
Pattern patt = Pattern.compile("(.*:)(.*)");
String sep = ",";
for (String string : tabS) {
sb.append(patt.matcher(string).replaceAll("$2 ")); // ' ' after $2 == ';' replaced
sb.append(sep);
}
System.out.println(sb.substring(0,sb.lastIndexOf(sep)));
output
def,ghi,jkl ,pqr,stu ,yza,aaa,bbb
Don't pattern match unless you have to in Java; if you can't have the ':' character in the field name (abc in your example), you can use indexOf(":") to figure out the "right part".
I have a text string that looks as follows:
word word word {{t:word word|word}} word word {{t:word|word}} word word...
I'm interested to extract all strings that start with "{{t" and end with "}}". I don't care about the rest. I don't know in advance the number of words in "{{..|..}}". If it wasn't a space separating the words inside then splitting the text on space would work. I'm not sure how to write a regular expression to get this done. I thought about running over the text, char by char, and then store everything between "{{t:" and "}}", but would like to know a cleaner way to do the same.
Thank you!
EDIT
Expected output from above:
An array of strings String[] a where a[0] is {{t:word word|word}} and a[1] is {{t:word|word}}.
How about (using non-greedy matching, so that it doesn't find ":word word|word}} word word {{t:word|word"
String s = "word word word {{t:word word|word}} word word {{t:word|word}} word word";
Pattern p = Pattern.compile("\\{\\{t:(.*?)\\}\\}");
Matcher m = p.matcher(s);
while (m.find()) {
//System.out.println(m.group(1));
System.out.println(m.group());
}
Edit:
changed to m.group() so that results contain delimiters.
using the java.util.regex.* package works miracles here
Pattern p = Pattern.compile("\\{\\{t(.*?)\\}\\}");//escaping + capturing group
Matcher m = p.matcher(str);
Set<String> result = new HashSet<String>();//can also be a list or whatever
while(m.find()){
result.add(m.group(1));
}
the capturing group can also be the entire regex to include the {{ and }} like so "(\\{\\{t.*?\\}\\})"
This worked for me:
import java.util.regex.*;
class WordTest {
public static void main( String ... args ) {
String input = "word word word {{t:word word|word}} word word {{t:word|word}} word word...";
Pattern p = Pattern.compile("(\\{\\{.*?\\}\\})");
Matcher m = p.matcher( input );
while( m.find() ) {
System.out.println( m.group(1) );
}
}
}