How to split a string every N words [duplicate] - java

This question already has answers here:
How to split a String by space
(17 answers)
How to split a string array into small chunk arrays in java?
(17 answers)
Splitting at every n-th separator, and keeping the character
(4 answers)
Closed last year.
I want to split one big string into smaller parts, so given for example:
"A B C D E F G H I J K L"
I want to get array (String []): [A,B,C,D], [E,F,G,H], [I,J,K,L]
Is there any regex for that or I need to do that manually so first to split every space and then concat every N words. ??

You can create a regex that describes this pattern.
e.g. "((?:\w+\s*){4})"
Or in simple words:
The \w+\s* part means that there are 1 or multiple word-characters (e.g. text, digits) followed by 0, 1 or multiple whitespace characters.
It is surrounded in braces and followed by {4} to indicate that we want this to occur 4 times.
Finally that again is wrapped in braces, because we want to capture that result.
By contrast the braces which were used to specify {4} are preceded by a (?: ...) prefix, which makes it a "non-capturing-group". We don't want to capture the individual matches just yet.
You can use that pattern in java to extract each chunk of 4 occurrences.
And than next, you can simply split each individual result with a second regex, \s+ ( = whitespace)
Edit
One more thing, you may notice that the first matched group also contains whitespace at the end. You can get rid of that with a more advanced regex: ((?:\w+\s+){3}(?:\w+))\s*

You could use regex for this:
e.g.:
String x = "AAS BASD CAFAS DAFASF EASFASF FAFSASF GA HASF IAS JAS KAS LSA";
ArrayList<String> found = new ArrayList<>();
Pattern pattern = Pattern.compile("(\\w+\\s\\w+\\s\\w+)");
Matcher m = pattern.matcher(x);
while (m.find()) {
String s = m.group();
found.add(s);
}
//if you want to convert your List to an Array
String[] result = found.toArray(new String[0]);
System.out.println(Arrays.toString(result));
Result: [AAS BASD CAFAS DAFASF, EASFASF FAFSASF GA HASF, IAS JAS KAS LSA]
This pattern ("(\\w+\\s\\w+\\s\\w+\\s\\w+)") matches 4 words separated by one space. The loop iterates over every found match and adds it to your result list.

There are multiple ways you can achieve this,
for ex. let your string be
String str = "A B C D E F G H I J K L";
one way to split it would be using regular expression
java.util.Arrays.toString(str.split("(?<=\\G....)"))
here the .... represent how many characters in each string, another way to specify the pattern would be .{4}
another way would be
Iterable<String> strArr = Splitter.fixedLength(3).split(str );
there could be more ways to achieve the same

Related

Split a word by a char in Java [duplicate]

This question already has answers here:
How to split a string, but also keep the delimiters?
(24 answers)
How do I split a string in Java?
(39 answers)
Closed 3 years ago.
Consider the following example. I would like to divide the String into two parts by the char 'T'
// input
String toDivideStr = "RaT15544";
// output
first= "RaT";
second = "15544";
I've tried this:
String[] first = toDivideStr.split("T",0);
Output:
first = "Ra"
second = "15544"
How do I achieve this?
What you need to to, is locate the last "T", then split:
StringToD.substring(StringToD.lastIndexOf("T") + 1)
You could use a positive lookahead to assert a digit and a positive lookbehind to assert RaT.
(?<=RaT)(?=\\d)
For example:
String str = "RaT15544";
for (String element : str.split("(?<=RaT)(?=\\d)"))
System.out.println(element);
Regex demo | Java demo
You can use positive look-ahead with split limit parameter for this. (?=\\d)
With only T in the split method parameter, what happens is the regex engine consumes this T. Hence the two string split that occurs doesn't have T. To avoid consuming the characters, we can use non-consumeing look-ahead.
(?=\\d) - This will match the first number that is encountered but it will not consume this number
public static void main(String[] args) {
String s = "RaT15544";
String[] ss = s.split("(?=\\d)", 2);
System.out.println(ss[0] + " " + ss[1]);
}
The below regex can be used to split the alphabets and numbers separately.
String StringToD = "RaT15544";
String[] parts = StringToD.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
System.out.println(parts[0]);
System.out.println(parts[1]);

Java split returns white spaces in result

I'm using the function "split" on this string:
p(80,2)
I would like to obtain just the two numbers, so this is what I do:
String[] split = msg.msgContent().split("[p(,)]")
The regex is correct (or at least, I think so) since it splits the two numbers and puts them in the vector "split", but it turns out that this vector has a length of 4, and the first two positions are occupied by white spaces.
In fact, if I print each vector position, this is the result:
Split:
80
2
I've tried adding \\s to the regex to match with white spaces, but since there are none in my string, it didn't work.
You don't need split here, just use a simple regex to extract the digits from your string:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(msg.msgContent());
while (m.find()) {
String number = m.group();
// add to array
}
Note that String#split takes a regex, and the regex you passed doesn't match the pattern you're looking for.
You might want to read the documentation of Pattern and Matcher for more information about the solution above.
split accepts a regular expression as parameter, and this is a character class: [p(,)].
Given that your code is splitting on all characters in the class:
"p(80,2)" will return an array {"", "80", "2"}
I know is not very beautiful:
List<String> collect = Pattern.compile("[^\\d]+")
.splitAsStream(s)
.filter(s -> s.length() > 0)
.collect(Collectors.toList());
Since you're splitting on p and (, the first two characters of your string are resulting in splits. I would split on the comma after replacing the p, (, and ). Like this:
String x = "p(80,2)";
String [] y = x.replaceAll("[p()]", "").split(",");
Split it's not really what you need here, but if you want to use it you can do something like that:
"p(80,2)".replace("p(", "").replace(")", "").split(",")
Results with
[80, 2]

Regular Expression for Extracting Operands from Mathematical Expression

No question on SO addresses my particular problem. I know very little about regular expression. I am building an expression parser in Java using Regex Class for that purpose. I want to extract Operands, Arguments, Operators, Symbols and Function Names from expression and then save to ArrayList. Currently I am using this logic
String string = "2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)" //This is just for testing purpose later on it will be provided by user
List<String> res = new ArrayList<>();
Pattern pattern = Pattern.compile((\\Q^\\E|\\Q/\\E|\\Q-\\E|\\Q-\\E|\\Q+\\E|\\Q*\\E|\\Q)\\E|\\Q)\\E|\\Q(\\E|\\Q(\\E|\\Q%\\E|\\Q!\\E)) //This string was build in a function where operator names were provided. Its mean that user can add custom operators and custom functions
Matcher m = pattern.matcher(string);
int pos = 0;
while (m.find())
{
if (pos != m.start())
{
res.add(string.substring(pos, m.start()))
}
res.add(m.group())
pos = m.end();
}
if (pos != string.length())
{
addToTokens(res, string.substring(pos));
}
for(String s : res)
{
System.out.println(s);
}
Output:
2
!
+
atan2
(
3
+
9
,
2
+
3
)
-
2
*
PI
+
3
/
3
-
9
-
12
%
3
*
sin
(
9
-
9
)
+
(
2
+
6
/
2
)
Problem is that now Expression can contain Matrix with user defined format. I want to treat every Matrix as a Operand or Argument in case of functions.
Input 1:
String input_1 = "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6"
Output Should be:
2
+
3
-
9
*
[{2+3,2,6},{7,2+3,2+3i}]
+
9
*
6
Input 2:
String input_2 = "{[2,5][9/8,func(2+3)]}+9*8/5"
Output Should be:
{[2,5][9/8,func(2+3)]}
+
9
*
8
/
5
Input 3:
String input_3 = "<[2,9,2.36][2,3,2!]>*<[2,3,9][23+9*8/8,2,3]>"
Output Should be:
<[2,9,2.36][2,3,2!]>
*
<[2,3,9][23+9*8/8,2,3]>
I want that now ArrayList should contain every Operand, Operators, Arguments, Functions and symbols at each index. How can I achieve my desired output using Regular expression. Expression validation is not required.
I think you can try with something like:
(?<matrix>(?:\[[^\]]+\])|(?:<[^>]+>)|(?:\{[^\}]+\}))|(?<function>\w+(?=\())|(\d+[eE][-+]\d+)|(?<operand>\w+)|(?<operator>[-+\/*%])|(?<symbol>.)
DEMO
elements are captured in named capturing groups. If you don't need it, you can use short:
\[[^\]]+\]|<[^>]+>|\{[^\}]+\}|\d+[eE][-+]\d+|\w+(?=\()|\w+|[-+\/*%]|.
The \[[^\]]+\]|<[^>]+>|\{[^\}]+\} match opening bracket ({, [ or <), non clasing bracket characters, and closing bracket (},],>) so if there are no nested same-type brackets, there is no problem.
Implementatin in Java:
public class Test {
public static void main(String[] args) {
String[] expressions = {"2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)", "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6",
"{[2,5][9/8,func(2+3)]}+9*8/5","<[2,9,2.36][2,3,2!]>*<[2,3,9][23 + 9 * 8 / 8, 2, 3]>"};
Pattern pattern = Pattern.compile("(?<matrix>(?:\\[[^]]+])|(?:<[^>]+>)|(?:\\{[^}]+}))|(?<function>\\w+(?=\\())|(?<operand>\\w+)|(?<operator>[-+/*%])|(?<symbol>.)");
for(String expression : expressions) {
List<String> elements = new ArrayList<String>();
Matcher matcher = pattern.matcher(expression);
while (matcher.find()) {
elements.add(matcher.group());
}
for (String element : elements) {
System.out.println(element);
}
System.out.println("\n\n\n");
}
}
}
Explanation of alternatives:
\[[^\]]+\]|<[^>]+>|\{[^\}]+\} - match opening bracket of given
type, character which are not closing bracket of that type
(everything byt not closing bracket), and closing bracket of that
type,
\d+[eE][-+]\d+ = digit, followed by e or E, followed by operator +
or -, followed by digits, to capture elements like 2e+3
\w+(?=\() - match one or more word characters (A-Za-z0-9_) if it is
followed by ( for matching functions like sin,
\w+ - match one or more word characters (A-Za-z0-9_) for matching
operands,
[-+\/*%] - match one character from character class, to match
operators
. - match any other character, to match other symbols
Order of alternatives is quite important, as last alternative . will match any character, so it need to be last option. Similar case with \w+(?=\() and \w+, the second one will match everything like previous one, however if you don't wont to distinguish between functions and operands, the \w+ will be enough for all of them.
In longer exemple the part (?<name> ... ) in every alternative, is a named capturing group, and you can see in demo, how it group matched fragments in gorups like: operand, operator, function, etc.
With regular expressions you cannot match any level of nested balanced parentheses.
For example, in your second example {[2,5][9/8,func(2+3)]} you need to match the opening brace with the close brace, but you need to keep track of how many opening and closing inner braces/parens/etc there are. That cannot be done with regular expressions.
If, on the other hand, you simplify your problem to remove any requirement for balancing, then you probably can handle with regular expressions.

Java Regular expressions issue - Can't match two strings in the same line [duplicate]

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 8 years ago.
just experiencing some problems with Java Regular expressions.
I have a program that reads through an HTML file and replaces any string inside the #VR# characters, i.e. #VR#Test1 2 3 4#VR#
However my issue is that, if the line contains more than two strings surrounded by #VR#, it does not match them. It would match the leftmost #VR# with the rightmost #VR# in the sentence and thus take whatever is in between.
For example:
#VR#Google#VR#
My code would match
URL-GOES-HERE#VR#" target="_blank" style="color:#f4f3f1; text-decoration:none;" title="ContactUs">#VR#Google
Here is my Java code. Would appreciate if you could help me to solve this:
Pattern p = Pattern.compile("#VR#.*#VR#");
Matcher m;
Scanner scanner = new Scanner(htmlContent);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
m = p.matcher(line);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String match_found = m.group().replaceAll("#VR#", "");
System.out.println("group: " + match_found);
}
}
I tried replacing m.group() with m.group(0) and m.group(1) but nothing. Also m.groupCount() always returns zero, even if there are two matches as in my example above.
Thanks, your help will be very much appreciated.
Your problem is that .* is "greedy"; it will try to match as long a substring as possible while still letting the overall expression match. So, for example, in #VR# 1 #VR# 2 #VR# 3 #VR#, it will match 1 #VR# 2 #VR# 3.
The simplest fix is to make it "non-greedy" (matching as little as possible while still letting the expression match), by changing the * to *?:
Pattern p = Pattern.compile("#VR#.*?#VR#");
Also m.groupCount() always returns zero, even if there are two matches as in my example above.
That's because m.groupCount() returns the number of capture groups (parenthesized subexpressions, whose corresponding matched substrings retrieved using m.group(1) and m.group(2) and so on) in the underlying pattern. In your case, your pattern has no capture groups, so m.groupCount() returns 0.
You can try the regular expression:
#VR#(((?!#VR#).)+)#VR#
Demo:
private static final Pattern REGEX_PATTERN =
Pattern.compile("#VR#(((?!#VR#).)+)#VR#");
public static void main(String[] args) {
String input = "#VR#Google#VR# ";
System.out.println(
REGEX_PATTERN.matcher(input).replaceAll("$1")
); // prints "Google "
}

split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

Categories