How to split a string without losing any word? - java

I am using Eclipse for Java and I want to split an input line without losing any characters.
For example, the input line is:
IPOD6 1 USD6IPHONE6 16G,64G,128G USD9,USD99,USD999MACAIR 2013-2014 USD123MACPRO 2013-2014,2014-2015 USD899,USD999
and the desired output is:
IPOD6 1 USD6
IPHONE6 16G,64G,128G USD9,USD99,USD999
MACAIR 2013-2014 USD123
MACPRO 2013-2014,2014-2015 USD899,USD999
I was using split("(?<=\\bUSD\\d{1,99}+)") but it doesn't work.

You just need to add a non-word boundary \B inside the positive look-behind. \B matches between two non-word characters or between two word characters. It won't split on the boundary which exists between USD9 and comma in this USD9, substring because there is a word boundary exits between USD9 and comma since 9 is a word character and , is a non-word character. It splits on the boundary which exists between USD6 and IPHONE6 because there is a non-word boundary \B exists between those substrings since 6 is a word character and I is also a word character.
String s = "IPOD6 1 USD6IPHONE6 16G,64G,128G USD9,USD99,USD999MACAIR 2013-2014 USD123MACPRO 2013-2014,2014-2015 USD899,USD999";
String[] parts = s.split("(?<=\\bUSD\\d{1,99}+\\B)");
for(String i: parts)
{
System.out.println(i);
}
Output:
IPOD6 1 USD6
IPHONE6 16G,64G,128G USD9,USD99,USD999
MACAIR 2013-2014 USD123
MACPRO 2013-2014,2014-2015 USD899,USD999

without making it too complicated, use this pattern
(?=IPOD|IPHONE|MAC)
and replace with new line
now it is easy to capture or split into an array
Demo
or maybe this pattern
((USD\d+,?)+)
and replace w/ $1\n
Demo

Related

Regex not matching against ampersand

I'm trying to match the following regex:
\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\.?\b
In other words, a word boundary followed by any of the strings above (optionally followed by a period character) followed by a word boundary.
I'm trying to match this in Java, but the ampersand will not match. For example:
Pattern p = Pattern.compile(
"\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\\.?\\b",
Pattern.CASE_INSENSITIVE);
String result = p.matcher("mr one and mrs.two and three & four").replaceAll(" ");
System.out.println("["+result+"]");
The output of this is: [ one two three & four]
I've also tried this at regex101, and again the ampersand does not match: https://regex101.com/r/klkmwl/1
Escaping the ampersand does not make a difference, and I've tried using the hex escape sequence \x26 instead of ampersand (as suggested in this question). Why is this not matching?
Your regex will match an ampersand if it is located in between word chars, e.g. three&four, see this regex demo. This happens because \b before a non-word char requires a word char to appear immediately before it. Also, as there is a \b after an optional dot, both the dot and ampersand will only match if there is a word char immediately on the left.
You need to re-write the pattern so that the word boundaries are applied to the words rather than symbols:
Pattern p = Pattern.compile(
"(?:\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|and)\\b|&)\\.?",
Pattern.CASE_INSENSITIVE);
See the regex demo online.
Problem is due to use of word boundaries. There are no word boundaries before or after a non-word character like &.
In place of word boundary you can use lookarounds:
(?<!\w)(?:[jsdm]r|mr?s|miss|messrs|mmes|prof|re|&|and)\.?(?!\w)
Updated RegEx Demo
(?<!\w): Make sure that previous character is not a word character
(?!\w): Make sure that next character is not a word character
Note some tweaks in your regex to make it shorter.

Find regular expression of length specified and starting and ending also specified in Java

I want to find all the words of length 3 with starting with 'l' and ending with 'f'.
Here's my code:
Pattern pt = Pattern.compile("\\bl.+?f{3}\\b");
Matcher mt = pt.matcher("#Java life! Go ahead Java,lyf,fly,luf,loof");
while(mt.find()) {
System.out.println(mt.group());
}
It's showing nothing. tried out this also Pattern pt = Pattern.compile("l.+?f{3}"); still not getting expected o/p.
The o/p should be:
lyf luf
You can use a word boundary \b, then match for l, a word character \w and then f ending with a word boundary \b.
\bl\wf\b
Explanation
Match a word boundary \b
Match l
Match a word character \w (\w is a shorthand character, matches the ASCII characters [A-Za-z0-9_])
Match a f
Match a word boundary \b
Demo
The regex you need is
\bl\wf\b
Explanation:
Since your word must be three character long, that means there can only be one letter between l and f, so that's why I didn't put a quantifier there.
Your regex is wrong because
f{3} means 3 f's, not 3 character long in total
. matches everything, including non word characters. Use \w instead.

Java regex text replace

I have text like this
Some. / text to-match (1)
I wanna replace ./() for _ has next
Some_text_to_match_1
How do it the pattern?
You may trim the string from non-word chars on both ends (with .replaceAll("^\\W+|\\W+$", "")), and then replace 1 or more non-word character chunks with _ inside the string (with .replaceAll("\\W+", "_")):
String s = "Some. / text to-match (1)";
s = s.replaceAll("^\\W+|\\W+$", "").replaceAll("\\W+", "_");
System.out.println(s);
See the Java demo
Details:
\W matches a non-word character
+ matches 1 or more occurrences of the subpattern this quantifier modifies.
Since we need to use 2 different replacements when trimming the string and then replacing non-word chars inside it, we cannot use just 1 replaceAll.

Regular expression for splitting a String while preserving whitespace

I am doing an Android project which needs to split a String into tokens while preserving whitespaces and also not to split at non-word characters like #, & etc ...
Using \b splits at any non-word character .So i need a way to split the string in the following way.
Input: (. indicates whitespace)
A.A#..A##
Desired output:
A
.
A#
..
A##
So these 5 lines are the 5 values I would like in an array or similar. That means the 4th element of the result-array contains 2 spaces.
I think this is what you want:
(?<=\S)(?=\s)|(?<=\s)(?=\S)
Debuggex Demo
Basically I'm saying "if the previous character is a non-space and the next is a space or if the previous is a space and the next is a non-space, then split".
Use StringTokenizer:
StringTokenizer st = new StringTokenizer("A.A#..A##", ".");//first argument is string you want to split, another is whitespace
while(st.hasMoreTokens())
System.out.println(st.nextToken());
output will be:
A
A#
A##
Try:
String s = "A.A#..A##";
if(s.contains("..")) | s.contains("...")) {
s.replace("..", ".");
s.replace("...", ".");
String out[] = s.split(".");
It should give you an array with Strings the way you want :)
Don't forget to replace the "." with actual spaces :)

How to split a string with any whitespace chars as delimiters

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Categories