Regular expression for splitting a String while preserving whitespace - java

I am doing an Android project which needs to split a String into tokens while preserving whitespaces and also not to split at non-word characters like #, & etc ...
Using \b splits at any non-word character .So i need a way to split the string in the following way.
Input: (. indicates whitespace)
A.A#..A##
Desired output:
A
.
A#
..
A##
So these 5 lines are the 5 values I would like in an array or similar. That means the 4th element of the result-array contains 2 spaces.

I think this is what you want:
(?<=\S)(?=\s)|(?<=\s)(?=\S)
Debuggex Demo
Basically I'm saying "if the previous character is a non-space and the next is a space or if the previous is a space and the next is a non-space, then split".

Use StringTokenizer:
StringTokenizer st = new StringTokenizer("A.A#..A##", ".");//first argument is string you want to split, another is whitespace
while(st.hasMoreTokens())
System.out.println(st.nextToken());
output will be:
A
A#
A##

Try:
String s = "A.A#..A##";
if(s.contains("..")) | s.contains("...")) {
s.replace("..", ".");
s.replace("...", ".");
String out[] = s.split(".");
It should give you an array with Strings the way you want :)
Don't forget to replace the "." with actual spaces :)

Related

Java Regex - Remove Non-Alphanumeric characters except line breaks

I'm trying to remove all the non-alphanumeric characters from a String in Java but keep the carriage returns. I have the following regular expression, but it keeps joining words before and after a line break.
[^\\p{Alnum}\\s]
How would I be able to preserve the line breaks or convert them into spaces so that I don't have words joining?
An example of this issue is shown below:
Original Text
and refreshingly direct
when compared with the hand-waving of Swinburne.
After Replacement:
and refreshingly directwhen compared with the hand-waving of Swinburne.
You may add these chars to the regex, not \s, as \s matches any whitespace:
String reg = "[^\\p{Alnum}\n\r]";
Or, you may use character class subtraction:
String reg = "[\\P{Alnum}&&[^\n\r]]";
Here, \P{Alnum} matches any non-alphanumeric and &&[^\n\r] prevents a LF and CR from matching.
A Java test:
String s = "&&& Text\r\nNew line".replaceAll("[^\\p{Alnum}\n\r]+", "");
System.out.println(s);
// => Text
Newline
Note that there are more line break chars than LF and CR. In Java 8, \R construct matches any style linebreak and it matches \u000D\u000A|\[\u000A\u000B\u000C\u000D\u0085\u2028\u2029\].
So, to exclude matching any line breaks, you may use
String reg = "[^\\p{Alnum}\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]+";
You can use this regex [^A-Za-z0-9\\n\\r] for example :
String result = str.replaceAll("[^a-zA-Z0-9\\n\\r]", "");
Example
Input
aaze03.aze1654aze987 */-a*azeaze\n hello *-*/zeaze+64\nqsdoi
Output
aaze03aze1654aze987aazeaze
hellozeaze64
qsdoi
I made a mistake with my code. I was reading in a file line by line and building the String, but didn't add a space at the end of each line. Therefore there were no actual line breaks to replace.
That's a perfect case for Guava's CharMatcher:
String input = "and refreshingly direct\n\rwhen compared with the hand-waving of Swinburne.";
String output = CharMatcher.javaLetterOrDigit().or(CharMatcher.whitespace()).retainFrom(input);
Output will be:
and refreshingly direct
when compared with the handwaving of Swinburne

How to split a string without losing any word?

I am using Eclipse for Java and I want to split an input line without losing any characters.
For example, the input line is:
IPOD6 1 USD6IPHONE6 16G,64G,128G USD9,USD99,USD999MACAIR 2013-2014 USD123MACPRO 2013-2014,2014-2015 USD899,USD999
and the desired output is:
IPOD6 1 USD6
IPHONE6 16G,64G,128G USD9,USD99,USD999
MACAIR 2013-2014 USD123
MACPRO 2013-2014,2014-2015 USD899,USD999
I was using split("(?<=\\bUSD\\d{1,99}+)") but it doesn't work.
You just need to add a non-word boundary \B inside the positive look-behind. \B matches between two non-word characters or between two word characters. It won't split on the boundary which exists between USD9 and comma in this USD9, substring because there is a word boundary exits between USD9 and comma since 9 is a word character and , is a non-word character. It splits on the boundary which exists between USD6 and IPHONE6 because there is a non-word boundary \B exists between those substrings since 6 is a word character and I is also a word character.
String s = "IPOD6 1 USD6IPHONE6 16G,64G,128G USD9,USD99,USD999MACAIR 2013-2014 USD123MACPRO 2013-2014,2014-2015 USD899,USD999";
String[] parts = s.split("(?<=\\bUSD\\d{1,99}+\\B)");
for(String i: parts)
{
System.out.println(i);
}
Output:
IPOD6 1 USD6
IPHONE6 16G,64G,128G USD9,USD99,USD999
MACAIR 2013-2014 USD123
MACPRO 2013-2014,2014-2015 USD899,USD999
without making it too complicated, use this pattern
(?=IPOD|IPHONE|MAC)
and replace with new line
now it is easy to capture or split into an array
Demo
or maybe this pattern
((USD\d+,?)+)
and replace w/ $1\n
Demo

Trying to remove up to 2 digits following a match

Say I have a few string like Foo3,5bar, Foo14,5bar and Foo23,42bar
I want to remove the second number, following the comma, as well as the comma, using Java Regex.
So far, I've tried String.replaceAll("(?<=Foo\d{1,2}),\d{1,2}", ""), using (?<=Foo\d{1,2}),\d{1,2} as my regex, but it's not working.
Use String#replaceAll that has regex support:
String str = "Foo3,4HelloFoo5,3World";
str = str.replaceAll("(\\d),\\d+", "$1"); // Foo3HelloFoo5World
OR else if you want to restrict matching to max 2 digits after comma then use:
str = str.replaceAll("(\\d),\\d{1,2}", "$1"); // Foo3HelloFoo5World
Live Demo: http://ideone.com/5P1guJ
str = str.replaceFirst(",\\d+$")
what about Integer.valueOf(str.substring(string.lastIndexOf(",")+1));
is regex a necessity ?
Your regex is almost correct. You forget to escape the \ in the regex. The correct one is:
(?<=Foo\\d{1,2}),\\d{1,2}
Note the \\ instead of \.
See https://ideone.com/W7IuT1 for a demo on your strings.

Splitting a Java String with '.'

I have
1. This is a test message
I want to print
This is a test message
I am trying
String delimiter=".";
String[] parts = line.split(delimiter);
int gg=parts.length;
Than want to print array
for (int k ;k <gg;K++)
parts[k];
But my gg is always 0.
am I missing anything.
All I need is to remove the number and . and white spaces
The number can be 1 (or) 5 digit number
You are using "." as a delimiter, you should break the special meaning of the . char.
The . char in regex is "any character" so your split is just splitting according to "any character", which is obviously not what you are after.
Use "\\." as a delimiter
For more information on pre-defined character classes you can have a look at the tutorial.
For more information on regex on general (includes the above) you can try this tutorial
EDIT:
P.S. What you are up to (removing the number) can be achieved with a one-liner, using the String.replaceAll() method.
System.out.println(line.replaceAll("[0-9]+\\.\\s+", ""));
will provide output
This is a test message
For your input example.
The idea is: [0-9] is any digit. - the + indicate there can be any number of them, which is greater then 0. The \\. is a dot (with breaking as mentioned above) and the \\s+ is at least one space.
It is all replaced with an empty string.
Note however, for strings like: "1. this is a 2. test" - it will provide "this is a test", and remove the "2. " as well, so think carefully if that is indeed what you are after.
Use following code..
String delimtor="\\."; // use this because . required to be skipped
String[] parts = line.split(delimtor);
For your for loop.
for (int k=0 ;k <gg.length;K++)
parts[k];
try this
String delimtor = "\\.";
"." has a special meaning for a regular expression.
If you are just trying to remove the prefix numbers then you can do it in one line. Not sure if you actually want to split on multiple dots. If it is just the prefix then you can do it in one line
String s = "1. with single digit";
String s2 = "999. with multiple digits";
String s3 = "999. with multiple digits . and . dots";
assertEquals("with single digit", (s.substring(s.indexOf(".") + 1).trim()));
assertEquals("with multiple digits", (s2.substring(s2.indexOf(".") + 1).trim()));
assertEquals("with multiple digits . and . dots", (s3.substring(s3.indexOf(".") + 1).trim()));

How to split a string with any whitespace chars as delimiters

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Categories