Check an expression with a regex

Check an expression with a regex - java

I have this method, which I use to create an array of a conditional expression.
private void convertToList() {
String regex = "[-]?[0-9]+([eE][-]?[0-9]+)?|([-+/*\\\\^])|([()])|(!)|(>=)|(<=)|(<)|(>)|(&&)|(==)|(!=)|([|][|])|(\\[)|(\\])|(and)|(or)|(not)|(true)|(false)|([A-Za-z_][A-Za-z_0-9]*)";
Matcher m3 = Pattern.compile(regex).matcher(this.stringExp);
this.arrayExp = new ArrayList<String>(this.stringExp.length());
while (m3.find()) {
arrayExp.add(m3.group());
}
}
The expression can contain words, numbers and operators (which you can see in the regex).
Now I want to check if the expression is valid before tokenizing. I've tried this, but it doesn't work.
private static void checkString(String s){
String regex = "[-]?[0-9]+([eE][-]?[0-9]+)?|([-+/*\\\\^])|([()])|(!)|(>=)|(<=)|(<)|(>)|(&&)|(==)|(!=)|([|][|])|(\\[)|(\\])|(and)|(or)|(not)|(true)|(false)|([A-Za-z_][A-Za-z_0-9]*)";
Matcher m3 = Pattern.compile(regex).matcher(s);
if (m3.matches()){
System.out.println("OK");
} else {
System.out.println("Not ok");
}
}
Examples of valid strings:
"a + b < 5"
"a <= b && c > 1 || a == 4"
Anyway to do that?

You are probably having problems with spaces. In your example strings are spaces, but they don't match in the regex.

Related

Remove the intersections of multiple regular expressions?

Pattern[] a =new Pattern[2];
a[0] = Pattern.compile("[$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d");
a[1] = Pattern.compile("Rs[.]?\\s*[\\d,]*[.]?\\d*\\d");
Ex: Rs.150 is detected by a[1] and 150 is detected by a[0].
How to remove such intersections and let it only detect by a[1] but not by a[0]?

You can use the | operator inside your regular expression. Then call the method Matcher#group(int) to see which pattern your input applies to. This method returns null if the matching group is empty.
Sample code
public static void main(String[] args) {
// Build regexp
final String MONEY_REGEX = "[$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d";
final String RS_REGEX = "Rs[.]?\\s*[\\d,]*[.]?\\d*\\d";
// Separate them with '|' operator and wrap them in two distinct matching groups
final String MONEY_OR_RS = String.format("(%s)|(%s)", MONEY_REGEX, RS_REGEX);
// Prepare some sample inputs
String[] inputs = new String[] { "$100", "Rs.150", "foo" };
Pattern p = Pattern.compile(MONEY_OR_RS);
// Test each inputs
Matcher m = null;
for (String input : inputs) {
if (m == null) {
m = p.matcher(input);
} else {
m.reset(input);
}
if (m.matches()) {
System.out.println(String.format("m.group(0) => %s\nm.group(1) => %s\n", m.group(1), m.group(2)));
} else {
System.out.println(input + " doesn't match regexp.");
}
}
}
Output
m.group(0) => $100
m.group(1) => null
m.group(0) => null
m.group(1) => Rs.150
foo doesn't match regexp.

Use an initial test to switch between expressions. How fast and/or smart this initial test is is up to you.
In this case you could do something like:
if (input.startsWith("Rs.") && a[1].matcher(input).matches()) {
return true;
}
and put it in front of your method that does the testing.
Simply putting the most common regular expressions in front of the array may help as well of course.

Description
Use a negative look ahead to match a[1] rs.150 format while at the same time preventing the a[0] 150 format.
Generic expression: (?! the a[0] regex goes here ) followed by the a[1] expression
Basic regex statment: (?![$£€]?\s*\d*[\.]?[pP]?\d*\d)Rs[.]?\s*[\d,]*[.]?\d*\d
escaped for java: (?![$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d)Rs[.]?\\s*[\\d,]*[.]?\\d*\\d

How can I get inside parentheses value in a string?

How can I get inside parentheses value in a string?
String str= "United Arab Emirates Dirham (AED)";
I need only AED text.

Compiles and prints "AED". Even works for multiple parenthesis:
import java.util.regex.*;
public class Main
{
public static void main (String[] args)
{
String example = "United Arab Emirates Dirham (AED)";
Matcher m = Pattern.compile("\\(([^)]+)\\)").matcher(example);
while(m.find()) {
System.out.println(m.group(1));
}
}
}
The regex means:
\\(: character (
(: start match group
[: one of these characters
^: not the following character
): with the previous ^, this means "every character except )"
+: one of more of the stuff from the [] set
): stop match group
\\): literal closing paranthesis

i can't get idea how to split inside parentheses. Would you help highly appreciated
When you split you are using a reg-ex, therefore some chars are forbidden.
I think what you are looking for is
str = str.split("[\\(\\)]")[1];
This would split by parenthesis. It translates into split by ( or ). you use the double \\ to escape the paranthese which is a reserved character for regular expressions.
If you wanted to split by a . you would have to use split("\\.") to escape the dot as well.

This works...
String str = "United Arab Emirates Dirham (AED)";
String answer = str.substring(str.indexOf("(")+1,str.indexOf(")"));

I know this was asked over 4 years ago, but for anyone with the same/similar question that lands here (as I did), there is something even simpler than using regex:
String result = StringUtils.substringBetween(str, "(", ")");
In your example, result would be returned as "AED". I would recommend the StringUtils library for various kinds of (relatively simple) string manipulation; it handles things like null inputs automatically, which can be convenient.
Documentation for substringBetween():
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#substringBetween-java.lang.String-java.lang.String-java.lang.String-
There are two other versions of this function, depending on whether the opening and closing delimiters are the same, and whether the delimiter(s) occur(s) in the target string multiple times.

You could try:
String str = "United Arab Emirates Dirham (AED)";
int firstBracket = str.indexOf('(');
String contentOfBrackets = str.substring(firstBracket + 1, str.indexOf(')', firstBracket));

I can suggest two ways:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args){
System.out.println(getParenthesesContent1("United Arab Emirates Dirham (AED)"));
System.out.println(getParenthesesContent2("United Arab Emirates Dirham (AED)"));
}
public static String getParenthesesContent1(String str){
return str.substring(str.indexOf('(')+1,str.indexOf(')'));
}
public static String getParenthesesContent2(String str){
final Pattern pattern = Pattern.compile("^.*\\((.*)\\).*$");
final Matcher matcher = pattern.matcher(str);
if (matcher.matches()){
return matcher.group(1);
} else {
return null;
}
}
}

yes, you can try different techniques like using
string.indexOf("(");
to get the index
and the use
string.substring(from, to)

Using regular expression:
String text = "United Arab Emirates Dirham (AED)";
Pattern pattern = Pattern.compile(".* \\(([A-Z]+)\\)");
Matcher matcher = pattern.matcher(text);
if (matcher.matches()) {
System.out.println("found: " + matcher.group(1));
}

The answer from #Cœur is probably correct, but unfortunately it did not work for me.
I had text similar to this:
mrewrwegsg {text in between braces} njanfjaenfjie a {text in between braces}
It was way larger, but let's consider only this short part.
I used this code to get each text between bracelet and print it to the console:
Pattern p = Pattern.compile("\\{.*?\\}");
Matcher m = p.matcher(test);
while(m.find()) {
System.out.println(m.group().subSequence(1, m.group().length()-1));
}
It might be more complicated than #Cœur answer, but maybe there is someone like me, who can find my answer useful.

The below method can split any String inside any bracket like '(. {. [, ), }, ]'.
public String getStringInsideChars(String original, char c){
this.original=original;
original = cleaning(original);
for(int i = 0; i < original.length(); i++){
if(original.charAt(i) == c){
String temp = original.substring(i + 1, original.length());
i = original.length();//end for
for(int k = temp.length() - 1; k >= 0; k--){
if(temp.charAt(k) == getReverseBracket(c)){
original = temp.substring(0, k);
k = -1; // end for
}
}
}
}
return original;
}
private char getReverseBracket(char c){
return c == '(' ? ')' :
c == '{' ? '}' :
c == '[' ? ']' :
c == ')' ? '(' :
c == '}' ? '{' :
c == ']' ? '[' : c;
}
public String cleaning(String original) {
this.original = original;
return original.replaceAll(String.valueOf('\t'), "").replaceAll(System.getProperty("line.separator"), "");
}
You can use it like below :
getStringInsideChars("{here is your string}", '{')
it will return "here is your string"

Why is this String not matching this regex?

I've got the following code:
public class testMatch {
public static void main(String[] args) {
String dummyMessage = "asdfasdfsadfsadfasdf 3 sdfasdfasdfasdf";
String expression = "3";
if (dummyMessage.matches(expression)){
System.out.println("MATCH!");
} else {
System.out.println("NO MATCH!");
}
}
}
I'd expect this to be a successful match as the dummyMessage contains the expression 3 but when I run this snippet the code prints NO MATCH!
I don't get what I'm doing wrong.
OKAY STOP ANSWERING! .*3.* works
This is an over simplification of an issue I have in some live code, the regex is configurable, and up until now matching the entire string has been okay, I've now had to match a part of the string and was wondering why it wasn't working.

It matches against the whole string, i.e. like ^3$ in most other regex implementations. So 3 does not match e.g. 333 or your string. But .*3.* would do the job.
However, if you just want to test if "3" is contained in your string you don't need a regex at all. Use dummyMessage.contains(expression) instead.

String#matches(regex) Tells whether or not this string matches the given regular expression.
your string dummyMessage doesn't match expression, as it tries to check if dummyMessage is 3 you probably want String.contains(charseq) instead.
String dummyMessage = "asdfasdfsadfsadfasdf 3 sdfasdfasdfasdf";
String expression = "3";
if (dummyMessage.contains(expression)){
System.out.println("MATCH!");
} else {
System.out.println("NO MATCH!");
}

You should match the whole string for matches to return true. Maybe try using .*3.*.

It will match for such regex: .*3.*

Use contains(expression)
String dummyMessage = "asdfasdfsadfsadfasdf 3 sdfasdfasdfasdf";
String expression = "3";
if (dummyMessage.contains(expression)) {
System.out.println("MATCH!");
} else {
System.out.println("NO MATCH!");
}

By default String#matches() test if string matches regular expression completely. To make it working replace
expression = "3"
with
expression = ".*3.*"
To match substring in string use Matcher#find() method.

your regexp should rather be .*3.*

the matches() method on String class check if the whole string matches.
I modified your code to:
public class testMatch
{
public static void main(String[] args)
{
String dummyMessage = "asdfasdfsadfsadfasdf 3 sdfasdfasdfasdf";
String expression = ".*3.*";
if (dummyMessage.matches(expression))
{
System.out.println("MATCH!");
}
else
{
System.out.println("NO MATCH!");
}
}
}
and it now works

You may be looking for matcher.find:
String message = "asdfasdfsadfsadfasdf 3 sdfasdf3asdfasdf";
String expression = "3";
// Really only need to do this once.
Pattern pattern = Pattern.compile(expression);
// Do this once for each message.
Matcher matcher = pattern.matcher(message);
if (matcher.find()) {
do {
System.out.println("MATCH! At " + matcher.start() + "-" + matcher.end());
} while ( matcher.find() );
} else {
System.out.println("NO MATCH!");
}

Change the original regex accordingly - it is currently incorrect and does not match:
String expression = "(.*)3(.*)";
Or just use String.contains() - I'd say that is a lot more appropriate for this situation.

Either you do it that way:
String dummyMessage = "asdfasdfsadfsadfasdf 3 sdfasdfasdfasdf";
String expression = "3";
Pattern p = Pattern.compile(".*3.*");
Matcher m = p.matcher(dummyMessage);
boolean b = m.matches();
if (b) {
System.out.println("MATCH!");
}
else {
System.out.println("NO MATCH!");
}
Or this way:
String dummyMessage = "asdfasdfsadfadfasdf 3 sdfasdfasdfasdf";
String expression = "3"
if (dummyMessage.contains(expression)) {
System.out.println("MATCH!");
}
else {
System.out.println("NO MATCH!");
}

Check if a String contains a special character

How do you check if a String contains a special character like:
[,],{,},{,),*,|,:,>,

Pattern p = Pattern.compile("[^a-z0-9 ]", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("I am a string");
boolean b = m.find();
if (b)
System.out.println("There is a special character in my string");

If you want to have LETTERS, SPECIAL CHARACTERS and NUMBERS in your password with at least 8 digit, then use this code, it is working perfectly
public static boolean Password_Validation(String password)
{
if(password.length()>=8)
{
Pattern letter = Pattern.compile("[a-zA-z]");
Pattern digit = Pattern.compile("[0-9]");
Pattern special = Pattern.compile ("[!##$%&*()_+=|<>?{}\\[\\]~-]");
//Pattern eight = Pattern.compile (".{8}");
Matcher hasLetter = letter.matcher(password);
Matcher hasDigit = digit.matcher(password);
Matcher hasSpecial = special.matcher(password);
return hasLetter.find() && hasDigit.find() && hasSpecial.find();
}
else
return false;
}

You can use the following code to detect special character from string.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class DetectSpecial{
public int getSpecialCharacterCount(String s) {
if (s == null || s.trim().isEmpty()) {
System.out.println("Incorrect format of string");
return 0;
}
Pattern p = Pattern.compile("[^A-Za-z0-9]");
Matcher m = p.matcher(s);
// boolean b = m.matches();
boolean b = m.find();
if (b)
System.out.println("There is a special character in my string ");
else
System.out.println("There is no special char.");
return 0;
}
}

If it matches regex [a-zA-Z0-9 ]* then there is not special characters in it.

What do you exactly call "special character" ? If you mean something like "anything that is not alphanumeric" you can use org.apache.commons.lang.StringUtils class (methods IsAlpha/IsNumeric/IsWhitespace/IsAsciiPrintable).
If it is not so trivial, you can use a regex that defines the exact character list you accept and match the string against it.

This is tested in android 7.0 up to android 10.0 and it works
Use this code to check if string contains special character and numbers:
name = firstname.getText().toString(); //name is the variable that holds the string value
Pattern special= Pattern.compile("[^a-z0-9 ]", Pattern.CASE_INSENSITIVE);
Pattern number = Pattern.compile("[0-9]", Pattern.CASE_INSENSITIVE);
Matcher matcher = special.matcher(name);
Matcher matcherNumber = number.matcher(name);
boolean constainsSymbols = matcher.find();
boolean containsNumber = matcherNumber.find();
if(constainsSymbols){
//string contains special symbol/character
}
else if(containsNumber){
//string contains numbers
}
else{
//string doesn't contain special characters or numbers
}

All depends on exactly what you mean by "special". In a regex you can specify
\W to mean non-alpahnumeric
\p{Punct} to mean punctuation characters
I suspect that the latter is what you mean. But if not use a [] list to specify exactly what you want.

Have a look at the java.lang.Character class. It has some test methods and you may find one that fits your needs.
Examples: Character.isSpaceChar(c) or !Character.isJavaLetter(c)

This worked for me:
String s = "string";
if (Pattern.matches("[a-zA-Z]+", s)) {
System.out.println("clear");
} else {
System.out.println("buzz");
}

First you have to exhaustively identify the special characters that you want to check.
Then you can write a regular expression and use
public boolean matches(String regex)

//without using regular expression........
String specialCharacters=" !#$%&'()*+,-./:;<=>?#[]^_`{|}~0123456789";
String name="3_ saroj#";
String str2[]=name.split("");
for (int i=0;i<str2.length;i++)
{
if (specialCharacters.contains(str2[i]))
{
System.out.println("true");
//break;
}
else
System.out.println("false");
}

Pattern p = Pattern.compile("[\\p{Alpha}]*[\\p{Punct}][\\p{Alpha}]*");
Matcher m = p.matcher("Afsff%esfsf098");
boolean b = m.matches();
if (b == true)
System.out.println("There is a sp. character in my string");
else
System.out.println("There is no sp. char.");

//this is updated version of code that i posted
/*
The isValidName Method will check whether the name passed as argument should not contain-
1.null value or space
2.any special character
3.Digits (0-9)
Explanation---
Here str2 is String array variable which stores the the splited string of name that is passed as argument
The count variable will count the number of special character occurs
The method will return true if it satisfy all the condition
*/
public boolean isValidName(String name)
{
String specialCharacters=" !#$%&'()*+,-./:;<=>?#[]^_`{|}~0123456789";
String str2[]=name.split("");
int count=0;
for (int i=0;i<str2.length;i++)
{
if (specialCharacters.contains(str2[i]))
{
count++;
}
}
if (name!=null && count==0 )
{
return true;
}
else
{
return false;
}
}

Visit each character in the string to see if that character is in a blacklist of special characters; this is O(n*m).
The pseudo-code is:
for each char in string:
if char in blacklist:
...
The complexity can be slightly improved by sorting the blacklist so that you can early-exit each check. However, the string find function is probably native code, so this optimisation - which would be in Java byte-code - could well be slower.

in the line String str2[]=name.split(""); give an extra character in Array...
Let me explain by example
"Aditya".split("") would return [, A, d,i,t,y,a] You will have a extra character in your Array...
The "Aditya".split("") does not work as expected by saroj routray you will get an extra character in String => [, A, d,i,t,y,a].
I have modified it,see below code it work as expected
public static boolean isValidName(String inputString) {
String specialCharacters = " !#$%&'()*+,-./:;<=>?#[]^_`{|}~0123456789";
String[] strlCharactersArray = new String[inputString.length()];
for (int i = 0; i < inputString.length(); i++) {
strlCharactersArray[i] = Character
.toString(inputString.charAt(i));
}
//now strlCharactersArray[i]=[A, d, i, t, y, a]
int count = 0;
for (int i = 0; i < strlCharactersArray.length; i++) {
if (specialCharacters.contains( strlCharactersArray[i])) {
count++;
}
}
if (inputString != null && count == 0) {
return true;
} else {
return false;
}
}

Convert the string into char array with all the letters in lower case:
char c[] = str.toLowerCase().toCharArray();
Then you can use Character.isLetterOrDigit(c[index]) to find out which index has special characters.

Use java.util.regex.Pattern class's static method matches(regex, String obj)
regex : characters in lower and upper case & digits between 0-9
String obj : String object you want to check either it contain special character or not.
It returns boolean value true if only contain characters and numbers, otherwise returns boolean value false
Example.
String isin = "12GBIU34RT12";<br>
if(Pattern.matches("[a-zA-Z0-9]+", isin)<br>{<br>
System.out.println("Valid isin");<br>
}else{<br>
System.out.println("Invalid isin");<br>
}

Is there a way to split strings with String.split() and include the delimiters? [duplicate]

I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?

You can use lookahead and lookbehind, which are features of regular expressions.
System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));
And you will get:
[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]
The last one is what you want.
((?<=;)|(?=;)) equals to select an empty character before ; or after ;.
EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:
static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}

You want to use lookarounds, and split on zero-width matches. Here are some examples:
public class SplitNDump {
static void dump(String[] arr) {
for (String s : arr) {
System.out.format("[%s]", s);
}
System.out.println();
}
public static void main(String[] args) {
dump("1,234,567,890".split(","));
// "[1][234][567][890]"
dump("1,234,567,890".split("(?=,)"));
// "[1][,234][,567][,890]"
dump("1,234,567,890".split("(?<=,)"));
// "[1,][234,][567,][890]"
dump("1,234,567,890".split("(?<=,)|(?=,)"));
// "[1][,][234][,][567][,][890]"
dump(":a:bb::c:".split("(?=:)|(?<=:)"));
// "[][:][a][:][bb][:][:][c][:]"
dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
// "[:][a][:][bb][:][:][c][:]"
dump(":::a::::b b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
// "[:::][a][::::][b b][::][c][:]"
dump("a,bb:::c d..e".split("(?!^)\\b"));
// "[a][,][bb][:::][c][ ][d][..][e]"
dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
// "[Array][Index][Out][Of][Bounds][Exception]"
dump("1234567890".split("(?<=\\G.{4})"));
// "[1234][5678][90]"
// Split at the end of each run of letter
dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
// "[Booo][yaaaa][h! Yipp][ieeee][!!]"
}
}
And yes, that is triply-nested assertion there in the last pattern.
Related questions
Java split is eating my characters.
Can you use zero-width matching regex in String split?
How do I convert CamelCase into human-readable names in Java?
Backreferences in lookbehind
See also
regular-expressions.info/Lookarounds

A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
string.replace(FullString, "," , "~,~")
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.

import java.util.regex.*;
import java.util.LinkedList;
public class Splitter {
private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");
private Pattern pattern;
private boolean keep_delimiters;
public Splitter(Pattern pattern, boolean keep_delimiters) {
this.pattern = pattern;
this.keep_delimiters = keep_delimiters;
}
public Splitter(String pattern, boolean keep_delimiters) {
this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
}
public Splitter(Pattern pattern) { this(pattern, true); }
public Splitter(String pattern) { this(pattern, true); }
public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
public Splitter() { this(DEFAULT_PATTERN); }
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.err.println("Syntax: java Splitter <pattern> <text>");
return;
}
Pattern pattern = null;
try {
pattern = Pattern.compile(argv[0]);
}
catch (PatternSyntaxException e) {
System.err.println(e);
return;
}
Splitter splitter = new Splitter(pattern);
String text = argv[1];
int counter = 1;
for (String part : splitter.split(text)) {
System.out.printf("Part %d: \"%s\"\n", counter++, part);
}
}
}
/*
Example:
> java Splitter "\W+" "Hello World!"
Part 1: "Hello"
Part 2: " "
Part 3: "World"
Part 4: "!"
Part 5: ""
*/
I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.
Edit: Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453

Pass the 3rd aurgument as "true". It will return delimiters as well.
StringTokenizer(String str, String delimiters, true);

I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:
String str = "Hello-World:How\nAre You&doing";
inputs = str.split("(?!^)\\b");
for (int i=0; i<inputs.length; i++) {
System.out.println("a[" + i + "] = \"" + inputs[i] + '"');
}
OUTPUT:
a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"
I am just using word boundary \b to delimit the words except when it is start of text.

I got here late, but returning to the original question, why not just use lookarounds?
Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));
output:
[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:
{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }
I hope that's easier to read. Thanks for the heads-up, #finnw.

I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:
Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:
https://stackoverflow.com/users/18393/cletus
Anyway, the code:
public class Split {
public static List<String> split(String s, String pattern) {
assert s != null;
assert pattern != null;
return split(s, Pattern.compile(pattern));
}
public static List<String> split(String s, Pattern pattern) {
assert s != null;
assert pattern != null;
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
ret.add(start >= s.length() ? "" : s.substring(start));
return ret;
}
private static void testSplit(String s, String pattern) {
System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
List<String> tokens = split(s, pattern);
System.out.printf("Found %d matches%n", tokens.size());
int i = 0;
for (String token : tokens) {
System.out.printf(" %d/%d: '%s'%n", ++i, tokens.size(), token);
}
System.out.println();
}
public static void main(String args[]) {
testSplit("abcdefghij", "z"); // "abcdefghij"
testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
}
}

I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
[o], '', [o], '', [o]
But the regexp o+ will return the expected result when splitting "aooob"
[], 'a', [ooo], 'b', []
To use this StringTokenizerEx:
final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
// uses the split String detected and memorized in 'aString'
final nextDelimiter = aStringTokenizerEx.getDelimiter();
}
The code of this class is available at DZone Snippets.
As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
Note: (late 2009 edit)
The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
The Google common-library Guava contains also a Splitter which is:
simpler to use
maintained by Google (and not by you)
So it may worth being checked out. From their initial rough documentation (pdf):
JDK has this:
String[] pieces = "foo.bar".split("\\.");
It's fine to use this if you want exactly what it does:
- regular expression
- result as an array
- its way of handling empty pieces
Mini-puzzler: ",a,,b,".split(",") returns...
(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above
Answer: (e) None of the above.
",a,,b,".split(",")
returns
"", "a", "", "b"
Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)
In any case, our Splitter is simply more flexible: The default behavior is simplistic:
Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]
If you want extra features, ask for them!
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]
Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.

Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by #cletus.
public static String[] split(CharSequence input, String pattern) {
return split(input, Pattern.compile(pattern));
}
public static String[] split(CharSequence input, Pattern pattern) {
Matcher matcher = pattern.matcher(input);
int start = 0;
List<String> result = new ArrayList<>();
while (matcher.find()) {
result.add(input.subSequence(start, matcher.start()).toString());
result.add(matcher.group());
start = matcher.end();
}
if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
return result.toArray(new String[0]);
}
I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.
I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.
Here are my tests:
#Test
public void splitsVariableLengthPattern() {
String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}
#Test
public void splitsEndingWithPattern() {
String[] result = Split.split("/foo/$bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}
#Test
public void splitsStartingWithPattern() {
String[] result = Split.split("$foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}
#Test
public void splitsNoMatchesPattern() {
String[] result = Split.split("/foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}

I will post my working versions also(first is really similar to Markus).
public static String[] splitIncludeDelimeter(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
int now, old = 0;
while(matcher.find()){
now = matcher.end();
list.add(text.substring(old, now));
old = now;
}
if(list.size() == 0)
return new String[]{text};
//adding rest of a text as last element
String finalElement = text.substring(old);
list.add(finalElement);
return list.toArray(new String[list.size()]);
}
And here is second solution and its round 50% faster than first one:
public static String[] splitIncludeDelimeter2(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer stringBuffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(stringBuffer, matcher.group());
list.add(stringBuffer.toString());
stringBuffer.setLength(0); //clear buffer
}
matcher.appendTail(stringBuffer); ///dodajemy reszte ciagu
list.add(stringBuffer.toString());
return list.toArray(new String[list.size()]);
}

Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.
package javaapplication2;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaApplication2 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";
// Terrifying regex:
// (a)|(b)|(c) match a or b or c
// where
// (a) is one or more digits optionally followed by a decimal point
// followed by one or more digits: (\d+(\.\d+)?)
// (b) is one of the set + * / - occurring once: ([+*/-])
// (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
Matcher tokenMatcher = tokenPattern.matcher(num);
List<String> tokens = new ArrayList<>();
while (!tokenMatcher.hitEnd()) {
if (tokenMatcher.find()) {
tokens.add(tokenMatcher.group());
} else {
// report error
break;
}
}
System.out.println(tokens);
}
}
Sample output:
[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]

I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
static String[] splitWithDelimiters(String s) {
if (s == null || s.length() == 0) {
return new String[0];
}
LinkedList<String> result = new LinkedList<String>();
StringBuilder sb = null;
boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
for (char c : s.toCharArray()) {
if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
if (sb != null) {
result.add(sb.toString());
}
sb = new StringBuilder();
wasLetterOrDigit = !wasLetterOrDigit;
}
sb.append(c);
}
result.add(sb.toString());
return result.toArray(new String[0]);
}

I suggest using Pattern and Matcher, which will almost certainly achieve what you want. Your regular expression will need to be somewhat more complicated than what you are using in String.split.

I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:
new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims

If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with.
Example:
I want to split the string "boo:and:foo" and keep ':' at its righthand String.
String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");
Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution.
But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.

Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).
It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.
Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.
My quick test:
String str = "'ab','cd','eg'";
String[] stra = str.split("\\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();
Result:
'|ab|','|cd|','|eg|'|
A bit too much... :-)

Tweaked Pattern.split() to include matched pattern to the list
Added
// add match to the list
matchList.add(input.subSequence(start, end).toString());
Full source
public static String[] inclusiveSplit(String input, String re, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Pattern pattern = Pattern.compile(re);
Matcher m = pattern.matcher(input);
// Add segments before each match found
while (m.find()) {
int end = m.end();
if (!matchLimited || matchList.size() < limit - 1) {
int start = m.start();
String match = input.subSequence(index, start).toString();
matchList.add(match);
// add match to the list
matchList.add(input.subSequence(start, end).toString());
index = end;
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index, input.length())
.toString();
matchList.add(match);
index = end;
}
}
// If no match was found, return this
if (index == 0)
return new String[] { input.toString() };
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}

Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.
List splitWithTokens(str, pat) {
def tokens=[]
def lastMatch=0
def m = str=~pat
while (m.find()) {
if (m.start() > 0) tokens << str[lastMatch..<m.start()]
tokens << m.group()
lastMatch=m.end()
}
if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
tokens
}
[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each {
println splitWithTokens(*it)
}

An extremely naive and inefficient solution which works nevertheless.Use split twice on the string and then concatenate the two arrays
String temp[]=str.split("\\W");
String temp2[]=str.split("\\w||\\s");
int i=0;
for(String string:temp)
System.out.println(string);
String temp3[]=new String[temp.length-1];
for(String string:temp2)
{
System.out.println(string);
if((string.equals("")!=true)&&(string.equals("\\s")!=true))
{
temp3[i]=string;
i++;
}
// System.out.println(temp.length);
// System.out.println(temp2.length);
}
System.out.println(temp3.length);
String[] temp4=new String[temp.length+temp3.length];
int j=0;
for(i=0;i<temp.length;i++)
{
temp4[j]=temp[i];
j=j+2;
}
j=1;
for(i=0;i<temp3.length;i++)
{
temp4[j]=temp3[i];
j+=2;
}
for(String s:temp4)
System.out.println(s);

String expression = "((A+B)*C-D)*E";
expression = expression.replaceAll("\\+", "~+~");
expression = expression.replaceAll("\\*", "~*~");
expression = expression.replaceAll("-", "~-~");
expression = expression.replaceAll("/+", "~/~");
expression = expression.replaceAll("\\(", "~(~"); //also you can use [(] instead of \\(
expression = expression.replaceAll("\\)", "~)~"); //also you can use [)] instead of \\)
expression = expression.replaceAll("~~", "~");
if(expression.startsWith("~")) {
expression = expression.substring(1);
}
String[] expressionArray = expression.split("~");
System.out.println(Arrays.toString(expressionArray));

One of the subtleties in this question involves the "leading delimiter" question: if you are going to have a combined array of tokens and delimiters you have to know whether it starts with a token or a delimiter. You could of course just assume that a leading delim should be discarded but this seems an unjustified assumption. You might also want to know whether you have a trailing delim or not. This sets two boolean flags accordingly.
Written in Groovy but a Java version should be fairly obvious:
String tokenRegex = /[\p{L}\p{N}]+/ // a String in Groovy, Unicode alphanumeric
def finder = phraseForTokenising =~ tokenRegex
// NB in Groovy the variable 'finder' is then of class java.util.regex.Matcher
def finderIt = finder.iterator() // extra method added to Matcher by Groovy magic
int start = 0
boolean leadingDelim, trailingDelim
def combinedTokensAndDelims = [] // create an array in Groovy
while( finderIt.hasNext() )
{
def token = finderIt.next()
int finderStart = finder.start()
String delim = phraseForTokenising[ start .. finderStart - 1 ]
// Groovy: above gets slice of String/array
if( start == 0 ) leadingDelim = finderStart != 0
if( start > 0 || leadingDelim ) combinedTokensAndDelims << delim
combinedTokensAndDelims << token // add element to end of array
start = finder.end()
}
// start == 0 indicates no tokens found
if( start > 0 ) {
// finish by seeing whether there is a trailing delim
trailingDelim = start < phraseForTokenising.length()
if( trailingDelim ) combinedTokensAndDelims << phraseForTokenising[ start .. -1 ]
println( "leading delim? $leadingDelim, trailing delim? $trailingDelim, combined array:\n $combinedTokensAndDelims" )
}

If you want keep character then use split method with loophole in .split() method.
See this example:
public class SplitExample {
public static void main(String[] args) {
String str = "Javathomettt";
System.out.println("method 1");
System.out.println("Returning words:");
String[] arr = str.split("t", 40);
for (String w : arr) {
System.out.println(w+"t");
}
System.out.println("Split array length: "+arr.length);
System.out.println("method 2");
System.out.println(str.replaceAll("t", "\n"+"t"));
}

I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.
string[] mySplit(string s,string delimiter)
{
string[] result = s.Split(delimiter);
for(int i=0;i<result.Length-1;i++)
{
result[i] += delimiter; //this one would add the delimiter to each items end except the last item,
//you can modify it however you want
}
}
string[] res = mySplit(myString,myDelimiter);
Its not too elegant, but it'll do.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Check an expression with a regex - java

You are probably having problems with spaces. In your example strings are spaces, but they don't match in the regex.

Related

Remove the intersections of multiple regular expressions?

How can I get inside parentheses value in a string?

Why is this String not matching this regex?

Check if a String contains a special character

Is there a way to split strings with String.split() and include the delimiters? [duplicate]

Categories

Resources