java Regex - split but ignore text inside quotes?

java Regex - split but ignore text inside quotes? - java

using only regular expression methods, the method String.replaceAll and ArrayList
how can i split a String into tokens, but ignore delimiters that exist inside quotes?
the delimiter is any character that is not alphanumeric or quoted text
for example:
The string :
hello^world'this*has two tokens'
should output:
hello
worldthis*has two tokens

I know there is a damn good and accepted answer already present but I would like to add another regex based (and may I say simpler) approach to split the given text using any non-alphanumeric delimiter which not inside the single quotes using
Regex:
/(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+/
Which basically means match a non-alphanumeric text if it is followed by even number of single quotes in other words match a non-alphanumeric text if it is outside single quotes.
Code:
String string = "hello^world'this*has two tokens'#2ndToken";
System.out.println(Arrays.toString(
string.split("(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+"))
);
Output:
[hello, world'this*has two tokens', 2ndToken]
Demo:
Here is a live working Demo of the above code.

Use a Matcher to identify the parts you want to keep, rather than the parts you want to split on:
String s = "hello^world'this*has two tokens'";
Pattern pattern = Pattern.compile("([a-zA-Z0-9]+|'[^']*')+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
See it working online: ideone

You cannot in any reasonable way. You are posing a problem that regular expressions aren't good at.

Do not use a regular expression for this. It won't work. Use / write a parser instead.
You should use the right tool for the right task.

Related

how to break the string using keywords using regex

I have a scenario where i need to break the below input string based on the keywords using regex.
Keywords are UPRCAS, REPLC, LOWCAS and TUPIL.
String input = "UPRCAS-0004-abcdREPLC-0003-123TUPIL-0005-adf2344LOWCAS-0003-ABCD";
The output should be as follows
UPRCAS-00040-abcd
REPLC-0003-123
TUPIL-0005-adf2344
LOWCAS-00030-ABCD
How can i achieve this using java regex.
I have tried using split by '-' and using regex but both the approach gives an array of strings and again i have to process each string and combine 3 strings together to form UPRCAS-00040-abcd. I felt this is not the efficient way to do as it takes an extra array and process them back.
String[] tokens = input.split("-");
String[] r = input.split("(?=\\p{Upper})");
Please let me know if we can split the string using regex based on the keyword. Basically i need to extract the string between the keyword boundary.
Edited question after understanding the limitation of existing problem
The regex should be generic to extract the string from input between the UPPERCASE characters
The regex should not contains keywords to split the string.
I understood that, it is a bad idea to add new keyword everytime in regex pattern for searching. My expectation is to be a generic as possible.
Thanks all for your time. Really appreciate it.

Split using the following regex:
(?=UPRCAS|REPLC|LOWCAS|TUPIL)
The (?=xxx) is a zero-width positive lookahead, meaning that it matches the empty space immediately preceding one of the 4 keywords.
See Regular-Expressions.info for more information: Lookahead and Lookbehind Zero-Length Assertions
Test
String input = "UPRCAS-0004-abcdREPLC-0003-123TUPIL-0005-adf2344LOWCAS-0003-ABCD";
String[] output = input.split("(?=UPRCAS|REPLC|LOWCAS|TUPIL)");
for (String value : output)
System.out.println(value);
Output
UPRCAS-0004-abcd
REPLC-0003-123
TUPIL-0005-adf2344
LOWCAS-0003-ABCD

You can try this regex:
\w+-\w+-(?:[a-z0-9]+|[A-Z]+)
Demo: https://regex101.com/r/etKBjI/3

Java ignore special characters in string matching

I want to match two strings in java eg.
text: János
searchExpression: Janos
Since I don't want to replace all special characters, I thought I could just make the á a wildcard, so everything would match for this character. For instance if I search in János with Jxnos, it should find it. Of course there could be multiple special characters in the text. Does anyone have an idea how I could achieve this via any pattern matcher, or do I have to compare char by char?

use pattern and matcher classes with J\\Snos as regex. \\S matches any non-space character.
String str = "foo János bar Jxnos";
Matcher m = Pattern.compile("J\\Snos").matcher(str);
while(m.find())
{
System.out.println(m.group());
}
Output:
János
Jxnos

A possible solution would be to strip the accent with the help of Apache Commons StringUtils.stripAccents(input) method:
String input = StringUtils.stripAccents("János");
System.out.println(input); //Janos
Make sure to also read upon the more elaborate approaches based on the Normalizer class: Is there a way to get rid of accents and convert a whole string to regular letters?

How can I obtain what .* matched in a regular expression?

I have thousands of different regular expressions and they look like this:
^Mozilla.*Android.*AppleWebKit.*Chrome.*OPR\/([0-9\.]+)
How do I obtain those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different .*s. In addition, I don't know in advance how many .*s there are, even though I can possibly find out by doing some simple operation on the given regex string, but that would impose more complexity on the program. I process a fairly big amount of data, so really focus on the efficiency here.

Replace the .*s with (.*)s and use matcher.group(n). For instance:
Pattern p = Pattern.compile("1(.*)2(.*)3");
Matcher m = p.matcher("1abc2xyz3");
m.find();
System.out.println(m.group(2));
xyz
Notice how the match of the second (.*) was returned (since m.group(2) was used).
Also, since you mentioned you won't know how many .*s your regex will contain, there is a matcher.groupCount() method you can use, if the only capturing groups in your regex will indeed be (.*)s.
For your own enlightenment, try reading about capturing groups.

How do I get those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different DOT STAR.
Use groups: (.*)
I addition, I don't know in advance how many DOT STARs there are
Build your regex string, then replace .* with (.*):
String myRegex = "your regex here";
myRegex = myRegex.replace(".*","(.*)");
even though I can possible find out about that by doing some simple operation on the given regex string, but that would impose more complexity on the program
If you don't know how the regex is made and the regex is not built by your application, the only way is to process it after you have it. If you are building the regex, then append (.*) to the regex string instead of appending .*

Java: regex - how do i get the first quote text

As a beginner with regex i believe im about to ask something too simple but ill ask anyway hope it won't bother you helping me..
Lets say i have a text like "hello 'cool1' word! 'cool2'"
and i want to get the first quote's text (which is 'cool1' without the ')
what should be my pattern? and when using matcher, how do i guarantee it will remain the first quote and not the second?
(please suggest a solution only with regex.. )

Use this regular expression:
'([^']*)'
Use as follows: (ideone)
Pattern pattern = Pattern.compile("'([^']*)'");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Or this if you know that there are no new-line characters in your quoted string:
'(.*?)'
when using matcher, how do i guarantee it will remain the first quote and not the second?
It will find the first quoted string first because it starts seaching from left to right. If you ask it for the next match it will give you the second quoted string.

If you want to find first quote's text without the ' you can/should use Lookahead and Lookbehind mechanism like
(?<=').*?(?=')
for example
System.out.println("hello 'cool1' word! 'cool2'".replaceFirst("(?<=').*?(?=')", "ABC"));
//out -> hello 'ABC' word! 'cool2'
more info

You could just split the string on quotes and get the second piece (which will be between the first and second quotes).
If you insist on regex, try this:
/^.*?'(.*?)'/
Make sure it's set to multiline, unless you know you'll never have newlines in your input. Then, get the subpattern from the result and that will be your string.
To support double quotes too:
/^.*?(['"])(.*?)\1/
Then get subpattern 2.

regex to find substring between special characters

I am running into this problem in Java.
I have data strings that contain entities enclosed between & and ; For e.g.
&Text.ABC;, &Links.InsertSomething;
These entities can be anything from the ini file we have.
I need to find these string in the input string and remove them. There can be none, one or more occurrences of these entities in the input string.
I am trying to use regex to pattern match and failing.
Can anyone suggest the regex for this problem?
Thanks!

Here is the regex:
"&[A-Za-z]+(\\.[A-Za-z]+)*;"
It starts by matching the character &, followed by one or more letters (both uppercase and lower case) ([A-Za-z]+). Then it matches a dot followed by one or more letters (\\.[A-Za-z]+). There can be any number of this, including zero. Finally, it matches the ; character.
You can use this regex in java like this:
Pattern p = Pattern.compile("&[A-Za-z]+(\\.[A-Za-z]+)*;"); // java.util.regex.Pattern
String subject = "foo &Bar; baz\n";
String result = p.matcher(subject).replaceAll("");
Or just
"foo &Bar; baz\n".replaceAll("&[A-Za-z]+(\\.[A-Za-z]+)*;", "");
If you want to remove whitespaces after the matched tokens, you can use this re:
"&[A-Za-z]+(\\.[A-Za-z]+)*;\\s*" // the "\\s*" matches any number of whitespace

And there is a nice online regular expression tester which uses the java regexp library.
http://www.regexplanet.com/simple/index.html

You can try:
input=input.replaceAll("&[^.]+\\.[^;]+;(,\\s*&[^.]+\\.[^;]+;)*","");
See it

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java Regex - split but ignore text inside quotes? - java

You cannot in any reasonable way. You are posing a problem that regular expressions aren't good at.

Do not use a regular expression for this. It won't work. Use / write a parser instead. You should use the right tool for the right task.

Related

how to break the string using keywords using regex

Java ignore special characters in string matching

How can I obtain what .* matched in a regular expression?

Java: regex - how do i get the first quote text

regex to find substring between special characters

Categories

Resources