java - split string using regular expression - java

I need to split a string where there's a comma, but it depends where the comma is placed.
As an example
consider the following:
C=75,user_is_active(A,B),user_is_using_app(A,B),D=78
I'd like the String.split() function to separate them like this:
C=75
user_is_active(A,B)
user_using_app(A,B)
D=78
I can only think of one thing but I'm not sure how it'd be expressed in regex.
The characters/words within the brackets are always capital. In other words, there won't be a situation where I will have user_is_active(a,b).
Is there's a way to do this?

If you don't have more than one level of parentheses, you could do a split on a comma that isn't followed by a closing ) before an opening (:
String[] splitArray = subjectString.split(
"(?x), # Verbose regex: Match a comma\n" +
"(?! # unless it's followed by...\n" +
" [^(]* # any number of characters except (\n" +
" \\) # and a )\n" +
") # end of lookahead assertion");
Your proposed rule would translate as
String[] splitArray = subjectString.split(
"(?x), # Verbose regex: Match a comma\n" +
"(?<!\\p{Lu}) # unless it's preceded by an uppercase letter\n" +
"(?!\\p{Lu}) # or followed by an uppercase letter");
but then you would miss a split in a text like
Org=NASA,Craft=Shuttle

consider using a parser generator for parsing this kind of query. E.g: javacc or antlr

As an alternative, if you need more than one level of parentheses, you can create a little string parser for parsing the string character by character.

Related

Regex replace word while preserving spaces/punctuation

I am trying to go through a document and change all instances of a name using regular expressions in Java. My code looks something like this:
Pattern replaceWordPattern = Pattern.compile("(^|\\s)" + replaceWord + "^|\\W");
followed by:
String line = matcher.replaceAll("Alice");
The problem is that this does not preserve the spaces or punctuation or other non-word characters that followed. If I had "Jack jumped" it becomes "Alicejumped". Does anyone know a way to fix this?
\W consumes the space after the replaceWord. Replace ^|\\W with word boundary \\b which does not consume symbols. Consider doing same for the first delimiter group, as I suspect you do not want to consume anything there too.
Pattern replaceWordPattern = Pattern.compile("\\b" + replaceWord + "\\b");
If semantic of word boundaries is not suitable for you, consider using lookahead and lookbehind constructs which do not consume input too.
You're missing brackets on the second non-whitespace character expression:
Pattern replaceWordPattern = Pattern.compile("[^|\\s]" + replaceWord + "[^|\\W]");

What is the regex for checking string A for the presence of any 3 consecutive characters in string B?

For example:
I have a username string: "johnwatson#221b.bakerstreet"
I want to search some password string to make sure it doesn't contain any 3 consecutive letters in the username, e.g.: no "joh", "ohn", "hnw", etc...
I'm aware of a functional way to do this, but is there a way to do this with regex?
Short answer: no, you should do this in your application code by generating all the 3-letter substrings and checking if the password contains any of them.
But if you feel adventurous, you could still summon a bloody regex monster from a 19th century gothic novel to achieve this.
See #sln's and #Floris's answers for that.
My 2cents: that's a very, very bad idea. Regex are great when you have a fixed, regular syntactical structure to recognize, which is not your case.
Capture 3, consume 1
Taking a guess. Catenate username + newline + password.
(atually not a guess)
Context: NO Dot-All
If a match, then error.
# johnwatson#221b.bakerstreet\nPassword
# (?=(...)[^\n]*\n(?:(?!\1).)*\1)
(?= # Lookahead assertion start
( . . . ) # Capture 3 non-newline chars
[^\n]* \n # Get up to and the next newline
(?: # Cluster group start
(?! \1 ) # Backref check, not the current 3 char string in front of us
. # This char is ok, consume it in the assertion context
)* # Cluster group end, do 0 to many times
\1 # Here, found a user name sub-string
# in the password, it will match now
) # Lookahead assertion end
Heavily inspired by #sln's answer, I would like to offer the following solution:
First - concatenate your user name and password into a single string, separated by a newline (assuming that newline does not otherwise occur in either user name or password; reasonable assumption, I think).
Next, test the resulting string with the following expression:
(?=(...).*\n.*\1)
(See it at work here)
How this works:
(?= ) - positive lookahead: "somewhere we can match this"
(...) - three consecutive characters - 'capture group'. We can refer to these as \1
.*\n - followed by "anything" up to a newline character
.*\1 - followed by "anything" up to a repeat of the first match (the ...)
This is going to try as hard as it can to find a match (that's what regex tries to do). If it succeeds, it means that there was a repeat of three consecutive characters that occurred before the \n in the part after the \n. So try to test for the above; if it succeeds, your "rule" was violated.
edit - example of complete (tested, working) Java code:
import java.io.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class passwordTester
{
public static void main (String[] args) throws java.lang.Exception
{
String username="johnwatson#221bakerstreet.com";
String password = "youcantmakethisup";
String input = username + "\n" + password;
System.out.println("testing " + input);
Pattern p = Pattern.compile("(?=(...).*\\n.*\\1)");
Matcher m = p.matcher(input);
if(m.find()) {
System.out.println("the three character sequence '" + input.substring(m.start(), m.start()+3)+ "' was repeated");
}
else System.out.println("the password is good");
}
}
Output:
testing johnwatson#221bakerstreet.com
youcantmakethisup
the three character sequence 'ake' was repeated
I don't think so. Regular expressions don't have "memory" and doing what you want requires memory of previously-matched characters. This might be possible with some of the more evil Perl extensions to regular expressions (inline code?), I'm not sure, but I don't believe this is possible with "pure" regular expressions.

Regex Lookahead and Lookbehinds: followed by this or that

I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.
As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"
What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.
If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.

Replace all content within braces?

In the end I need a regex which basically converts me a phone number into a E164 conform number. As for now i got this:
result = s.replaceAll("[(*)|+| ]", "");
It replaces everything fine: the spaces, the "+"-sign and also the braces "()". But it does not match the content of its braces, so that e.g. the number +49 (0)11 111 11 11 will be replaced to 49111111111.
How can I get this to work?
You can do it, but what if there's more than just a zero between parentheses?
result = s.replaceAll("\\([^()]*\\)|[*+ ]+", "");
As a verbose regex:
result = s.replaceAll(
"(?x) # Allow comments in the regex. \n" +
"\\( # Either match a ( \n" +
"[^()]* # then any number of characters except parentheses \n" +
"\\) # then a ). \n" +
"| # Or \n" +
"[*+\\ ]+ # Match one or more asterisks, pluses or spaces", "");
[(*)|+| ]
is a character class, matching any single parenthesis, asterisk, bar, plus or space character. Get rid of the square brackets and use something like
s.replaceAll("\\(.*?\\)|\\D", "");
This will remove anything between (and including) parentheses, as well as anything else that is not a digit. Note that this will not handle nested parentheses very well - it will eat everything from an open parenthesis to the first closing one it finds, so would change (123(45)67) into 67 (the unbalanced close parenthesis being removed as it's a \D)
You might try this: "(\\(\\d+\\))|\\+|\\s". Removes the paren's and contents, plus sign, and space.
I think you are expecting a little too much magic from character classes. Firstly, in character classes, don't use |. It is just another character that will be matched by the character class. Simply list all the characters you want to include without any delimiters.
Secondly, a character class really just matches single characters. So (*) inside a character class can by definition do nothing more than remove (, or * (literally) or ). If you are 100% sure that your input will never have nested parentheses or unmatched parentheses or something, then you can do something like this:
"(?:\\([^)]\\)|\\D)+"

java regex string split by " not \"

actually I need to write just a simple program in JAVA to convert MySQL INSERTS lines into CSV files (each mysql table equals one CSV file)
is the best solution to use regex in JAVA?
My main problem how to match correctly value like this: 'this is \'cool\'...'
(how to ignore escaped ')
example:
INSERT INTO `table1` VALUES ('this is \'cool\'...' ,'some2');
INSERT INTO `table1` (`field1`,`field2`) VALUES ('this is \'cool\'...' ,'some2');
Thanks
Assuming that your SQL statements are syntactically valid, you could use
Pattern regex = Pattern.compile("'(?:\\\\.|[^'\\\\])*'");
to get a regex that matches all single-quoted strings, ignoring escaped characters inside them.
Explanation without all those extra backslashes:
' # Match '
(?: # Either match...
\\. # an escaped character
| # or
[^'\\] # any character except ' or \
)* # any number of times.
' # Match '
Given the string
'this', 'is a \' valid', 'string\\', 'even \\\' with', 'escaped quotes.\\\''
this matches
'this'
'is a \' valid'
'string\\'
'even \\\' with'
'escaped quotes.\\\''
You can match on chars within non-escaped quotes by using this regex:
(?<!\\)'([^'])(?<!\\)`
This is using a negative look-behind to assert that the character before the quote is not a bask-slash.
In jave, you have to double-escape (once for the String, once for the regex), so it looks like:
String regex = "(?<!\\\\)'([^'])(?<!\\\\)`";
If you are working in linux, I would be using sed to do all the work.
Four backslashes (two to represent a backslash) plus dot. "'(\\\\.|.)*'"
Although regexes give you a very powerful mechanism to parse text, I think you might be better off with a non-regex parser. I think you code will be easier to write, easier to understand and have fewer bugs.
Something like:
find "INSERT INTO"
find table name
find column names
find "VALUES"
find value set (loop this part)
Writing the regex to do all of the above, with optional column values and an optional number of value sets is non-trivial and error-prone.
You have to use \\\\. In Java Strings \\is one \, because the backslash is used to do whitespace or control characters (\n,\t, ...). But in regex a backslash is also represented by '\'.

Categories