java regex string split by " not \" - java

actually I need to write just a simple program in JAVA to convert MySQL INSERTS lines into CSV files (each mysql table equals one CSV file)
is the best solution to use regex in JAVA?
My main problem how to match correctly value like this: 'this is \'cool\'...'
(how to ignore escaped ')
example:
INSERT INTO `table1` VALUES ('this is \'cool\'...' ,'some2');
INSERT INTO `table1` (`field1`,`field2`) VALUES ('this is \'cool\'...' ,'some2');
Thanks

Assuming that your SQL statements are syntactically valid, you could use
Pattern regex = Pattern.compile("'(?:\\\\.|[^'\\\\])*'");
to get a regex that matches all single-quoted strings, ignoring escaped characters inside them.
Explanation without all those extra backslashes:
' # Match '
(?: # Either match...
\\. # an escaped character
| # or
[^'\\] # any character except ' or \
)* # any number of times.
' # Match '
Given the string
'this', 'is a \' valid', 'string\\', 'even \\\' with', 'escaped quotes.\\\''
this matches
'this'
'is a \' valid'
'string\\'
'even \\\' with'
'escaped quotes.\\\''

You can match on chars within non-escaped quotes by using this regex:
(?<!\\)'([^'])(?<!\\)`
This is using a negative look-behind to assert that the character before the quote is not a bask-slash.
In jave, you have to double-escape (once for the String, once for the regex), so it looks like:
String regex = "(?<!\\\\)'([^'])(?<!\\\\)`";
If you are working in linux, I would be using sed to do all the work.

Four backslashes (two to represent a backslash) plus dot. "'(\\\\.|.)*'"

Although regexes give you a very powerful mechanism to parse text, I think you might be better off with a non-regex parser. I think you code will be easier to write, easier to understand and have fewer bugs.
Something like:
find "INSERT INTO"
find table name
find column names
find "VALUES"
find value set (loop this part)
Writing the regex to do all of the above, with optional column values and an optional number of value sets is non-trivial and error-prone.

You have to use \\\\. In Java Strings \\is one \, because the backslash is used to do whitespace or control characters (\n,\t, ...). But in regex a backslash is also represented by '\'.

Related

How to replace a space exactly with "\\\\s+" [duplicate]

I'm trying to convert the String \something\ into the String \\something\\ using replaceAll, but I keep getting all kinds of errors. I thought this was the solution:
theString.replaceAll("\\", "\\\\");
But this gives the below exception:
java.util.regex.PatternSyntaxException: Unexpected internal error near index 1
The String#replaceAll() interprets the argument as a regular expression. The \ is an escape character in both String and regex. You need to double-escape it for regex:
string.replaceAll("\\\\", "\\\\\\\\");
But you don't necessarily need regex for this, simply because you want an exact character-by-character replacement and you don't need patterns here. So String#replace() should suffice:
string.replace("\\", "\\\\");
Update: as per the comments, you appear to want to use the string in JavaScript context. You'd perhaps better use StringEscapeUtils#escapeEcmaScript() instead to cover more characters.
TLDR: use theString = theString.replace("\\", "\\\\"); instead.
Problem
replaceAll(target, replacement) uses regular expression (regex) syntax for target and partially for replacement.
Problem is that \ is special character in regex (it can be used like \d to represents digit) and in String literal (it can be used like "\n" to represent line separator or \" to escape double quote symbol which normally would represent end of string literal).
In both these cases to create \ symbol we can escape it (make it literal instead of special character) by placing additional \ before it (like we escape " in string literals via \").
So to target regex representing \ symbol will need to hold \\, and string literal representing such text will need to look like "\\\\".
So we escaped \ twice:
once in regex \\
once in String literal "\\\\" (each \ is represented as "\\").
In case of replacement \ is also special there. It allows us to escape other special character $ which via $x notation, allows us to use portion of data matched by regex and held by capturing group indexed as x, like "012".replaceAll("(\\d)", "$1$1") will match each digit, place it in capturing group 1 and $1$1 will replace it with its two copies (it will duplicate it) resulting in "001122".
So again, to let replacement represent \ literal we need to escape it with additional \ which means that:
replacement must hold two backslash characters \\
and String literal which represents \\ looks like "\\\\"
BUT since we want replacement to hold two backslashes we will need "\\\\\\\\" (each \ represented by one "\\\\").
So version with replaceAll can look like
replaceAll("\\\\", "\\\\\\\\");
Easier way with replaceAll
To make out life easier Java provides tools to automatically escape text into target and replacement parts. So now we can focus only on strings, and forget about regex syntax:
replaceAll(Pattern.quote(target), Matcher.quoteReplacement(replacement))
which in our case can look like
replaceAll(Pattern.quote("\\"), Matcher.quoteReplacement("\\\\"))
Even better: use replace
If we don't really need regex syntax support lets not involve replaceAll at all. Instead lets use replace. Both methods will replace all targets, but replace doesn't involve regex syntax. So you could simply write
theString = theString.replace("\\", "\\\\");
To avoid this sort of trouble, you can use replace (which takes a plain string) instead of replaceAll (which takes a regular expression). You will still need to escape backslashes, but not in the wild ways required with regular expressions.
You'll need to escape the (escaped) backslash in the first argument as it is a regular expression. Replacement (2nd argument - see Matcher#replaceAll(String)) also has it's special meaning of backslashes, so you'll have to replace those to:
theString.replaceAll("\\\\", "\\\\\\\\");
Yes... by the time the regex compiler sees the pattern you've given it, it sees only a single backslash (since Java's lexer has turned the double backwhack into a single one). You need to replace "\\\\" with "\\\\", believe it or not! Java really needs a good raw string syntax.

mongo query that has regex returns null for string that contains special character as ^ [duplicate]

I am trying to find the following text in my string : '***'
the thing is that the C# Regex mechanism doesnt allow me to do the following:
new Regex("***", RegexOptions.CultureInvariant | RegexOptions.Compiled);
due to
ArgumentException: "parsing "*" - Quantifier {x,y} following nothing."
obviously it thinks that my stars represents regular expressions,
is there a way to tell the Regex mechanism to treat stars as just stars and nothing else?
* in Regex means:
Matches the previous element zero or more times.
so that, you need to use \* or [*] instead.
explain:
\
When followed by a character that is not recognized as an escaped character in this and other tables in this topic, matches that character. For example, \* is the same as \x2A.
[ character_group ]
Matches any single character in character_group.
You need to escape the star with a backslash: #"\*"

Replace word using java regex but not quotes

I want to replace a word in a sentence using java regex replace.
test string is a_b a__b a_bced adbe a_bc_d 'abcd' ''abcd''
if i want to replace all the words which starts with a & ends with d.
i'm using String.replaceAll("(?i)\\ba[a-zA-Z0-9_.]*d\\b","temp").
its replacing as a_b a__b temp adbe a_bc_d 'temp' ''temp''
What should be my regex if i don't want to consider the string in quotes.?
I used String.replaceAll("[^'](?i)\\ba[a-zA-Z0-9_.]*d\\b[^']","temp")
Its replacing as a_b a__btempadbe temp'abcd' ''abcd''.
Its removing one spaces of that word.
Is there any way to replace only that string not inside the quotes?
PS: there is a workaround for this String.replaceAll("[^'](?i)\\ba[a-zA-Z0-9_.]*d\\b[^']"," temp "). But it fails in some cases.
What should be my regex if i want to replace a word in a sentence & i should not consider string in side quotes.?
Thanks in Advance...!!!
You can use lookaround assertions:
string = string.replaceAll("(?i)(?<!')\\ba[a-zA-Z0-9_.]*d\\b(?!')", "temp");
RegEx Demo
Read more about lookarounds
Testing if there's or not a quote before and after the target is a wrong approach because you can't know if the described quote is an opening quote or a closing quote. (try to add a quote at the start of your test string and test a naive pattern, you will see: 'inside'a_outside_d'inside').
The only way to know if something is inside or outside quotes is to check the string from the beginning (or from the end, but it's less handy and eventually error prone if quotes aren't balanced). To do that, you must describe all possible substrings before the target, example:
\G([^a']*+(?:'[^']*'[^a']*|\Ba+[^a']*|a(?!\w*d\b)[^a']*)*+)\ba\w*d\b
details:
\G # matches the start of the string or the position after the previous match
(
[^a']*+ # all that isn't an "a" or a quote
(?:
'[^']*' [^a']* # content between quotes
|
\Ba+ [^a']* # "a" not at the start of a word
|
a(?!\w*d\b) [^a']* # "a" at the start of a word that doesn't end with "d"
)*+
) # all that can be before the target in a capture group
\ba\w*d\b # the target
Don't forget to escape backslashes in the java string: \ => \\.
To perform the replacement, you need to refer to the capture group 1:
$1temp
Note: to handle escaped quotes between quotes, change '[^']*' to: '[^\\']*+(?s:\\.[^\\']*)*+'.
Demo: click the Java button.

Regular expression that matches "{$" AND NOT matches "\{$"

I am working on a project with lexical analysis and basically I have to generate tokens that are text and that are not text.
Tokens that are text are considered all characters until the "{$" sequence.
Tokens that are not text are considered all characters inside the "{$" and "$}" sequences.
Note that the "{$" character sequence can be escaped by writing "\{$" so this also becomes a part of text.
My job is to read a String of text, and for that I am using Regular expressions.
I am using the Java Scanner and Pattern classes and this is my work so far:
String text = "This is \\{$ just text$}\nThis is {$not_text$}."
Scanner sc = new Scanner(text);
Pattern textPattern = Pattern.compile("{\\$"); // insert working regex here
sc.useDelimiter(textPattern);
System.out.println(sc.next());
This is what should be printed out:
This is \{$ just text$}
This is
How do I make a regex for the following logical statement:
match "{$" AND NOT match "\{$"
You can use Negative Look-Behind (?<!\\) in front of \{\$ to ensure that escaped curly braces are not matched:
(?<!\\)\{\$
Demo
Possible solution:
String text = "This is \\{$ just text$}\nThis is {$not_text$}.";
Pattern textPattern = Pattern.compile(
"(?<text>(?:\\\\.|(?!\\{\\$).)+)" // text - `\x` or non-start-of `{$`
+ "|" // OR
+ "(?<nonText>\\{\\$.*?\\$\\})"); // non-text
Matcher m = textPattern.matcher(text);
while (m.find()) {
if (m.group(1)!=null){
System.out.println("text : "+m.group("text"));
}else{
System.out.println("non-text : "+m.group("nonText"));
}
}
System.out.println("\01234");
Explanation:
From what I see, you want \ to be special character used for escaping.
Problem now is to determine where \ is meant to escape character/sequence after it, and when it should be treated as simple printable character (literal).
(possible problem)
Lets say that you have text dir1\dir2\ and you want to add after it non-text foo. How would you write it?
You could try writing dir1\dir2\{$foo$} but this could mean that you just escaped {$ which would prevent foo from being seen as non-text.
In Java, String literals faced same problem since \ can be used to create other special characters using
pairs \n \r \t \"
Unicode codepoints \uFFFF
octal format \012.
Solution used in Java (and many other languages) was making \ always special character which to create \ literal required escaping it with another \ (there was no real need to add yet another special character for that). So to represent \ we need to write it as \\.
So if we have text dir1\dir2\ we would need to write it as dir1\\dir2\\. This would allow us to concatenate to it {$non-text$} without fear that this last \\ placed right before {$ will be causing misinterpretation of it and prevent seeing it as non-text sequence.
So now when we see dir1\\dir2\\{$foo$} we can interpret {$ properly.
From this point I am assuming you are also using this approach which ensures proper interpretation of \.
Now, lets try to create rule which will let us find/separate text and non-text characters.
Based on our example we know that dir1\\dir2\\{$foo$} is: text dir1\\dir2\\ and non-text {$foo$}.
So as you see splitting on {$ which is not preceded by \ can fail you sometimes (if number of preceding \ is not odd).
Probably simpler solution is to accept
for text:
\\. - regex representing characters which are preceded by \ (this will handle \\ literal and escaped \{ (which will also allow us to accept rest of $..$} part)
(?!\{\$). - regex representing character which isn't { which would start {$ area.
for non-text:
\{\$.*?\$\} - regex representing {$...$} - we know that it will be unescaped because all escaped characters will be accepted by \\..

Regular expression to match strings enclosed in square brackets or double quotes

I need 2 simple reg exps that will:
Match if a string is contained within square brackets ([] e.g [word])
Match if string is contained within double quotes ("" e.g "word")
\[\w+\]
"\w+"
Explanation:
The \[ and \] escape the special bracket characters to match their literals.
The \w means "any word character", usually considered same as alphanumeric or underscore.
The + means one or more of the preceding item.
The " are literal characters.
NOTE: If you want to ensure the whole string matches (not just part of it), prefix with ^ and suffix with $.
And next time, you should be able to answer this yourself, by reading regular-expressions.info
Update:
Ok, so based on your comment, what you appear to be wanting to know is if the first character is [ and the last ] or if the first and last are both " ?
If so, these will match those:
^\[.*\]$ (or ^\\[.*\\]$ in a Java String)
"^.*$"
However, unless you need to do some special checking with the centre characters, simply doing:
if ( MyString.startsWith("[") && MyString.endsWith("]") )
and
if ( MyString.startsWith("\"") && MyString.endsWith("\"") )
Which I suspect would be faster than a regex.
Important issues that may make this hard/impossible in a regex:
Can [] be nested (e.g. [foo [bar]])? If so, then a traditional regex cannot help you. Perl's extended regexes can, but it is probably better to write a parser.
Can [, ], or " appear escaped (e.g. "foo said \"bar\"") in the string? If so, see How can I match double-quoted strings with escaped double-quote characters?
Is it possible for there to be more than one instance of these in the string you are matching? If so, you probably want to use the non-greedy quantifier modifier (i.e. ?) to get the smallest string that matches: /(".*?"|\[.*?\])/g
Based on comments, you seem to want to match things like "this is a "long" word"
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'The non-string "this is a crazy "string"" is bad (has own delimiter)';
print $s =~ /^.*?(".*").*?$/, "\n";
Are they two separate expressions?
[[A-Za-z]+]
\"[A-Za-z]+\"
If they are in a single expression:
[[\"]+[a-zA-Z]+[]\"]+
Remember that in .net you'll need to escape the double quotes " by ""

Categories