Java Regex find/replace pattern in SQL comments - java

I want to find/replace a character/pattern ONLY inside SQL comments ( single line comments -- and block comments /* */). The source string is an SQL script.
At the moment I am searching for a semi-colon (;) within comments and want to replace it with blank space.
Source
CREATE OR REPLACE PROCEDURE TESTDBA.PROC_REUSING_BINDED_VAR_N_DSQL
AS
a NUMBER:=2;
b NUMBER:=3; -- jladjfljaf; lakjflajf
-- alksdjflkjaf ladkjf
v_plsql_tx VARCHAR2(2000);
begin
v_plsql_tx := 'BEGIN ' || ' :1 := :1 + :2; ' || 'END;';
execute immediate v_plsql_tx
using in out a, b;
insert into testdba.NEW_TESTING_TABLE(CHARACTER_VALUE) VALUES('a='||a);
end PROC_REUSING_BINDED_VAR_N_DSQL;
-- lajdflajf
/*lakjdfljalfdk; alkdjf*/
/*asdf
;
asdfa*/
/*
adf
asd asdf
*/
Can you please suggest something.
Thanks

I would do this like this :
try {
Pattern regex = Pattern.compile("(?:/\\*[^;]*?\\*/)|(?:--[^;]*?$)", Pattern.DOTALL | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
The above will give you all comments without ';'. Then I would iterate line by line through the sql file and when I encountered a line which had a comment I would check to see if that line is in my list of matches - if not then I would search replace ; with ' ' in the whole comment. Of course you will have to find where the comment ends but this is easy -- ends in the same line and /* and when the first */ is found. This way you can change any number of ; with the same code.

Probably the best bet is to use two regexes and two Patterns (one single line and one multi-line).
Single Line: \\-\\-[^;]*(;) -- not sure the best way to find multiple ; within a line
Multi-line: /\\*[^;(\\*/)]*?(;)[^;]*?\\*/ -- something like this anyway

What you need to find out first: There are two ways that multi-line comments can be handled.
A single "*/" closes all currently open "/*".
For every "/*" your need a corresponding "*/" (nested comments).
The first one is relatively easy to implement. The second can only be done by either deep magic regex (read: unmaintainable by future coders) or with a short program.
The first one is pretty easy: Using "/\*.*;.*\*/" will give you a match whenever there is an embedded semicolon.
The second one will need a bit of programming. If you encounter a ";", you will need to check if you're currently inside a comment. You can know by just sequentially reading the file (ignoring the carriage return/line feeds) and incrementing a number whenever you encounter a "/*" and decrement the number when encountering a "*/". If the number is at least 1, your semicolon is inside a comment.

Related

Reading text from a file using regular expressions

I have a text file containing information that has numbers and characters that are broken into 3 columns and I can't figure out what regular expressions I'm needing. The columns are broken by ; and after the third column is written then it skips to the next line and goes on. I know majority of my code is working properly and I've narrowed down the problem to this section of code.
I've tried looking up java regular expressions and I can't seem to find what I'm trying to accomplish.
while ((line = br.readLine()) != null) {
// Searches the file that matches a specific value
if (!line.isEmpty() || line.matches("Need regular expression here that skips over the two columns and reads the last")) {
if (isValid(line)) {
System.out.println(line + "IS Valid");
} else {
System.out.println(line + "IS NOT VALID");
}
}
}
In the console after reading the file it should say
"12345";"12";"tacobell#yahoo.com"; IS valid
"123456";"31";"Taco . bell#yahoo.com"; IS NOT VALID
It must contain the whole line when writing out to the console not just the third column.
^[^;]*;[^;]*;([^ ]*);$
That will give you a match only if the third column contains no spaces (so it will match "12345";"12";"tacobell#yahoo.com";, but it will not match "123456";"31";"Taco . bell#yahoo.com";).
The parentheses are a capture group, so you can extract that column by grabbing group #1 (not group #0) from the capture results.
The ^ at the beginning means that this pattern has to start at the beginning of a line, and the $ at the end means that this pattern has to end at the end of a line. If that's not the case for your input, you will have to adjust it. For example, if you had trailing whitespace after the last column, you might do:
^[^;]*;[^;]*;([^ ]*);[ ]*$
If you had trailing whitespace and the last semicolon was optional, you'd do:
^[^;]*;[^;]*;([^ ]*);?[ ]*$
One last thing: I'm using [ ] to indicate whitespace, but that only includes the basic space character. It doesn't include tabs, newlines, or any other type of whitespace. It's better to use \s if you want to include all of those, but in Java string syntax you have to escape the backslash, so it would look like this:
Pattern.compile("^[^;]*;[^;]*;([^ ]*);?\\s*$")
This is the reason why well-designed programming languages have a specialized regular expression syntax. It gets even crazier if you want to match a literal backslash:
Pattern.compile("\\\\")
In Javascript, this would just be:
/\\/

Regex Pattern to negate a character

I'm having a statement as below
#sv_q = " INSERT INTO alertuser.REALTIME
I'm trying to create a regex to match this line from a set of lines from the file, whenever insert keyword is used. But since a '#' is used at the beginning I wouldn't want to consider this line. How to achieve this. I tried the below regex, but still this line is getting considered, can anyone suggest how to achieve this?
static Pattern tracePattern = Pattern.compile("(?!\\#).*insert\\s*into",Pattern.CASE_INSENSITIVE);
Matcher localMatcher = tracePattern.matcher(line);
if (localMatcher.find()) {
// doing some checks
}
If you want to use .find(), you need to anchor the pattern at the beginning with ^ or \A. Besides, you'd better use word boundaries to only match whole words insert and into, and use \s+ instead of \s* to enforce at least 1 occurrence of a whitespace between insert and into:
"^(?!#).*\\binsert\\s+into\\b"
You could shorten your solution to
if (s.matches("(?i)(?!#).*\\binsert\\s+into\\b.*")) { doing some checks}
The matches() method requires a full string match, and thus, you need to add the .* at the end. Also, the Pattern.CASE_INSENSITIVE option can be used inline with the help of the embedded flag option (?i). If your input can contain line breaks, use (?si) instead of (?i).

Java Regular Expression Negative Look Ahead Finding Wrong Match

Assume I have the following string.
create or replace package test as
-- begin null; end;/
end;
/
I want a regular expression that will find the semicolon not preceded by a set of "--" double dashes on the same line. I'm using the following pattern "(?!--.*);" and I'm still getting matches for the two semicolons on the 2nd line.
I feel like I'm missing something about negative look aheads but I can't figure out what.
If you want to match semicolons only on the lines which do not start with --, this regex should do the trick:
^(?!--).*(;)
Example
I only made a few changes from your regex:
Multi-line mode, so we can use ^ and $ and search by line
^ at the beginning to indicate start of a line
.* between the negative lookahead and the semicolon, because otherwise with the first change it would try to match something like ^;, which is wrong
(I also added parentheses around the semicolon so the demo page displays the result more clearly, but this is not necessary and you can change to whatever is most convenient for your program.)
First of all, what you need is a negative lookbehind (?<!) and not a negative lookahead (?!) since you want to check what's behind your potential match.
Even with that, you won't be able to use the negative lookbehind in your case since the Java's regex engine does not support variable length lookbehind. This means that you need to know exactly how many characters to look behind your potential match for it to work.
With that said, wouldn't be simpler in your case to just split your String by linefeed/carriage return and then remove the line that start with "--"?
The reason "(?!--.*);" isn't working is because the negative look ahead is asserting that when positioned before a ; that the next two chars are --, which of course matches every time (; is always not --).
In java, to match a ; that doesn't have -- anywhere before it:
"\\G(((?<!--)[^;])*);"
To see this in action using a replaceAll() call:
String s = "foo; -- begin null; end;";
s = s.replaceAll("\\G(((?<!--)[^;])*);", "$1!");
System.out.println(s);
Output:
foo! -- begin null; end;
Showing that only semi colons before a double dash are matched.

How to determine substring OR match?

I have a regex that has an | (or) in it and I would like to determine what part of the or matched in the regex:
Possible Inputs:
-- Input 1 --
Stuff here to keep.
-- First --
all of this below
gets
deleted
-- Input 2 --
Stuff here to keep.
-- Second --
all of this below
gets
deleted
Regex to match part of an incoming input source and determine what part of the | (or) was matched? "-- First --" or "-- Second --"
Pattern PATTERN = Pattern.compile("^(.*?)-+ *(?:First|Second) *-+", Pattern.DOTALL);
Matcher m = PATTERN.matcher(text);
if (m.find()) {
// How can I tell if the regex matched "First" or "Second"?
}
How can I tell which input was matched (First or Second)?
The regular expression does not contain that information. However, you could use some additional groups to figure it out.
Example pattern: (?:(First)|(Second))
On the string First the second capture group will be empty and with Second the first one will be empty. A simple inspection of the groups returned to Java will tell you which part of the regex matched.
EDIT: I assumed that First and Second were used as placeholders for the sake of simplicity and actually represent more complex expressions. If you are really looking to find which of two strings was matched, then having a single capture group (like this: (First|Second)) and comparing its content with First will do the job just fine.
Because RegExes are stateless there is no way to tell by using only one regex.
The solution is to use two different RegExes and make a case decision.
However, you can use group() which returns the last match as String.
You can test this for .contains("First").
if(m.group().contains("First")) {
// case 1
} else {
// case 2
}

Differentiating SQL strings from comments

I have files of SQL code that I want to beautify, and I'm having trouble with differentiating between whether a certain line/part of the code is a String or a comment.
My current process is I do a Pattern/Matcher search through the file and pull out the strings with the regex N?'([']{2}|[^'])*+'(?!') and the comments with \s*--.*?\n|/\*.*?\*/, and put them in their respective storage arrays to avoid formatting them.
EXAMPLES:
WHERE y = 'STRING' -> WHERE y = THIS_IS_A_STRING and strings[0] = 'STRING'
SELECT x --do not format-> SELECT x THIS_IS_A_COMMENT and comments[0] = --do not format
After beautifying everything, I then go through and search for THIS_IS_A_STRING and THIS_IS_A_COMMENT and restore their respective values from the arrays.
The problem I'm running into is if a comment has an apostrophe in it, or if a SQL string has double dashes in it. I can fix one problem, but it causes the other, depending on whether I choose to preserve strings or comments first.
For example:
--Don't format this with preserving strings going first will match 'nt format this all the way through to the next ', (due to the ability to have multiline strings).
On the flip side, if I choose to preserve comments first:
SELECT x FROM y WHERE z = '--THIS_IS_AS_STRING--', it will detect the -- and store everything until the next newline into the comments array.
Any help would greatly be appreciated.
EDIT: I know I should probably do this with a SQL parser, but I have been working on this with mainly regex and this is the last step I need to finishing
I made this reqexp:
/^(([^\\'"\-]+|\-[^\\'"\-]|\\.)+|-?'([^\\']+|\\.)+'|-?"([^\\"]+|\\.)+")+\-\-[^\n]+/
To match thouse Rules for SQL comments
a comment row ends with --, comment, and row break.
before the comment we can have:
any chars except \'"-
a - if not followed by any of \'"-
a \ followed by any character including \'"-
a pair of ' that dosn't have a ' between them, unless its have a odd number of \ inforont of it.
a pair of " that dosn't have a " between them, unless its have a odd number of \ inforont of it.
the pairs can have a single - inforont of them, but not 2
did i miss somthing?
This link may help:
Java Regex find/replace pattern in SQL comments
I paste the code here
try {
Pattern regex = Pattern.compile("(?:/\\*[^;]*?\\*/)|(?:--[^;]*?$)", Pattern.DOTALL | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
I would replace the comment first, and then use the replaced string as input for string regex. This way the regex will not confuse string and comment.
While I realize that Song is looking for a regex solution for this problem, I would like to point out that SQL is not regular (https://stackoverflow.com/a/5639859/2503659), hence no regex solution exists.
With that said, I think others have given good solutions for common scenarios.

Categories