Differentiating SQL strings from comments - java

I have files of SQL code that I want to beautify, and I'm having trouble with differentiating between whether a certain line/part of the code is a String or a comment.
My current process is I do a Pattern/Matcher search through the file and pull out the strings with the regex N?'([']{2}|[^'])*+'(?!') and the comments with \s*--.*?\n|/\*.*?\*/, and put them in their respective storage arrays to avoid formatting them.
EXAMPLES:
WHERE y = 'STRING' -> WHERE y = THIS_IS_A_STRING and strings[0] = 'STRING'
SELECT x --do not format-> SELECT x THIS_IS_A_COMMENT and comments[0] = --do not format
After beautifying everything, I then go through and search for THIS_IS_A_STRING and THIS_IS_A_COMMENT and restore their respective values from the arrays.
The problem I'm running into is if a comment has an apostrophe in it, or if a SQL string has double dashes in it. I can fix one problem, but it causes the other, depending on whether I choose to preserve strings or comments first.
For example:
--Don't format this with preserving strings going first will match 'nt format this all the way through to the next ', (due to the ability to have multiline strings).
On the flip side, if I choose to preserve comments first:
SELECT x FROM y WHERE z = '--THIS_IS_AS_STRING--', it will detect the -- and store everything until the next newline into the comments array.
Any help would greatly be appreciated.
EDIT: I know I should probably do this with a SQL parser, but I have been working on this with mainly regex and this is the last step I need to finishing

I made this reqexp:
/^(([^\\'"\-]+|\-[^\\'"\-]|\\.)+|-?'([^\\']+|\\.)+'|-?"([^\\"]+|\\.)+")+\-\-[^\n]+/
To match thouse Rules for SQL comments
a comment row ends with --, comment, and row break.
before the comment we can have:
any chars except \'"-
a - if not followed by any of \'"-
a \ followed by any character including \'"-
a pair of ' that dosn't have a ' between them, unless its have a odd number of \ inforont of it.
a pair of " that dosn't have a " between them, unless its have a odd number of \ inforont of it.
the pairs can have a single - inforont of them, but not 2
did i miss somthing?

This link may help:
Java Regex find/replace pattern in SQL comments
I paste the code here
try {
Pattern regex = Pattern.compile("(?:/\\*[^;]*?\\*/)|(?:--[^;]*?$)", Pattern.DOTALL | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
I would replace the comment first, and then use the replaced string as input for string regex. This way the regex will not confuse string and comment.

While I realize that Song is looking for a regex solution for this problem, I would like to point out that SQL is not regular (https://stackoverflow.com/a/5639859/2503659), hence no regex solution exists.
With that said, I think others have given good solutions for common scenarios.

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.
I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

How would I use regex to output a specific set of Strings java?

I have a String output of a very long row of movie titles and music album titles.
e.g.
[(Pixel Quality) (Year of Release) MovieTitle.ext,...... Albumname-artistname.ext]
i.e. [(HD 1080p) (2015) Batman vs Superman.mov,........tearsinheavan-ericclapton.mp3,.......]
I am trying to identify the movies and music apart using regex expressions. A movie has pixel quality, a release date, a movetitle and an extension like (.mov,.flv...etc) while music has an album name followed by a - and the artist name with an extension like (.mp3,.aax.....).
The expected output would be (Pixel Quality) (Year of Release) MovieTitle.ext for a movie, and Albumname-artistname.ext for music.
I am not too familiar with regex I only know how to match single characters, or a specific word. However I can't seem to output the whole pixel quality,year of release and movietitle.ext. Only the specific words I've matched or the single characters.
Method I used to try and find the "categories".
public void FindPatterns () {
String patternFilms = ("REGEX PATTERN?");
Pattern pattern = Pattern.compile(patternFilms);
for (String name : names) {
Matcher matcher = pattern.matcher(name);
while(matcher.find()){
System.out.println(matcher.group());
}
}
}
UPDATE:
I've attempted to fiddle around with the regex patterns in my code, and I get nothing but syntax errors being flagged asking me to delete the tokens, I can't find a clear enough example for what I am trying to achieve.
Just incase I've been putting the pattern in the wrong place this whole time, I've been putting the regex pattern in String pattern and "REGEX PATTERN? is just a placeholder where I am asking if that is the correct place to put the pattern.
On the java side of things, your code needs to extract each individual group as a named or indexed group. That's (relatively) the easy part though. Before you get to that point, it sounds like you need help with your pattern, so lets look at that first.
Build up your regex piece by piece. A tool that allows you to quickly iterate your regex is useful. I like https://regex101.com/.
What you need to do is select "matching groups" out of the input String. So you want to match everything that you can throw away (things like commas and parentheses), as well as the data you want to extract. For the data you want to extract, surround the regex for each of those pieces of data in parentheses to denote the group.
Your input strings have lots of characters that have special meaning inside a regex, like [ and (. So if you want to match them explicitly, they need to be "escaped". Also keep in mind that when you translate your regex to Java, the \ character is itself an escape for a Java String, so it needs to be escaped too with another \. So, for example, a regex to match a [ character would be defined like \\[.
So, start by matching the entire input:
^.*$
The ^ character are "anchors" that mean "beginning of the input" and the "end of the input" respectively. The . just matches any character, and the * matches the previous token (any character) 0, 1, or more times (so everything).
In regex 101, this will highlight the entire input.
The entire string is surrounded with square brackets, so lets match those, and remember they need to be escaped:
^\[.*\]$
Now lets start breaking up the individual components. The first two are delimited by parentheses, and remember we need to escape parentheses, so lets match (something) (something) something:
^\[\(.*\) \(.*\) .*\]$
Now again the whole input should be highlighted again. Lets pull out the two pieces of data we just identified by surrounding them in parentheses:
^\[\((.*)\) \((.*)\) .*\]$
Now you should see the matches highlighted and shown over on the right side. Now continue to build up the regex, replacing that last .* with more specific matches.
Comment on this answer if you run into any particular issue!
It looks like it's parenthesized and then comma separated, so something along the lines of ^[((.))((.?)),(.),(.)]$
^ matches the start of the line, and $ matches the start of the line
\ escapes characters that have special regex meaning, like [. You need [ and ( to match literal brackets and parentheses.
(...) marks a group, so that you can extract it when you get a match.
.* is just zero or more repetitions of any character. Use .+ to get one or more repetitions.
Also, add " *" where needed to match spaces.
An example in Perl:
echo "(hd)(2015) Avatar.ext, Douchebagson.ext" | perl -pe "s/^\((.*)\) *\((.*)\) *(.*) *, *(.*)$/\1,\2,\3,\4/g"
hd,2015,Avatar.ext,Douchebagson.ext
What's happening is a substitution. We're substituting the input string with <1st part>,<2nd part>,.... The result is a csv-format that can be interpreted by your language of choice, Excel or what ever.
\((.*)\) matches everything within parentheses. The parentheses are not part of the capturing group, since the literal parentheses /( and /) are outside the capturing clause (.*).
^ and $ are not necessary here, but can be used to find matches only near the end or near the beginning.
Note: Since it's a school assignment, I'm not going to explain what's happening so leaving to your imagination. You should be able to explain it to your teacher.
Try following code:
String data = "(HD 1080p) (2015) Batman vs Superman.mov," +
"tearsinheavan-ericclapton.mp3," +
"(HD 1080p) (2015) Batman vs Superman.mov," +
"tearsinheavan-ericclapton.mp3,(HD 1080p) (2015) Batman vs Superman.mov," +
"tearsinheavan-ericclapton.mp3,";
String rxString = "(?ism)(?<movie>\\(.*?\\) \\(\\d{4}\\).*?\\." +
"\\w+(?=[,\n]))|(?<music>[^(,\n]*?\\-[^,]+)";
Pattern regex = Pattern.compile(rxString);
Matcher regexMatcher = regex.matcher(data);
while (regexMatcher.find()) {
String movie = regexMatcher.group("movie");
String music = regexMatcher.group("music");
if(movie!=null) {
System.out.printf("Movie:\t%s\n", movie);
}
if(music!=null) {
System.out.printf("Music:\t%s\n", music);
}
}
It will printout:
Movie: (HD 1080p) (2015) Batman vs Superman.mov
Music: tearsinheavan-ericclapton.mp3
Movie: (HD 1080p) (2015) Batman vs Superman.mov
Music: tearsinheavan-ericclapton.mp3
Movie: (HD 1080p) (2015) Batman vs Superman.mov
Music: tearsinheavan-ericclapton.mp3

Java Regex find/replace pattern in SQL comments

I want to find/replace a character/pattern ONLY inside SQL comments ( single line comments -- and block comments /* */). The source string is an SQL script.
At the moment I am searching for a semi-colon (;) within comments and want to replace it with blank space.
Source
CREATE OR REPLACE PROCEDURE TESTDBA.PROC_REUSING_BINDED_VAR_N_DSQL
AS
a NUMBER:=2;
b NUMBER:=3; -- jladjfljaf; lakjflajf
-- alksdjflkjaf ladkjf
v_plsql_tx VARCHAR2(2000);
begin
v_plsql_tx := 'BEGIN ' || ' :1 := :1 + :2; ' || 'END;';
execute immediate v_plsql_tx
using in out a, b;
insert into testdba.NEW_TESTING_TABLE(CHARACTER_VALUE) VALUES('a='||a);
end PROC_REUSING_BINDED_VAR_N_DSQL;
-- lajdflajf
/*lakjdfljalfdk; alkdjf*/
/*asdf
;
asdfa*/
/*
adf
asd asdf
*/
Can you please suggest something.
Thanks
I would do this like this :
try {
Pattern regex = Pattern.compile("(?:/\\*[^;]*?\\*/)|(?:--[^;]*?$)", Pattern.DOTALL | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
The above will give you all comments without ';'. Then I would iterate line by line through the sql file and when I encountered a line which had a comment I would check to see if that line is in my list of matches - if not then I would search replace ; with ' ' in the whole comment. Of course you will have to find where the comment ends but this is easy -- ends in the same line and /* and when the first */ is found. This way you can change any number of ; with the same code.
Probably the best bet is to use two regexes and two Patterns (one single line and one multi-line).
Single Line: \\-\\-[^;]*(;) -- not sure the best way to find multiple ; within a line
Multi-line: /\\*[^;(\\*/)]*?(;)[^;]*?\\*/ -- something like this anyway
What you need to find out first: There are two ways that multi-line comments can be handled.
A single "*/" closes all currently open "/*".
For every "/*" your need a corresponding "*/" (nested comments).
The first one is relatively easy to implement. The second can only be done by either deep magic regex (read: unmaintainable by future coders) or with a short program.
The first one is pretty easy: Using "/\*.*;.*\*/" will give you a match whenever there is an embedded semicolon.
The second one will need a bit of programming. If you encounter a ";", you will need to check if you're currently inside a comment. You can know by just sequentially reading the file (ignoring the carriage return/line feeds) and incrementing a number whenever you encounter a "/*" and decrement the number when encountering a "*/". If the number is at least 1, your semicolon is inside a comment.

Parsing quoted text in java

Is there an easy way to parse quoted text as a string to java? I have this lines like this to parse:
author="Tolkien, J.R.R." title="The Lord of the Rings"
publisher="George Allen & Unwin" year=1954
and all I want is Tolkien, J.R.R.,The Lord of the Rings,George Allen & Unwin, 1954 as strings.
You could either use a regex like
"(.+)"
It will match any character between quotes. In Java would be:
Pattern p = Pattern.compile("\\"(.+)\\"";
Matcher m = p.matcher("author=\"Tolkien, J.R.R.\"");
while(matcher.find()){
System.out.println(m.group(1));
}
Note that group(1) is used, this is the second match, the first one, group(0), is the full string with quotes
Offcourse you could also use a substring to select everything except the first and last char:
String quoted = "author=\"Tolkien, J.R.R.\"";
String unquoted;
if(quoted.indexOf("\"") == 0 && quoted.lastIndexOf("\"")==quoted.length()-1){
unquoted = quoted.substring(1, quoted.lenght()-1);
}else{
unquoted = quoted;
}
There are some fancy pattern regex nonsense things that fancy people and fancy programmers like to use.
I like to use String.split(). It's a simple function and does what you need it to do.
So if I have a String word: "hello" and I want to take out "hello", I can simply do this:
myStr = string.split("\"")[1];
This will cut the string into bits based on the quote marks.
If I want to be more specific, I can do
myStr = string.split("word: \"")[1].split("\"")[0];
That way I cut it with word: " and "
Of course, you run into problems if word: " is repeated twice, which is what patterns are for. I don't think you'll have to deal with that problem for your specific question.
Also, be cautious around characters like . and . Split uses regex, so those characters will trigger funny behavior. I think that "\\" = \ will escape those funny rules. Someone correct me if I'm wrong.
Best of luck!
Can you presume your document is well-formed and does not contain syntax errors? If so, you are simply interested in every other token after using String.split().
If you need something more robust, you may need to use the Scanner class (or a StringBuffer and a for loop ;-)) to pick out the valid tokens, taking into account additional criterion beyond "I saw a quotation mark somewhere".
For example, some reasons you might need a more robust solution than splitting the string blindly on quotation marks: perhaps its only a valid token if the quotation mark starting it comes immediately after an equals sign. Or perhaps you do need to handle values that are not quoted as well as quoted ones? Will \" need to be handled as an escaped quotation mark, or does that count as the end of the string. Can it have either single or double quotes (eg: html) or will it always be correctly formatted with double quotes?
One robust way would be to think like a compiler and use a Java based Lexer (such as JFlex), but that might be overkill for what you need.
If you prefer a low-level approach, you could iterate through your input stream character by character using a while loop, and when you see an =" start copying the characters into a StringBuffer until you find another non-escaped ", either concatenating to the various wanted parsed values or adding them to a List of some sort (depending on what you plan to do with your data). Then continue reading until you encounter your start token (eg: =") again, and repeat.

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)
The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.
You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."
The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$
You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string
Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...
Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/
You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

Categories