find the last match with java/regex

find the last match with java/regex - java

I have a dynamic text that contains "font-family", for example:
style="font-family: "Calibri","sans-serif"; font-size:11pt";
And I want to remove all the font-family element.
I was using this code
patron = Pattern.compile("font-family:(.*?);");
encaja = patron.matcher(cadena);
nueva = encaja.replaceAll("");
But it remove in a way that isn't useful for me:
style="Calibri","sans-serif"; font-size: 11pt;"
What I want is:
style=" font-size: 11pt;"
I also tried using this pattern
font-family:[^(&.*;)]*?;
But it doesn't work.
Can you help me?
Thanks
EDIT
More case examples:
in: style="font-size:15px; font-family:Arial; mso-ascii-theme-font: minor-latin; "
output: style="font-size:15px; mso-ascii-theme-font: minor-latin;"
in: style="font-family:Arial,Aás;; font-size:11pt; mso-fareast-mso-fareast-theme-font: minor-latin;"
output:style="font-size:11pt; mso-fareast-mso-fareast-theme-font: minor-latin;"

You can use this:
String result = yourstr.replaceAll("(?i)font-family:(?>[^;&\"]++|&(?>quot|ntilde);)*(?>;\\s*+|(?=\"))", "");
pattern description:
(?i) # make the pattern case-insensitive
font-family:
(?> # open an atomic group
[^;&\"]++ # all characters except ; & and " one or more times (possessive)
| # OR
& # literal &
(?> # put the different possibilities here
quot
|
ntilde
)
; # literal ;
)* # repeat the atomic group zero or more times
(?>
;\\s*+ # literal ; and trailing spaces
|
(?=\") # followed by " (last value of the attribute without trailing ; )
)
Another but less safer way (IMO): skip all letters that are between a & and a ; :
String result = yourstr.replaceAll("(?i)font-family:(?>[^;&\"]++|&[a-z]++;)*(?>;\\s*+|(?=\"))", "");

Try this:
newstr = str.replaceFirst("font-family:\s?([^\s]+)", "");

Related

Newline in datatable Gherkin/Cucumber

I have this datatable in my cucumber scenario:
| name | value
| Description | one \n two \n three |
I want the values to appear in the textarea like this:
one
two
three
Because I need to make bullet points out of them.
So my actual question is, is it possible to use newline characters in one line or is there a better way to approach this?
EDIT: to clarify, it's not working with the code written above:
WebDriverException: unknown error: Runtime. evaluate threw exception: SyntaxError: Invalid or unexpected token
EDIT 2: I'm using a bit of unusual code to access the value, seeing as it is a p element and this is normally not possible:
js.executeScript("document.getElementsByTagName('p')[0].innerHTML = ' " + row.get("value") + " ' ");
This has been working for other rows tho, maybe because i'm using \n now?

You can try this way:
WebDriver driver = new ChromeDriver();
driver.get("https://stackoverflow.com/questions/51786797/newline-in-datatable-gherkin-cucumber/51787544#51787544");
Thread.sleep(3000); // pause to wait until page loads
String s = "SOME<br>WORDS";
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("document.getElementsByTagName('div')[0].innerHTML = '" + s + "';");
Output:
SOME
WORDS
So, main idea is to use <br> tag as new line separator.
In your case it would be like this:
| name | value
| Description | one<br>two<br>three |
and code would be:
// make sure, that row.get("value") returns a string
js.executeScript("document.getElementsByTagName('p')[0].innerHTML = ' " + row.get("value") + " ' ");

Consume only commented (/** ..... */ )section of java file thorugh ANTLR 4 and skip the rest

I'm new to ANTLR and getting familiar with ANTLR 4. How to consume only the commented section (/** ... */) from a java file(or any file) and skip the rest.
I do have the following file "t.txt" :-
t.txt
/**
#Key1("value1")
#Key2("value2")
*/
This is the text that we need to skip. Only wanted to read the above commented section.
//END_OF_FILE
AND My grammar file as below:-
MyGrammar.g4
grammar MyGrammar;
file : (pair | LINE_COMMENT)* ;
pair : ID VALUE ;
ID : '#' ('A'..'Z') (~('('|'\r'|'\n') | '\\)')* ;
VALUE : '(' (~('\r'|'\n'))*;
COMMENT : '/**' .*? '*/';
WS : [\t\r\n]+ -> skip;
LINE_COMMENT
: '#' ~('\r'|'\n')* ('\r'|'\n'|EOF)
;
I know the COMMENT rule will read the commented section but here i'm stuck that how should skip the rest of the file content and force the antlr to read ID and value from COMMENT content only.

You can use lexical modes for this. Simply switch to another mode when the lexer stumbles upon "/**" and ignore everything else.
Note that lexical modes cannot be used in a combined grammar. You will have to define a separate lexer- and parser-grammar.
A small demo:
AnnotationLexer.g4
lexer grammar AnnotationLexer;
ANNOTATION_START
: '/**' -> mode(INSIDE), skip
;
IGNORE
: . -> skip
;
mode INSIDE;
ID
: '#' [A-Z] (~[(\r\n] | '\\)')*
;
VALUE
: '(' ~[\r\n]*
;
ANNOTATION_END
: '*/' -> mode(DEFAULT_MODE), skip
;
IGNORE_INSIDE
: [ \t\r\n] -> skip
;
file: AnnotationParser.g4
parser grammar AnnotationParser;
options {
tokenVocab=AnnotationLexer;
}
parse
: pair* EOF
;
pair
: ID VALUE {System.out.println("ID=" + $ID.text + ", VALUE=" + $VALUE.text);}
;
And now simply use the lexer and parser:
String input = "/**\n" +
"\n" +
"#Key1(\"value1\")\n" +
"#Key2(\"value2\")\n" +
"\n" +
"*/\n" +
"\n" +
"This is the text that we need to skip. Only wanted to read the above commented section.\n" +
"\n" +
"//END_OF_FILE";
AnnotationLexer lexer = new AnnotationLexer(new ANTLRInputStream(input));
AnnotationParser parser = new AnnotationParser(new CommonTokenStream(lexer));
parser.parse();
which will produce the following output:
ID=#Key1, VALUE=("value1")
ID=#Key2, VALUE=("value2")

Get contents of brackets using regex in a list of values

I'm trying to look for a regex (Coldfusion or Java) that can get me the contents between the brackets for each (param \d+) without fail. I've tried dozens of different types of regexes and the closest one I got is this one:
\(param \d+\) = \[(type='[^']*', class='[^']*', value='(?:[^']|'')*', sqltype='[^']*')\]
Which would be perfect, if the string that I get back from CF escaped single quotes from the value parameter. But it doesn't so it fails miserably. Going the route of a negative lookahead like so:
\[(type='[^']*', class='[^']*', value='(?:(?!', sqltype).)*', sqltype='[^']*')\]
Is great, unless for some unnatured reason there's a piece of code that quite literally has , sqltype in the value. I find it hard to believe I can't simply tell regex to scoop out the contents of every open and closed bracket it finds but then again, I don't know enough regex to know its limits.
Here's an example string of what I'm trying to parse:
(param 1) = [type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer'] , (param 2) = [type='IN', class='java.lang.String', value='asf , O'Reilly, really?', sqltype='cf_sql_varchar'] , (param 3) = [type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , []can break it ', sqltype= ', sqltype='cf_sql_varchar']
For the curious this is a sub-question to Copyable Coldfusion SQL Exception.
EDIT
This is my attempt at implementing #Mena's answer in CF9.1. Sadly it doesn't finish processing the string. I had to replace the \\ with \ just to get it to run at first, but my implementation might still be at fault.
This is the string given (pipes are just to denote boundary):
| (param 1) = [type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer'] , (param 2) = [type='IN', class='java.lang.String', value='asf , O'Reilly], really?', sqltype='cf_sql_varchar'] , (param 3) = [type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , []can break it ', sqltype ', sqltype='cf_sql_varchar'] |
This is my implementation:
<cfset var outerPat = createObject("java","java.util.regex.Pattern").compile(javaCast("string", "\((.+?)\)\s?\=\s?\[(.+?)\](\s?,|$)"))>
<cfset var innerPat = createObject("java","java.util.regex.Pattern").compile(javaCast("string", "(.+?)\s?\=\s?'(.+?)'\s?,\s?"))>
<cfset var outerMatcher = outerPat.matcher(javaCast("string", arguments.params))>
<cfdump var="Start"><br />
<cfloop condition="outerMatcher.find()">
<cfdump var="#outerMatcher.group(1)#"> (<cfdump var="#outerMatcher.group(2)#">)<br />
<cfset var innerMatcher = innerPat.matcher(javaCast("string", outerMatcher.group(2)))>
<cfloop condition="innerMatcher.find()">
<cfoutput>|__</cfoutput><cfdump var="#innerMatcher.group(1)#"> --> <cfdump var="#innerMatcher.group(2)#"><br />
</cfloop>
<br />
</cfloop>
<cfabort>
And this is what printed:
Start
param 1 ( type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer' )
|__ type --> IN
|__ class --> java.lang.Integer
|__ value --> 47
param 2 ( type='IN', class='java.lang.String', value='asf , O'Reilly )
|__ type --> IN
|__ class --> java.lang.String
End

Here's a Java regex pattern that works for your sample input.
(?x)
# lookbehind to check for start of string or previous param
# java lookbehinds must have max length, so limits sqltype
(?<=^|sqltype='cf_sql_[a-z]{1,16}']\ ,\ )
# capture the full string for replacing in the orig sql
# and just the position to verify against the match position
(\(param\ (\d+)\))
\ =\ \[
# type and class wont contain quotes
type='([^']++)'
,\ class='([^']++)'
# match any non-quote, then lazily keep going
,\ value='([^']++.*?)'
# sqltype is always alphanumeric
,\ sqltype='cf_sql_[a-z]+'
\]
# lookahead to check for end of string or next param
(?=$|\ ,\ \(param\ \d+\)\ =\ \[)
(The (?x) flag is for comment mode, which ignores unescaped whitespace and between a hash and end of line.)
And here's that pattern implemented in CFML (tested on CF9,0,1,274733). It uses cfRegex (a library which makes it easier to work with Java regex in CFML) to get the results of that pattern, and then does a couple of checks to make sure the expected number of params are found.
<cfsavecontent variable="Input">
(param 1) = [type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer']
, (param 2) = [type='IN', class='java.lang.String', value='asf , O'Reilly, really?', sqltype='cf_sql_varchar']
, (param 3) = [type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , []can break it ', sqltype= ', sqltype='cf_sql_varchar']
</cfsavecontent>
<cfset Input = trim(Input).replaceall('\n','')>
<cfset cfcatch =
{ params = input
, sql = 'SELECT stuff FROM wherever WHERE (param 3) is last param'
}/>
<cfsavecontent variable="ParamRx">(?x)
# lookbehind to check for start or previous param
# java lookbehinds must have max length, so limits sqltype
(?<=^|sqltype='cf_sql_[a-z]{1,16}']\ ,\ )
# capture the full string for replacing in the orig sql
# and just the position to verify against the match position
(\(param\ (\d+)\))
\ =\ \[
# type and class wont contain quotes
type='([^']++)'
,\ class='([^']++)'
# match any non-quote, then lazily keep going if needed
,\ value='([^']++.*?)'
# sqltype is always alphanumeric
,\ sqltype='cf_sql_[a-z]+'
\]
# lookahead to check for end or next param
(?=$|\ ,\ \(param\ \d+\)\ =\ \[)
</cfsavecontent>
<cfset FoundParams = new Regex(ParamRx).match
( text = cfcatch.params
, returntype = 'full'
)/>
<cfset LastParamPos = cfcatch.sql.lastIndexOf('(param ') + 7 />
<cfset LastParam = ListFirst( Mid(cfcatch.sql,LastParamPos,3) , ')' ) />
<cfif LastParam NEQ ArrayLen(FoundParams) >
<cfset ProblemsDetected = true />
<cfelse>
<cfset ProblemsDetected = false />
<cfloop index="i" from=1 to=#ArrayLen(FoundParams)# >
<cfif i NEQ FoundParams[i].Groups[2] >
<cfset ProblemsDetected = true />
</cfif>
</cfloop>
</cfif>
<cfif ProblemsDetected>
<big>Something went wrong!</big>
<cfelse>
<big>All seems fine</big>
</cfif>
<cfdump var=#FoundParams# />
This will actually work if you embed an entire param inside the value of another param. It fails if you try two (or more), but at least least the checks should detect this failure.
Here's what the dump output should look like:
Hopefully everything here makes sense - let me know if any questions.

I would probably use a dedicated parser for that, but here's an example on how to do it with two Patterns and nested loops:
// the input String
String input = "(param 1) = " +
"[type='IN', class='java.lang.Integer', value='47', sqltype='cf_sql_integer'] , " +
"(param 2) = " +
"[type='IN', class='java.lang.String', value='asf , O'Reilly, really?', " +
"sqltype='cf_sql_varchar'] , " +
"(param 3) = " +
"[type='IN', class='java.lang.String', value='Th[is]is Ev'ery'thing That , " "[]can break it ', sqltype= ', sqltype='cf_sql_varchar']";
// the Pattern defining the round-bracket expression and the following
// square-bracket list. Both values within the brackets are grouped for back-reference
// note that what prevents the 3rd case from breaking is that the closing square bracket
// is expected to be either followed by optional space + comma, or end of input
Pattern outer = Pattern.compile("\\((.+?)\\)\\s?\\=\\s?\\[(.+?)\\](\\s?,|$)");
// the Pattern defining the key-value pairs within the square-bracket groups
// note that both key and value are grouped for back-reference
Pattern inner = Pattern.compile("(.+?)\\s?\\=\\s?'(.+?)'\\s?,\\s?");
Matcher outerMatcher = outer.matcher(input);
// iterating over the outer Pattern (type x) = [myKey = myValue, ad lib.], or end of input
while (outerMatcher.find()) {
System.out.println(outerMatcher.group(1));
Matcher innerMatcher = inner.matcher(outerMatcher.group(2));
// iterating over the inner Pattern myKey = myValue
while (innerMatcher.find()) {
System.out.println("\t" + innerMatcher.group(1) + " --> " + innerMatcher.group(2));
}
}
Output:
param 1
type --> IN
class --> java.lang.Integer
value --> 47
param 2
type --> IN
class --> java.lang.String
value --> asf , O'Reilly, really?
param 3
type --> IN
class --> java.lang.String
value --> Th[is]is Ev'ery'thing That , []can break it

Java Regex find Oracle Single Line comments Except in a String

Find Oracle single line comments except the ones that appear inside a string.
For example:
-- This is a valid single line comment
But
'This is a string -- and it is not a comment';
I am using this regex to find single line comments
--.*$
a few cases can be handled but there are several complex ones. You can use this script for reference
-- this is a single line comment
CREATE OR REPLACE PROCEDURE "MAIL_WITH_ATTACHMENT" ( )
IS
tmp varchar(2) ; -- this is a comment
tmp1 varchar(2) := 'some texxt'; -- this is another comment
tmp2 varchar(3) := 'some more --text'; -- this is one more comment
tmp3 varchar(4) := 'this regex isn't --working properly'; -- Don't you think this is another comment
BEGIN
'--This is a Mime message, which your current mail reader may not' || crlf ||
' some more -- characters in a string';
mesg:= crlf ||
'--This is a Mime message, which your current mail reader may not' || crlf ||
' some more -- characters in a string';
END;
Result must be this
[1] : -- this is a single line comment
[2] : -- this is a comment
[3] : -- this is another comment
[4] : -- this is one more comment
[5] : -- Don't you think this is another comment
Thanks

Personally, I'd use an SQL parser to strip these comments. The problem with regex is that it's not really aware of its surroundings: regex has a hard time figuring out if a single quote is inside a comment, or if -- is inside a string literal.
You can circumvent this by using a regex that matches from the start of a line and match string literals as well. Making it behave more like a lexical analyzer (the first stage of parsing).
Such a regex could look like this:
(?m)^((?:(?!--|').|'(?:''|[^'])*')*)--.*$
A quick break down of the regex:
(?m) # enable multi-line mode
^ # match the start of the line
( # start match group 1
(?: # start non-capturing group 1
(?!--|'). # if there's no '--' or single quote ahead, match any char (except a line break)
| # OR
'(?:''|[^'])*' # match a string literal
)* # end non-capturing group 1 and repeat it zero or more times
) # end match group 1
--.*$ # match a comment all the way to the end of the line
In plain English that would read like: from each start of a line, try to match zero or more:
string literals ('(?:''|[^'])*');
or any character as long as it's not a single quote, a line break char or a - that is a part of a comment ((?!--|').).
and store this match in group 1. Then match a comment (--.*$).
So now all you need to do is replace this pattern with whatever is matched in group 1. A demo:
String sql = "-- this is a single line comment\n" +
"\n" +
"CREATE OR REPLACE PROCEDURE \"MAIL_WITH_ATTACHMENT\" ( ) \n" +
"IS \n" +
"tmp varchar(2) ; -- this is a comment \n" +
"tmp1 varchar(2) := 'some texxt'; -- this is another comment\n" +
"tmp2 varchar(3) := 'some more --text'; -- this is one more comment\n" +
"tmp3 varchar(4) := 'this regex isn''t --working properly'; -- Don't you think this is another comment\n" +
"BEGIN\n" +
"\n" +
" '--This is a Mime message, which your current mail reader may not' || crlf ||\n" +
" ' some more -- characters in a string';\n" +
"\n" +
" mesg:= crlf ||\n" +
" '--This is a Mime message, which your current mail reader may not' || crlf ||\n" +
" ' some more -- characters in a string';\n" +
"END; ";
String stripped = sql.replaceAll("(?m)^((?:(?!--|').|'(?:''|[^'])*')*)--.*$", "$1[REMOVED COMMENT]");
System.out.println(stripped);
which will print:
[REMOVED COMMENT]
CREATE OR REPLACE PROCEDURE "MAIL_WITH_ATTACHMENT" ( )
IS
tmp varchar(2) ; [REMOVED COMMENT]
tmp1 varchar(2) := 'some texxt'; [REMOVED COMMENT]
tmp2 varchar(3) := 'some more --text'; [REMOVED COMMENT]
tmp3 varchar(4) := 'this regex isn''t --working properly'; [REMOVED COMMENT]
BEGIN
'--This is a Mime message, which your current mail reader may not' || crlf ||
' some more -- characters in a string';
mesg:= crlf ||
'--This is a Mime message, which your current mail reader may not' || crlf ||
' some more -- characters in a string';
END;
EDIT
And if you only want to extract the comments, wrap the capture group around --.*$ and use a Pattern & Matcher to find() the matches:
Matcher m = Pattern.compile("(?m)^(?:(?!--|').|'(?:''|[^'])*')*(--.*)$").matcher(sql);
while(m.find()) {
System.out.println(m.group(1));
}
which will print:
-- this is a single line comment
-- this is a comment
-- this is another comment
-- this is one more comment
-- Don't you think this is another comment

This should help. If you read line by line;
str = str.replaceAll("'{1}.*'{1}", "").replaceFirst(".*--", "--");
Input: -sd '--asdsa ---asdsadasdsad' || ' asdsad' || 'asdsadasd '--here x something
Output: --here x something
Edit: Final version after 3 edit:)

This regex should work fine:
Pattern p = Pattern.compile("^[^']*('[^']*'[^']*)*(--.*)$");
except for the case [5]. But before starting to overcomplicate the regex, are you sure that Oracle doesn't complain about that string?
EDIT
This is the code I've used to test the regex
String[] text =
{
"-- this is a single line comment",
"",
"CREATE OR REPLACE PROCEDURE \"MAIL_WITH_ATTACHMENT\" ( ) ",
"IS ",
"tmp varchar(2) ; -- this is a comment ",
"tmp1 varchar(2) := 'some texxt'; -- this is another comment",
"tmp2 varchar(3) := 'some more --text'; 'blah --blah' -- this is one more comment",
"tmp3 varchar(4) := 'this regex isn't --working properly'; -- Don't you think this is another comment",
"BEGIN",
"",
" '--This is a Mime message, which your current mail reader may not' || crlf ||",
" ' some more -- characters in a string';",
"",
" mesg:= crlf ||",
" '--This is a Mime message, which your current mail reader may not' || crlf ||",
" ' some more -- characters in a string';", "END; ", };
Pattern p = Pattern.compile("^[^']*('[^']*'[^']*)*(--.*)$");
Matcher m = p.matcher("");
for (String s : text) {
m.reset(s);
if (m.find()) {
System.out.println(m.group(m.groupCount()));
}
}
And here's the output:
-- this is a single line comment
-- this is a comment
-- this is another comment
-- this is one more comment
--working properly'; -- Don't you think this is another comment
As you can see, the last line of the output is "wrong". But, as you said, Oracle doesn't like such a string either. Once you correct isn't into isn''t, also the outoput will be correct.

How can I access blocks of text as an attribute that are matched using a greedy=false option in ANTLR?

I have a rule in my ANTLR grammar like this:
COMMENT : '/*' (options {greedy=false;} : . )* '*/' ;
This rule simply matches c-style comments, so it will accept any pair of /* and */ with any arbitrary text lying in between, and it works fine.
What I want to do now is capture all the text between the /* and the */ when the rule matches, to make it accessible to an action. Something like this:
COMMENT : '/*' e=((options {greedy=false;} : . )*) '*/' {System.out.println("got: " + $e.text);
This approach doesn't work, during parsing it gives "no viable alternative" upon reaching the first character after the "/*"
I'm not really clear on if/how this can be done - any suggestions or guidance welcome, thanks.

Note that you can simply do:
getText().substring(2, getText().length()-2)
on the COMMENT token since the first and the last 2 characters will always be /* and */.
You could also remove the options {greedy=false;} : since both .* and .+ are ungreedy (although without the . they are greedy) (i).
EDIT
Or use setText(...) on the Comment token to discard the /* and */ immediately. A little demo:
file T.g:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(
"/* abc */ \n" +
" \n" +
"/* \n" +
" DEF \n" +
"*/ "
);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
parser.parse();
}
}
parse
: ( Comment {System.out.printf("parsed :: >\%s<\%n", $Comment.getText());} )+ EOF
;
Comment
: '/*' .* '*/' {setText(getText().substring(2, getText().length()-2));}
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
Then generate a parser & lexer, compile all .java files and run the parser containing the main method:
java -cp antlr-3.2.jar org.antlr.Tool T.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar TParser
(or `java -cp .;antlr-3.2.jar TParser` on Windows)
which will produce the following output:
parsed :: > abc <
parsed :: >
DEF
<
(i) The Definitive ANTLR Reference, Chapter 4, Extended BNF Subrules, page 86.

Try this:
COMMENT :
'/*' {StringBuilder comment = new StringBuilder();} ( options {greedy=false;} : c=. {comment.appendCodePoint(c);} )* '*/' {System.out.println(comment.toString());};
Another way which will actually return the StringBuilder object so you can use it in your program:
COMMENT returns [StringBuilder comment]:
'/*' {comment = new StringBuilder();} ( options {greedy=false;} : c=. {comment.append((char)c);} )* '*/';

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

find the last match with java/regex - java

Try this: newstr = str.replaceFirst("font-family:\s?([^\s]+)", "");

Related

Newline in datatable Gherkin/Cucumber

Consume only commented (/** ..... */ )section of java file thorugh ANTLR 4 and skip the rest

Get contents of brackets using regex in a list of values

Java Regex find Oracle Single Line comments Except in a String

How can I access blocks of text as an attribute that are matched using a greedy=false option in ANTLR?

Categories

Resources