I am trying to extract words from a line which starts with a hash(#) symbol.
Suppose the line is :
#This #is#the line #containing multiple #tags.
The regex which I am using is :
(?:^|\s)(#\w+)
The answer which I'm getting is :
#This , #is , #containing , #tags
The output should be
#This , #is#the , #containing , #tags.
Please help.
Thanks
# is a non-word character. As such, the output that you get is expected.
Instead of looking for word characters match anything that is not a space.
(?:^|\s)(#[^ ]+)
Related
I have a requirement of parsing through an python file which contains multiple sql queries and get the start and end positions of the query to get only the query part using JAVA
I am using .contains function to check for sql(''' as my opening character for the query and now for the closing character I have ''') but there are some cases where ''') comes in between the query when there is a variable involved which should not be detected as an end of the query.
Something like this :
spark.sql(''' SELECT .......
FROM.....
WHERE xxx IN ('''+ Variable +''')
''')
here the last but one line also gets detected as end of line if I use line.contains(" ''') ") which is wrong.
All I can think of is to check for next line character as the end of the query as each query is separated by two empty lines. So tried these if (line.contains(" ''')\n") & if (line.contains(" ''')\r\n") but none of them work for me.
Kindly let me know of any other way to do this.
Note that I do not have the privilege to change the query file.
Thanks
I believe simple contains won't solve this problem.
You will have to use Pattern if you are looking to match \n.
String query = "spark.sql(''' SELECT .......\n" +
"FROM..... \n" +
"WHERE xxx IN ('''+ Variable +''')\n" +
"''')";
Pattern pattern = Pattern.compile("^spark.sql\\('''(.*)'''\\)$", Pattern.DOTALL);
System.out.println(pattern.matcher(query).find());
Output:
true
Pattern.DOTALL tells Java to allow the dot to match newline characters, too.
I need a Regex to merge multiple numbers in a line without merging them all together.
Example line :
Hello World9.99 123 456.00 7 890 123.45 0.97
My desired output is :
Hello World9.99 123456.00 7890123.45 0.97
I know basic regex but am not experienced with lookaheads/behinds.
So far I created this method :
final String regex = "(?<!\\.\\d{1,3})\\s+(?=\\d{1,3}\\.?\\d{2}?)";
public String mergeNumbers(String s){
return s.replaceAll(regex, "");
}
This works fine if the number tied to the word has a dot.
But I just can't figure out how to match this line without a dot at the beginning :
Hello World99 123 456.00 7 890 123.45 0.97
This is returning :
Hello World99123456.00 7890123.45 0.97
but I want :
Hello World99 123456.00 7890123.45 0.97
So my question is :
How can I modify my regex to match both cases?
I suggest using
.replaceAll("\\b(?<!\\.)(\\d+)\\s+(?=\\d)", "$1")
See the regex demo.
Details:
\b - a word boundary
(?<!\.) - there can be no . immediately before the current location
(\d+) - Group 1 (referred to with $1 backreference from the string replacement pattern): one or more digits
\s+ - 1+ whitespaces
(?=\\d) - there must be a digit immediately to the right of the current location.
I have a string which I want a string to parse via Java or Python regexp:
something (\var1 \var2 \var3 $var4 #var5 $var6 *fdsfdsfd #uytuytuyt fdsgfdgfdgf aaabbccc)
The number of var is unknown. Their exact names are unknown. Their names may or may not start with "\" or "$", "*", "#" or "#" and there're delimited by whitespace.
I'd like to parse them separately, that is, in capture groups, if possible. How can I do that? The output I want is a list of:
[\var1 , \var2 , \var3 , $var4 , #var5 , $var6 , *fdsfdsfd , #uytuytuyt , fdsgfdgfdgf , aaabbccc]
I don't need the java or python code, I just need the regexp. My incomplete one is:
something\s\(.+\)
something\s\((.+)\)
In this regex you are capturing the string containing all the variables. split it based on whitespace since you are sure that they are delimited by whitespace.
m = re.search('something\s\((.+)\)', input_string)
if m:
list_of_vars = m.group(1).split()
A log file has these pattern appearing more than once in a line.
for example the file may look like
dsads utc-hour_of_year:2013-07-30T17 jdshkdsjhf utc-week_of_year:2013-W31 dskjdskf
utc-week_of_year:2013-W31 dskdsld fdsfd
dshdskhkds utc-month_of_year:2013-07 gfdkjlkdf
I want to replace all date specific info with "Y"
I tried :
replaceAll("_year:.*\s", "_year:Y ");` but it removes everything that occurs after the first replacement,due to greedy match of .*
dsads utc-hour_of_year:Y
utc-week_of_year:Y
dshdskhkds utc-month_of_year:Y
but the expected result is:
dsads utc-hour_of_year:Y jdshkdsjhf utc-week_of_year:Y dskjdskf
utc-week_of_year:Y dskdsld fdsfd
dshdskhkds utc-month_of_year:Y gfdkjlkdf
Try using a reluctant quantifier: _year:.*?\s.
.replaceAll("_year:.*?\\s", "_year:Y ")
System.out
.println("utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf"
.replaceAll("_year:.*?\\s", "_year:Y "));
utc-hour_of_year:Y dsfsdgfsgf utc-week_of_year:Y dsfsdgfsdgf
I am not sure what you are really trying to do and this answer is only based on your example. In case you want to do something else leave comment below or edit your question with more specific information/example
It removes everything after _year: because you are using .*\\s which means
.* zero or more of any characters (beside new line),
\\s and space after it
so in sentence
utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf
it will match
utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf
// ^from here to here^
because by default * quantifier is greedy. To make it reluctant you need to add ? after * so try maybe
"_year:.*?\\s"
or even better instead .*? match only non-space characters using \\S which is the same as negation of \\s that can be written as [^\\s]. Also if your data can be at the end of your input you shouldn't probably add \\s at the end of your regex and space in your replacement, so try maybe one of this ways
.replaceAll("_year:\\S*", "_year:Y")
.replaceAll("_year:\\S*\\s", "_year:Y ")
I'm trying to create ANTLR grammar that will parse the following input:
#code 123 some arbitrary text
I would like to split it onto three tokens: #code, 123 and any text after the space.. It should be something very simple, but I can't understand how to make it working..
It doesn't sound like a good problem for antlr.
You can define tokens like AT : #[a-z+], NUMBER : [0-9]+ WORD : [a-z]+ and SIGNIFICANT_SPACE : [ ]+ WS : [\n] {skip();}
Then a grammar like ,
AT NUMBER [SIGNIFICANT_SPACE | WORD] +
and reconstruct the word and spaces, but it seems wrong.
You may also look at the filter option in antlr. You can use it to parse part of the input, then examine the character ranges of the tokens to get the parts of the line that were filtered out.
This is very simple.
program : start ID end;
start : '#' ID;
end : ANYTHING;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'1')*;
ANYTHING : .*;
WS : (' '|'\n'|'\r'|'\t')+ {$channel = HIDDEN;};
Other than that, you just need to expand on the rules to suit your purposes.
This should work with ANTLR3, but I don't know about ANTLR2.