Java regex: need one regex to match all the formats specified

Java regex: need one regex to match all the formats specified - java

A log file has these pattern appearing more than once in a line.
for example the file may look like
dsads utc-hour_of_year:2013-07-30T17 jdshkdsjhf utc-week_of_year:2013-W31 dskjdskf
utc-week_of_year:2013-W31 dskdsld fdsfd
dshdskhkds utc-month_of_year:2013-07 gfdkjlkdf
I want to replace all date specific info with "Y"
I tried :
replaceAll("_year:.*\s", "_year:Y ");` but it removes everything that occurs after the first replacement,due to greedy match of .*
dsads utc-hour_of_year:Y
utc-week_of_year:Y
dshdskhkds utc-month_of_year:Y
but the expected result is:
dsads utc-hour_of_year:Y jdshkdsjhf utc-week_of_year:Y dskjdskf
utc-week_of_year:Y dskdsld fdsfd
dshdskhkds utc-month_of_year:Y gfdkjlkdf

Try using a reluctant quantifier: _year:.*?\s.
.replaceAll("_year:.*?\\s", "_year:Y ")
System.out
.println("utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf"
.replaceAll("_year:.*?\\s", "_year:Y "));
utc-hour_of_year:Y dsfsdgfsgf utc-week_of_year:Y dsfsdgfsdgf

I am not sure what you are really trying to do and this answer is only based on your example. In case you want to do something else leave comment below or edit your question with more specific information/example
It removes everything after _year: because you are using .*\\s which means
.* zero or more of any characters (beside new line),
\\s and space after it
so in sentence
utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf
it will match
utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf
// ^from here to here^
because by default * quantifier is greedy. To make it reluctant you need to add ? after * so try maybe
"_year:.*?\\s"
or even better instead .*? match only non-space characters using \\S which is the same as negation of \\s that can be written as [^\\s]. Also if your data can be at the end of your input you shouldn't probably add \\s at the end of your regex and space in your replacement, so try maybe one of this ways
.replaceAll("_year:\\S*", "_year:Y")
.replaceAll("_year:\\S*\\s", "_year:Y ")

Related

Regex for matching strings between '=' and '/' or "=" and end of string

I am looking for regex which can help me replace strings like
source=abc/task=cde/env=it --> source='abc'/task='cde'/env='it'
To be more precise, I want to replace a string which starts with = and ends with either / or end of the string with ''
Tried code like this
"source=abc/task=cde/env=it".replaceAll("=(.*?)/","'$1'")
But that results in
source'abc'task'cde'env=it

Using lookahead and look behind:
(?<==)([^/]*)((?=/)|$)
Lookbehind allows you to specify what comes before your match. In this case an equals: (?<==).
The main match in my regex looks for any non-slash character, zero or more times: ([^/]*)
Lookahead allows you to specify what comes after your match. In this case, a slash: (?=/).
The $ matches the end of the line, so that the last item in your test data becomes quoted. ((?=/)|$) combines with this with the lookahead, meaning "either a slash comes after the match or this is the end of the line".
Here it is in action in a test.
#Test
public void test_quote_items() {
String regex = "(?<==)([^/]*)((?=/)|$)";
String actual = "source=abc/task=cde/env=it".replaceAll(regex,"'$1'");
String expected = "source='abc'/task='cde'/env='it'";
assertEquals(expected, actual);
}

Try
String input = "source=abc/task=cde/env=it".replaceAll("=(.*?)(/|$)","='$1'/");
The problems I found are that you are not replacing the =
and also the / is not there for the end of String, that also needs to be replaced when found.
output
source='abc'/task='cde'/env='it'/
If you don't want the last '/', that is trivial to remove isn't it.

Regex pattern matching is getting timed out

I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!

If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"

How to remove comma after a word pattern in java

Please help me out to get the specific regex to remove comma after a word pattern in java.
Assume, I would like to delete comma after each pattern where the pattern is <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG>, <Word$TAG> now I want my output to be <Word$TAG> <Word$TAG> <Word$TAG> <Word$TAG> . if I used .replaceAll(), it will replace all commas, but in my <Word$TAG> Word may have a comma(,).
For example, Input.txt is as follows
mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E
and Output.txt
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E

You could use ", " as search and replace it with " " (space) as below:
one.replace(", ", " ");
If you think, you have "myString, ,,," or multiple spaces in between, then you could use replace all with regex like
one.replaceAll(",\\s+", " ");

(?<=[^,\s]),
Try this.Replace by empty string.See demo.
http://regex101.com/r/lZ5mN8/5

Match the data you want, not the one you don't want.
You probably want ([^ ]+), and keep the bracketed data, separated by whitespace.
You might even want to narrow it down to ([^ ]+§[^ ]+),. Usually, stricter is better.

You could use a positive lookahead assertion to match all the commas which are followed by a space or end of the line anchor.
String s = "mms§NNP_ACRON, site§N_NN, pe§PSP, ,,,,,§RD_PUNC, link§N_NN, ....§RD_PUNC, CID§NNP_ACRON, team§N_NN, :)§E";
System.out.println(s.replaceAll(",(?=\\s|$)",""));
Output:
mms§NNP_ACRON site§N_NN pe§PSP ,,,,,§RD_PUNC link§N_NN ....§RD_PUNC CID§NNP_ACRON team§N_NN :)§E

Multiline Regex Matching Issue

I have the following string that I am trying to match via regex:
;IF TEST_DATE <= 200112 THEN E>=90 AND S>=90
OR P = "25" ENDIF
IF TEST_DATE >= 200201 AND TEST_DATE < 200407 THEN E>=89
AND S>=90 OR P = "25" ENDIF
I am using the following regex in an attempt to match from the semicolon (intended to be a comment) until the first ENDIF:
;\s*IF (\d|\D)+ ENDIF
Unfortunately, this pattern matches all the way until the second ENDIF. I've tried various solutions using the Java Pattern.DOTALL, as well as the (?s) flag, with no luck.

You are using greedy quantifier, due to which your pattern (\d|\D) matches everything till it finds the last ENDIF.
You need to use reluctant quantifier - +? if you want your regex to stop matching at the first ENDIF : -
;\s*IF (\d|\D)+? ENDIF

Use the non-greedy qualifier.
;\s*IF (\d|\D)*? ENDIF

Regex: strip all tags except those containing keyword "univ"

[introduction][position]Lead Researcher and Research Manager[/position] in the [affiliation]Web Search and Mining Group, Microsoft Research[/affiliation]</b>.
I am a [position]lead researcher[/position] at [affiliation]Microsoft Research[/affiliation]. I am also [position]adjunct professor[/position] of [affiliation]Peking University[/affiliation], [affiliation]Xian Jiaotong University[/affiliation] and [affiliation]Nankai University[/affiliation].
I joined [affiliation]Microsoft Research[/affiliation] in June 2001. Prior to that, I worked at the Research Laboratories of NEC Corporation.
I obtained a [bsdegree]B.S.[/bsdegree] in [bsmajor]Electrical Engineering[/bsmajor] from [bsuniv]Kyoto University[/bsuniv] in [bsdate]1988[/bsdate] and a [msdegree]M.S.[/msdegree] in [msmajor]Computer Science[/msmajor] from [msuniv]Kyoto University[/msuniv] in [msdate]1990[/msdate]. I earned my [phddegree]Ph.D.[/phddegree] in [phdmajor]Computer Science[/phdmajor] from the [phduniv]University of Tokyo[/phduniv] in [phddate]1998[/phddate].
I am interested in [interests]statistical learning[/interests], [interests]natural language processing[/interests], [interests]data mining, and information retrieval[/interests].[/introduction]
I'm able to strip all tags from the paragraph above with:
String stripped = html.replaceAll("\\[.*?\\]", "");
But I'd like to keep three pairs of tags in the paragraph, which are [bsuniv][/bsuniv],[msuniv][/msuniv] and [phduniv][/phduniv]. In other words, I don't want to strip those tags containing the keyword "univ". I can't find a convenient way to rewrite the regular expression. Anyone help me?

You can use a negative-look ahead assertion here: -
str = str.replaceAll("\\[(.(?!univ))*?\\]", "");
or: -
str = str.replaceAll("\\[((?!univ).)*?\\]", "");
Both of them will give you the desired output. There is only one difference -
The first one does a negative look-ahead, against the current character, and if it is not followed by univ, it moves to the next character.
The second one does a negative look-ahead against an empty string before every character, and if it is not followed by univ, it goes ahead to match a single character.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex: need one regex to match all the formats specified - java

Try using a reluctant quantifier: _year:.?\s. .replaceAll("_year:.?\\s", "_year:Y ") System.out .println("utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf" .replaceAll("_year:.*?\\s", "_year:Y ")); utc-hour_of_year:Y dsfsdgfsgf utc-week_of_year:Y dsfsdgfsdgf

Related

Regex for matching strings between '=' and '/' or "=" and end of string

Regex pattern matching is getting timed out

How to remove comma after a word pattern in java

Multiline Regex Matching Issue

Regex: strip all tags except those containing keyword "univ"

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex: need one regex to match all the formats specified - java

Try using a reluctant quantifier: _year:.*?\s. .replaceAll("_year:.*?\\s", "_year:Y ") System.out .println("utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf" .replaceAll("_year:.*?\\s", "_year:Y ")); utc-hour_of_year:Y dsfsdgfsgf utc-week_of_year:Y dsfsdgfsdgf

Related

Regex for matching strings between '=' and '/' or "=" and end of string

Regex pattern matching is getting timed out

How to remove comma after a word pattern in java

Multiline Regex Matching Issue

Regex: strip all tags except those containing keyword "univ"

Categories

Resources

Try using a reluctant quantifier: _year:.?\s. .replaceAll("_year:.?\\s", "_year:Y ") System.out .println("utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf" .replaceAll("_year:.*?\\s", "_year:Y ")); utc-hour_of_year:Y dsfsdgfsgf utc-week_of_year:Y dsfsdgfsdgf