Write a Regex for a string - java

Hi I have a file with a lot of bad data lines. I've identified the lines with bad data. The file is very big that it cant be done manually. The problem may reoccur in future so I'm writing a small tool in java to remove the bad segments based on a input regex and remove it.
An example of Bad data is
ABC*HIK*UG*XY\17
I'm trying to write a regex for the above string. So far
Only "(^ABC)" works and ABC is removed.
When I use this nothing happens.
"(^ABC*.XY\17$)"
Please give your inputs.
EDITED:
The answer is working perfect but
If my input files contains this
ABC
123
ABC*HIK*UG*XY\17
1025
KHJ*YU*789
I should get output like
ABC
123
1025
KHJ*YU*789
but I'm getting like this
ABC
123
1025
KHJ*YU*789

Change your pattern to,
"^ABC.*XY\\\\17$"
In java, you need to escape the backslash three more times in-order to match a single \ character. And the pattern to match any character zero or more times must be like .* not *. And also you don't need to put your pattern inside a capturing group.
String s = "ABC\n" +
"123\n" +
"ABC*HIK*UG*XY\\17\n" +
"1025\n" +
"KHJ*YU*789";
System.out.println(s.replaceAll("(?m)^ABC.*XY\\\\17\n?", ""));
Output:
ABC
123
1025
KHJ*YU*789
Since we are using anchors in our regex, we need to add the modifier. In our case, we need to add multi-line modifier (?m)

Related

Reluctant java regex patterns in jEdit [duplicate]

Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.
Is this possible with regular expressions?
When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)
a[^abc]*c
This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):
a[^abc]*b[^ac]*c
When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:
abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz
See the regex demo
To make sure it matches across lines, use re.DOTALL flag when compiling the regex.
Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
abc - match abc
(?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
123 - a literal string 123
(?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
xyz - a trailing substring xyz
See the diagram below (if re.S is used, . will mean AnyChar):
See the Python demo:
import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']
Using PCRE a solution would be:
This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively
abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz
Debuggex Demo
The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
where val like 'abc%123%xyz' and
val not like 'abc%abc%' and
val not like '%xyz%xyz'
I imagine something quite similar is simple to do in other environments.
You could use lookaround.
/^abc(?!.*abc).*123.*(?<!xyz.*)xyz$/g
(I've not tested it.)

multiple regular expressions vs search algorithm

I have a text file where every line is a random combination of any of the following groups
Numbers - English Letters - Arabic Letters - Punctuation
\w which is composed of a-zA-Z0-9_ for the first 2 groups
\p{InArabic} for the third group
\p{Punct} which is composed of !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ for the fifth group
I got this info from here
i read a line. The ONLY time I do something to this line is if the line contains Arabic letters AND (English letters OR Unicode Symbols)
After reading this post and this post I came up with the following expression. Obviously it's wrong as my output is all wrong >.<
pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");
Here's the input
1
1a
a!
aش
شa
ششa
aشش
شaش
aشa
!aش
The first three shouldn't be matched but my output shows that NONE are a match.
Edit: sorry I just realized that I forgot to change my title. But if any of you feel that searching is better performance wise then please suggest a search algorithm. Using search algo instead of regex looks ugly but I'd go with it if it performed better. Thanks to the posts I read, I learned that I can make regex faster if I put this in the constructor so that it'd be executed once only instead of including them in my loop thereby being executed everytime
pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");
matcher = pattern.matcher("");
To follow your idea, the correct pattern is:
pattern = Pattern.compile("(?=.*\\p{InArabic})(?=.*[a-zA-Z\\p{Punct}])");
The same position in a string can not be followed by an arabic letter and a punctuation character or a latin letter at the same time. In other words, you have written an always false condition. Adding .* allows characters to be anywhere in the string.
If you want a more optimised pattern, you can use Jason C idea but with negative character classes to reduce the backtracking:
pattern = Pattern.compile("\\p{inArabic}[^a-zA-Z\\p{Punct}]*[a-zA-Z\\p{Punct}]|[a-zA-Z\\p{Punct}]\\P{inArabic}*\\p{inArabic}");
If you want to find a line with a mix, all you really need are 2 boundry condition checks.
A sucessfull match indicates a mix.
# "\\p{InArabic}(?=[\\w\\p{Punct}])|(?<=[\\w\\p{Punct}])\\p{InArabic}"
\p{InArabic}
(?= [\w\p{Punct}] )
|
(?<= [\w\p{Punct}] )
\p{InArabic}

Differentiating SQL strings from comments

I have files of SQL code that I want to beautify, and I'm having trouble with differentiating between whether a certain line/part of the code is a String or a comment.
My current process is I do a Pattern/Matcher search through the file and pull out the strings with the regex N?'([']{2}|[^'])*+'(?!') and the comments with \s*--.*?\n|/\*.*?\*/, and put them in their respective storage arrays to avoid formatting them.
EXAMPLES:
WHERE y = 'STRING' -> WHERE y = THIS_IS_A_STRING and strings[0] = 'STRING'
SELECT x --do not format-> SELECT x THIS_IS_A_COMMENT and comments[0] = --do not format
After beautifying everything, I then go through and search for THIS_IS_A_STRING and THIS_IS_A_COMMENT and restore their respective values from the arrays.
The problem I'm running into is if a comment has an apostrophe in it, or if a SQL string has double dashes in it. I can fix one problem, but it causes the other, depending on whether I choose to preserve strings or comments first.
For example:
--Don't format this with preserving strings going first will match 'nt format this all the way through to the next ', (due to the ability to have multiline strings).
On the flip side, if I choose to preserve comments first:
SELECT x FROM y WHERE z = '--THIS_IS_AS_STRING--', it will detect the -- and store everything until the next newline into the comments array.
Any help would greatly be appreciated.
EDIT: I know I should probably do this with a SQL parser, but I have been working on this with mainly regex and this is the last step I need to finishing
I made this reqexp:
/^(([^\\'"\-]+|\-[^\\'"\-]|\\.)+|-?'([^\\']+|\\.)+'|-?"([^\\"]+|\\.)+")+\-\-[^\n]+/
To match thouse Rules for SQL comments
a comment row ends with --, comment, and row break.
before the comment we can have:
any chars except \'"-
a - if not followed by any of \'"-
a \ followed by any character including \'"-
a pair of ' that dosn't have a ' between them, unless its have a odd number of \ inforont of it.
a pair of " that dosn't have a " between them, unless its have a odd number of \ inforont of it.
the pairs can have a single - inforont of them, but not 2
did i miss somthing?
This link may help:
Java Regex find/replace pattern in SQL comments
I paste the code here
try {
Pattern regex = Pattern.compile("(?:/\\*[^;]*?\\*/)|(?:--[^;]*?$)", Pattern.DOTALL | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
I would replace the comment first, and then use the replaced string as input for string regex. This way the regex will not confuse string and comment.
While I realize that Song is looking for a regex solution for this problem, I would like to point out that SQL is not regular (https://stackoverflow.com/a/5639859/2503659), hence no regex solution exists.
With that said, I think others have given good solutions for common scenarios.

Regular expression for UK postal codes

I'm making an application that asks the user to enter a postcode and outputs the postcode if it is valid.
I found the following pattern, which works correctly:
String pattern = "^([A-PR-UWYZ](([0-9](([0-9]|[A-HJKSTUW])?)?)|([A-HK-Y][0-9]([0-9]|[ABEHMNPRVWXY])?)) [0-9][ABD-HJLNP-UW-Z]{2})";
I don't know much about regex and it would be great if someone could talk me through this statement. I mainly don't understand the ? and use of ().
Your regex has the following:
^ and $ - anchors for indicating start and end of matching input.
[A-PR-UWYZ] - Any character among A to P or R to U or W,Y,Z. Characters enclosed in square brackets form a character class, which allows any of the enclosed characters and - is for indicating a sequence of characters like [A-D] allowing A,B,C or D.
([0-9]|[A-HJKSTUW])? - An optional character any of 0-9 or characters indicated by [A-HJKSTUW]. ? makes the preceding part optional. | is for an OR. The () combines the two parts to be ORed. Here you may use [0-9A-HJKSTUW] instead of this.
[ABD-HJLNP-UW-Z]{2} - Sequence of length 2 formed by characters allowed by the character class. {2} indicates the length 2. So [ABD-HJLNP-UW-Z]{2} is equivalent to [ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]
the ? means occurs 0 or 1 times and the brackets do grouping as you might expect, modifiers will work on groups. A regex tutorial is probably the best thing here
http://www.vogella.com/articles/JavaRegularExpressions/article.html
i had a brief look and it seems reasonable also for practice/play see this applet
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
simple example (ab)?
means 'ab' once or not at all

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)
The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.
You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."
The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$
You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string
Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...
Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/
You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

Categories