I have to validate the lines from a text file. The line would be something like below.
"Field1" "Field2" "Field3 Field_3.1 Field3.2" 23 3445 "Field5".
The delimiter here is a single Space(\s). If more than one space present outside of text fields, then the line should be rejected. For example,
Note : \s would be present as literal space and not as \s in the line. For easy reading I mentioned space as \s
Invalid:
"Field1"\\s\\s"Field2" "Field3 Field_3.1 Field3.2" 23\\s\\s3445 "Field5". //two or more spaces between "Field1" and "Field2" or numeric fields 23 3445. \s would be present as literal space and not as \s
Valid
"Field1\\s\\s" "\\s\\sField2" "Field3\\s\\sField_3.1\\s\\sField3.2" 23 3445 "Field5". //two or more spaces within third field "Field3 Field_3.1 Field3.2" or at the end/beginning of any field as in first two fields.
I created a Pattern as below to validate the Spaces in between. But it's not working as expected when there're more than two Strings and a numeric present inside a Field wrapped by Double quotes like "Field3 Field_3.1 123"
public class SpaceValidation
{
public static void main(String ar[])
{
String spacePattern_1 = "[\"^\\n]\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s";
String line1 = "Field3 Field_3.1 "; // valid and pattern doesn't find it as invalid - Works as expected
String line2 = "Field3 Field_3.1 123";//Valid and but pattern find it as invalid - Not working as expected.
Pattern pattern = Pattern.compile(spacePattern_1);
Matcher matLine1 = pattern.matcher(line1);
Matcher matLine2 = pattern.matcher(line2);
if(matLine1.find())
{
sysout("Invalid Line1");
}
if(matLine2.find())
{
sysout("Invalid Line2");
}
}
I have tried another pattern given below. But due to backtracking issues reported I have to avoid the below pattern, Even this one is not working when there are more than two subfields present two or more spaces in a line.
(\".*\")\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s
// * or . shouldn't be present more than once in the same condition to prevent backtracking, hence I have to use negation of \\n in the above code
Kindly let me know how I could resolve this using pattern for fields such as "field3 field3.1 123", which is a valid field. Thanks in advance.
EDIT:
After little bit tinkering, I narrowed down the issue to digit. The lines becomes invalid only if the third subfield is numeric ("Field 3 Field3.1 123"). For alphabets its working fine.
Here in the pattern \\s\\s\\d seems to be the culprit. It's that condition that flags the third subfield as invalid(numeric subfield 123). But I need that to validate numeric fields present outside of the DoubleQuotes.
You can use
^(?:\"[^\"]*\"|\d+)(?:\s(?:\"[^\"]*\"|\d+))*$
If you are using it to extract lines from a multiline document:
(?m)^(?:\"[^\"\n\r]*\"|\d+)(?:\h(?:\"[^\"\n\r]*\"|\d+))*\r?$
See the regex demo.
Details:
^ - start of a string (line, if you use (?m) or Pattern.MULTILINE)
(?:\"[^\"]*\"|\d+) - either " + zero or more chars other than " + ", or one or more digits
(?:\s(?:\"[^\"]*\"|\d+))* - zero or more sequences of
\s - a single whitespace
(?:\"[^\"]*\"|\d+) - either " + zero or more chars other than " + ", or one or more digits
$ - end of string
The second pattern contains \h instead of \s to only match horizontal whitespaces, [^\"\n\r] matches any char other than ", line feed and carriage return.
In Java:
String pattern = "^(?:\"[^\"]*\"|\\d+)(?:\\s(?:\"[^\"]*\"|\\d+))*$";
String pattern = "(?m)^(?:\"[^\"\n\r]*\"|\\d+)(?:\\h(?:\"[^\"\n\r]*\"|\\d+))*\r?$";
I have set of inputs ++++,----,+-+-.Out of these inputs I want the string containing only + symbols.
If you want to see if a String contains nothing but + characters, write a loop to check it:
private static boolean containsOnly(String input, char ch) {
if (input.isEmpty())
return false;
for (int i = 0; i < input.length(); i++)
if (input.charAt(i) != ch)
return false;
return true;
}
Then call it to check:
System.out.println(containsOnly("++++", '+')); // prints: true
System.out.println(containsOnly("----", '+')); // prints: false
System.out.println(containsOnly("+-+-", '+')); // prints: false
UPDATE
If you must do it using regex (worse performance), then you can do any of these:
// escape special character '+'
input.matches("\\++")
// '+' not special in a character class
input.matches("[+]+")
// if "+" is dynamic value at runtime, use quote() to escape for you,
// then use a repeating non-capturing group around that
input.matches("(?:" + Pattern.quote("+") + ")+")
Replace final + with * in each of these, if an empty string should return true.
The regular expression for checking if a string is composed of only one repeated symbol is
^(.)\1*$
If you only want lines composed by '+', then it's
^\++$, or ^++*$ if your regex implementation does not support +(meaning "one or more").
For a sequence of the same symbol, use
(.)\1+
as the regular expression. For example, this will match +++, and --- but not +--.
Regex pattern: ^[^\+]*?\+[^\+]*$
This will only permit one plus sign per string.
Demo Link
Explanation:
^ #From start of string
[^\+]* #Match 0 or more non plus characters
\+ #Match 1 plus character
[^\+]* #Match 0 or more non plus characters
$ #End of string
edit, I just read the comments under the question, I didn't actually steal the commented regex (it just happens to be intellectual convergence):
Whoops, when using matches disregard ^ and $ anchors.
input.matches("[^\\+]*?\+[^\\+]*")
i have seen to replace "," to "." by using ".$"|",$", but this logic is not working with alphabets.
i need to replace last letter of a word to another letter for all word in string containing EXAMPLE_TEST using java
this is my code
Pattern replace = Pattern.compile("n$");//here got the real problem
matcher2 = replace.matcher(EXAMPLE_TEST);
EXAMPLE_TEST=matcher2.replaceAll("k");
i also tried "//n$" ,"\n$" etc
Please help me to get the solution
input text=>njan ayman
output text=> njak aymak
Instead of the end of string $ anchor, use a word boundary \b
String s = "njan ayman";
s = s.replaceAll("n\\b", "k");
System.out.println(s); //=> "njak aymak"
You can use lookahead and group matching:
String EXAMPLE_TEST = "njan ayman";
s = EXAMPLE_TEST.replaceAll("(n)(?=\\s|$)", "k");
System.out.println("s = " + s); // prints: s = njak aymak
Explanation:
(n) - the matched word character
(?=\\s|$) - which is followed by a space or at the end of the line (lookahead)
The above is only an example! if you want to switch every comma with a period the middle line should be changed to:
s = s.replaceAll("(,)(?=\\s|$)", "\\.");
Here's how I would set it up:
(?=.\b)\w
Which in Java would need to be escaped as following:
(?=.\\b)\\w
It translates to something like "a character (\w) after (?=) any single character (.) at the end of a word (\b)".
String s = "njan ayman aowkdwo wdonwan. wadawd,.. wadwdawd;";
s = s.replaceAll("(?=.\\b)\\w", "");
System.out.println(s); //nja ayma aowkdw wdonwa. wadaw,.. wadwdaw;
This removes the last character of all words, but leaves following non-alphanumeric characters. You can specify only specific characters to remove/replace by changing the . to something else.
However, the other answers are perfectly good and might achieve exactly what you are looking for.
if (word.endsWith("char oldletter")) {
name = name.substring(0, name.length() - 1 "char newletter");
}
I want to detect if a String is a decimal by using a regular expression. My question is more on how to use the regular expression mechanism than detecting that a String is a decimal. I use the RegExp class provided by GWT.
String regexDecimal = "\\d+(?:\\.\\d+)?";
RegExp regex = RegExp.compile(regexDecimal);
String[] decimals = { "one", "+2", "-2", ".4", "-.4", ".5", "2.5" };
for (int i = 0; i < decimals.length; i++) {
System.out.println(decimals[i] + " "
+ decimals[i].matches(regexDecimal) + " "
+ regex.test(decimals[i]) + " "
+ regex.exec(decimals[i]));
}
The output:
one false false null
+2 false true 2
-2 false true 2
.4 false true 4
-.4 false true 4
.5 false true 5
2.5 true true 2.5
I was expecting that both methods String.matches() and RegExp.test() return the same result.
So what's the difference between
both methods?
How to use the RegExp.test() to get the same behaviour?
Try to change the regex to
"^\\d+(?:\\.\\d+)?$"
explain
double escape is because we're in Java...
regex start with ^ to forces the regex to match from the very start of the string.
regex end with $ to forces the regex to match from the very end of the string.
this is how you should get String.matches() to do the same as GWT RegExp.test()
I don't know the difference, but I would say that RegExp.test() is correct, because your regex matches as soon as there is a digit within your string and String.matches() behaves like there where anchors around the regex.
\\d+(?:\\.\\d+)?
Your non capturing group is optional, so one \\d ([0-9]) is enough to match, no matter what is around.
When you add anchors to your regex, that means it has to match the string from the start to the end, then RegExp.test() will probably show the same results.
^\\d+(?:\\.\\d+)?$
I need to build a regex that match words with these patterns:
Letters and numbers:
A35, 35A, B503X, 1ABC5
Letters and numbers separated by "-", "/", "\":
AB-10, 10-AB, A10-BA, BA-A10, etc...
I wrote this regex for it:
\b[A-Za-z]+(?=[(?<!\-|\\|\/)\d]+)[(?<!\-|\\|\/)\w]+\b|\b[0-9]+(?=[(?<!\-|\\|\/)A-Za-z]+)[(?<!\-|\\|\/)\w]+\b
It works partially, but it's match only letters or only numbers separated by symbols.
Example:
10-10, open-office, etc.
And I don't wanna this matches.
I guess that my regex is very repetitive and somewhat ugly.
But it's what I have for now.
Could anyone help me?
I'm using java/groovy.
Thanks in advance.
Interesting challenge. Here is a java program with a regex that picks out the types of "words" you are after:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A35, 35A, B503X, 1ABC5 " +
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +
"10-10, open-office, etc.";
Pattern regex = Pattern.compile(
"# Match special word having one letter and one digit (min).\n" +
"\\b # Match first word having\n" +
"(?=[-/\\\\A-Za-z]*[0-9]) # at least one number and\n" +
"(?=[-/\\\\0-9]*[A-Za-z]) # at least one letter.\n" +
"[A-Za-z0-9]+ # Match first part of word.\n" +
"(?: # Optional extra word parts\n" +
" [-/\\\\] # separated by -, / or //\n" +
" [A-Za-z0-9]+ # Match extra word part.\n" +
")* # Zero or more extra word parts.\n" +
"\\b # Start and end on a word boundary",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(s);
while (regexMatcher.find()) {
System.out.print(regexMatcher.group() + ", ");
}
}
}
Here is the correct output:
A35, 35A, B503X, 1ABC5, AB-10, 10-AB, A10-BA, BA-A10,
Note that the only complex regexes which are "ugly", are those that are not properly formatted and commented!
Just use this:
([a-zA-Z]+[-\/\\]?[0-9]+|[0-9]+[-\/\\]?[a-zA-Z]+)
In Java \\ and \/ should be escaped:
([a-zA-Z]+[-\\\/\\\\]?[0-9]+|[0-9]+[-\\\/\\\\]?[a-zA-Z]+)
Excuse me to write my solution in Python, I don't know enough Java to write in Java.
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## This part verifies that
'[^ ]*' ## there are at least one
'(?(1)\d|[A-Z]))' ## letter and one digit.
'('
'(?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])' # start of second group
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,)' # end of second group
')',
re.IGNORECASE) # this group 2 catches the string
.
My solution catches the desired string in the second group: ((?:(?<={ ,])[A-Z0-9]|\A[A-Z0-9])[A-Z0-9-/\\\\]*[A-Z0-9](?= |\Z|,))
.
The part before it verifies that one letter at least and one digit at least are present in the catched string:
(?(1)\d|[A-Z]) is a conditional regex that means "if group(1) catched something, then there must be a digit here, otherwise there must be a letter"
The group(1) is ([A-Z]) in (?=(?:([A-Z])|[0-9])
(?:([A-Z])|[0-9]) is a non-capturing group that matches a letter (catched) OR a digit, so when it matches a letter, the group(1) isn't empty
.
The flag re.IGNORECASE allows to treat strings with upper or lower cased letters.
.
In the second group, I am obliged to write (?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9]) because lookbehind assertions with non fixed length are not allowed. This part signifies one character that can't be '-' preceded by a blank or the head of the string.
At the opposite, (?= |\Z[,) means 'end of string or a comma or a blank after'
.
This regex supposes that the characters '-' , '/' , '\' can't be the first character or the last one of a captured string . Is it right ?
import re
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## (from here) This part verifies that
'[^ ]*' # there are at least one
'(?(1)\d|[A-Z]))' ## (to here) letter and one digit.
'((?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])'
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,))',
re.IGNORECASE) # this group 2 catches the string
ch = "ALPHA13 10 ZZ 10-10 U-R open-office ,10B a10 UCS5000 -TR54 code vg4- DV-3000 SEA 300-BR gt4/ui bn\\3K"
print [ mat.group(2) for mat in pat.finditer(ch) ]
s = "A35, 35A, B503X,1ABC5 " +\
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +\
"10-10, open-office, etc."
print [ mat.group(2) for mat in pat.finditer(s) ]
result
['ALPHA13', '10B', 'a10', 'UCS5000', 'DV-3000', '300-BR', 'gt4/ui', 'bn\\3K']
['A35', '35A', 'B503X', '1ABC5', 'AB-10', '10-AB', 'A10-BA', 'BA-A10']
My first pass yields
(^|\s)(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)(\s|$)
Sorry, but it's not java formatted (you'll need to edit the \ \s etc.). Also, you can't use \b b/c a word boundary is anything that is not alphanumeric and underscore, so I used \s and the start and end of the string.
This is still a bit raw
EDIT
Version 2, slightly better, but could be improved for performance by usin possessive quantifiers. It matches ABC76 AB-32 3434-F etc, but not ABC or 19\23 etc.
((?<=^)|(?<=\s))(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)((?=$)|(?=\s))
A condition (A OR NOT A) can be omited. So symbols can savely been ignored.
for (String word : "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" "))
if (word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)"))
// do something
You didn't mention -x4, 4x-, 4-x-, -4-x or -4-x-, I expect them all to match.
My expression looks just for something-alpha-something-digits-something, where something might be alpha, digits or symbols, and the opposite: something-alpha-something-digits-something. If something else might occur, like !#$~()[]{} and so on, it would get longer.
Tested with scala:
scala> for (word <- "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" ")
| if word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)")) yield word
res89: Array[java.lang.String] = Array(10B, A10, UCS5000, DV-3000, 300-BR)
Slightly modified to filter matches:
String s = "A35, 35A, B53X, 1AC5, AB-10, 10-AB, A10-BA, BA-A10, etc. -4x, 4x- -4-x- 10-10, oe-oe, etc";
Pattern pattern = java.util.regex.Pattern.compile ("\\b([^ ,]*[A-Za-z][^ ,]*[0-9])[^ ,]*|([^ ,]*[0-9][^ ,]*[A-Za-z][^ ,]*)\\b");
matcher = pattern.matcher (s);
while (matcher.find ()) { System.out.print (matcher.group () + "|") }
But I still have an error, which I don't find:
A35|35A|B53X|1AC5|AB-10|10-AB|A10-BA|BA-A10|-4x|4x|-4-x|
4x should be 4x-, and -4-x should be -4-x-.