Regular expression is not accepted - java

I have implemented the code to count the occurrence of words in a text. However, my regular expression is not accepted for some reason and I get the following error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 12
My code is:
import java.util.*;
public class CountOccurrenceOfWords {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
char lf = '\n';
String text = "It was the best of times, it was the worst of times," +
lf +
"it was the age of wisdom, it was the age of foolishness," +
lf +
"it was the epoch of belief, it was the epoch of incredulity," +
lf +
"it was the season of Light, it was the season of Darkness," +
lf +
"it was the spring of hope, it was the winter of despair," +
lf +
"we had everything before us, we had nothing before us," +
lf +
"we were all going direct to Heaven, we were all going direct" +
lf +
"the other way--in short, the period was so far like the present" +
lf +
"period, that some of its noisiest authorities insisted on its" +
lf +
"being received, for good or for evil, in the superlative degree" +
lf +
"of comparison only." +
lf +
"There were a king with a large jaw and a queen with a plain face," +
lf +
"on the throne of England; there were a king with a large jaw and" +
lf +
"a queen with a fair face, on the throne of France. In both" +
lf +
"countries it was clearer than crystal to the lords of the State" +
lf +
"preserves of loaves and fishes, that things in general were" +
lf +
"settled for ever";
TreeMap<String, Integer> map = new TreeMap<String, Integer>();
String[] words = text.split("[\n\t\r.,;:!?(){");
for(int i = 0; i < words.length; i++){
String key = words[i].toLowerCase();
if(key.length() > 0) {
if(map.get(key) == null){
map.put(key, 1);
}
else{
int value = map.get(key);
value++;
map.put(key, value);
}
}
}
Set<Map.Entry<String, Integer>> entrySet = map.entrySet();
//Get key and value from each entry
for(Map.Entry<String, Integer> entry: entrySet){
System.out.println(entry.getValue() + "\t" + entry.getKey());
}
}
}
Also, could you please provide a hint on how can I order the words alphabetically? Thank you in advance

You missed "]" at end of your Regular Expression.
"[\n\t\r.,;:!?(){" is not correct.
You need to replace your Regular Expression to "[\n\t\r.,;:!?(){]"

You need to escape special characters for regular expressions. In your case you have not escaped (, ),[, ?, . and {. Escape them using \. E.g. \[. You may also consider a predefined chracter class for whitespaces \s - this will match \r, \t and many more.

Your Problem is an unclosed character class in your regular expression. RegEx has some 'pre-defined' characters wich you need to escape when looking for them.
A character class is:
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets.
Source
This means you have to either escape these characters:
\[\n\t\r\.,;:!\?\(\){
Or close the character class
[\n\t\r\.,;:!\?\(\){]
Either way you need to escape the dot, the question mark and the parentheses.

Related

Regex display with arrays

So I have a regex question. When running this code
if (str1.trim().contains(search2)){
String str3 = str1;
str3 = str3.replaceAll("[^-?0-9]+", " ");
System.out.println("location: " + Arrays.asList(str3.trim().split(" ")));
System.out.println(" ");
}
it produces
location: [290, -70]
is it possible to replace the bracket characters with "[ x, x]" with "x x" so that they just show the characters within quotes?
location: "290 -70"?
I'm kinda new to regex so I tried some things like .replace("[", " "); but it did not work.
EDIT ----
Here's my entire code.
public static void main (String [] args) throws IOException {
BufferedReader in = new BufferedReader (new FileReader ("/Users/Dannybwee/Documents/workspace/csc199/src/csc199/test.txt"));
String str;
List<String> finallist = new ArrayList<String>();
while ((str = in.readLine()) != null){
finallist.add(str);
}
String search = "node";
String search2 = "position";
for (String str1: finallist) {
if (str1.trim().contains(search)){
System.out.print("{ key " + str1+ ",\n" +
"name: " + str1 + ",\n" +
"Truth: 'Tainted'," + "\n" +
"False: 'NotTainted, \n");
}
if (str1.trim().contains(search2)){
String str3 = str1;
str3 = str3.replaceAll("[^-?0-9]+", " ");
System.out.println("location: " + Arrays.asList(str3.trim().split(" ")));
System.out.println("}");
}
}
}
What i'm trying to do is take a text file, and then change the formatting of the text. I thought it would be easiest to take the file and scan for what needed to change. for instance, All I want is to change the brackets outputed above to braces.
So basically I want it to output location: "290 -70" instead of location: [290, -70] without the comma and brackets
I'm splitting because the line is positions = (number number); What I'm trying to do is just extract the number from that index
Then if you split, you get ["(number", "number)"].
You want to remove the round brackets, not the square ones. And you have already done that [^-?0-9]+ removes all characters but one or more 0-9, -, and ?
You don't need to split anything.
if (str1.trim().contains(search2)){
str1 = st1.replaceAll("[^-?0-9]+", " ");
System.out.println("location: \"" + str1 + "\"");
System.out.println("}");
}
You could also forget the regex entirely and use str1.substring(1, str1.length() - 1)
By the way, if you are trying to produce JSON, it isn't valid. The keys need to be quoted
you can specify the literal bracket with the backslash "escape character" \[. This is common for many regex entries that also correspond to triggered characters.
\\ , \. , \( ... etc
It is important to note that in Java we must escape our escape character, therefore whenever you use it you'll need a single backslash for each backslash:
\\[, \\\\, \\., \\( ... etc
You can implement this into your existing code, or you could make your life a little easier by using a pattern matcher.
Pattern p = Pattern.compile("\\D+?(-?\\d++)\\D+?(-?\\d++)\\D*");
Matcher m = p.matcher(STRING);
String results = "location: "+m.group(1)+" "+m.group(2);
\\D+? eliminates non-digit (0-9) characters reluctantly, this will spare the '-' when found.
(-?\\d++) will capture m.group(n) which will possessively contain as many digits as it can find in a row. Since the '-' was spared earlier it should be present for this capture if at all.

Regular Expression That Contains All Of The Specific Letters In Java

I have a regular expression, which selects all the words that contains all (not! any) of the specific letters, just works fine on Notepad++.
Regular Expression Pattern;
^(?=.*B)(?=.*T)(?=.*L).+$
Input Text File;
AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
And output of the regular expression in notepad++;
LABAT
BALAT
LATAB
As It is useful for Notepad++, I tried the same regular expression on java but it is simply failed.
Here is my test code;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lev.kelimelik.resource.*;
public class Test {
public static void main(String[] args) {
String patternString = "^(?=.*B)(?=.*T)(?=.*L).+$";
String dictionary =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(dictionary);
while(m.find())
{
System.out.println("Match: " + m.group());
}
}
}
The output is errorneous as below;
Match: AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
My question is simply, what is the java-compatible version of this regular expression?
Java-specific answer
In real life, we rarely need to validate lines, and I see that in fact, you just use the input as an array of test data. The most common scenario is reading input line by line and perform checks on it. I agree in Notepad++ it would be a bit different solution, but in Java, a single line should be checked separately.
That said, you should not copy the same approaches on different platforms. What is good in Notepad++ does not have to be good in Java.
I suggest this almost regex-free approach (String#split() still uses it):
String dictionary_str =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
String[] dictionary = dictionary_str.split("\n"); // Split into lines
for (int i=0; i<dictionary.length; i++) // Iterate through lines
{
if(dictionary[i].indexOf("B") > -1 && // There must be B
dictionary[i].indexOf("T") > -1 && // There must be T
dictionary[i].indexOf("L") > -1) // There must be L
{
System.out.println("Match: " + dictionary[i]); // No need matching, print the whole line
}
}
See IDEONE demo
Original regex-based answer
You should not rely on .* ever. This construct causes backtracking issues all the time. In this case, you can easily optimize it with a negated character class and possessive quantifiers:
^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)
The regex breakdown:
^ - start of string
(?=[^B]*+B) - right at the start of the string, check for at least one B presence that may be preceded with 0 or more characters other than B
(?=[^T]*+T) - still right at the start of the string, check for at least one T presence that may be preceded with 0 or more characters other than T
(?=[^L]*+L)- still right at the start of the string, check for at least one L presence that may be preceded with 0 or more characters other than L
See Java demo:
String patternString = "^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)";
String[] dictionary = {"AL", "BAL", "BAK", "LABAT", "TAL", "LAT", "BALAT", "LA", "AB", "LATAB", "TAB"};
for (int i=0; i<dictionary.length; i++)
{
Pattern p = Pattern.compile(patternString);
Matcher m = p.matcher(dictionary[i]);
if(m.find())
{
System.out.println("Match: " + dictionary[i]);
}
}
Output:
Match: LABAT
Match: BALAT
Match: LATAB
Change your Pattern to:
String patternString = ".*(?=.*B)(?=.*L)(?=.*T).*";
Output
Match: LABAT
Match: BALAT
Match: LATAB
I did not debug your situation, but I think your problem is caused by matching the entire string rather than individual words.
You're matching "AL\nBAL\nBAK\nLABAT\n" plus some more. Of course that string has all the required characters. You can see it in the fact that your output only contains one Match: prefix.
Please have a look at this answer. You need to use Pattern.MULTILINE.

regex seems to be off for special characters (e.g. +-.,!##$%^&*;)

I am using regex to print out a string and adding a new line after a character limit. I don't want to split up a word if it hits the limit (start printing the word on the next line) unless a group of concatenated characters exceed the limit where then I just continue the end of the word on the next line. However when I hit special characters(e.g. +-.,!##$%^&*;) as you'll see when I test my code below, it adds an additional character to the limit for some reason. Why is this?
My function is:
public static String limiter(String str, int lim) {
str = str.trim().replaceAll(" +", " ");
str = str.replaceAll("\n +", "\n");
Matcher mtr = Pattern.compile("(.{1," + lim + "}(\\W|$))|(.{0," + lim + "})").matcher(str);
String newStr = "";
int ctr = 0;
while (mtr.find()) {
if (ctr == 0) {
newStr += (mtr.group());
ctr++;
} else {
newStr += ("\n") + (mtr.group());
}
}
return newStr ;
}
So my input is:
String str = " The 123456789 456789 +-.,!##$%^&*();\\/|<>\"\' fox jumpeded over the uf\n 2 3456 green fence ";
With a character line limit of 7.
It outputs:
456789 +
-.,!##$%
^&*();\/
|<>"
When the correct output should be:
456789
+-.,!##
$%^&*()
;\/|<>"
My code is linked to an online compiler you can run here:
https://ideone.com/9gckP1
You need to replace the (\W|$) with \b as your intention is to match whole words (and \b provides this functionality). Also, since you do not need trailing whitespace on newly created lines, you need to also use \s*.
So, use
Matcher mtr = Pattern.compile("(?U)(.{1," + lim + "}\\b\\s*)|(.{0," + lim + "})").matcher(str);
See demo
Note that (?U) is used here to "fix" the word boundary behavior to keep it in sync with \w (so that diacritics were not considered word characters).
In your pattern, \\W is part of the first capturing group. It is adding this one (non-word) character to the .{1,limit} pattern.
Try with: "(.{1," + lim + "})(\W|$)|(.{0," + lim + "})"
(I can't currently use your regex online compiler)

Regex for circle and polygon string with decimal/integer values

I'm trying to create regex patterns to be used in Java for the following two strings:
CIRCLE ( (187.8562 ,-88.562 ) , 0.774 )
and
POLYGON ( (17.766 55.76676,77.97666 -32.866888,54.97799 54.2131,67.666777 24.9771,17.766 55.76676) )
Please note that
one/more white spaces may exist anywhere.Exceptions are not between alphabets.And not between any digits of a number. [UPDATED]
CIRCLE and POLYGON words are fixed but are not case sensitive.[UPDATED]
For the 2nd string the number of point set are not fixed.Here I've given 5 set of points for simplicity.
points are set of decimal/integer numbers [UPDATED]
positive decimal number can have a + sign [UPDATED]
leading zero is not mandatory for a decimal number [UPDATED]
For polygon atleast 3 point set are required.And also first & last point set will be the same (enclosed polygon) [UPDATED]
Any help or suggestion will be appreciated.
I've tried as:
(CIRCLE)(\\s+)(\\()(\\s+)(\\()(\\s+)([+-]?\\d*\\.\\d+)(?![-+0-9\\.])(\\s+)(,)(\\s+)([+-]?\\d*\\.\\d+)(?![-+0-9\\.])(\\s+)(\\))(\\s+)(,)(\\s+)([+-]?\\d*\\.\\d+)(?![-+0-9\\.])(\\s+)(\\))
Could you please provide me the working regex pattern for those two string?
I suggest you to remove space from your string before submitting it to the regex.
Circle:
CIRCLE\(\(-?\d+\.\d+,-?\d+\.\d+\),[-]?\d+\.\d+\)
Polygon:
POLYGON\(\((-?\d+\.\d+\s+-?\d+\.\d+,)+-?\d+\.\d+\s+-?\d+\.\d+\)\)
Circle including spaces:
CIRCLE\s*\(\s*\(\s*-?\d+\.\d+\s*,\s*-?\d+\.\d+\s*\)\s*,\s*-?\d+\.\d+\s*\)
Polygon including spaces:
POLYGON\s*\(\s*\(\s*(-?\d+\.\d+\s+-?\d+\.\d+\s*,\s*)+\s*-?\d+\.\d+\s+-?\d+\.\d+\s*\)\s*\)
Circle including spaces updated:
/CIRCLE\s*\(\s*\(\s*[+-]?\d*\.\d+\s*,\s*[+-]?\d*\.\d+\s*\)\s*,\s*[+-]?\d*\.\d+\s*\)/i
Polygon including spaces updated:
/POLYGON\s*\(\s*\(\s*([+-]?\d*\.\d+)\s+([+-]?\d*\.\d+)\s*(,\s*[+-]?\d*\.\d+\s+[+-]?\d*\.\d+)+\s*,\s*\1\s+\2\s*\)\s*\)/i
UPDATED ANSWER:
This match examples from question and comments:
(CIRCLE|POLYGON)([( ]+)([+ \-\.]?(\d+)?([ \.]\d+[ ,)]+))+
Any help or suggestion will be appreciated.
My suggestion is to break it up into pieces. Just as you'd want to break up a large, complex function into smaller functions so that each part is easy to see and understand, you want to break up a large, complex regex pattern into smaller patterns for the same reason. For example:
private interface Patterns {
String UNSIGNED_INTEGER = "(?:0|[1-9]\\d*+)";
String DECIMAL_PART = "(?:[.]\\d++)";
String UNSIGNED_NUMBER_WITH_INTEGER_PART =
"(?:" + UNSIGNED_INTEGER + DECIMAL_PART + "?+)";
String UNSIGNED_NUMBER =
"(?:" + UNSIGNED_NUMBER_WITH_INTEGER_PART + "|" + DECIMAL_PART ")";
String NUMBER = "(?:[+-]?+" + UNSIGNED_NUMBER + ")";
String SPACE_SEPARATED_PAIR = "(?:" + NUMBER + "\\s++" + NUMBER + ")";
String OPTIONAL_SPACE = "(?:\\s*+)";
String LPAREN = "(?:" + OPTIONAL_SPACE + "[(]" + OPTIONAL_SPACE + ")";
String RPAREN = "(?:" + OPTIONAL_SPACE + "[)]" + OPTIONAL_SPACE + ")";
String COMMA = "(?:" + OPTIONAL_SPACE + "," + OPTIONAL_SPACE + ")";
Pattern CIRCLE = Pattern.compile(
OPTIONAL_SPACE + "CIRCLE" + OPTIONAL_SPACE + LPAREN +
LPAREN +
NUMBER + COMMA + NUMBER +
RPAREN + COMMA +
NUMBER +
RPAREN + OPTIONAL_SPACE,
Pattern.CASE_INSENSITIVE);
Pattern POLYGON = Pattern.compile(
OPTIONAL_SPACE + "POLYGON" + OPTIONAL_SPACE + LPAREN +
LPAREN +
NUMBER_PAIR + "(?:" + COMMA + NUMBER_PAIR + "){3,}+" +
RPAREN
RPAREN + OPTIONAL_SPACE,
Pattern.CASE_INSENSITIVE);
}
Notes:
The above is not tested. My goal was to show you how to do this maintainably, rather than to simply do it for you. (It should work as-is, though, unless I have typos or whatnot.)
Note the pervasive use of non-capture groups (?:...). This allows each subpattern to be a separate module; for example, something like COMMA + "+" is well-defined as meaning "one or more commas, plus optional spaces".
Also note the pervasive use of possessive quantifiers like ?+ and *+ and ++. It's easier to tell what is matched by a given occurrence of NUMBER when you know that NUMBER will never "stop short" before a trailing digit. (Imagine having a function whose behavior depended on the code that runs after it. That would be confusing, right? Well, the non-possessive quantifiers can change their meaning depending on what follows, which can have similarly confusing results for large, complex regexes.) This also has considerable performance benefits in the event of a near-match.
I made no attempt to detect the "And also first & last point set will be the same (enclosed polygon)" case. Regexes are not suited to this, since regexes are string-description language, and "same" in this case is not a string concept but a mathematical one. (It's easy to tell that 1 +0.3 is equivalent to +1.0 .30 if you use something like BigDecimal to store the actual values; but to try to express that using a regex would be pure folly.)

What does regex "\\p{Z}" mean?

I am working with some code in java that has an statement like
String tempAttribute = ((String) attributes.get(i)).replaceAll("\\p{Z}","")
I am not used to regex, so what is the meaning of it? (If you could provide a website to learn the basics of regex that would be wonderful) I've seen that for a string like
ept as y it gets transformed into eptasy, but this doesn't seem right. I believe the guy who wrote this wanted to trim leading and trailing spaces maybe.
It removes all the whitespace (replaces all whitespace matches with empty strings).
A wonderful regex tutorial is available at regular-expressions.info.
A citation from this site:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
The OP stated that the code fragment was in Java. To comment on the statement:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
the sample code below shows that this does not apply in Java.
public static void main(String[] args) {
// some normal white space characters
String str = "word1 \t \n \f \r " + '\u000B' + " word2";
// various regex patterns meant to remove ALL white spaces
String s = str.replaceAll("\\s", "");
String p = str.replaceAll("\\p{Space}", "");
String b = str.replaceAll("\\p{Blank}", "");
String z = str.replaceAll("\\p{Z}", "");
// \\s removed all white spaces
System.out.println("s [" + s + "]\n");
// \\p{Space} removed all white spaces
System.out.println("p [" + p + "]\n");
// \\p{Blank} removed only \t and spaces not \n\f\r
System.out.println("b [" + b + "]\n");
// \\p{Z} removed only spaces not \t\n\f\r
System.out.println("z [" + z + "]\n");
// NOTE: \p{Separator} throws a PatternSyntaxException
try {
String t = str.replaceAll("\\p{Separator}","");
System.out.println("t [" + t + "]\n"); // N/A
} catch ( Exception e ) {
System.out.println("throws " + e.getClass().getName() +
" with message\n" + e.getMessage());
}
} // public static void main
The output for this is:
s [word1word2]
p [word1word2]
b [word1
word2]
z [word1
word2]
throws java.util.regex.PatternSyntaxException with message
Unknown character property name {Separator} near index 12
\p{Separator}
^
This shows that in Java \\p{Z} removes only spaces and not "any kind of whitespace or invisible separator".
These results also show that in Java \\p{Separator} throws a PatternSyntaxException.
First of all, \p means you are going to match a class, a collection of character, not single one. For reference, this is Javadoc of Pattern class. https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Unicode scripts, blocks, categories and binary properties are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property.
And then Z is the name of a class (collection,set) of characters. In this case, it's abbreviation of Separator . Separator containts 3 sub classes: Space_Separator(Zs), Line_Separator(Zl) and Paragraph_Separator(Zp).
Refer here for which characters those classes contains here: Unicode Character Database or
Unicode Character Categories
More document: http://www.unicode.org/reports/tr18/#General_Category_Property

Categories