How to create a Pattern matching given set of chars? - java

I get a set of chars, e.g. as a String containing all of them and need a charclass Pattern matching any of them. For example
for "abcde" I want "[a-e]"
for "[]^-" I want "[-^\\[\\]]"
How can I create a compact solution and how to handle border cases like empty set and set of all chars?
What chars need to be escaped?
Clarification
I want to create a charclass Pattern, i.e. something like "[...]", no repetitions and no such stuff. It must work for any input, that's why I'm interested in the corner cases, too.

Here's a start:
import java.util.*;
public class RegexUtils {
private static String encode(char c) {
switch (c) {
case '[':
case ']':
case '\\':
case '-':
case '^':
return "\\" + c;
default:
return String.valueOf(c);
}
}
public static String createCharClass(char[] chars) {
if (chars.length == 0) {
return "[^\\u0000-\\uFFFF]";
}
StringBuilder builder = new StringBuilder();
boolean includeCaret = false;
boolean includeMinus = false;
List<Character> set = new ArrayList<Character>(new TreeSet<Character>(toCharList(chars)));
if (set.size() == 1<<16) {
return "[\\w\\W]";
}
for (int i = 0; i < set.size(); i++) {
int rangeLength = discoverRange(i, set);
if (rangeLength > 2) {
builder.append(encode(set.get(i))).append('-').append(encode(set.get(i + rangeLength)));
i += rangeLength;
} else {
switch (set.get(i)) {
case '[':
case ']':
case '\\':
builder.append('\\').append(set.get(i));
break;
case '-':
includeMinus = true;
break;
case '^':
includeCaret = true;
break;
default:
builder.append(set.get(i));
break;
}
}
}
builder.append(includeCaret ? "^" : "");
builder.insert(0, includeMinus ? "-" : "");
return "[" + builder + "]";
}
private static List<Character> toCharList(char[] chars) {
List<Character> list = new ArrayList<Character>();
for (char c : chars) {
list.add(c);
}
return list;
}
private static int discoverRange(int index, List<Character> chars) {
int range = 0;
for (int i = index + 1; i < chars.size(); i++) {
if (chars.get(i) - chars.get(i - 1) != 1) break;
range++;
}
return range;
}
public static void main(String[] args) {
System.out.println(createCharClass("daecb".toCharArray()));
System.out.println(createCharClass("[]^-".toCharArray()));
System.out.println(createCharClass("".toCharArray()));
System.out.println(createCharClass("d1a3e5c55543b2000".toCharArray()));
System.out.println(createCharClass("!-./0".toCharArray()));
}
}
As you can see, the input:
"daecb".toCharArray()
"[]^-".toCharArray()
"".toCharArray()
"d1a3e5c55543b2000".toCharArray()
prints:
[a-e]
[-\[\]^]
[^\u0000-\uFFFF]
[0-5a-e]
[!\--0]
The corner cases in a character class are:
\
[
]
which will need a \ to be escaped. The character ^ doesn't need an escape if it's not placed at the start of a character class, and the - does not need to be escaped when it's placed at the start, or end of the character class (hence the boolean flags in my code).

The empty set is [^\u0000-\uFFFF], and the set of all the characters is [\u0000-\uFFFF]. Not sure what you need the former for as it won't match anything. I'd throw an IllegalArgumentException() on an empty string instead.
What chars need to be escaped?
- ^ \ [ ] - that's all of them, I've actually tested it. And unlike some other regex implementations [ is considered a meta character inside a character class, possibly due to the possibility of using inner character classes with operators.
The rest of task sounds easy, but rather tedious. First you need to select unique characters. Then loop through them, appending to a StringBuilder, possibly escaping. If you want character ranges, you need to sort the characters first and select contiguous ranges while looping. If you want the - to be at the beginning of the range with no escaping, then set a flag, but don't append it. After the loop, if the flag is set, prepend - to the result before wrapping it in [].

Match all characters ".*" (zero or more repeitions * of matching any character . .
Match a blank line "^$" (match start of a line ^ and end of a line $. Note the lack of stuff to match in the middle of the line).
Not sure if the last pattern is exactly what you wanted, as there's different interpretations to "match nothing".

A quick, dirty, and almost-not-pseudo-code answer:
StringBuilder sb = new StringBuilder("[");
Set<Character> metaChars = //...appropriate initialization
while (sourceString.length() != 0) {
char c = sourceString.charAt(0);
sb.append(metaChars.contains(c) ? "\\"+c : c);
sourceString.replace(c,'');
}
sb.append("]");
Pattern p = Pattern.compile(sb.toString());
//...can check here for the appropriate sb.length cases
// e.g, 2 = empty, all chars equals the count of whatever set qualifies as all chars, etc
Which gives you the unique string of char's you want to match, with meta-characters replaced. It will not convert things into ranges (which I think is fine - doing so smells like premature optimization to me). You can do some post tests for simple set cases - like matching sb against digits, non-digits, etc, but unless you know that's going to buy you a lot of performance (or the simplification is the point of this program), I wouldn't bother.
If you really want to do ranges, you could instead sourceString.toCharArray(), sort that, iterate deleting repetitions and doing some sort of range check and replacing meta characters as you add the contents to StringBuilder.
EDIT: I actually kind of liked the toCharArray version, so pseudo-coded it out as well:
//...check for empty here, if not...
char[] sourceC = sourceString.toCharArray();
Arrays.sort(sourceC);
lastC = sourceC[0];
StringBuilder sb = new StringBuilder("[");
StringBuilder range = new StringBuilder();
for (int i=1; i<sourceC.length; i++) {
if (lastC == sourceC[i]) continue;
if (//.. next char in sequence..//) //..add to range
else {
// check range size, append accordingly to sb as a single item, range, etc
}
lastC = sourceC[i];
}

Related

Capitalize each word in a string, how does this code check the previous space

I recently came across this solution to writing a program that capitalizes each word in a string and I'm trying really hard to understand it, but I can't get over one hurdle.
public class test {
public static void main(String[] args) {
// create a string
String message = "everyone loves java";
// stores each characters to a char array
char[] charArray = message.toCharArray();
boolean foundSpace = true;
for(int i = 0; i < charArray.length; i++) {
// if the array element is a letter
if(Character.isLetter(charArray[i])) {
// check space is present before the letter
if(foundSpace) {
// change the letter into uppercase
charArray[i] = Character.toUpperCase(charArray[i]);
foundSpace = false;
}
}
else {
// if the new character is not character
foundSpace = true;
}
}
// convert the char array to the string
message = String.valueOf(charArray);
System.out.println("Message: " + message);
}
I understand everything in this code except for how it checks if the previous character was a space. The comment says it does this with the if(foundSpace) statement, but I don't see any reason why that would work since it's never specified that it's looking for a space. Any pointers in the right direction would be appreciated
Edit: Thanks for everyone’s answers, I think I finally get it now!
Look at the code inside the loop, stripped down a bit:
for(int i = 0; i < charArray.length; i++) {
if(Character.isLetter(charArray[i])) {
// check space is present before the letter
if(foundSpace) {
// Some code.
foundSpace = false;
}
// Here, foundSpace is false.
} else {
foundSpace = true;
}
}
The code doesn't care about what your variables or methods are called: it's checking if the character meets some criterion (Character.isLetter), and afterwards it has recorded whether or not the character met that criterion (in foundSpace).
This means that foundSpace contains whether the previous character met that criterion, and allows you to decide to do something on the current character based on what the previous one was: in your case, uppercase that character if the previous character didn't meet the criterion.
foundSpace is initially set to true, which means that the loop treats a string which starts with a letter as if there was a space before it, and so that will be capitalized.
Really, there is just a mismatch between the criterion used, and the name of the variable:
If the criterion should be "is it whitespace", use Character.isWhitespace as your criterion, and keep the flag name as foundSpace;
If the criterion really is "is it a letter", use Character.isLetter as your criterion, and name the flag foundLetter (and invert its sense, so check if (!foundLetter) etc, because it's easier to deal with "positive" checks (if (foundLetter)) rather than "negative" checks (if (!foundLetter))).
foundSpace is only set to false if the condition:
(Character.isLetter(charArray[i]))
is triggered.
If the condition is not triggered, aka if the character is not a letter, then foundSpace will be set to True.
This means that anything that is not a letter will set foundSpace equal to true.
As Silvio pointed out, this includes other non space characters such as 1, $, and ".
It would be better instead of checking if the character is a letter, to instead check to see if charArray[i] is equal to a space.
Like so:
boolean foundSpace = Character.isWhitespace(charArray[i])
More info about the Character class can be found below:
https://www.geeksforgeeks.org/character-equals-method-in-java-with-examples/
The code you've shown could have some undesirable side effects - especially if foundSpace is relied upon later on - because it could potentially lead you to believe that a space exists in a word, when it's not a space at all but rather a character that isn't a letter like "$".
In the context of your entire program, this looks like:
public class test {
public static void main(String[] args) {
// create a string
String message = "everyone loves java";
// stores each characters to a char array
char[] charArray = message.toCharArray();
boolean foundSpace = true;
for(int i = 0; i < charArray.length; i++) {
if(Character.isWhitespace(charArray[i])){
foundSpace = true;
}
else{
foundSpace = false;
}
// if the array element is a letter
if(Character.isLetter(charArray[i])) {
// check space is present before the letter
if(foundSpace) {
// change the letter into uppercase
charArray[i] = Character.toUpperCase(charArray[i]);
}
}
else {
// if the new character is not character
// delete this condition if you don't plan to use it
}
}
// convert the char array to the string
message = String.valueOf(charArray);
System.out.println("Message: " + message);
}
Some edits made to the code to fix compilation errors, thanks to those in the comments for your suggestions.

StreamTokenizer mangles integers and loose periods

I've appropriated and modified the below code which does a pretty good job of tokenizing Java code using Java's StreamTokenizer. Its number handling is problematic, though:
it turns all integers into doubles. I can get past that by testing num % 1 == 0, but this feels like a hack
More critically, a . following whitespace is treated as a number. "Class .method()" is legal Java syntax, but the resulting tokens are [Word "Class"], [Whitespace " "], [Number 0.0], [Word "method"], [Symbol "("], and [Symbol ")"]
I'd be happy turning off StreamTokenizer's number parsing entirely and parsing the numbers myself from word tokens, but commenting st.parseNumbers() seems to have no effect.
public class JavaTokenizer {
private String code;
private List<Token> tokens;
public JavaTokenizer(String c) {
code = c;
tokens = new ArrayList<>();
}
public void tokenize() {
try {
// Create the tokenizer
StringReader sr = new StringReader(code);
StreamTokenizer st = new StreamTokenizer(sr);
// Java-style tokenizing rules
st.parseNumbers();
st.wordChars('_', '_');
st.eolIsSignificant(false);
// Don't want whitespace tokens
//st.ordinaryChars(0, ' ');
// Strip out comments
st.slashSlashComments(true);
st.slashStarComments(true);
// Parse the file
int token;
do {
token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_NUMBER:
// A number was found; the value is in nval
double num = st.nval;
if(num % 1 == 0)
tokens.add(new IntegerToken((int)num);
else
tokens.add(new FPNumberToken(num));
break;
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
tokens.add(new WordToken(word));
break;
case '"':
// A double-quoted string was found; sval contains the contents
String dquoteVal = st.sval;
tokens.add(new DoubleQuotedStringToken(dquoteVal));
break;
case '\'':
// A single-quoted string was found; sval contains the contents
String squoteVal = st.sval;
tokens.add(new SingleQuotedStringToken(squoteVal));
break;
case StreamTokenizer.TT_EOL:
// End of line character found
tokens.add(new EOLToken());
break;
case StreamTokenizer.TT_EOF:
// End of file has been reached
tokens. add(new EOFToken());
break;
default:
// A regular character was found; the value is the token itself
char ch = (char) st.ttype;
if(Character.isWhitespace(ch))
tokens.add(new WhitespaceToken(ch));
else
tokens.add(new SymbolToken(ch));
break;
}
} while (token != StreamTokenizer.TT_EOF);
sr.close();
} catch (IOException e) {
}
}
public List<Token> getTokens() {
return tokens;
}
}
parseNumbers() in "on" by default. Use resetSyntax() to turn off number parsing and all other predefined character types, then enable what you need.
That said, manual number parsing might get tricky with accounting for dots and exponents... With a scanner and regular expressions it should be relatively straightforward to implement your own tokenizer, tailored exactly to your needs. For an example, you may want to take a look at the Tokenizer inner class here: https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java (about 120 LOC at the end)
I'll look into parboiled when I have a chance. In the meantime, the disgusting workaround I implemented to get it working is:
private static final String DANGLING_PERIOD_TOKEN = "___DANGLING_PERIOD_TOKEN___";
Then in tokenize()
//a period following whitespace, not followed by a digit is a "dangling period"
code = code.replaceAll("(?<=\\s)\\.(?![0-9])", " "+DANGLING_PERIOD_TOKEN+" ");
And in the tokenization loop
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
if(word.equals(DANGLING_PERIOD_TOKEN))
tokens.add(new SymbolToken('.'));
else
tokens.add(new WordToken(word));
break;
This solution is specific to my needs of not caring what the original whitespace was (as it adds some around the inserted "token")

Java efficiently replace unless matches complex regular expression

I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).
Question: How can I efficiently replace certain characters that don't match a particular regular expression?
I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though.
I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.
Should I do this line by line instead of operating on the whole text?
String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);
//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");
// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");
Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.
I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).
//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();
int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
int pt = strArray[i];
// System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
if (protectedArray[currentPos]==pt) {
if (currentPos == protectedLen - 1) {
resultStr += protectedStrs.get(protectedCount);
protectedCount++;
currentPos = 0;
} else {
currentPos++;
}
} else {
if (currentPos > 0) {
resultStr += replaceStr.substring(0, currentPos);
currentPos = 0;
currentStr = "";
}
resultStr += ParseUtils.getSymbol(pt);
}
}
s = resultStr;
This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?
I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.
You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.
Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:
StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
appendInBetween(resultStr, str, current, protectedM.str());
resultStr.append(protectedM.group());
currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());
where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:
void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
// Pass the whole input string and the bounds, rather than taking a substring.
// Allocate roughly enough space up-front.
resultStr.ensureCapacity(resultStr.length() + end - start);
for (int i = start; i < end; ++i) {
char c = s.charAt(i);
// Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
if (!(Character.isLetter(c)
|| Character.isDigit(c)
|| Character.getType(c) == Character.NON_SPACING_MARK
|| "_\\-<>'".indexOf(c) != -1)) {
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else if (c == '\'' && i > 0 && i + 1 < s.length()) {
// We have a quote that's not at the beginning or end.
// Call these 3 characters bcd, where c is the quote.
char b = s.charAt(i - 1);
char d = s.charAt(i + 1);
if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
// If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
resultStr.append(' ');
resultStr.append(c);
} else if (!Character.isLetter(b) && !Character.isLetter(d)) {
// If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else {
resultStr.append(c);
}
} else {
// Everything else, just append.
resultStr.append(c);
}
}
}
Ideone demo
Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.
I'd be interested to know if this works any faster for you!
At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:
StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
replaceM.appendReplacement(replacedBuff, "");
replacedBuff.append(protectedStrs.get(index));
index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();
Reference: Second answer at this question.
Another option to consider:
During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
String substr = s.substring(currIndex,protectedM.start());
substr = p1.matcher(substr).replaceAll(" $1 ");
substr = p2.matcher(substr).replaceAll("$1 '$2");
substr = p3.matcher(substr).replaceAll("$1 ' $2");
resultStr += substr+protectedM.group();
currIndex = protectedM.end();
}
Speed comparison for 100,000 lines of text:
Original Perl script: 272.960579875 seconds
My first attempt: Too long to finish.
With appendReplacement(): 14.245160866 seconds
Replacing while finding protected: 68.691842962 seconds
Thank you, Java, for not letting me down.

Regex to capture groups

My group could either be of the form x/y, x.y or x_y.z. Each group is separated by an underscore. The groups are unordered.
Example:
ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno
I would like to capture the following:
ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno
I have done this using a fairly verbose string iteration and parsing method (shown below), but am wondering if a simple regex can accomplish this.
private static ArrayList<String> go(String s){
ArrayList<String> list = new ArrayList<String>();
boolean inSlash = false;
int pos = 0 ;
boolean inDot = false;
for(int i = 0 ; i < s.length(); i++){
char c = s.charAt(i);
switch (c) {
case '/':
inSlash = true;
break;
case '_':
if(inSlash){
list.add(s.substring(pos,i));
inSlash = false;
pos = i+1 ;
}
else if (inDot){
list.add(s.substring(pos,i));
inDot = false;
pos = i+1;
}
break;
case '.':
inDot = true;
break;
default:
break;
}
}
list.add(s.substring(pos));
System.out.println(list);
return list;
}
Have a try with:
((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))
I don't know java syntax but in Perl:
#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;
my $str = q!ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno_a_b_c.z_a_b_c_d.z_a_b_c_d_e.z!;
my $re = qr!((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))!;
while($str=~/$re/g) {
say $1;
}
will produce:
ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno
a_b_c.z
a_b_c_d.z
a_b_c_d_e.z
There might be a problem with the underscore since it's not always a separator.
Maybe: ((?<=_)\w+_)?\w+[./]\.w+
This regex would probably do (tested with .Net regular expressions):
[a-zA-Z]+[./][a-zA-Z]+|[a-zA-Z]+_[a-zA-Z]+\.[a-zA-Z]+
(If you know your input is well formed there is no need to explicitly match the separator)
This one goes with positive lookahead instead of alternations
[A-Za-z]+(_(?=[A-Za-z]+\.[A-Za-z]+))?[A-Za-z]+[/.][A-Za-z]+

Java String Special character replacement

I have string which contains alpahanumeric and special character.
I need to replace each and every special char with some string.
For eg,
Input string = "ja*va st&ri%n#&"
Expected o/p = "jaasteriskvaspacestandripercentagenatand"
= "asterisk"
& = "and"
% = "percentage"
# = "at"
thanks,
Unless you're absolutely desperate for performance, I'd use a very simple approach:
String result = input.replace("*", "asterisk")
.replace("%", "percentage")
.replace("#", "at"); // Add more to taste :)
(Note that there's a big difference between replace and replaceAll - the latter takes a regular expression. It's easy to get the wrong one and see radically different effects!)
An alternative would be something like:
public static String replaceSpecial(String input)
{
// Output will be at least as long as input
StringBuilder builder = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
switch (c)
{
case '*': builder.append("asterisk"); break;
case '%': builder.append("percentage"); break;
case '#': builder.append("at"); break;
default: builder.append(c); break;
}
}
return builder.toString();
Take a look at the following java.lang.String methods:
replace()
replaceAll()

Categories