Can this regex be further optimized?

Can this regex be further optimized? - java

I wrote this regex to parse entries from srt files.
(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$
I don't know if it matters, but this is done using Scala programming language (Java Engine, but literal strings so that I don't have to double the backslashes).
The s{1,2} is used because some files will only have line breaks \n and others will have line breaks and carriage returns \n\r
The first (?s) enables DOTALL mode so that the third capturing group can also match line breaks.
My program basically breaks a srt file using \n\r?\n as a delimiter and use Scala nice pattern matching feature to read each entry for further processing:
val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r
def apply(string: String): Entry = string match {
case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start),
timeFormat.parse(end), text);
}
Sample entries:
One line:
1073
01:46:43,024 --> 01:46:45,015
I am your father.
Two Lines:
160
00:20:16,400 --> 00:20:19,312
<i>Help me, Obi-Wan Kenobi.
You're my only hope.</i>
The thing is, the profiler shows me that this parsing method is by far the most time consuming operation in my application (which does intensive time math and can even reencode the file several times faster than what it takes to read and parse the entries).
So any regex wizards can help me optimize it? Or maybe I should sacrifice regex / pattern matching succinctness and try an old school java.util.Scanner approach?
Cheers,

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$
In Java, $ means the end of input or the beginning of a line-break immediately preceding the end of input. \z means unambiguously end of input, so if that is also the semantics in Scala, then \r?$ is redundant and $ would do just as well. If you really only want a CR at the end and not CRLF then \r?\z might be better.
The (?s) should also make (.+)\r? redundant since the + is greedy, the . should always expand to include the \r. If you do not want the \r included in that third capturing group, then make the match lazy : (.+?) instead of (.+).
Maybe
(?s)^\d++\s\s?(.{12}) --> (.{12})\s\s?(.+?)\r?\z
Other fine high-performance alternatives to regular expressions that will run inside a JVM &| CLR include JavaCC and ANTLR. For a Scala only solution, see http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html

I'm not optimistic, but here are two things to try:
you could do is move the (?s) to just before you need it.
remove the \r?$ and use a greedy .++ for the text .+
^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(?s)(.++)$
To really get good performance, I would refactor the code and regex to use findAllIn. The current code is doing a regex for every Entry in your file. I imagine the single findAllIn regex would perform better...But maybe not...

Check this out:
(?m)^\d++\r?+\n(.{12}) --> (.{12})\r?+\n(.++(?>\r?+\n.++)*+)$
This regex matches a complete .srt file entry in place. You don't have to split the contents up on line breaks first; that's a huge waste of resources.
The regex takes advantage of the fact that there's exactly one line separator (\n or \r\n) separating the lines within an entry (multiple line separators are used to separate entries from each other). Using \r?+\n instead of \s{1,2} means you can never accidentally match two line separators (\n\n) when you only wanted to match one.
This way, too, you don't have to rely on the . in (?s) mode. #Jacob was right about that: it's not really helping you, and it's killing your performance. But (?m) mode is helpful, for correctness as well as performance.
You mentioned java.util.Scanner; this regex would work very nicely with findWithinHorizon(0). But I'd be surprised if Scala doesn't offer a nice, idiomatic way to use it as well.

I wouldn't use java.util.Scanner or even strings. Everything you're doing will work perfectly on a byte stream as long as you can assume UTF-8 encoding of your files (or a lack of unicode). You should be able to speed things up by at least 5x.
Edit: this is just a lot of low-level fiddling of bytes and indices. Here's something based loosely on things I've done before, which seems about 2x-5x faster, depending on file size, caching, etc.. I'm not doing the date parsing here, just returning strings, and I'm assuming the files are small enough to fit in a single block of memory (i.e. <2G). This is being rather pedantically careful; if you know, for example, that the date string format is always okay, then the parsing can be faster yet (just count the characters after the first line of digits).
import java.io._
abstract class Entry {
def isDefined: Boolean
def date1: String
def date2: String
def text: String
}
case class ValidEntry(date1: String, date2: String, text: String) extends Entry {
def isDefined = true
}
object NoEntry extends Entry {
def isDefined = false
def date1 = ""
def date2 = ""
def text = ""
}
final class Seeker(f: File) {
private val buffer = {
val buf = new Array[Byte](f.length.toInt)
val fis = new FileInputStream(f)
fis.read(buf)
fis.close()
buf
}
private var i = 0
private var d1,d2 = 0
private var txt,n = 0
def isDig(b: Byte) = ('0':Byte) <= b && ('9':Byte) >= b
def nextNL() {
while (i < buffer.length && buffer(i) != '\n') i += 1
i += 1
if (i < buffer.length && buffer(i) == '\r') i += 1
}
def digits() = {
val zero = i
while (i < buffer.length && isDig(buffer(i))) i += 1
if (i==zero || i >= buffer.length || buffer(i) != '\n') {
nextNL()
false
}
else {
nextNL()
true
}
}
def dates(): Boolean = {
if (i+30 >= buffer.length) {
i = buffer.length
false
}
else {
d1 = i
while (i < d1+12 && buffer(i) != '\n') i += 1
if (i < d1+12 || buffer(i)!=' ' || buffer(i+1)!='-' || buffer(i+2)!='-' || buffer(i+3)!='>' || buffer(i+4)!=' ') {
nextNL()
false
}
else {
i += 5
d2 = i
while (i < d2+12 && buffer(i) != '\n') i += 1
if (i < d2+12 || buffer(i) != '\n') {
nextNL()
false
}
else {
nextNL()
true
}
}
}
}
def gatherText() {
txt = i
while (i < buffer.length && buffer(i) != '\n') {
i += 1
nextNL()
}
n = i-txt
nextNL()
}
def getNext: Entry = {
while (i < buffer.length) {
if (digits()) {
if (dates()) {
gatherText()
return ValidEntry(new String(buffer,d1,12), new String(buffer,d2,12), new String(buffer,txt,n))
}
}
}
return NoEntry
}
}
Now that you see that, aren't you glad that the regex solution was so quick to code?

Related

Java efficiently replace unless matches complex regular expression

I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).
Question: How can I efficiently replace certain characters that don't match a particular regular expression?
I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though.
I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.
Should I do this line by line instead of operating on the whole text?
String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);
//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");
// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");
Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.
I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).
//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();
int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
int pt = strArray[i];
// System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
if (protectedArray[currentPos]==pt) {
if (currentPos == protectedLen - 1) {
resultStr += protectedStrs.get(protectedCount);
protectedCount++;
currentPos = 0;
} else {
currentPos++;
}
} else {
if (currentPos > 0) {
resultStr += replaceStr.substring(0, currentPos);
currentPos = 0;
currentStr = "";
}
resultStr += ParseUtils.getSymbol(pt);
}
}
s = resultStr;
This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?

I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.
You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.
Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:
StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
appendInBetween(resultStr, str, current, protectedM.str());
resultStr.append(protectedM.group());
currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());
where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:
void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
// Pass the whole input string and the bounds, rather than taking a substring.
// Allocate roughly enough space up-front.
resultStr.ensureCapacity(resultStr.length() + end - start);
for (int i = start; i < end; ++i) {
char c = s.charAt(i);
// Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
if (!(Character.isLetter(c)
|| Character.isDigit(c)
|| Character.getType(c) == Character.NON_SPACING_MARK
|| "_\\-<>'".indexOf(c) != -1)) {
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else if (c == '\'' && i > 0 && i + 1 < s.length()) {
// We have a quote that's not at the beginning or end.
// Call these 3 characters bcd, where c is the quote.
char b = s.charAt(i - 1);
char d = s.charAt(i + 1);
if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
// If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
resultStr.append(' ');
resultStr.append(c);
} else if (!Character.isLetter(b) && !Character.isLetter(d)) {
// If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else {
resultStr.append(c);
}
} else {
// Everything else, just append.
resultStr.append(c);
}
}
}
Ideone demo
Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.
I'd be interested to know if this works any faster for you!

At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:
StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
replaceM.appendReplacement(replacedBuff, "");
replacedBuff.append(protectedStrs.get(index));
index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();
Reference: Second answer at this question.
Another option to consider:
During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
String substr = s.substring(currIndex,protectedM.start());
substr = p1.matcher(substr).replaceAll(" $1 ");
substr = p2.matcher(substr).replaceAll("$1 '$2");
substr = p3.matcher(substr).replaceAll("$1 ' $2");
resultStr += substr+protectedM.group();
currIndex = protectedM.end();
}
Speed comparison for 100,000 lines of text:
Original Perl script: 272.960579875 seconds
My first attempt: Too long to finish.
With appendReplacement(): 14.245160866 seconds
Replacing while finding protected: 68.691842962 seconds
Thank you, Java, for not letting me down.

confusion in behavior of capturing groups in java regex

In this answer I recommended using
s.replaceFirst("\\.0*$|(\\.\\d*?)0+$", "$1");
but two people complained that the result contained the string "null", e.g., 23.null. This could be explained by $1 (i.e., group(1)) being null, which could be transformed via String.valueOf to the string "null". However, I always get the empty string. My testcase covers it and
assertEquals("23", removeTrailingZeros("23.00"));
passes. Is the exact behavior undefined?

The documentation of Matcher class from the reference implementation doesn't specify the behavior of appendReplacement method when a capturing group which doesn't capture anything (null) is specified in the replacement string. While the behavior of group method is clear, nothing is mentioned in appendReplacement method.
Below are 3 exhibits of difference in implementation for the case above:
The reference implementation does not append anything (or we can say append an empty string) for the case above.
GNU Classpath and Android's implementation appends null for the case above.
Some code has been omitted for the sake of brevity, and is indicated by ....
1) Sun/Oracle JDK, OpenJDK (Reference implementation)
For the reference implementation (Sun/Oracle JDK and OpenJDK), the code for appendReplacement doesn't seem to have changed from Java 6, and it will not append anything when a capturing group doesn't capture anything:
} else if (nextChar == '$') {
// Skip past $
cursor++;
// The first number is always a group
int refNum = (int)replacement.charAt(cursor) - '0';
if ((refNum < 0)||(refNum > 9))
throw new IllegalArgumentException(
"Illegal group reference");
cursor++;
// Capture the largest legal group string
...
// Append group
if (start(refNum) != -1 && end(refNum) != -1)
result.append(text, start(refNum), end(refNum));
} else {
Reference
jdk6/98e143b44620
jdk8/687fd7c7986d
2) GNU Classpath
GNU Classpath, which is a complete reimplementation of Java Class Library has a different implementation for appendReplacement in the case above. In Classpath, the classes in java.util.regex package in Classpath is just a wrapper for classes in gnu.java.util.regex.
Matcher.appendReplacement calls RE.getReplacement to process replacement for the matched portion:
public Matcher appendReplacement (StringBuffer sb, String replacement)
throws IllegalStateException
{
assertMatchOp();
sb.append(input.subSequence(appendPosition,
match.getStartIndex()).toString());
sb.append(RE.getReplacement(replacement, match,
RE.REG_REPLACE_USE_BACKSLASHESCAPE));
appendPosition = match.getEndIndex();
return this;
}
RE.getReplacement calls REMatch.substituteInto to get the content of the capturing group and appends its result directly:
case '$':
int i1 = i + 1;
while (i1 < replace.length () &&
Character.isDigit (replace.charAt (i1)))
i1++;
sb.append (m.substituteInto (replace.substring (i, i1)));
i = i1 - 1;
break;
REMatch.substituteInto appends the result of REMatch.toString(int) directly without checking whether the capturing group has captured anything:
if ((input.charAt (pos) == '$')
&& (Character.isDigit (input.charAt (pos + 1))))
{
// Omitted code parses the group number into val
...
if (val < start.length)
{
output.append (toString (val));
}
}
And REMatch.toString(int) returns null when the capturing group doesn't capture (irrelevant code has been omitted).
public String toString (int sub)
{
if ((sub >= start.length) || sub < 0)
throw new IndexOutOfBoundsException ("No group " + sub);
if (start[sub] == -1)
return null;
...
}
So in GNU Classpath's case, null will be appended to the string when a capturing group which fails to capture anything is specified in the replacement string.
3) Android Open Source Project - Java Core Libraries
In Android, Matcher.appendReplacement calls private method appendEvaluated, which in turn directly appends the result of group(int) to the replacement string.
public Matcher appendReplacement(StringBuffer buffer, String replacement) {
buffer.append(input.substring(appendPos, start()));
appendEvaluated(buffer, replacement);
appendPos = end();
return this;
}
private void appendEvaluated(StringBuffer buffer, String s) {
boolean escape = false;
boolean dollar = false;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\' && !escape) {
escape = true;
} else if (c == '$' && !escape) {
dollar = true;
} else if (c >= '0' && c <= '9' && dollar) {
buffer.append(group(c - '0'));
dollar = false;
} else {
buffer.append(c);
dollar = false;
escape = false;
}
}
// This seemingly stupid piece of code reproduces a JDK bug.
if (escape) {
throw new ArrayIndexOutOfBoundsException(s.length());
}
}
Since Matcher.group(int) returns null for capturing group which fails to capture, Matcher.appendReplacement appends null when the capturing group is referred to in the replacement string.
It is most likely that the 2 people complaining to you are running their code on Android.

Having had a careful look at the Javadoc, I conclude that:
$1 is equivalent to calling group(1), which is specified to return null when the group didn't get captured.
The handling of nulls in the replacement expression is unspecified.
The wording of the relevant parts of the Javadoc is on the whole surprisingly vague (emphasis mine):
Dollar signs may be treated as references to captured subsequences as described above...

You have two alternatives | or-ed together, but only the second is between ( ) hence if the first alternative is matched, group 1 is null.
In general place the parentheses around all alternatives
In your case you want to replace
"xxx.00000" by "xxx" or else
"xxx.yyy00" by "xxx.yyy"
Better do that in two steps, as that is more readable:
"xxx.y*00" by "xxx.y*" then
"xxx." by "xxx"
This does a bit extra, changing an initial "1." to "1".
So:
.replaceFirst("(\\.\\d*?)0+$", "$1").replaceFirst("\\.$", "");

Tokenizing an algebraic expression in string format

I"m trying to take a string that represents a full algebraic excpression, such as x = 15 * 6 / 3 which is a string, and tokenize it into its individual components. So the first would be x, then =, then 15, then *, 6, / and finally 3.
The problem I am having is actually parsing through the string and looking at the individual characters. I can't think of a way to do this without a massive amount of if statements. Surely there has to be a better way tan specifically defining each individual case and testing for it.

For each type of token, you'll want to figure out how to identify:
when you're starting to read a particular token
if you're continuing to read the same token, or if you've started a different one
Let's take your example: x=15*6/3. Let's assume that you cannot rely on the fact that there are spaces in between each token. In that case, it's trivial: your new token starts when you reach a space.
You can break down the character types into letters, digits, and symbols. Let's call the token types Variable, Operator, and Number.
A letter indicates a Variable token has started. It continues until you read a non-letter.
A symbol indicates the start of an Operator token. I only see single symbols, but you can have groups of symbols correspond to different Operator tokens.
A digit indicates the start of a Number token. (Let's assume integers for now.) The Number token continues until you read a non-digit.
Basically, that's how a simple symbolic parser works. Now, if you add in negative numbers (where the '-' symbol can have multiple meanings), or parentheses, or function names (like sin(x)) then things get more complicated, but it amounts to the same set of rules, now just with more choices.

create regular expression for each possible element: integer, variable, operator, parentheses.
combine them using the | regular expression operator into one big regular expression with capture groups to identify which one matched.
in a loop match the head of the remaining string and break off the matched part as a token. the type of the token depends on which sub-expression matched as described in 2.
or
use a lexer library, such as the one in antlr or javacc

This is from my early expression evaluator that takes an infix expression like yours and turns it into postfix to evaluate. There are methods that help the parser but I think they're pretty self documenting. Mine uses symbol tables to check tokens against. It also allows for user defined symbols and nested assignments and other things you may not need/want. But it shows how I handled your issue without using niceties like regex which would simplify this task tremendously. In addition everything shown is of my own implementation - stack and queue as well - everything. So if anything looks abnormal (unlike Java imps) that's because it is.
This section of code is important not to answer your immediate question but to show the necessary work to determine the type of token you're dealing with. In my case I had three different types of operators and two different types of operands. Based on either the known rules or rules I chose to enforce (when appropriate) it was easy to know when something was a number (starts with a number), variable/user symbol/math function (starts with a letter), or math operator (is: /,*,-,+) . Note that it only takes seeing the first char to know the correct extraction rules. From your example, if all your cases are as simple, you'd only have to handle two types, operator or operand. Nonetheless the same logic will apply.
protected Queue<Token> inToPostParse(String exp) {
// local vars
inputExp = exp;
offset = 0;
strLength = exp.length();
String tempHolder = "";
char c;
// the program runs in a loop so make sure you're dealing
// with an empty queue
q1.reset();
for (int i = offset; tempHolder != null && i < strLength; ++i) {
c = exp.charAt(i);
// Spaces are useless so skip them
if (c == ' ') { continue; }
// If c is a letter
if ((c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')) {
// Here we know it must be a user symbol possibly undefined
// at this point or an function like SIN, ABS, etc
// We extract, based on obvious rules, the op
tempHolder = extractPhrase(i); // Used to be append sequence
if (ut.isTrigOp(tempHolder) || ut.isAdditionalOp(tempHolder)) {
s1.push(new Operator(tempHolder, "Function"));
} else {
// If not some math function it is a user defined symbol
q1.insert(new Token(tempHolder, "User"));
}
i += tempHolder.length() - 1;
tempHolder = "";
// if c begins with a number
} else if (c >= '0' && c <= '9') {
try {
// Here we know that it must be a number
// so we extract until we reach a non number
tempHolder = extractNumber(i);
q1.insert(new Token(tempHolder, "Number"));
i += tempHolder.length() - 1;
tempHolder = "";
}
catch (NumberFormatException nfe) {
return null;
}
// if c is in the math symbol table
} else if (ut.isMathOp(String.valueOf(c))) {
String C = String.valueOf(c);
try {
// This is where the magic happens
// Here we determine the "intersection" of the
// current C and the top of the stack
// Based on the intersection we take action
// i.e., in math do you want to * or + first?
// Depending on the state you may have to move
// some tokens to the queue before pushing onto the stack
takeParseAction(C, ut.findIntersection
(C, s1.showTop().getSymbol()));
}
catch (NullPointerException npe) {
s1(C);
}
// it must be an invalid expression
} else {
return null;
}
}
u2();
s1.reset();
return q1;
}
Basically I have a stack (s1) and a queue (q1). All variables or numbers go into the queue. Any operators trig, math, parens, etc.. go on the stack. If the current token is to be put on the stack you have to check the state (top) to determine what parsing action to take (i.e., what to do based on math precedence). Sorry if this seems like useless information. I imagine if you're parsing a math expression it's because at some point you plan to evaluate it. IMHO, postfix is the easiest so I, regardless of input format, change it to post and evaluate with one method. If your O is different - do what you like.
Edit: Implementations
The extract phrase and number methods, which you may be most interested in, are as follows:
protected String extractPhrase(int it) {
String phrase = new String();
char c;
for ( ; it < inputExp.length(); ++it) {
c = inputExp.charAt(it);
if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
|| (c >= '0' && c <= '9')) {
phrase += String.valueOf(c);
} else {
break;
}
}
return phrase;
}
protected String extractNumber(int it) throws NumberFormatException {
String number = new String();
int decimals = 0;
char c;
for ( ; it < strLength; ++it) {
c = inputExp.charAt(it);
if (c >= '0' && c <= '9') {
number += String.valueOf(c);
} else if (c == '.') {
++decimals;
if (decimals < 2) {
number += ".";
} else {
throw new NumberFormatException();
}
} else {
break;
}
}
return number;
}
Remember - By the time they enter these methods I've already been able to deduce what type it is. This allows you to avoid the seemingly endless while-if-else chain.

Are components always separated by space character like in your question? if so, use algebricExpression.split(" ") to get a String[] of components.
If no such restrictions can be assumed, a possible solution can be to iterate over the input, and switch the Character.getType() of the current index, somthing like that:
ArrayList<String> getExpressionComponents(String exp) {
ArrayList<String> components = new ArrayList<String>();
String current = "";
int currentSequenceType = Character.UNASSIGNED;
for (int i = 0 ; i < exp.length() ; i++) {
if (currentSequenceType != Character.getType(exp.charAt(i))) {
if (current.length() > 0) components.add(current);
current = "";
currentSequenceType = Character.getType(exp.charAt(i));
}
switch (Character.getType(exp.charAt(i))) {
case Character.DECIMAL_DIGIT_NUMBER:
case Character.MATH_SYMBOL:
case Character.START_PUNCTUATION:
case Character.END_PUNCTUATION:
case Character.LOWERCASE_LETTER:
case Character.UPPERCASE_LETTER:
// add other required types
current = current.concat(new String(new char[] {exp.charAt(i)}));
currentSequenceType = Character.getType(exp.charAt(i));
break;
default:
current = "";
currentSequenceType = Character.UNASSIGNED;
break;
}
}
return components;
}
You can easily change the cases to meet with other requirements, such as split non-digit chars to separate components etc.

How to get good perfomance of Regex in java

Below is example of text:
String id = "A:abc,X:def,F:xyz,A:jkl";
Below is regex:
Pattern p = Pattern.compile("(.*,)?[AC]:[^:]+$");
if(p.matcher(id).matches()) {
System.out.println("Hello world!")
}
When executed above code should print Hello world!.
Does this regex can be modified to gain more performance?

As I can't see your entire code, I can only assume that you do the pattern compilation inside your loop/method/etc. One thing that can improve performance is to compile at the class level and not recompile the pattern each time. Other than that, I don't see much else that you could change.

Pattern p = Pattern.compile(".*[AC]:[^:]+$");
if(p.matcher(id).matches()) {
System.out.println("Hello world!")
}
As you seem to only be interested if it the string ends in A or C followed by a colon and some characters which aren't colons you can just use .* instead of (.*,)? (or do you really want to capture the stuff before the last piece?)
If the stuff after the colon is all lower case you could even do
Pattern p = Pattern.compile(".*[AC]:[a-z]+$");
And if you are going to match this multiple times in a row (e.g. loop) be sure to compile the pattern outside of the loop.
e,g
Pattern p = Pattern.compile(".*[AC]:[a-z]+$");
Matcher m = p.matcher(id);
while(....) {
...
// m.matches()
...
// prepare for next loop m.reset(newvaluetocheck);
}

Move Pattern instantiation to a final static field (erm, constant), in your current code you're recompiling essentially the same Pattern every single time (no, Pattern doesn't cache anything!). That should give you some noticeable performance boost right off the bat.

Do you even need to use regualr expressions? It seems there isn't a huge variety in what you are testing.
If you need to use the regex as others have said, compiling it only once makes sense and if you only need to check the last token maybe you could simplify the regex to: [AC]:[^:]{3}$.
Could you possibly use something along these lines (untested...)?
private boolean isId(String id)
{
char[] chars = id.toCharArray();
boolean valid = false;
int length = chars.length;
if (length >= 5 && chars[length - 4] == ':')
{
char fifthToLast = chars[length - 5];
if (fifthToLast == 'A' || fifthToLast == 'C')
{
valid = true;
for (int i = length - 1; i >= length - 4; i--)
{
if (chars[i] == ':')
{
valid = false;
break;
}
}
}
}
return valid;
}

Best way to encode text data for XML in Java?

Very similar to this question, except for Java.
What is the recommended way of encoding strings for an XML output in Java. The strings might contain characters like "&", "<", etc.

As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.

Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.

Just use.
<![CDATA[ your text here ]]>
This will allow any characters except the ending
]]>
So you can include characters that would be illegal such as & and >. For example.
<element><![CDATA[ characters such as & and > are allowed ]]></element>
However, attributes will need to be escaped as CDATA blocks can not be used for them.

This question is eight years old and still not a fully correct answer! No, you should not have to import an entire third party API to do this simple task. Bad advice.
The following method will:
correctly handle characters outside the basic multilingual plane
escape characters required in XML
escape any non-ASCII characters, which is optional but common
replace illegal characters in XML 1.0 with the Unicode substitution character. There is no best option here - removing them is just as valid.
I've tried to optimise for the most common case, while still ensuring you could pipe /dev/random through this and get a valid string in XML.
public static String encodeXML(CharSequence s) {
StringBuilder sb = new StringBuilder();
int len = s.length();
for (int i=0;i<len;i++) {
int c = s.charAt(i);
if (c >= 0xd800 && c <= 0xdbff && i + 1 < len) {
c = ((c-0xd7c0)<<10) | (s.charAt(++i)&0x3ff); // UTF16 decode
}
if (c < 0x80) { // ASCII range: test most common case first
if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
switch(c) {
case '&': sb.append("&"); break;
case '>': sb.append(">"); break;
case '<': sb.append("<"); break;
// Uncomment next two if encoding for an XML attribute
// case '\'' sb.append("&apos;"); break;
// case '\"' sb.append("""); break;
// Uncomment next three if you prefer, but not required
// case '\n' sb.append("
"); break;
// case '\r' sb.append("
"); break;
// case '\t' sb.append(" "); break;
default: sb.append((char)c);
}
}
} else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
sb.append("&#x");
sb.append(Integer.toHexString(c));
sb.append(';');
}
}
return sb.toString();
}
Edit: for those who continue to insist it foolish to write your own code for this when there are perfectly good Java APIs to deal with XML, you might like to know that the StAX API included with Oracle Java 8 (I haven't tested others) fails to encode CDATA content correctly: it doesn't escape ]]> sequences in the content. A third party library, even one that's part of the Java core, is not always the best option.

This has worked well for me to provide an escaped version of a text string:
public class XMLHelper {
/**
* Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "<A & B >"
* .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
* no characters to protect, the original string is returned.
*
* #param originalUnprotectedString
* original string which may contain characters either reserved in XML or with different representation
* in different encodings (like 8859-1 and UFT-8)
* #return
*/
public static String protectSpecialCharacters(String originalUnprotectedString) {
if (originalUnprotectedString == null) {
return null;
}
boolean anyCharactersProtected = false;
StringBuffer stringBuffer = new StringBuffer();
for (int i = 0; i < originalUnprotectedString.length(); i++) {
char ch = originalUnprotectedString.charAt(i);
boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';
if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
stringBuffer.append("&#" + (int) ch + ";");
anyCharactersProtected = true;
} else {
stringBuffer.append(ch);
}
}
if (anyCharactersProtected == false) {
return originalUnprotectedString;
}
return stringBuffer.toString();
}
}

Try this:
String xmlEscapeText(String t) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < t.length(); i++){
char c = t.charAt(i);
switch(c){
case '<': sb.append("<"); break;
case '>': sb.append(">"); break;
case '\"': sb.append("""); break;
case '&': sb.append("&"); break;
case '\'': sb.append("&apos;"); break;
default:
if(c>0x7e) {
sb.append("&#"+((int)c)+";");
}else
sb.append(c);
}
}
return sb.toString();
}

StringEscapeUtils.escapeXml() does not escape control characters (< 0x20). XML 1.1 allows control characters; XML 1.0 does not. For example, XStream.toXML() will happily serialize a Java object's control characters into XML, which an XML 1.0 parser will reject.
To escape control characters with Apache commons-lang, use
NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))

public String escapeXml(String s) {
return s.replaceAll("&", "&").replaceAll(">", ">").replaceAll("<", "<").replaceAll("\"", """).replaceAll("'", "&apos;");
}

For those looking for the quickest-to-write solution: use methods from apache commons-lang:
StringEscapeUtils.escapeXml10() for xml 1.0
StringEscapeUtils.escapeXml11() for xml 1.1
StringEscapeUtils.escapeXml() is now deprecated, but was used commonly in the past
Remember to include dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version> <!--check current version! -->
</dependency>

While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.
Consider this: XML was meant to be written by humans.
Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.
Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:
<%#taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%>
<item>${fn:escapeXml(value)}</item>

The behavior of StringEscapeUtils.escapeXml() has changed from Commons Lang 2.5 to 3.0.
It now no longer escapes Unicode characters greater than 0x7f.
This is a good thing, the old method was to be a bit to eager to escape entities that could just be inserted into a utf8 document.
The new escapers to be included in Google Guava 11.0 also seem promising:
http://code.google.com/p/guava-libraries/issues/detail?id=799

While I agree with Jon Skeet in principle, sometimes I don't have the option to use an external XML library. And I find it peculiar the two functions to escape/unescape a simple value (attribute or tag, not full document) are not available in the standard XML libraries included with Java.
As a result and based on the different answers I have seen posted here and elsewhere, here is the solution I've ended up creating (nothing worked as a simple copy/paste):
public final static String ESCAPE_CHARS = "<>&\"\'";
public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] {
"<"
, ">"
, "&"
, """
, "&apos;"
}));
private static String UNICODE_NULL = "" + ((char)0x00); //null
private static String UNICODE_LOW = "" + ((char)0x20); //space
private static String UNICODE_HIGH = "" + ((char)0x7f);
//should only be used for the content of an attribute or tag
public static String toEscaped(String content) {
String result = content;
if ((content != null) && (content.length() > 0)) {
boolean modified = false;
StringBuilder stringBuilder = new StringBuilder(content.length());
for (int i = 0, count = content.length(); i < count; ++i) {
String character = content.substring(i, i + 1);
int pos = ESCAPE_CHARS.indexOf(character);
if (pos > -1) {
stringBuilder.append(ESCAPE_STRINGS.get(pos));
modified = true;
}
else {
if ( (character.compareTo(UNICODE_LOW) > -1)
&& (character.compareTo(UNICODE_HIGH) < 1)
) {
stringBuilder.append(character);
}
else {
//Per URL reference below, Unicode null character is always restricted from XML
//URL: https://en.wikipedia.org/wiki/Valid_characters_in_XML
if (character.compareTo(UNICODE_NULL) != 0) {
stringBuilder.append("&#" + ((int)character.charAt(0)) + ";");
}
modified = true;
}
}
}
if (modified) {
result = stringBuilder.toString();
}
}
return result;
}
The above accommodates several different things:
avoids using char based logic until it absolutely has to - improves unicode compatibility
attempts to be as efficient as possible given the probability is the second "if" condition is likely the most used pathway
is a pure function; i.e. is thread-safe
optimizes nicely with the garbage collector by only returning the contents of the StringBuilder if something actually changed - otherwise, the original string is returned
At some point, I will write the inversion of this function, toUnescaped(). I just don't have time to do that today. When I do, I will come update this answer with the code. :)

Note: Your question is about escaping, not encoding. Escaping is using <, etc. to allow the parser to distinguish between "this is an XML command" and "this is some text". Encoding is the stuff you specify in the XML header (UTF-8, ISO-8859-1, etc).
First of all, like everyone else said, use an XML library. XML looks simple but the encoding+escaping stuff is dark voodoo (which you'll notice as soon as you encounter umlauts and Japanese and other weird stuff like "full width digits" (&#FF11; is 1)). Keeping XML human readable is a Sisyphus' task.
I suggest never to try to be clever about text encoding and escaping in XML. But don't let that stop you from trying; just remember when it bites you (and it will).
That said, if you use only UTF-8, to make things more readable you can consider this strategy:
If the text does contain '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
If the text doesn't contain these three characters, don't warp it.
I'm using this in an SQL editor and it allows the developers to cut&paste SQL from a third party SQL tool into the XML without worrying about escaping. This works because the SQL can't contain umlauts in our case, so I'm safe.

If you are looking for a library to get the job done, try:
Guava 26.0 documented here
return XmlEscapers.xmlContentEscaper().escape(text);
Note: There is also an xmlAttributeEscaper()
Apache Commons Text 1.4 documented here
StringEscapeUtils.escapeXml11(text)
Note: There is also an escapeXml10() method

To escape XML characters, the easiest way is to use the Apache Commons Lang project, JAR downloadable from: http://commons.apache.org/lang/
The class is this: org.apache.commons.lang3.StringEscapeUtils;
It has a method named "escapeXml", that will return an appropriately escaped String.

You could use the Enterprise Security API (ESAPI) library, which provides methods like encodeForXML and encodeForXMLAttribute. Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder.

Use JAXP and forget about text handling it will be done for you automatically.

Here's an easy solution and it's great for encoding accented characters too!
String in = "Hi Lârry & Môe!";
StringBuilder out = new StringBuilder();
for(int i = 0; i < in.length(); i++) {
char c = in.charAt(i);
if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) {
out.append("&#" + (int) c + ";");
} else {
out.append(c);
}
}
System.out.printf("%s%n", out);
Outputs
Hi Lârry & Môe!

Try to encode the XML using Apache XML serializer
//Serialize DOM
OutputFormat format = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial = new XMLSerializer (stringOut,
format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());

Just replace
& with &
And for other characters:
> with >
< with <
\" with "
' with &apos;

Here's what I found after searching everywhere looking for a solution:
Get the Jsoup library:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
Then:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser
String xml = '''<?xml version = "1.0"?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV = "http://www.w3.org/2001/12/soap-envelope"
SOAP-ENV:encodingStyle = "http://www.w3.org/2001/12/soap-encoding">
<SOAP-ENV:Body xmlns:m = "http://www.example.org/quotations">
<m:GetQuotation>
<m:QuotationsName> MiscroSoft#G>>gle.com </m:QuotationsName>
</m:GetQuotation>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>'''
Document doc = Jsoup.parse(new ByteArrayInputStream(xml.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)
println doc.toString()
Hope this helps someone

I have created my wrapper here, hope it will helps a lot, Click here You can modify depends on your requirements

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can this regex be further optimized? - java

Related

Java efficiently replace unless matches complex regular expression

confusion in behavior of capturing groups in java regex

Tokenizing an algebraic expression in string format

How to get good perfomance of Regex in java

Best way to encode text data for XML in Java?

Categories

Resources