How to optimize do-while loops? - java

I want to execute a loop as long as a certain condition applies. At the end, I want to return the value that was last being found inside the loop.
Non-realworld example:
teststring = " abcde";
String letter = null;
do {
letter = reader.read(); //reads the teststring char by char
} while (letter.equals(" "));
return letter; //return "a"
Could this be optimized from the coding point of view, eg transform it from a do-while loop to just a while-loop?

If you use Java 1.7 or 1.8, you can do this:
while((letter=reader.read()).equals(" ")){
}
return letter;

if you are reading from a Reader it returns an int which is the char or -1 if at the end of input.
int ch;
while((ch = reader.read()) == ' ');
return ch;
Note: " " is a String and ' ' is a char.

No sure about what is more efficient but you could do something like:
`return teststring.trim().charAt(0);

do {
...
} while (<condition>);
I am going to explain your question on do/while vs while alone.
The do is only a label. It has no impact on efficiency. The while at the bottom is effectively an if(condition) goto line #, where line # is the do. The "do" is simply a way of telling the compiler what number you want in that goto statement at the bottom.
Putting the while statement at the top would actually be less efficient because it means the condition has to be evaluated on the first iteration. Perhaps your reader does need to be checked on the first iteration, then it should be a while statement, but that requires more work, you see?
Second even transforming it to only a while statement, still places an unconditional goto at the bottom, with a conditional goto on the top, so even though it looks like less code, it could possibly be more.

I think it would be easier to just use String.toCharArray() and a For Each loop like
String teststring = " abcde";
for (char ch : teststring.toCharArray()) {
if (ch != ' ') return ch; // <-- 'a'
}
throw new ParseException("Whitespace only");
But, you could use a StringReader and you're using char (a primitive), so I think you've asked for
String teststring = " abcde";
StringReader reader = new StringReader(teststring);
try {
int letter;
do {
letter = reader.read();
} while (letter == ' ');
return ((char) letter);
} catch (IOException e) {
e.printStackTrace();
}
throw new ParseException("Whitespace only");
or return a default value if the character isn't found.

Related

Java efficiently replace unless matches complex regular expression

I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).
Question: How can I efficiently replace certain characters that don't match a particular regular expression?
I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though.
I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.
Should I do this line by line instead of operating on the whole text?
String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);
//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");
// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");
Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.
I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).
//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();
int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
int pt = strArray[i];
// System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
if (protectedArray[currentPos]==pt) {
if (currentPos == protectedLen - 1) {
resultStr += protectedStrs.get(protectedCount);
protectedCount++;
currentPos = 0;
} else {
currentPos++;
}
} else {
if (currentPos > 0) {
resultStr += replaceStr.substring(0, currentPos);
currentPos = 0;
currentStr = "";
}
resultStr += ParseUtils.getSymbol(pt);
}
}
s = resultStr;
This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?
I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.
You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.
Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:
StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
appendInBetween(resultStr, str, current, protectedM.str());
resultStr.append(protectedM.group());
currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());
where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:
void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
// Pass the whole input string and the bounds, rather than taking a substring.
// Allocate roughly enough space up-front.
resultStr.ensureCapacity(resultStr.length() + end - start);
for (int i = start; i < end; ++i) {
char c = s.charAt(i);
// Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
if (!(Character.isLetter(c)
|| Character.isDigit(c)
|| Character.getType(c) == Character.NON_SPACING_MARK
|| "_\\-<>'".indexOf(c) != -1)) {
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else if (c == '\'' && i > 0 && i + 1 < s.length()) {
// We have a quote that's not at the beginning or end.
// Call these 3 characters bcd, where c is the quote.
char b = s.charAt(i - 1);
char d = s.charAt(i + 1);
if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
// If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
resultStr.append(' ');
resultStr.append(c);
} else if (!Character.isLetter(b) && !Character.isLetter(d)) {
// If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else {
resultStr.append(c);
}
} else {
// Everything else, just append.
resultStr.append(c);
}
}
}
Ideone demo
Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.
I'd be interested to know if this works any faster for you!
At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:
StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
replaceM.appendReplacement(replacedBuff, "");
replacedBuff.append(protectedStrs.get(index));
index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();
Reference: Second answer at this question.
Another option to consider:
During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
String substr = s.substring(currIndex,protectedM.start());
substr = p1.matcher(substr).replaceAll(" $1 ");
substr = p2.matcher(substr).replaceAll("$1 '$2");
substr = p3.matcher(substr).replaceAll("$1 ' $2");
resultStr += substr+protectedM.group();
currIndex = protectedM.end();
}
Speed comparison for 100,000 lines of text:
Original Perl script: 272.960579875 seconds
My first attempt: Too long to finish.
With appendReplacement(): 14.245160866 seconds
Replacing while finding protected: 68.691842962 seconds
Thank you, Java, for not letting me down.

delimiters check using stack

the code checks if delimiters are balanced in the string or not. I've been using a stack to solve this. I traverse the string to the end, whenever an opening delimiter is encountered I push it into the stack, for each closing delimiter encountered I make a check if the stack is empty (and report error if it is) and then pop the stack to match the popped character and the closing delimiter encountered. I ignore all other characters in the string.
At the end of the traversal I make a check if the stack is empty (that is I check if all the opening delimiters were balanced out or not). If it's not empty, I report an error.
Although I have cross checked many times the code seems to be reporting every string as invaalid(i.e with unbalanced delimiters). Here's the code:
import java.util.*;
public class delimiter {
public static void main(String args[]){
String s1 = "()";
String s2 = "[}[]";
if(delimitercheck(s1)){
System.out.println("s1 is a nice text!");
}
else
System.out.println("S1 is not nice");
if(delimitercheck(s2)){
System.out.println("s2 is a nice text!");
}
else
System.out.println("S2 is not nice");
}
public static boolean delimitercheck(String s){
Stack<Character> stk = new Stack<Character>();
if(s==null||s.length()==0)//if it's a null string return true
return true;
for(int i=0;i<s.length();i++){
if(s.charAt(i)=='('||s.charAt(i)=='{'||s.charAt(i)=='['){
stk.push(s.charAt(i));
}
if(s.charAt(i)==')'||s.charAt(i)=='}'||s.charAt(i)==']'){
if(stk.isEmpty()){
return false;
}
if(stk.peek()==s.charAt(i)){
stk.pop();
}
}
}
if(stk.isEmpty()){
return true;
}
else
return false;
}
}
Can anyone point to me where am I going wrong?
Your error is here :
if(stk.peek()==s.charAt(i)){
stk.pop();
}
The i'th character shouldn't be equal to stk.peek(). It should be closing it. i.e. if stk.peek() == '{', s.charAt(i) should be '}', and so on.
In addition, if the current closing parenthesis doesn't match to top of the stack, you should return false.
You can either have a separate condition for each type of paretheses, or you can create a Map<Character,Character> that maps each opening parenthesis to its corresponding closing parenthesis, and then your condition will become :
if(map.get(stk.peek())==s.charAt(i)){
stk.pop();
} else {
return false;
}
where map can be initialized to :
Map<Character,Character> map = new HashMap<>();
map.put('(',')');
map.put('{','}');
map.put('[',']');
Yes, when encountering a closing bracket, you check if it is similar to the opening bracket which is not correct.
if(stk.peek()==s.charAt(i)){
stk.pop();
}
should be replaced with something similar to
Character toCheck = s.charAt(i);
Character peek = stk.peek();
if (toCheck == ')') {
if (peek == '(') {
stk.pop();
} else {
return false;
}
} else if ( // ... check all three bracket types
And please stick to brackets for every if-statement - there's nothing more tedious then one day encountering an error due to omitted brackets which will cause you more internal pain.
You are checking the stack value which has an opening delimiter with a closing delimiter. So for the first example you are checking '(' with ')'. Instead for every corresponding end delimiter you should check the stack for its starting delimiter i.e., '(' with ')'. Hope that makes sense.

java: code keeps looping

The below code is giving me a headache: It's supposed to jump out of the do--while loop after replacing all \n's, but it doesn't. Any ideas how to solve this?
public String invoerenTemplate(){
String templateGescheiden = null;
String teHerkennenTemplate = Input.readLine();
String uitvoer = teHerkennenTemplate;
do {
templateGescheiden = teHerkennenTemplate.substring(0, teHerkennenTemplate.indexOf(" "));
templateGescheiden += " ";
if (templateGescheiden.charAt(0) == '\\' && templateGescheiden.charAt(1) == 'n') {
teHerkennenTemplate = teHerkennenTemplate.replace(templateGescheiden, "\n");
uitvoer = uitvoer.replace(templateGescheiden, "\n");
}
teHerkennenTemplate = teHerkennenTemplate.substring(teHerkennenTemplate.indexOf(" "));
System.out.println(uitvoer);
} while (teHerkennenTemplate.length() > 0);
return uitvoer;
}
EDIT:
I now placed this line: teHerkennenTemplate.trim(); just beneath my if-statement, but now it gives me a StringIndexOutOfRange: 0 error at my first line of my if-statement
I have noticed a couple of problems with the above code, although it is difficult to tell why you are taking the approach that you are to the solution.
The main thing I noticed is that your replace statements do NOT remove the \n characters
teHerkennenTemplate = teHerkennenTemplate.replace(templateGescheiden, "\n");
uitvoer = uitvoer.replace(templateGescheiden, "\n");
From Java Documentation:
replace(char oldChar, char newChar):
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
So, you are replacing your string templateGescheiden with \n each time you loop.
Another issue would be the improper shortening of your teHerkennenTemplate string each loop, which is causing it not to terminate correctly. It will always shorten from the next space character to the end of the string (inclusive) - meaning it will never be an empty string, but will always have a " ".
My advice would be to debug and go step-by-step to see where the shortening and string manipulation is not doing what you want, then evaluate why and modify the code appropriately
There's a variety of things wrong with the code:
the index of a carriage return is found in the string with indexOf("\n").
the substring of teHerkennenTemplate isn't taking into account that it starts with a space, which cause the loop to continue forever.
The simplest way to do what you want is with a regular expression:
"test \n test \n".replaceAll("\n", "")
Will return:
"test test "
If you're set on using a loop then this will do the same:
public static String invoerenTemplate(String teHerkennenTemplate)
{
StringBuilder result = new StringBuilder();
while (teHerkennenTemplate.length() > 0)
{
int index = teHerkennenTemplate.indexOf("\n");
result.append(index > -1 ? teHerkennenTemplate.substring(0, index) : teHerkennenTemplate);
teHerkennenTemplate = teHerkennenTemplate.substring(index + 1, teHerkennenTemplate.length());
}
return result.toString();
}

regular expression for \" in java

I need to write a regular expression for string read from a file
apple,boy,cat,"dog,cat","time\" after\"noon"
I need to split it into
apple
boy
cat
dog,cat
time"after"noon
I tried using
Pattern pattern =
Pattern.compile("[\\\"]");
String items[]=pattern.split(match);
for the second part but I could not get the right answer,can you help me with this?
Since your question is more of a parsing problem than a regex problem, here's another solution that will work:
public class CsvReader {
Reader r;
int row, col;
boolean endOfRow;
public CsvReader(Reader r){
this.r = r instanceof BufferedReader ? r : new BufferedReader(r);
this.row = -1;
this.col = 0;
this.endOfRow = true;
}
/**
* Returns the next string in the input stream, or null when no input is left
* #return
* #throws IOException
*/
public String next() throws IOException {
int i = r.read();
if(i == -1)
return null;
if(this.endOfRow){
this.row++;
this.col = 0;
this.endOfRow = false;
} else {
this.col++;
}
StringBuilder b = new StringBuilder();
outerLoop:
while(true){
char c = (char) i;
if(i == -1)
break;
if(c == ','){
break;
} else if(c == '\n'){
endOfRow = true;
break;
} else if(c == '\\'){
i = r.read();
if(i == -1){
break;
} else {
b.append((char)i);
}
} else if(c == '"'){
while(true){
i = r.read();
if(i == -1){
break outerLoop;
}
c = (char)i;
if(c == '\\'){
i = r.read();
if(i == -1){
break outerLoop;
} else {
b.append((char)i);
}
} else if(c == '"'){
r.mark(2);
i = r.read();
if(i == '"'){
b.append('"');
} else {
r.reset();
break;
}
} else {
b.append(c);
}
}
} else {
b.append(c);
}
i = r.read();
}
return b.toString().trim();
}
public int getColNum(){
return col;
}
public int getRowNum(){
return row;
}
public static void main(String[] args){
try {
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"\nquick\"fix\" hello, \"\"\"who's there?\"";
System.out.println(input);
Reader r = new StringReader(input);
CsvReader csv = new CsvReader(r);
String s;
while((s = csv.next()) != null){
System.out.println("R" + csv.getRowNum() + "C" + csv.getColNum() + ": " + s);
}
} catch(IOException e){
e.printStackTrace();
}
}
}
Running this code, I get the output:
R0C0: apple
R0C1: boy
R0C2: cat
R0C3: dog,cat
R0C4: time" after"noon
R1C0: quickfix hello
R1C1: "who's there?
This should fit your needs pretty well.
A few disclaimers, though:
It won't catch errors in the syntax of the CSV format, such as an unescaped quotation mark in the middle of a value.
It won't perform any character conversion (such as converting "\n" to a newline character). Backslashes simply cause the following character to be treated literally, including other backslashes. (That should be easy enough to alter if you need additional functionality)
Some csv files escape quotes by doubling them rather than using a backslash, this code now looks for both.
Edit: Looked up the csv format, discovered there's no real standard, but updated my code to catch quotes escaped by doubling rather than backslashes.
Edit 2: Fixed. Should work as advertised now. Also modified it to test the tracking of row and column numbers.
First thing: String.split() uses the regex to find the separators, not the substrings.
Edit: I'm not sure if this can be done with String.split(). I think the only way you could deal with the quotes while only matching the comma would be by readahead and lookbehind, and that's going to break in quite a lot of cases.
Edit2: I'm pretty sure it can be done with a regular expression. And I'm sure this one case could be solved with string.split() -- but a general solution wouldn't be simple.
Basically, you're looking for anything that isn't a comma as input [^,], you can handle quotes as a separate character. I've gotten most of the way there myself. I'm getting this as output:
apple
boy
cat
dog
cat
time\" after\"noon
But I'm not sure why it has so many blank lines.
My complete code is:
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"";
Pattern pattern =
Pattern.compile("(\\s|[^,\"\\\\]|(\\\\.)||(\".*\"))*");
Matcher m = pattern.matcher(input);
while(m.find()){
System.out.println(m.group());
}
But yeah, I'll echo the guy above and say that if there's no requirement to use a regular expression, then it's probably simpler to do it manually.
But then I guess I'm almost there. It's spitting out ... oh hey, I see what's going on here. I think I can fix that.
But I'm going to echo the guy above and say that if there's no requirement to use a regular expression, it's probably better to do it one character at a time and implement the logic manually. If your regex isn't picture-perfect, then it could cause all kinds of unpredictable weirdness down the line.
I am not really sure about this but you could have a go at Pattern.compile("[\\\\"]");
\ is an escape character and to detect a \ in the expression, \\\\ could be used.
A similar thing worked for me in another context and I hope it solves your problem too.

Java - "String index out of range" exception

I wrote this little function just for practice, but an exception ("String index out of range: 29") is thrown and I don't know why...
(I know this isn't the best way to write this function and can I use regular expressions.)
This is the code:
public String retString(String x)
{
int j=0;
int i=0;
StringBuffer y = new StringBuffer(x);
try
{
while ( y.charAt(i) != '\0' )
{
if (y.charAt(i) != ' ')
{
y.setCharAt(j, y.charAt(i));
i++;
j++;
}
else
{
y.setCharAt(j, y.charAt(i));
i++;
j++;
while (y.charAt(i) == ' ')
i++;
}
}
y.setCharAt(j,'\0');
}
finally
{
System.out.println("lalalalololo " );
}
return y.toString();
}
Are you translating this code from another language? You are looping through the string until you reach a null character ("\0"), but Java doesn't conventionally use these in strings. In C, this would work, but in your case you should try
i < y.length()
instead of
y.charAt(i) != '\0'
Additionally, the
y.setCharAt(j,'\0')
at the end of your code will not resize the string, if that is what you are expecting. You should instead try
y.setLength(j)
This exception is an IndexOutOfBoundsException but more particularly, a StringIndexOutOfBoundsException (which is derived from IndexOutOfBoundsException). The reason for receiving an error such as this is because you are exceeding the bounds of an indexable collection. This is something C/C++ does not do (you check bounds of collections manually) whereas Java has these built into their collections to avoid issues such as this. In this case, you're using the String object like an array (probably what it is in implementation) and going over the boundary of the String.
Java does not expose the null terminator in the public interface of String. In other words, you cannot determine the end of the String by searching for the null terminator. Rather, the ideal way to do this is by ensuring you do not exceed the length of the string.
Java strings are not null-terminated. Use String.length() to determine where to stop.
Looks like you are a C/C++ programmer coming to java ;)
Once you have gone out of range with .charAt (), it doesn't reach null, it reaches a StringIndexOutOfBoundsException. So in this case, you will need a for loop that goes from 0 to y.length()-1.
a much better implementation (with regex) is simply return y.replaceAll("\\s+"," "); (this even replaces other whitespace)
and StringBuffer.length() is constant time (no slow null termination semantics in java)
and similarly x.charAt(x.length()); will also throw a StringIndexOutOfBoundsException (and not return \0 like you'd expect in C)
for the fixed code:
while ( y.length()>i)//use length from the buffer
{
if (y.charAt(i) != ' ')
{
y.setCharAt(j, y.charAt(i));
i++;
j++;
}
else
{
y.setCharAt(j, y.charAt(i));
i++;
j++;
while (y.charAt(i) == ' ')
i++;
}
}
y.setLength(j);//using setLength to actually set the length
btw a StringBuilder is a faster implementation (no unnecessary synchronization)

Categories