How do I able to replace:
((90+1)%(100-4)) + ((90+1)%(100-4/(6-4))) - (var1%(var2%var3(var4-var5)))
with
XYZ((90+1),(100-4)) + XYZ((90+1),100-4/(6-4)) - XYZ(var1,XYZ(var2,var3(var4-var5)))
with regex?
Thanks,
J
this doesn't really look like a very good job for a regex. It looks like you might want to write a quick recursive descent parser instead. If I understand you correctly, you want to replace the infix operator % with a function name XYZ?
So (expression % expression) becomes XYZ(expression, expression)
This looks like a good resource to study: http://www.cs.uky.edu/~lewis/essays/compilers/rec-des.html
I don't know much about regex, but try looking at this, especially 9 and 10:
http://www.mkyong.com/regular-expressions/10-java-regular-expression-examples-you-should-know/
And of course:
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
You could at least check them out until an in depth answer comes along.
See this code:
String input = "((90+1)%(100-4)) + ((90+1)%(100-4/(6-4))) - (var1%(var2%var3(var4-var5)))";
input = input.replaceAll("%", ",");
int level = 0;
List<Integer> targetStack = new ArrayList<Integer>();
List<Integer> splitIndices = new ArrayList<Integer>();
// add the index of last character as default checkpoint
splitIndices.add(input.length());
for (int i = input.length() - 1; i >= 0; i--) {
if (input.charAt(i) == ',') {
targetStack.add(level - 1);
} else if (input.charAt(i) == ')') {
level++;
}
else if (input.charAt(i) == '(') {
level--;
if (!targetStack.isEmpty() && level == targetStack.get(targetStack.size() - 1)) {
splitIndices.add(i);
}
}
}
Collections.reverse(splitIndices); // reversing the indices so that they are in increasing order
StringBuilder result = new StringBuilder();
for (int i = 1; i < splitIndices.size(); i++) {
result.append("XYZ");
result.append(input.substring(splitIndices.get(i - 1), splitIndices.get(i)));
}
System.out.println(result);
The output is as you expect it:
XYZ((90+1),(100-4)) + XYZ((90+1),(100-4/(6-4))) - XYZ(var1,XYZ(var2,var3(var4-var5)))
However keep in mind that it is a bit hacky and it might not work exactly as you expect it. Btw, I had to change a bit the output I added couple of brackets: XYZ((90+1), ( 100-4/(6-4 ) )) because otherwise you were not following your own conventions. Hopefully this code helps you. For me it was a good exercise at least.
Would it satisfy your requirements to do the following:
Find ( at first position or preceded by space and replace it with XYZ(
Find % and replace it with ,
If those two instructions are sufficient and satisfactory, then you could transform the original string in three "moves":
Replace ^\( with XYZ(
Replace \( with XYZ(
Replace % with ,
Related
I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).
Question: How can I efficiently replace certain characters that don't match a particular regular expression?
I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though.
I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.
Should I do this line by line instead of operating on the whole text?
String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);
//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");
// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");
Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.
I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).
//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();
int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
int pt = strArray[i];
// System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
if (protectedArray[currentPos]==pt) {
if (currentPos == protectedLen - 1) {
resultStr += protectedStrs.get(protectedCount);
protectedCount++;
currentPos = 0;
} else {
currentPos++;
}
} else {
if (currentPos > 0) {
resultStr += replaceStr.substring(0, currentPos);
currentPos = 0;
currentStr = "";
}
resultStr += ParseUtils.getSymbol(pt);
}
}
s = resultStr;
This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?
I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.
You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.
Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:
StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
appendInBetween(resultStr, str, current, protectedM.str());
resultStr.append(protectedM.group());
currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());
where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:
void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
// Pass the whole input string and the bounds, rather than taking a substring.
// Allocate roughly enough space up-front.
resultStr.ensureCapacity(resultStr.length() + end - start);
for (int i = start; i < end; ++i) {
char c = s.charAt(i);
// Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
if (!(Character.isLetter(c)
|| Character.isDigit(c)
|| Character.getType(c) == Character.NON_SPACING_MARK
|| "_\\-<>'".indexOf(c) != -1)) {
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else if (c == '\'' && i > 0 && i + 1 < s.length()) {
// We have a quote that's not at the beginning or end.
// Call these 3 characters bcd, where c is the quote.
char b = s.charAt(i - 1);
char d = s.charAt(i + 1);
if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
// If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
resultStr.append(' ');
resultStr.append(c);
} else if (!Character.isLetter(b) && !Character.isLetter(d)) {
// If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
resultStr.append(' ');
resultStr.append(c);
resultStr.append(' ');
} else {
resultStr.append(c);
}
} else {
// Everything else, just append.
resultStr.append(c);
}
}
}
Ideone demo
Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.
I'd be interested to know if this works any faster for you!
At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:
StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
replaceM.appendReplacement(replacedBuff, "");
replacedBuff.append(protectedStrs.get(index));
index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();
Reference: Second answer at this question.
Another option to consider:
During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.
int currIndex = 0;
while (protectedM.find()) {
protectedStrs.add(protectedM.group());
String substr = s.substring(currIndex,protectedM.start());
substr = p1.matcher(substr).replaceAll(" $1 ");
substr = p2.matcher(substr).replaceAll("$1 '$2");
substr = p3.matcher(substr).replaceAll("$1 ' $2");
resultStr += substr+protectedM.group();
currIndex = protectedM.end();
}
Speed comparison for 100,000 lines of text:
Original Perl script: 272.960579875 seconds
My first attempt: Too long to finish.
With appendReplacement(): 14.245160866 seconds
Replacing while finding protected: 68.691842962 seconds
Thank you, Java, for not letting me down.
i would like to use a regular expression for the following problem:
SOME_RANDOM_TEXT
should be converted to:
someRandomText
so, the _(any char) should be replaced with just the letter in upper case. i found something like that, using the tool:
_\w and $&
how to get only the second letter from the replacement?? any advice? thanks.
It might be easier simply to String.split("_") and then rejoin, capitalising the first letter of each string in your collection.
Note that Apache Commons has lots of useful string-related stuff, including a join() method.
The problem is that the case conversion from lowercase to uppercase is not supported by Java.util.regex.Pattern
This means you will need to do the conversion programmatically as Brian suggested. See also this thread
You can also write a simple method to do this. It's more complicated but more optimized :
public static String toCamelCase(String value) {
value = value.toLowerCase();
byte[] source = value.getBytes();
int maxLen = source.length;
byte[] target = new byte[maxLen];
int targetIndex = 0;
for (int sourceIndex = 0; sourceIndex < maxLen; sourceIndex++) {
byte c = source[sourceIndex];
if (c == '_') {
if (sourceIndex < maxLen - 1)
source[sourceIndex + 1] = (byte) Character.toUpperCase(source[sourceIndex + 1]);
continue;
}
target[targetIndex++] = source[sourceIndex];
}
return new String(target, 0, targetIndex);
}
I like Apache commons libraries, but sometimes it's good to know how it works and be able to write some specific code for jobs like this.
Another problem I try to solve (NOTE this is not a homework but what popped into my head), I'm trying to improve my problem-solving skills in Java. I want to display this:
Students ID #
Carol McKane 920 11
James Eriol 154 10
Elainee Black 462 12
What I want to do is on the 3rd column, display the number of characters without counting the spaces. Give me some tips to do this. Or point me to Java's robust APIs, cause I'm not yet that familiar with Java's string APIs. Thanks.
It sounds like you just want something like:
public static int countNonSpaces(String text) {
int count = 0;
for (int i = 0; i < text.length(); i++) {
if (text.charAt(i) != ' ') {
count++;
}
}
return count;
}
You may want to modify this to use Character.isWhitespace instead of only checking for ' '. Also note that this will count pairs outside the Basic Multilingual Plane as two characters. Whether that will be a problem for you or not depends on your use case...
Think of solving a problem and presenting the answer as two very different steps. I won't help you with the presentation in a table, but to count the number of characters in a String (without spaces) you can use this:
String name = "Carol McKane";
int numberOfCharacters = name.replaceAll("\\s", "").length();
The regular expression \\s matches all whitespace characters in the name string, and replaces them with "", or nothing.
Probably the shortest and easiest way:
String[][] students = { { "Carol McKane", "James Eriol", "Elainee Black" }, { "920", "154", "462" } };
for (int i = 0 ; i < students[0].length; i++) {
System.out.println(students[0][i] + "\t" + students[1][i] + "\t" + students[0][i].replace( " ", "" ).length() );
}
replace(), replaces each substring (" ") of your string and removes it from the result returned, from this temporal string, without spaces, you can get the length by calling length() on it...
The String name will remain unchanged.
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html
cheers
To learn more about it you should watch the API documentation for String and Character
Here some examples how to do:
// variation 1
int count1 = 0;
for (char character : text.toCharArray()) {
if (Character.isLetter(character)) {
count1++;
}
}
This uses a special short from of "for" instruction. Here's the long form for better understanding:
// variation 2
int count2 = 0;
for (int i = 0; i < text.length(); i++) {
char character = text.charAt(i);
if (Character.isLetter(character)) {
count2++;
}
}
BTW, removing whitespaces via replace method is not a good coding style to me and not quite helpful for understanding how string class works.
This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Programming java to determine a symmetrical word
am new here, but I am having hard time figuring out how to write a code to determine an input of word and see if the first is matching with the end of the word. You may input abba and get answer it's evenly symmetric and aba is oddly symmetric.
Please show me how:(
Just two main things.
first I want to know if it's oddly or evenly amount of letter(number of letter divided by 2,if it's ending with 0.5, it's oddly symmetric, if is an integer it's evenly symmetric.
second I want to get (i.e 1=n,2=n-1,3=n-2...) position of the letter in the word to be the main idea of the execution.If there is a last letter in the oddly symmetric word, ignore the last remaining letter.
I appreciate any headstart or idea:) Thanks!
Thanks KDiTraglia, I made the code and compiled and here is what I put. I am not getting any further.
Reported problem:
Exception in thread "main" java.lang.Error: Unresolved compilation problems: reverse cannot be resolved or is not a field reverse cannot be resolved or is not a field Syntax error, insert ") Statement" to complete IfStatement
This is what i got from, KDiTraglia's help
public class WordSymmetric {
public static void main(String[] args) {
String word = "abccdccba";
if ( (word.length() % 2) == 1 ) {
System.out.println("They are oddly symmetric");
//odd
}
else {
System.out.println("They are evenly symmetric");
//even
}
int halfLength = word.length() / 2;
String firstHalf = word.substring(0, halfLength);
String secondHalf = word.substring(halfLength, word.length());
System.out.println(secondHalf.reverse());
if (firstHalf.equals(secondHalf.reverse()) {
System.out.println("They match");
//they match
}
} }
String does not have a reverse method. You could use the apache commons lang library for this purpose:
http://commons.apache.org/lang/api-release/org/apache/commons/lang3/StringUtils.html#reverse%28java.lang.String%29
The reverse() approach is very clean and readable. Unfortunately there is no reverse() method for Strings. So you would either have to take an external library (StringUtils from the appache common lang3 library has a reverse method) or code it yourself.
public static String reverse(String inputString) {
StringBuilder reverseString = new StringBuilder();
for(int i = inputString.length(); i > 0; --i) {
char result = inputString.charAt(i-1);
reverseString.append(result);
}
return reverseString.toString();
}
(This only works for characters that can fit into a char. So if you need something more general, you would have to expand it.)
Then you can just have a method like this:
enum ePalindromResult { NO_PALINDROM, PALINDROM_ODD, PALINDROM_EVEN };
public static ePalindromResult checkForPalindrom(String inputStr) {
// this uses the org.apache.commons.lang3.StringUtils class:
if (inputStr.equals(StringUtils.reverse(inputStr)) {
if (inputStr.length % 2 == 0) return PALINDROM_EVEN;
else return PALINDROM_ODD;
} else return NO_PALINDROM;
}
System.out.println(secondHalf.reverse());
There is no reverse() method defined fro String
I would probably loop over word from index 0 to the half (word.length() / 2) and compare the character at the current index (word.charAt(i)) with the correspoding from the other half (word.charAt(word.length() - i).
This is just a rough draft, you probably need to think about the loop end index, depending on oddly or evenly symmetry.
You can adapt this :
final char[] word = "abccdccba".toCharArray(); // work also with "abccccba"
final int t = word.length;
boolean ok = true;
for (int i = t / 2; i > 0; i--) {
if (word[i - 1] != word[t - i]) {
ok = false;
break;
}
System.out.println(word[i - 1] + "\t" + word[word.length - i]);
}
System.out.println(ok);
Console :
c c
c c
b b
a a
true
Use class StringBuffer instead of String
Below is example of text:
String id = "A:abc,X:def,F:xyz,A:jkl";
Below is regex:
Pattern p = Pattern.compile("(.*,)?[AC]:[^:]+$");
if(p.matcher(id).matches()) {
System.out.println("Hello world!")
}
When executed above code should print Hello world!.
Does this regex can be modified to gain more performance?
As I can't see your entire code, I can only assume that you do the pattern compilation inside your loop/method/etc. One thing that can improve performance is to compile at the class level and not recompile the pattern each time. Other than that, I don't see much else that you could change.
Pattern p = Pattern.compile(".*[AC]:[^:]+$");
if(p.matcher(id).matches()) {
System.out.println("Hello world!")
}
As you seem to only be interested if it the string ends in A or C followed by a colon and some characters which aren't colons you can just use .* instead of (.*,)? (or do you really want to capture the stuff before the last piece?)
If the stuff after the colon is all lower case you could even do
Pattern p = Pattern.compile(".*[AC]:[a-z]+$");
And if you are going to match this multiple times in a row (e.g. loop) be sure to compile the pattern outside of the loop.
e,g
Pattern p = Pattern.compile(".*[AC]:[a-z]+$");
Matcher m = p.matcher(id);
while(....) {
...
// m.matches()
...
// prepare for next loop m.reset(newvaluetocheck);
}
Move Pattern instantiation to a final static field (erm, constant), in your current code you're recompiling essentially the same Pattern every single time (no, Pattern doesn't cache anything!). That should give you some noticeable performance boost right off the bat.
Do you even need to use regualr expressions? It seems there isn't a huge variety in what you are testing.
If you need to use the regex as others have said, compiling it only once makes sense and if you only need to check the last token maybe you could simplify the regex to: [AC]:[^:]{3}$.
Could you possibly use something along these lines (untested...)?
private boolean isId(String id)
{
char[] chars = id.toCharArray();
boolean valid = false;
int length = chars.length;
if (length >= 5 && chars[length - 4] == ':')
{
char fifthToLast = chars[length - 5];
if (fifthToLast == 'A' || fifthToLast == 'C')
{
valid = true;
for (int i = length - 1; i >= length - 4; i--)
{
if (chars[i] == ':')
{
valid = false;
break;
}
}
}
}
return valid;
}