Split String While Ignoring Escaped Character - java

I want to split a string along spaces, ignoring spaces if they are contained inside single quotes, and ignoring single quotes if they are escaped (i.e., \' )
I have the following completed from another question.
String s = "Some message I want to split 'but keeping this a\'s a single string' Voila!";
for (String a : s.split(" (?=([^\']*\'[^\"]*\')*[^\']*$)")) {
System.out.println(a);
}
The output of the above code is
Some
message
I
want
to
split
'but
keeping
this
'a's a single string'
Voila!
However, I need single quotes to be ignored if they are escaped ( \' ), which the above does not do. Also, I need the first and last single quotes and forward slashes removed, if and only if it (the forward slashes) are escaping a single quote (to where 'this is a \'string' would become this is a 'string). I have no idea how to use regex. How would I accomplish this?

You need to use a negative lookbehind to take care of escaped single quotes:
String str =
"Some message I want to split 'but keeping this a\\'s a single string' Voila!";
String[] toks = str.split( " +(?=((.*?(?<!\\\\)'){2})*[^']*$)" );
for (String tok: toks)
System.out.printf("<%s>%n", tok);
output:
<Some>
<message>
<I>
<want>
<to>
<split>
<'but keeping this a\'s a single string'>
<Voila!>
PS: As you noted that escaped single quote needs to be typed as \\' in String assignment otherwise it will be treated as plain '

or you could use this pattern to capture what you want
('(?:[^']|(?!<\\\\)')*'|\S+)
Demo

I was really overthinking this one.
This should work, and the best part is that it doesn't use lookarounds at all (so it works in nearly ever regex implementation, most famously javascript)
('[^']*?(?:\\'[^']*?)*'|[^\s]+)
Instead of using a split, use a match to build an array with this regex.
My objectives were
It can discern between an escaped apostrophe and not (of course)
It's fast. The behemoth I wrote before actually took time
It worked with multiple subquotes, a lot of suggestions here don't.
Demo
Test String: Discerning between 'the single quote\'s double purpose' as a 'quote marker', like ", and a 'a cotraction\'s marker.'.
If you asked the author and he was speaking in the third person, he would say 'CFQueryParam\'s example is contrived, and he knew that but he had the world\'s most difficult time thinking up an example.'
Some message I want to split 'but keeping this a\'s a single string' Voila!
Result: Discerning, between, 'the single quote\'s double purpose', as, a, 'quote marker',,, like, ",, and, a, 'a cotraction\'s marker.',.,
If, you, asked, the, author, and, he, was, speaking, in, the, third, person,, he, would, say, 'CFQueryParam\'s example is contrived, and he knew that but he had the world\'s most difficult time thinking up an example.',
Some, message, I, want, to, split, 'but keeping this a\'s a single string', Voila!

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.
I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

How to remove this type of quotes “” instead of this type of quote "" in Java?

To remove quotation marks in Java,I understand I can use
replaceAll("\"", "");
Ex: "Hello World" becomes Hello World.
However, it only removes this type of quotation marks "". Is there a way to remove quotes like this “Hello World” ?
If you simply want to remove those 3 kinds of double-quotes, irrespective of the context:
replaceAll("[\"“”]", "");
If there are other kinds of quote characters that you want to remove, just add them before the ].
These pages list some of the other quote characters that you might encounter:
https://unicode-table.com/en/sets/quotation-marks/
https://en.wikipedia.org/wiki/Quotation_mark
And also see:
Is there a regex to grab all quotation marks?
which talks about the difficulty in creating a regex to match all of them in a future-proof fashion.
Note that since we are including some "funky" characters (non-ASCII) in the source code (above), it is important that the Java compiler is aware of the character encoding that the source code uses. We could avoid that by using Unicode escapes instead. For example:
replaceAll("[\"\u201c\u201d]", "");
You may try a regex replacement here, e.g.
String input = "“Hello World”";
System.out.println(input.replaceAll("“(.*?)”", "$1")); // prints Hello World

java regular expression exclusion list pattern

I understand that when I do [^abc] this will match any thing other than a,b, and c. What if I want it to match anything other than a "..". So far the exclusion list I have is:
[^<>:\"/\|?*]+
I want to add a ".." as well into this exclusion list. So in english it would be "if it's anything other than the left brackets, right brackets, double quote, asterix, double dot (".."), the rest of the characters here, then it should match".
The test case I need to pass is:
foo/../baz needs to be /baz
bar/../../foo needs to be /../foo
Not a java expert, but it looks like you have a negated character class defined there. A character class is basically a list of characters in that class, or in your case, not in that class, and you can apply this to a string.
It seems that you're most likely after a match for the string "..". If so, I think you just need a specific regex for it. Maybe this would do the trick:
\.\.
A dot "." by itself of course matches any single character, so the backslash escapes are needed to match an actual string of two dots instead of any two characters.
Ok I have been playing around for a bit and this will prevent a match if the .. string is present:
^(?:(?!(\.\.)).)*$
I'm going to carry on but you might consider simply running two separate regex and making sure neither match.
If you are not particularly interested in delimiter itself you could use this
String source = "aa^bb<cc>dd:ee\"ff/gg\\hh|ii?jj*kk:ll/mm..mn";
String regex = "[<>:\"/\\\\|?*^]+|[.]{2}";
String[] splits = source.split(regex);
System.out.println(Arrays.toString(splits));
Output
[aa, bb, cc, dd, ee, ff, gg, hh, ii, jj, kk, ll, mm, mn]
If java can do lookahead assertions, one way is this:
(?:(?!\.\.)[^<>:"/|?*])+ untested
edit the above will match up until the first ..
Its not clear what you are trying to do, but to validate the entire string to these conditions, simply add ^$ -
^(?:(?!\.\.)[^<>:"/|?*])+$ untested
#xonegirlz - The Way to exclude .. depends on what you're trying to do. Your "test case I need to pass" is some help, but the statement "needs to be" is vague.
It would be helpful if you could state it like
"I'm writing a function X that, given a String like Y,
would return a String like Z.
I'm trying to use a regex-replace to find ______ in Y
and replace all execpt _______ to return Z".
You're current question is asking only about negated character classes, and the answer to that is You can't do ("..") that way

Parsing quoted text in java

Is there an easy way to parse quoted text as a string to java? I have this lines like this to parse:
author="Tolkien, J.R.R." title="The Lord of the Rings"
publisher="George Allen & Unwin" year=1954
and all I want is Tolkien, J.R.R.,The Lord of the Rings,George Allen & Unwin, 1954 as strings.
You could either use a regex like
"(.+)"
It will match any character between quotes. In Java would be:
Pattern p = Pattern.compile("\\"(.+)\\"";
Matcher m = p.matcher("author=\"Tolkien, J.R.R.\"");
while(matcher.find()){
System.out.println(m.group(1));
}
Note that group(1) is used, this is the second match, the first one, group(0), is the full string with quotes
Offcourse you could also use a substring to select everything except the first and last char:
String quoted = "author=\"Tolkien, J.R.R.\"";
String unquoted;
if(quoted.indexOf("\"") == 0 && quoted.lastIndexOf("\"")==quoted.length()-1){
unquoted = quoted.substring(1, quoted.lenght()-1);
}else{
unquoted = quoted;
}
There are some fancy pattern regex nonsense things that fancy people and fancy programmers like to use.
I like to use String.split(). It's a simple function and does what you need it to do.
So if I have a String word: "hello" and I want to take out "hello", I can simply do this:
myStr = string.split("\"")[1];
This will cut the string into bits based on the quote marks.
If I want to be more specific, I can do
myStr = string.split("word: \"")[1].split("\"")[0];
That way I cut it with word: " and "
Of course, you run into problems if word: " is repeated twice, which is what patterns are for. I don't think you'll have to deal with that problem for your specific question.
Also, be cautious around characters like . and . Split uses regex, so those characters will trigger funny behavior. I think that "\\" = \ will escape those funny rules. Someone correct me if I'm wrong.
Best of luck!
Can you presume your document is well-formed and does not contain syntax errors? If so, you are simply interested in every other token after using String.split().
If you need something more robust, you may need to use the Scanner class (or a StringBuffer and a for loop ;-)) to pick out the valid tokens, taking into account additional criterion beyond "I saw a quotation mark somewhere".
For example, some reasons you might need a more robust solution than splitting the string blindly on quotation marks: perhaps its only a valid token if the quotation mark starting it comes immediately after an equals sign. Or perhaps you do need to handle values that are not quoted as well as quoted ones? Will \" need to be handled as an escaped quotation mark, or does that count as the end of the string. Can it have either single or double quotes (eg: html) or will it always be correctly formatted with double quotes?
One robust way would be to think like a compiler and use a Java based Lexer (such as JFlex), but that might be overkill for what you need.
If you prefer a low-level approach, you could iterate through your input stream character by character using a while loop, and when you see an =" start copying the characters into a StringBuffer until you find another non-escaped ", either concatenating to the various wanted parsed values or adding them to a List of some sort (depending on what you plan to do with your data). Then continue reading until you encounter your start token (eg: =") again, and repeat.

Regular expression removing all words shorter than n

Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.
I thought something like \s\w{1,2}\s would grab all the 1 and 2 letter words (a whitespace, one to two word characters and another whitespace), but it just doesn't work.
Where am I wrong?
I've got it working fairly well, but it took two passes.
public static void main(String[] args) {
String passage = "Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.";
System.out.println(passage);
passage = passage.replaceAll("\\b[\\w']{1,2}\\b", "");
passage = passage.replaceAll("\\s{2,}", " ");
System.out.println(passage);
}
The first pass replaces all words containing less than three characters with a single space. Note that I had to include the apostrophe in the character class to eliminate because the word "I'm" was giving me trouble without it. You may find other special characters in your text that you also need to include here.
The second pass is necessary because the first pass left a few spots where there were double spaces. This just collapses all occurrences of 2 or more spaces down to one. It's up to you whether you need to keep this or not, but I think it's better with the spaces collapsed.
Output:
Well, I'm looking for a regexp in Java that deletes all words shorter than 3 characters.
Well, looking for regexp Java that deletes all words shorter than characters.
If you don't want the whitespace matched, you might want to use
\b\w{1,2}\b
to get the word boundaries.
That's working for me in RegexBuddy using the Java flavor; for the test string
"The dog is fun a cat"
it highlights "is" and "a". Similarly for words at the beginning/end of a line.
You might want to post a code sample.
(And, as GameFreak just posted, you'll still end up with double spaces.)
EDIT:
\b\w{1,2}\b\s?
is another option. This will partially fix the space-stripping issue, although words at the end of a string or followed by punctuation can still cause issues. For example, "A dog is fun no?" becomes "dog fun ?" In any case, you're still going to have issues with capitalization (dog should now be Dog).
Try: \b\w{1,2}\b although you will still have to get rid of the double spaces that will show up.
If you have a string like this:
hello there my this is a short word
This regex will match all words in the string greater than or equal to 3 characters in length:
\w{3,}
Resulting in:
hello there this short word
That, to me, is the easiest approach. Why try to match what you don't want, when you can match what you want a lot easier? No double spaces, no leftovers, and the punctuation is under your control. The other approaches break on multiple spaces and aren't very robust.

Categories