Parsing quoted text in java

Parsing quoted text in java - java

Is there an easy way to parse quoted text as a string to java? I have this lines like this to parse:
author="Tolkien, J.R.R." title="The Lord of the Rings"
publisher="George Allen & Unwin" year=1954
and all I want is Tolkien, J.R.R.,The Lord of the Rings,George Allen & Unwin, 1954 as strings.

You could either use a regex like
"(.+)"
It will match any character between quotes. In Java would be:
Pattern p = Pattern.compile("\\"(.+)\\"";
Matcher m = p.matcher("author=\"Tolkien, J.R.R.\"");
while(matcher.find()){
System.out.println(m.group(1));
}
Note that group(1) is used, this is the second match, the first one, group(0), is the full string with quotes
Offcourse you could also use a substring to select everything except the first and last char:
String quoted = "author=\"Tolkien, J.R.R.\"";
String unquoted;
if(quoted.indexOf("\"") == 0 && quoted.lastIndexOf("\"")==quoted.length()-1){
unquoted = quoted.substring(1, quoted.lenght()-1);
}else{
unquoted = quoted;
}

There are some fancy pattern regex nonsense things that fancy people and fancy programmers like to use.
I like to use String.split(). It's a simple function and does what you need it to do.
So if I have a String word: "hello" and I want to take out "hello", I can simply do this:
myStr = string.split("\"")[1];
This will cut the string into bits based on the quote marks.
If I want to be more specific, I can do
myStr = string.split("word: \"")[1].split("\"")[0];
That way I cut it with word: " and "
Of course, you run into problems if word: " is repeated twice, which is what patterns are for. I don't think you'll have to deal with that problem for your specific question.
Also, be cautious around characters like . and . Split uses regex, so those characters will trigger funny behavior. I think that "\\" = \ will escape those funny rules. Someone correct me if I'm wrong.
Best of luck!

Can you presume your document is well-formed and does not contain syntax errors? If so, you are simply interested in every other token after using String.split().
If you need something more robust, you may need to use the Scanner class (or a StringBuffer and a for loop ;-)) to pick out the valid tokens, taking into account additional criterion beyond "I saw a quotation mark somewhere".
For example, some reasons you might need a more robust solution than splitting the string blindly on quotation marks: perhaps its only a valid token if the quotation mark starting it comes immediately after an equals sign. Or perhaps you do need to handle values that are not quoted as well as quoted ones? Will \" need to be handled as an escaped quotation mark, or does that count as the end of the string. Can it have either single or double quotes (eg: html) or will it always be correctly formatted with double quotes?
One robust way would be to think like a compiler and use a Java based Lexer (such as JFlex), but that might be overkill for what you need.
If you prefer a low-level approach, you could iterate through your input stream character by character using a while loop, and when you see an =" start copying the characters into a StringBuffer until you find another non-escaped ", either concatenating to the various wanted parsed values or adding them to a List of some sort (depending on what you plan to do with your data). Then continue reading until you encounter your start token (eg: =") again, and repeat.

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.

I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

java String.replaceAll char between two numbers

I would like to replace all char '-' that between two numbers, or that between number and '.' by char '&'.For example
String input= "2.1(-7-11.3)-12.1*-2.3-.11"
String output= "2.1(-7&11.3)-12.1*-2.3&.11"
I have something like this, but I try to do it easier.
public void preperString(String input) {
input=input.replaceAll(" ","");
input=input.replaceAll(",",".");
input=input.replaceAll("-","&");
input=input.replaceAll("\\(&","\\(-");
input=input.replaceAll("\\[&","\\[-");
input=input.replaceAll("\\+&","\\+-");
input=input.replaceAll("\\*&","\\*-");
input=input.replaceAll("/&","/-");
input=input.replaceAll("\\^&","\\^-");
input=input.replaceAll("&&","&-");
input=input.replaceFirst("^&","-");
for (String s :input.split("[^.\\-\\d]")) {
if (!s.equals(""))
numbers.add(Double.parseDouble(s));
}

You can make it in one shot using groups of regex to solve your problem, you can use this :
String input = "2.1(-7-11.3)-12.1*-2.3-.11";
input = input.replaceAll("([\\d.])-([\\d.])", "$1&$2");
Output
2.1(-7&11.3)-12.1*-2.3&.11
([\\d.])-([\\d.])
// ^------------replace the hyphen(-) that it between
// ^__________^--------two number(\d)
// ^_^______^_^------or between number(\d) and dot(.)
regex demo

Let me guess. You don't really have a use for & here; you're just trying to replace certain minus signs with & so that they won't interfere with the split that you're trying to use to find all the numbers (so that the split doesn't return "-7-11" as one of the array elements, in your original example). Is that correct?
If my guess is right, then the correct answer is: don't use split. It is the wrong tool for the job. The purpose of split is to split up a string by looking for delimiter patterns (such as a sequence of whitespace or a comma); but where the format of the elements between the delimiters doesn't much matter. In your case, though, you are looking for elements of a particular numeric format (it might start with -, and otherwise will have at least one digit and at most one period; I don't know what your exact requirements are). In this case, instead of split, the right way to do this is to create a regular expression for the pattern you want your numbers to have, and then use m.find in a loop (where m is a Matcher) to get all your numbers.
If you need to treat some - characters differently (e.g. in -7-11, where you want the second - to be an operator and not part of -11), then you can make special checks for that in your loop, and skip over the - signs that you know you want to treat as operators.
It's simpler, readers will understand what you're trying to do, and it's less error-prone because all you have to do is make sure your pattern for expressing numbers accurately reflects what you're looking for.
It's common for newer Java programmers to think regexes and split are magic tools that can solve everything. But often the result ends up being too complex (code uses overly complicated regexes, or relies on trickery like having to replace characters with & temporarily). I cannot look at your original code and convince myself that it works right. It's not worth it.

You can use lookahead and lookbehind to match digit or dot:
input.replaceAll("(?<=[\\d\\.])-(?=[\\d\\.])","&")
Have a look on this fiddle.

Split String While Ignoring Escaped Character

I want to split a string along spaces, ignoring spaces if they are contained inside single quotes, and ignoring single quotes if they are escaped (i.e., \' )
I have the following completed from another question.
String s = "Some message I want to split 'but keeping this a\'s a single string' Voila!";
for (String a : s.split(" (?=([^\']*\'[^\"]*\')*[^\']*$)")) {
System.out.println(a);
}
The output of the above code is
Some
message
I
want
to
split
'but
keeping
this
'a's a single string'
Voila!
However, I need single quotes to be ignored if they are escaped ( \' ), which the above does not do. Also, I need the first and last single quotes and forward slashes removed, if and only if it (the forward slashes) are escaping a single quote (to where 'this is a \'string' would become this is a 'string). I have no idea how to use regex. How would I accomplish this?

You need to use a negative lookbehind to take care of escaped single quotes:
String str =
"Some message I want to split 'but keeping this a\\'s a single string' Voila!";
String[] toks = str.split( " +(?=((.*?(?<!\\\\)'){2})*[^']*$)" );
for (String tok: toks)
System.out.printf("<%s>%n", tok);
output:
<Some>
<message>
<I>
<want>
<to>
<split>
<'but keeping this a\'s a single string'>
<Voila!>
PS: As you noted that escaped single quote needs to be typed as \\' in String assignment otherwise it will be treated as plain '

or you could use this pattern to capture what you want
('(?:[^']|(?!<\\\\)')*'|\S+)
Demo

I was really overthinking this one.
This should work, and the best part is that it doesn't use lookarounds at all (so it works in nearly ever regex implementation, most famously javascript)
('[^']*?(?:\\'[^']*?)*'|[^\s]+)
Instead of using a split, use a match to build an array with this regex.
My objectives were
It can discern between an escaped apostrophe and not (of course)
It's fast. The behemoth I wrote before actually took time
It worked with multiple subquotes, a lot of suggestions here don't.
Demo
Test String: Discerning between 'the single quote\'s double purpose' as a 'quote marker', like ", and a 'a cotraction\'s marker.'.
If you asked the author and he was speaking in the third person, he would say 'CFQueryParam\'s example is contrived, and he knew that but he had the world\'s most difficult time thinking up an example.'
Some message I want to split 'but keeping this a\'s a single string' Voila!
Result: Discerning, between, 'the single quote\'s double purpose', as, a, 'quote marker',,, like, ",, and, a, 'a cotraction\'s marker.',.,
If, you, asked, the, author, and, he, was, speaking, in, the, third, person,, he, would, say, 'CFQueryParam\'s example is contrived, and he knew that but he had the world\'s most difficult time thinking up an example.',
Some, message, I, want, to, split, 'but keeping this a\'s a single string', Voila!

How to remove duplicate characters in a string using regex?

I need to replace the duplicate characters in a string. I tried using
outputString = str.replaceAll("(.)(?=.*\\1)", "");
This replaces the duplicate characters but the position of the characters changes as shown below.
input
haih
output
aih
But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.
input
aaaassssddddd
output
asd
input
cdddddggggeeccc
output
cdge
How can this be achieved?

It seems like your code is leaving the last character, so how about this?
outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai

Overview
It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:
It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.
Therefore, this answer is for educational purpose, rather than something to be used in production code.
Solution
Here is the one-liner replaceAll regex solution:
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
Printing out the regex:
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.
However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.
This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).
This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).
Can this be done in one replacement without using look-behind?
I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.
For your information, this is for removing all subsequent appearance of a fixed character (h is used here):
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.
We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)

The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.

You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."

The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$

You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string

Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...

Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/

You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.