Java/Kotlin: Tokenize a string ignoring the contents of nested quotes

Java/Kotlin: Tokenize a string ignoring the contents of nested quotes - java

I would like to split a character by spaces but keep the spaces inside the quotes (and the quotes themselves). The problem is, the quotes can be nested, and also I would need to do this for both single and double quotes. So, from the line this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes" I would like to get [this, "'"is a possible option"'", and, ""so is this"", and, '''this one too''', and, even, ""mismatched quotes"].
This question has already been asked, but not the exact question that I'm asking. Here are several solutions: one uses a matcher (in this case """x""" would be split into [""", x"""], so this is not what I need) and Apache Commons (which works with """x""" but not with ""x"", since it takes the first two double quotes and leaves the last two with x). There are also suggestions of writing a function to do so manually, but this would be the last resort.

You can achieve that with the following regex: ["']+[^"']+?["']+. Using that pattern you retrieve the indices where you want to split like this:
val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
The rest is building the list out of substrings. Here the complete function:
fun String.splitByPattern(pattern: String): List<String> {
val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()
var lastIndex = 0
return indices.mapIndexed { i, ele ->
val end = if(i % 2 == 0) ele else ele + 1 // magic
substring(lastIndex, end).apply {
lastIndex = end
}
}
}
Usage:
val str = """
this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes"
""".trim()
println(str.splitByPattern("""["']+[^"']+?["']+"""))
Output:
[this , "'"is a possible option"'", and , ""so is this"", and , '''this one too''', and even , ""mismatched quotes"]
Try it out on Kotlin's playground!

Related

Android String.split("") returning extra element

I am trying to split a word into its individual letters.
I tried both String.split("") and String.split("|") however when I split a word it is creating a extra empty element.
Example:
word = "word";
int n = word.length();
Log.i("20",Integer.toString(n));
String[] letters = word.split("|");
Log.i("25",Integer.toString(letters.length));
The output in the Android Monitor is:
07-21 15:50:23.084 5711-5711/com.strizhevskiy.movetester I/20: 4
07-21 15:50:23.085 5711-5711/com.strizhevskiy.movetester I/25: 5
I put the individual letters into TextView blocks and I can actually see an extra empty TextView.
When I test these methods in my regular Java it outputs the expected answer: 4.
I am almost tempted to think this is an actual bug in Android's implementation of the method.

I am thinking you want to do this:
public Character[] toCharacterArray( String s ) {
if ( s == null ) {
return null;
}
int len = s.length();
Character[] array = new Character[len];
for (int i = 0; i < len ; i++) {
array[i] = new Character(s.charAt(i));
}
return array;
}
Instead of splitting a word without delimiters?
I hope this helps!

It's hard to say if it's bug or expected behavior, because what are you doing doesn't make sense. You are trying to split string with logical OR (split is waiting for Regular expression, not just a string), so as result it could be different result in Android comparing with normal java, and I don't see there any issue.
Anyway, there is many ways to achieve what you want in a normal way, e.g. just iterating over word by each char in a cycle or just use toCharArray String's method.

Thank you for the suggestions. My current work-around is to use a mock array and copying over into a fresh array using System.arraycopy().
String[] mockLetters = word.split("");
int n = word.length();
String[] letters = new String[n];
System.arraycopy(mockLetters,1,letters,0,n);
I appreciate the suggestions to use toCharArray(). However, these letters then get put into TextViews and TextView doesnt seem to accept char. I could, of coarse, make it work but I've decided to stick with what I currently have.
Tom, in a comment to my question, answered my underlying issue:
Why String.split() worked differently in Android than it does in Java?
Apparently the rules for String.split() changed with Java 8.

Try passing a 0 as the limit per the documentation below so that the trailing spaces are discarded.
String[] split (String regex,
int limit)
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

java regex: is it possible to find number of captures in a match without looping

My code looks like this and it works fine for finding all the numbers in the matrix but it seems overly complicated to me.
String attr = "matrix(1 0 0 1 22.51 35)";
Pattern nums = Pattern.compile("(-*\\d+(?:\\.\\d+)*)");
Matcher m2 = nums.matcher(attr);
while(m2.find()) {
Log.i(logTag, "s = " + m2.group(0));
}
I would like to allocate an array and then assign values to it so I could do something like:
Pattern nums = Pattern.compile("(-*\\d+(?:\\.\\d+)*)");
Matcher m2 = nums.matcher(attr);
String [] matches = new String[number_of_matches];
int index = 0;
while (m2.find()) {
matches[index++] = m2.groups(0);
}
Is this possible? I've looked for several hours and can't find anything like this in native java but I've found several pieces of example code to implement this functionality but it's something I'd expect to find in the regex library.
In PERL my code would look like:
$s = "matrix(1 0 0 1 22.51 35)";
#x = ($s =~ m{(\d+(?:\.\d+)*)}g);
x #x
0 1
1 0
2 0
3 1
4 22.51
5 35

No, there's no method for that functionality.
If it existed, it would do exactly what you describe: go over all matches and count them (and then reset to the beginning of the string)
If you really needed to know up-front, you could just write the counting function yourself.
But I would suggest that you use a List (such as an ArrayList) instead of an array; then you don't need to know the number of matches up-front, and the List interface is generally much much convenient to use than an array.
(Your Perl result also returns a variable-size list rather than a fixed-size array, if my rusty Perl knowledge is not mistaken)

You should be using the List interface for these types of operations anyway.
Pattern nums = Pattern.compile("(-*\\d+(?:\\.\\d+)*)");
Matcher m2 = nums.matcher(attr);
List <String> matches = new ArrayList<String>();
while (m2.find()) {
matches.add(m2.groups(0));
}

You can't do it, with the find() you initiate scanning of your String of which you don't know much before hand. You can learn more only if the match succeeds, from the docs
If the match succeeds then more information can be obtained via the
start, end, and group methods, and subsequent invocations of the
find() method will start at the first character not matched by this
match.

Scala HashMap throwing key not found exception

I am very new to Scala, and would appreciate any help (have looked everywhere and spent the last 8 hours trying to figure this out)
Currently I have
def apply(file: String) : Iterator[String] = {
scala.io.Source.fromFile(file).getLines().map(_.toLowerCase)
}
As well as
def groupFreq[A,B](xs: Iterator[A], f: A => B): HashMap[B, Int] = {
var freqMap = new HashMap[B, Int]
for (x <- xs) freqMap = freqMap + ( f(x) -> ( freqMap.getOrElse( f(x) , 0 ) +1 ) )
freqMap
}
apply just takes a file of words that we pass in.
GroupFreq takes xs: Iterator[A] and a grouping function f that converts A values to their B groups.
The function returns a HashMap that for each B group, counts the number of A values that fell into the group.
I use both of these functions, to help me with charFreq, a function that uses both apply and groupFreq to pass back a HashMap that counts how many times a Char appears throughout the entire file. If the char does not appear anywhere in the file, then there should be no mapping for it.
def charFreq(file: String): HashMap[Char, Int] =
{
var it = Iterator[Char]()
val words = apply(file)
for {
xs<-words
} yield { it = it ++ xs.toIterator }
val chars = it
val grouper = (x: Char) => x
groupFreq(chars, grouper)
}
My solution compiles and apply and groupFreq work as intended, but when I run charFreq, it says
charFreq threw an exception: java.util.NoSuchElementException: key not
found: d
I believe I'm doing something wrong, most likely with my for loop and yield, but I've gone through the logic many times and I don't get why it doesn't work.
Google and StackOverflow has recommended flatmaps, but I coulnd't get that to work either.
Any help would be appreciated. Keep in mind this is a class assignment with the skeleton methods set up, so I cannot change the way apply and groupFreq and charFreq are set up, I can only manipulate the bodies which I have tried to do.

I can't reproduce your error with some random text files of strings. I suspect it occurred in an earlier iteration of groupFreq() w/o a getOrElse() type test.
However, when run your code, I end up with an empty map from the call to charFreq(). You're correct that the loop/yield in charFreq() is problematic. It's easier to see when you put a val l = in front of the for and check the value in an IDE which shoulds that l is of type Iterator[Unit].
You don't need vars for the for loop. The for loop isn't the same as a C-style for loop and is equivalent to calling flatMap/map over its elements ( though others can express this much better than I ). The yield is being concatenated to something for you (defined by the steps you take inside it ).
Here are two ways to get an Iterator[Char] for your call to groupFreq():
1> Remove the unnecessary var it and fill chars directly with a for comprehension loop:
val chars = for {
xs<-words
i<-xs.toIterator
} yield { i }
2> call flatMap directly on the words val:
val chars = words.flatMap( s => s )

A. Regarding your problem, there is at least one issue I can spot in the code:
The way you build up an iterator (in charFreq) seems to be too heavy. words.toIterator would suffice.
The way you update the map also seems strange to me. I would rather do:
val mapped = f(x)
if (!(freqMap contains mapped) freqMap(mapped) = 0
freqMap(mapped)+=1
B. As far as I understand, this problem can be solved with a one-liner (which is why Scala is so cool of course ;-) )
def charFreq(file:String) =
file.toCharArray.groupBy(m=>m).map(m => (m._1,m._2.size))
Explanation:
1) toCharArray converts your string into an array of Char elements
2) groupBy(m=>m) groups together all elements with the same values, the result will be of type Map[Char,Array[Char]], where every char is mapped to the array of all occurrences of that char in your string.
3) now all we need is to map each entry of the Map[Char,Array[Char]] to Map[Char,Int]] by using the mapping map(m => (m._1,m._2.size)), which takes every element (key->value), leaves the key intact and transforms the value (an array) into the size of that array.
4) If your input string is going to be very large (I haven't evaluated that but if it's in the ballpark of MB I'd start to worry about that), then I would probably use another solution, with mutable map which I'd fill up while iterating over the source:
def charFreq(hugeFile:String) = {
//create a mutable map, which can be updated when needed
val mm = scala.collection.mutable.Map[Char,Int]()
//iterate over the string
for (m <- hugeFile) {
//ensure that our map contains the entry for the given character
if (! (mm contains m)) mm(m) = 0
mm(m) = mm(m)+1
}
//return the result as an immutable map
mm.toMap
}

Replace all occurrences of a sub-string in a larger string with the value in an array

I want to replace i# (look that the example inputs/outputs) with the value of an array, but I'm not sure how to do this with Java+Regex.
Assume you have an array with: [3,2,1,0]
Example inputs:
i0
i1^2
(i1+2)+5
2*5+i1
i1+i2-i3
1+2
Example output:
3 [why? input is i0 and index 0 = 3 in the array]
2^2
(2+2)+5
2*5+2
2+1-0
1+2
Regex is here:
http://rubular.com/r/KXbCQnbs8K
REGEX = i{1}(\d+)
Code:
private String replace(String input){
StringBuffer s = new StringBuffer();
Pattern regex = Pattern.compile(REGEX);
Matcher m = regex.matcher(input);
if( !m.find(0) ){
return input;
}else{
m.reset();
}
while (m.find() ){
m.appendReplacement(s, getRealValue(m.group(1)) );
}
return s.toString();
}
private String getRealValue(String val){
int value = Integer.parseInt(val);
return String.valueOf(array.get(value));
}
Assume i#s given are always valid. My code works for some cases, but fails in most. Any help? Thanks!
EDIT:
I'm not sure how to tell it to add the last part (for example: +5 in i0+5).
i0 -- works
i1^2 -- doesn't work
(i1+2)+5 -- doesn't work
2*5+i1 -- works
i1+i2-i3 -- doesn't work
1+2 -- works
1+i2 -- works
I want to modify the regex to "i{1}(\d+)(.*)"
if(lastMatch()){ //if last match is true
s += m.group(2) //concat the last group (ie. "+5" in "i0+5")
}
But I don't know the correct syntax for that.

So it fails... what is it about the output that is being produced that is wrong? That's a very useful bit of information that you've neglected to mentioned. In the future you should try to think more about why the output is wrong, this will help you to figure what the program is doing wrong.
But by the looks of it you've forgotten to use Matcher.appendTail(StringBuffer) after you've done all the replacements. appendTail appends any remaining characters after the last match eg. "i0[this bit]".
I assume the wrong output was
i1^2 -> 2
(i1+2)+5 -> (2
Looking at this it would have been much faster to figure out what was going wrong. It's forgetting to add last bit of String at the end. Let's find a way to sort this out or read the API to see if there's a method that does it in one simple step for me.
Example code
while (m.find() ){
m.appendReplacement(s, getRealValue(m.group(1)) );
}
m.appendTail(s); // you missed out this line
return s.toString();

Splitting string N into N/X strings

I would like some guidance on how to split a string into N number of separate strings based on a arithmetical operation; for example string.length()/300.
I am aware of ways to do it with delimiters such as
testString.split(",");
but how does one uses greedy/reluctant/possessive quantifiers with the split method?
Update: As per request a similar example of what am looking to achieve;
String X = "32028783836295C75546F7272656E745C756E742E657865000032002E002E005C0"
Resulting in X/3 (more or less... done by hand)
X[0] = 32028783836295C75546F
X[1] = 6E745C756E742E6578650
x[2] = 65000032002E002E005C0
Dont worry about explaining how to put it into the array, I have no problem with that, only on how to split without using a delimiter, but an arithmetic operation

You could do that by splitting on (?<=\G.{5}) whereby the string aaaaabbbbbccccceeeeefff would be split into the following parts:
aaaaa
bbbbb
ccccc
eeeee
fff
The \G matches the (zero-width) position where the previous match occurred. Initially, \G starts at the beginning of the string. Note that by default the . meta char does not match line breaks, so if you want it to match every character, enable DOT-ALL: (?s)(?<=\G.{5}).
A demo:
class Main {
public static void main(String[] args) {
int N = 5;
String text = "aaaaabbbbbccccceeeeefff";
String[] tokens = text.split("(?<=\\G.{" + N + "})");
for(String t : tokens) {
System.out.println(t);
}
}
}
which can be tested online here: http://ideone.com/q6dVB
EDIT
Since you asked for documentation on regex, here are the specific tutorials for the topics the suggested regex contains:
\G, see: http://www.regular-expressions.info/continue.html
(?<=...), see: http://www.regular-expressions.info/lookaround.html
{...}, see: http://www.regular-expressions.info/repeat.html

If there's a fixed length that you want each String to be, you can use Guava's Splitter:
int length = string.length() / 300;
Iterable<String> splitStrings = Splitter.fixedLength(length).split(string);
Each String in splitStrings with the possible exception of the last will have a length of length. The last may have a length between 1 and length.
Note that unlike String.split, which first builds an ArrayList<String> and then uses toArray() on that to produce the final String[] result, Guava's Splitter is lazy and doesn't do anything with the input string when split is called. The actual splitting and returning of strings is done as you iterate through the resulting Iterable. This allows you to just iterate over the results without allocating a data structure and storing them all or to copy them into any kind of Collection you want without going through the intermediate ArrayList and String[]. Depending on what you want to do with the results, this can be considerably more efficient. It's also much more clear what you're doing than with a regex.

How about plain old String.substring? It's memory friendly (as it reuses the original char array).

well, I think this is probably as efficient a way to do this as any other.
int N=300;
int sublen = testString.length()/N;
String[] subs = new String[N];
for(int i=0; i<testString.length(); i+=sublen){
subs[i] = testString.substring(i,i+sublen);
}
You can do it faster if you need the items as a char[] array rather as individual Strings - depending on how you need to use the results - e.g. using testString.toCharArray()

Dunno, you'll probably need a method that takes string and int times and returns a list of strings. Pseudo code (haven't checked if it works or not):
public String[] splintInto(String splitString, int parts)
{
int dlength = splitString.length/parts
ArrayList<String> retVal = new ArrayList<String>()
for(i=0; i<splitString.length;i+=dlength)
{
retVal.add(splitString.substring(i,i+dlength)
}
return retVal.toArray()
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java/Kotlin: Tokenize a string ignoring the contents of nested quotes - java

Related

Android String.split("") returning extra element

java regex: is it possible to find number of captures in a match without looping

Scala HashMap throwing key not found exception

Replace all occurrences of a sub-string in a larger string with the value in an array

Splitting string N into N/X strings

Categories

Resources