context.getText() excludes spaces in ANTLR4 - java

The getText() returns the complete statement excluding the spaces between the words. One way of considering the spaces is to include them in grammar. But, is there any other way to get the complete String with the spaces considered.

Yes, there is (assuming here you are using ParserRuleContext.getText(). The idea is to ask the input char stream for a range of characters. The position values are stored in the start and stop tokens of the context.
Here's some code (converted from C++, so it might not be 100% correct):
string sourceTextForContext(ParseTree context) {
Token startToken = (context.start instanceof TerminalNode) ? (TerminalNode)(start).getSymbol() : (ParserRuleContext)(start).start;
Token stopToken = (context.stop instanceof TerminalNode) ? (TerminalNode)(stop).getSymbol() : (ParserRuleContext)(stop).start;
CharStream cs = start.getTokenSource().getInputStream();
int stopIndex = stop != null ? stop.getStopIndex() : -1;
return cs.getText(new Interval(start.getStartIndex(), stopIndex));
}
Since this retrieval function uses the absolute char indexes, it doesn't count in any possible whitespace rule.

Related

finding the middle index of a substring when there are duplicates in the string

I was working on a Java coding problem and encountered the following issue.
Problem:
Given a string, does "xyz" appear in the middle of the string? To define middle, we'll say that the number of chars to the left and right of the "xyz" must differ by at most one
xyzMiddle("AAxyzBB") → true
xyzMiddle("AxyzBBB") → false
My Code:
public boolean xyzMiddle(String str) {
boolean result=false;
if(str.length()<3)result=false;
if(str.length()==3 && str.equals("xyz"))result=true;
for(int j=0;j<str.length()-3;j++){
if(str.substring(j,j+3).equals("xyz")){
String rightSide=str.substring(j+3,str.length());
int rightLength=rightSide.length();
String leftSide=str.substring(0,j);
int leftLength=leftSide.length();
int diff=Math.abs(rightLength-leftLength);
if(diff>=0 && diff<=1)result=true;
else result=false;
}
}
return result;
}
Output I am getting:
Running for most of the test cases but failing for certain edge cases involving more than once occurence of "xyz" in the string
Example:
xyzMiddle("xyzxyzAxyzBxyzxyz")
My present method is taking the "xyz" starting at the index 0. I understood the problem. I want a solution where the condition is using only string manipulation functions.
NOTE: I need to solve this using string manipulations like substrings. I am not considering using list, stringbuffer/builder etc. Would appreciate answers which can build up on my code.
There is no need to loop at all, because you only want to check if xyz is in the middle.
The string is of the form
prefix + "xyz" + suffix
The content of the prefix and suffix is irrelevant; the only thing that matters is they differ in length by at most 1.
Depending on the length of the string (and assuming it is at least 3):
Prefix and suffix must have the same length if the (string's length - the length of xyz) is even. In this case:
int prefixLen = (str.length()-3)/2;
result = str.substring(prefixLen, prefixLen+3).equals("xyz");
Otherwise, prefix and suffix differ in length by 1. In this case:
int minPrefixLen = (str.length()-3)/2;
int maxPrefixLen = minPrefixLen+1;
result = str.substring(minPrefixLen, minPrefixLen+3).equals("xyz") || str.substring(maxPrefixLen, maxPrefixLen+3).equals("xyz");
In fact, you don't even need the substring here. You can do it with str.regionMatches instead, and avoid creating the substrings, e.g. for the first case:
result = str.regionMatches(prefixLen, "xyz", 0, 3);
Super easy solution:
Use Apache StringUtils to split the string.
Specifically, splitByWholeSeparatorPreserveAllTokens.
Think about the problem.
Specifically, if the token is in the middle of the string then there must be an even number of tokens returned by the split call (see step 1 above).
Zero counts as an even number here.
If the number of tokens is even, add the lengths of the first group (first half of the tokens) and compare it to the lengths of the second group.
Pay attention to details,
an empty token indicates an occurrence of the token itself.
You can count this as zero length, count as the length of the token, or count it as literally any number as long as you always count it as the same number.
if (lengthFirstHalf == lengthSecondHalf) token is in middle.
Managing your code, I left unchanged the cases str.lengt<3 and str.lengt==3.
Taking inspiration from #Andy's answer, I considered the pattern
prefix+'xyz'+suffix
and, while looking for matches I controlled also if they respect the rule IsMiddle, as you defined it. If a match that respect the rule is found, the loop breaks and return a success, else the loop continue.
public boolean xyzMiddle(String str) {
boolean result=false;
if(str.length()<3)
result=false;
else if(str.length()==3 && str.equals("xyz"))
result=true;
else{
int preLen=-1;
int sufLen=-2;
int k=0;
while(k<str.lenght){
if(str.indexOf('xyz',k)!=-1){
count++;
k=str.indexOf('xyz',k);
//check if match is in the middle
preLen=str.substring(0,k).lenght;
sufLen=str.substring(k+3,str.lenght-1).lenght;
if(preLen==sufLen || preLen==sufLen-1 || preLen==sufLen+1){
result=true;
k=str.length; //breaks the while loop
}
else
result=false;
}
else
k++;
}
}
return result;
}

How to compare Chinese characters in Java using 'equals()'

I want to compare a string portion (i.e. character) against a Chinese character. I assume due to the Unicode encoding it counts as two characters, so I'm looping through the string with increments of two. Now I ran into a roadblock where I'm trying to detect the '兒' character, but equals() doesn't match it, so what am I missing ? This is the code snippet:
for (int CharIndex = 0; CharIndex < tmpChar.length(); CharIndex=CharIndex+2) {
// Account for 'r' like in dianr/huir
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Also, feel free to suggest a more elegant way to parse this ...
[UPDATE] Some pics from the debugger, showing that it doesn't match, even though it should. I pasted the Chinese character from the spreadsheet I use as input, so I don't think it's a copy and paste issue (unless the unicode gets lost along the way)
oh, dang, apparently it does not work simply copy and pasting:
Use CharSequence.codePoints(), which returns a stream of the codepoints, rather than having to deal with chars:
tmpChar.codePoints().forEach(c -> {
if (c == '兒') {
// ...
}
});
(Of course, you could have used tmpChar.codePoints().filter(c -> c == '兒').forEach(c -> { /* ... */ })).
Either characters, accepting 兒 as substring.
String s = ...;
if (s.contains("兒")) { ... }
int position = s.indexOf("兒");
if (position != -1) {
int position2 = position + "兒".length();
s = s.substring(0, position) + "*" + s.substring(position2);
}
if (s.startsWith("兒", i)) {
// At position i there is a 兒.
}
Or code points where it would be one code point. As that is not really easier, variable substring seem fine.
if (tmpChar.substring(CharIndex,CharIndex+2).equals("兒")) {
Is your problem. 兒 is only one UTF-16 character. Many Chinese characters can be represented in UTF-16 in one code unit; Java uses UTF-16. However, other characters are two code units.
There are a variety of APIs on the String class for coping.
As offered in another answer, obtaining the IntStream from codepoints allows you to get a 32-bit code point for each character. You can compare that to the code point value for the character you are looking for.
Or, you can use the ICU4J library with a richer set of facilities for all of this.

How to find duplicates inside a string?

I want to find out if a string that is comma separated contains only the same values:
test,asd,123,test
test,test,test
Here the 2nd string contains only the word "test". I'd like to identify these strings.
As I want to iterate over 100GB, performance matters a lot.
Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?
public static boolean stringHasOneValue(String string) {
String value = null;
for (split : string.split(",")) {
if (value == null) {
value = split;
} else {
if (!value.equals(split)) return false;
}
}
return true;
}
No need to split the string at all, in fact no need for any string manipulation.
Find the first word (indexOf comma).
Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.
Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:
for (int i=startPos;i<str.length();i++) {
if (str.charAt(i) != str.charAt(i-startPos)) {
return false;
}
}
return true;
You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.
Calling split might be expensive - especially if it is 200 GB data.
Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -
public static boolean stringHasOneValue(String string) {
String seperator = ",";
int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma
for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
String currentValue = string.substring(i, firstValue.length());
if (!firstValue.equals(currentValue)) {
return false;
}
}
return true;
}
Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.
for a crack just a line code:
(#Tim answer is more efficient)
System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));

Regex not valid when trying to invalidate alpha's and non-zeros

Wrote a method which takes in a String and checks to see the follow conditions:
If String is "quit", it will terminate the program.
If the String is any value other than an integer, it should return "Invalid input ".
Any negative integers and also 0 should return "Invalid input".
However, when I passed in 10, it returned as "Invalid input"?
Please advise:
public static String validate(String input) {
Pattern pattern = Pattern.compile(".*[^1-9].*");
StringBuilder results = new StringBuilder();
if (input.equals("quit")) {
System.exit(1);
} else if (!pattern.matcher(input).matches() == false) {
results.append("Invalid input ");
results.append("'");
results.append(input);
results.append("'");
}
return results.toString();
}
What's wrong with what I am doing?
You should write a pattern of what you expect instead of what you're not.
As describe what you want is always simpler that describe the rest of it.
So you expect :
Pattern acceptPattern = Pattern.compile("[1-9][0-9]*");
You may consider make you conditional expression simpler and correct by not using both ! and == false at the same time:
Which will make :
if (!acceptPattern .matcher(input).matches()) {//Invalid input code}
or
if (acceptPattern .matcher(input).matches() == false) {//Invalid input code}
note :
You write if(!A == false) => if(A == true) => if(A) but which was the inverse
It looks like you want to match one or more digits, where the first one is not a zero.
[1-9]\d*
If you want to force it to be the entire string, you can add anchors, like this:
^[1-9]\d*$
Your regex string doesn't allow for the presence of a zero (not just a lone zero).
That is, the string ".*[^1-9].*" is looking for "any number of characters, something that isn't 1-9, and any number of characters". When it finds the zero, it gives you your incorrect result.
Check out What is the regex for "Any positive integer, excluding 0" for how to change this.
Probably the most helpful solution on that page is the regex [0-9]*[1-9][0-9]* (for a valid integer). This allows for leading zeros and/or internal zeros, both of which could be present in a valid integer. In using Matcher#matches you also ensure that this regex matches the whole input, not just part of it (without the need to add in beginning and end anchors -- ^$).
Also, the line else if (!pattern.matcher(input).matches() == false) could be made a lot more clear.... maybe try else if (pattern.matcher(input).matches()) instead?

Java String.indexOf and empty Strings

I'm curious why the String.indexOf is returning a 0 (instead of -1) when asking for the index of an empty string within a string.
The Javadocs only say this method returns the index in this string of the specified string, -1 if the string isn't found.
To me this behavior seems highly unexpected, I would have expected a -1. Any ideas why this unexpected behavior is going on? I would at the least think this is worth a note in the method's Javadocs...
System.out.println("FOO".indexOf("")); // outputs 0 wtf!!!
System.out.println("FOO".indexOf("bar")); // outputs -1 as expected
System.out.println("FOO".indexOf("F")); // outputs 0 as expected
System.out.println("".indexOf("")); // outputs 0 as expected, I think
The empty string is everywhere, and nowhere. It is within all strings at all times, permeating the essence of their being, yet as you seek it you shall never catch a glimpse.
How many empty strings can you fit at the beginning of a string? Mu
The student said to the teacher,
Teacher, I believe that I have found the nature of the empty string. The empty string is like a particle of dust, and it floats freely through a string as dust floats freely through the room, glistening in a beam of sunlight.
The teacher responded to the student,
Hmm. A fine notion. Now tell me, where is the dust, and where is the sunlight?
The teacher struck the student with a strap and instructed him to continue his meditation.
Well, if it helps, you can think of "FOO" as "" + "FOO".
int number_of_empty_strings_in_string_named_text = text.length() + 1
All characters are separated by an empty String. Additionally empty String is present at the beginning and at the end.
By using the expression "", you are actually referring to a null string. A null string is an ethereal tag placed on something that exists only to show that there is a lack of anything at this location.
So, by saying "".indexOf( "" ), you are really asking the interpreter:
Where does a string value of null exist in my null string?
It returns a zero, since the null is at the beginning of the non-existent null string.
To add anything to the string would now make it a non-null string... null can be thought of as the absence of everything, even nothing.
Using an algebraic approach, "" is the neutral element of string concatenation: x + "" == x and "" + x == x (although + is non commutative here).
Then it must also be:
x.indexOf ( y ) == i and i != -1
<==> x.substring ( 0, i ) + y + x.substring ( i + y.length () ) == x
when y = "", this holds if i == 0 and x.substring ( 0, 0 ) == "".
I didn't design Java, but I guess mathematicians participated in it...
if we look inside of String implementation for a method "foo".indexOf(""), we arrive at this method:
public int indexOf(String str) {
byte coder = coder();
if (coder == str.coder()) {
return isLatin1() ? StringLatin1.indexOf(value, str.value)
: StringUTF16.indexOf(value, str.value);
}
if (coder == LATIN1) { // str.coder == UTF16
return -1;
}
return StringUTF16.indexOfLatin1(value, str.value);
}
If we look inside of any of the called indexOf(value, str.value) methods we find a condition that says:
if the second parameter (string we are searching for) length is 0 return 0:
public static int indexOf(byte[] value, byte[] str) {
if (str.length == 0) {
return 0;
}
...
This is just defensive coding for an edge case, and it is necessary because in the next method that is called to do actual searching by comparing bytes of the string (string is a byte array) it would otherwise have resulted in an ArrayIndexOutOfBounds exception:
public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
byte first = str[0];
...
This question is actually two questions:
Why should a string contain the empty string?
Why should the empty string be found specifically at index zero?
Answering #1:
A string contains the empty string in order to be in accordance with Set Theory, according to which:
The empty set is a subset of every set including itself.
This also means that even the empty string contains the empty string, and the following statement proves it:
assert "".indexOf( "" ) == 0;
I am not sure why mathematicians have decided that it should be so, but I am pretty sure they have their reasons, and it appears that these reasons can be explained in layman's terms, as various youtube videos seem to do, (for example, https://www.youtube.com/watch?v=1nBKadtFViM) although I have not actually viewed any of those videos, because #AintNoBodyGotNoTimeFoDat.
Answering #2:
The empty string can be found specifically at index zero of any string, because why not? In other words, if not at index zero, then at which index? Index zero is as good as any other index, and index zero is guaranteed to be a valid index for all strings except for the trifling exception of the empty string.

Categories