"Find substring in char[]" getting unexpected results

"Find substring in char[]" getting unexpected results - java

Disclaimer: This is a bit of a homework question. I'm attempting to write a contains(java.lang.String subString) method , that returns an int value representing the index of the comparison string within the primary string, for a custom-made String class.
Some of the rules:
No collection classes
Only charAt() and toCharArray() are allowed from the java String class (but methods from other classes are allowed)
Assume length() returns the length of the primary string (which is exactly what it does)
My Code:
public int contains(java.lang.String subString) {
this.subString = subString;
char[] arrSubStr = this.subString.toCharArray();
//Create initial fail
int index = -1;
//Make sure comparison subString is the same length or shorter than the primary string
if(arrSubStr.length > length()) {
return index;
}
//Steps to perform if initial conditions are met
else {
//Compare first character of subString to each character in primary string
for(int i = 0; i < length(); i++) {
//When a match is found...
if(arrSubStr[0] == this.content[i]) {
//...make sure that the subString is not longer than the remaining length of the primary string
if(arrSubStr.length > length() - i) {
return index;
}
//Proceed matching remainder of subString
else {
//Record the index of the beginning of the subString contained in primary string
index = i;
//Starting with second character of subString...
for(int j = 1; j < arrSubStr.length;) {
//...compare with subsequent chars of primary string,
//and if a failure of match is found, reset index to failure (-1)
if(arrSubStr[j] != this.content[j+i]) {
index = -1;
return index;
}
//If we get here, it means whole subString match found
//Return the index (=i) we set earlier
else {
return index;
}
}
}
}
}
}
return index;
}
Results from testing:
Primary string: asdfg
Comparison string: donkey
Result: -1 [PASS]
Primary string: asdfg
Comparison string: asdfg
Result: 0 [PASS]
Primary string: asdfg
Comparison string: g
Result: 4 [PASS]
Primary string: asasasf
Comparison string: asd
Result: 0 [FAIL] (should be -1)
Primary string: asasasf
Comparison string: asf
Result: 0 [FAIL] (should be 4)
The comments reflect how the code is intended to work. However its clear that when it reaches the second for loop, the logic is breaking down somehow to give the results above. But I can't see the problem. Could I get a second set of eyes on this?

//If we get here, it means whole subString match found
//Return the index (=i) we set earlier
else {
return index;
}
This assumption is not correct unfortunately. If you get there, it means that the second character of both substrings are identical since the if-else statement will only get executed once and both ends contains a return.
The way to solve this is probably easy now that I've diagnosed the problem but I want to go a bit further with this. The way we try to write code on a daily basis is a way in which the code we use can be maintainable, reusable and testable.
This means basically that the function we have here could be easily sliced up in different little functions invoked one after the other for which we could write unit tests and receive a quick feedback on whether a set of logical statements fit or not.

With suggestions from Jai and azurefrog in the comments, I was able to solve the issues by re-writing the logic to the following (somewhat abridged):
if(arrSubStr.length > length()) {
return index;
}
//Steps to perform if initial conditions are met
else {
//Compare first character of subString to each character in primary string
for(int i = 0; i < length(); i++) {
//When a match is found...
if(arrSubStr[0] == this.content[i]) {
//...make sure that the subString is not longer than the remaining length of the primary string
if(arrSubStr.length <= length() - i) {
//Record the index of the beginning of the subString contained in primary string
index = i;
//Starting with second character of subString...
for(int j = 1; j < arrSubStr.length; j++) {
//...compare with subsequent chars of primary string,
//and if a failure of match is found, reset index to failure (-1)
if(arrSubStr[j] != this.content[j+i]) {
index = -1;
break;
}
}
}
}
}
}
return index;
Essentially, I removed all of the return statements from within the loops. Simply setting the index value appropriately and making use of the final (outside) return statement was, in hindsight, the correct way to approach the problem. I then also added a break; to the inner for loop to make sure that a failure to match would continue the loop ticking through. I'm sure there's still unnecessary code in there, but while its still passing the requisite tests, I'm encouraged to leave it the hell alone. :)
I'm still a novice at Java, so I hope this explanation made sense.

Related

finding the middle index of a substring when there are duplicates in the string

I was working on a Java coding problem and encountered the following issue.
Problem:
Given a string, does "xyz" appear in the middle of the string? To define middle, we'll say that the number of chars to the left and right of the "xyz" must differ by at most one
xyzMiddle("AAxyzBB") → true
xyzMiddle("AxyzBBB") → false
My Code:
public boolean xyzMiddle(String str) {
boolean result=false;
if(str.length()<3)result=false;
if(str.length()==3 && str.equals("xyz"))result=true;
for(int j=0;j<str.length()-3;j++){
if(str.substring(j,j+3).equals("xyz")){
String rightSide=str.substring(j+3,str.length());
int rightLength=rightSide.length();
String leftSide=str.substring(0,j);
int leftLength=leftSide.length();
int diff=Math.abs(rightLength-leftLength);
if(diff>=0 && diff<=1)result=true;
else result=false;
}
}
return result;
}
Output I am getting:
Running for most of the test cases but failing for certain edge cases involving more than once occurence of "xyz" in the string
Example:
xyzMiddle("xyzxyzAxyzBxyzxyz")
My present method is taking the "xyz" starting at the index 0. I understood the problem. I want a solution where the condition is using only string manipulation functions.
NOTE: I need to solve this using string manipulations like substrings. I am not considering using list, stringbuffer/builder etc. Would appreciate answers which can build up on my code.

There is no need to loop at all, because you only want to check if xyz is in the middle.
The string is of the form
prefix + "xyz" + suffix
The content of the prefix and suffix is irrelevant; the only thing that matters is they differ in length by at most 1.
Depending on the length of the string (and assuming it is at least 3):
Prefix and suffix must have the same length if the (string's length - the length of xyz) is even. In this case:
int prefixLen = (str.length()-3)/2;
result = str.substring(prefixLen, prefixLen+3).equals("xyz");
Otherwise, prefix and suffix differ in length by 1. In this case:
int minPrefixLen = (str.length()-3)/2;
int maxPrefixLen = minPrefixLen+1;
result = str.substring(minPrefixLen, minPrefixLen+3).equals("xyz") || str.substring(maxPrefixLen, maxPrefixLen+3).equals("xyz");
In fact, you don't even need the substring here. You can do it with str.regionMatches instead, and avoid creating the substrings, e.g. for the first case:
result = str.regionMatches(prefixLen, "xyz", 0, 3);

Super easy solution:
Use Apache StringUtils to split the string.
Specifically, splitByWholeSeparatorPreserveAllTokens.
Think about the problem.
Specifically, if the token is in the middle of the string then there must be an even number of tokens returned by the split call (see step 1 above).
Zero counts as an even number here.
If the number of tokens is even, add the lengths of the first group (first half of the tokens) and compare it to the lengths of the second group.
Pay attention to details,
an empty token indicates an occurrence of the token itself.
You can count this as zero length, count as the length of the token, or count it as literally any number as long as you always count it as the same number.
if (lengthFirstHalf == lengthSecondHalf) token is in middle.

Managing your code, I left unchanged the cases str.lengt<3 and str.lengt==3.
Taking inspiration from #Andy's answer, I considered the pattern
prefix+'xyz'+suffix
and, while looking for matches I controlled also if they respect the rule IsMiddle, as you defined it. If a match that respect the rule is found, the loop breaks and return a success, else the loop continue.
public boolean xyzMiddle(String str) {
boolean result=false;
if(str.length()<3)
result=false;
else if(str.length()==3 && str.equals("xyz"))
result=true;
else{
int preLen=-1;
int sufLen=-2;
int k=0;
while(k<str.lenght){
if(str.indexOf('xyz',k)!=-1){
count++;
k=str.indexOf('xyz',k);
//check if match is in the middle
preLen=str.substring(0,k).lenght;
sufLen=str.substring(k+3,str.lenght-1).lenght;
if(preLen==sufLen || preLen==sufLen-1 || preLen==sufLen+1){
result=true;
k=str.length; //breaks the while loop
}
else
result=false;
}
else
k++;
}
}
return result;
}

Boolean always returns false possibly due to values not increasing

This method, moreVowels, is intended to be able to count the amount of vowels and consonants in the String entered, and return true if the amount of vowels is greater than the amount of consonants. Sadly this code always returns false, and I cannot understand why. Here is the method stated:
public Boolean moreVowels()
{ vowelCount = 0;
consonantCount = 0;
for(int i = 0; i < word.length(); i++)
{
if ("AEOIUY".contains(word.substring(i,i++)) || "aeoiuy".contains(word.substring(i,i++)))
{
vowelCount++;
}
if ("BCDFGHJKLMNPQRSTVWXZ".contains(word.substring(i,i++)) || "abcdefghijklmnopqrstuvwxyz".contains(word.substring(i,i++)))
{
consonantCount++;
}
}
if (vowelCount > consonantCount)
{
return true;
}
else
{
return false;
}
}
I believe it is always returning false due to the loop not actually increasing the counts, but I'm not quite sure why not. Thank you for reading, I'm sure the answer is something silly that I failed to recognize.

First, you should not use substring(i,i++), but substring(i,i+1). Otherwise, you'll increase i, making your code skip letters.
"abcdefghijklmnopqrstuvwxyz".contains(word.substring(i,i+1)) looks like a mistake. It will cause consonantCount to increase in each loop for every lowercase letter.
If you're only dealing with words (no spaces etc.), then every word is either a consonant or a vowel, so you don't need the second if. You could get consonant count by subtracting vowelCount from length.
Furthermore, if you convert the i-th character to uppercase, you can omit the || "aeoiuy".contains(...) part.

The other answer and comment already show the problems with your code. I just want to add a possible stream based solution that can reduce the possibilty for errors by repacing the index based looping and local variables:
private static boolean moreVowels(String word)
{
return word.chars()
.mapToObj(c -> Character.toString((char) c).toUpperCase())
.mapToInt(c -> "AEIOUY".contains(c) ? 1 : "BCDFGHJKLMNPQRSTVWXZ".contains(c) ? -1 : 0)
.sum() > 0;
}
You can apply the use of toUpperCase() to your own implementation as well to make the if statements a bit shorter (again avoiding possible errors).

How to find duplicates inside a string?

I want to find out if a string that is comma separated contains only the same values:
test,asd,123,test
test,test,test
Here the 2nd string contains only the word "test". I'd like to identify these strings.
As I want to iterate over 100GB, performance matters a lot.
Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?
public static boolean stringHasOneValue(String string) {
String value = null;
for (split : string.split(",")) {
if (value == null) {
value = split;
} else {
if (!value.equals(split)) return false;
}
}
return true;
}

No need to split the string at all, in fact no need for any string manipulation.
Find the first word (indexOf comma).
Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.
Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:
for (int i=startPos;i<str.length();i++) {
if (str.charAt(i) != str.charAt(i-startPos)) {
return false;
}
}
return true;
You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.

Calling split might be expensive - especially if it is 200 GB data.
Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -
public static boolean stringHasOneValue(String string) {
String seperator = ",";
int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma
for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
String currentValue = string.substring(i, firstValue.length());
if (!firstValue.equals(currentValue)) {
return false;
}
}
return true;
}
Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.

for a crack just a line code:
(#Tim answer is more efficient)
System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));

Comparing String Integers Issue

I have a scanner that reads a 7 character alphanumeric code (inputted by the user). the String variable is called "code".
The last character of the code (7th character, 6th index) MUST BE NUMERIC, while the rest may be either numeric or alphabetical.
So, I sought ought to make a catch, which would stop the rest of the method from executing if the last character in the code was anything but a number (from 0 - 9).
However, my code does not work as expected, seeing as even if my code ends in an integer between 0 and 9, the if statement will be met, and print out "last character in code is non-numerical).
example code: 45m4av7
CharacterAtEnd prints out as the string character 7, as it should.
however my program still tells me my code ends non-numerically.
I'm aware that my number values are string characters, but it shouldnt matter, should it?
also I apparently cannot compare actual integer values with an "|", which is mainly why im using String.valueOf, and taking the string characters of 0-9.
String characterAtEnd = String.valueOf(code.charAt(code.length()-1));
System.out.println(characterAtEnd);
if(!characterAtEnd.equals(String.valueOf(0|1|2|3|4|5|6|7|8|9))){
System.out.println("INVALID CRC CODE: last character in code in non-numerical.");
System.exit(0);
I cannot for the life of me, figure out why my program is telling me my code (that has a 7 at the end) ends non-numerically. It should skip the if statement and continue on. right?

The String contains method will work here:
String digits = "0123456789";
digits.contains(characterAtEnd); // true if ends with digit, false otherwise
String.valueOf(0|1|2|3|4|5|6|7|8|9) is actually "15", which of course can never be equal to the last character. This should make sense, because 0|1|2|3|4|5|6|7|8|9 evaluates to 15 using integer math, which then gets converted to a String.
Alternatively, try this:
String code = "45m4av7";
char characterAtEnd = code.charAt(code.length() - 1);
System.out.println(characterAtEnd);
if(characterAtEnd < '0' || characterAtEnd > '9'){
System.out.println("INVALID CRC CODE: last character in code in non-numerical.");
System.exit(0);
}

You are doing bitwise operations here: if(!characterAtEnd.equals(String.valueOf(0|1|2|3|4|5|6|7|8|9)))
Check out the difference between | and ||
This bit of code should accomplish your task using regular expressions:
String code = "45m4av7";
if (!code.matches("^.+?\\d$")){
System.out.println("INVALID CRC CODE");
}
Also, for reference, this method sometimes comes in handy in similar situations:
/* returns true if someString actually ends with the specified suffix */
someString.endsWith(suffix);
As .endswith(suffix) does not take regular expressions, if you wanted to go through all possible lower-case alphabet values, you'd need to do something like this:
/* ASCII approach */
String s = "hello";
boolean endsInLetter = false;
for (int i = 97; i <= 122; i++) {
if (s.endsWith(String.valueOf(Character.toChars(i)))) {
endsInLetter = true;
}
}
System.out.println(endsInLetter);
/* String approach */
String alphabet = "abcdefghijklmnopqrstuvwxyz";
boolean endsInLetter2 = false;
for (int i = 0; i < alphabet.length(); i++) {
if (s.endsWith(String.valueOf(alphabet.charAt(i)))) {
endsInLetter2 = true;
}
}
System.out.println(endsInLetter2);
Note that neither of the aforementioned approaches are a good idea - they are clunky and rather inefficient.
Going off of the ASCII approach, you could even do something like this:
ASCII reference : http://www.asciitable.com/
int i = (int)code.charAt(code.length() - 1);
/* Corresponding ASCII values to digits */
if(i <= 57 && i >= 48){
System.out.println("Last char is a digit!");
}
If you want a one-liner, stick to regular expressions, for example:
System.out.println((!code.matches("^.+?\\d$")? "Invalid CRC Code" : "Valid CRC Code"));
I hope this helps!

Java: Efficient way to determine if a String meets several criteria?

I would like to find an efficient way (not scanning the String 10,000 times, or creating lots of intermediary Strings for holding temporary results, or string bashing, etc.) to write a method that accepts a String and determine if it meets the following criteria:
It is at least 2 characters in length
The first character is uppercased
The remaining substring after the first character contains at least 1 lowercased character
Here's my attempt so far:
private boolean isInProperForm(final String token) {
if(token.length() < 2)
return false;
char firstChar = token.charAt(0);
String restOfToken = token.substring(1);
String firstCharAsString = firstChar + "";
String firstCharStrToUpper = firstCharAsString.toUpperCase();
// TODO: Giving up because this already seems way too complicated/inefficient.
// Ignore the '&& true' clause - left it there as a placeholder so it wouldn't give a compile error.
if(firstCharStrToUpper.equals(firstCharAsString) && true)
return true;
// Presume false if we get here.
return false;
}
But as you can see I already have 1 char and 3 temp strings, and something just doesn't feel right. There's got to be a better way to write this. It's important because this method is going to get called thousands and thousands of times (for each tokenized word in a text document). So it really really needs to be efficient.
Thanks in advance!

This function should cover it. Each char is examined only once and no objects are created.
public static boolean validate(String token) {
if (token == null || token.length() < 2) return false;
if (!Character.isUpperCase(token.charAt(0)) return false;
for (int i = 1; i < token.length(); i++)
if (Character.isLowerCase(token.charAt(i)) return true;
return false;

The first criteria is simply the length - this data is cached in the string object and is not requiring traversing the string.
You can use Character.isUpperCase() to determine if the first char is upper case. No need as well to traverse the string.
The last criteria requires a single traversal on the string- and stop when you first find a lower case character.
P.S. An alternative for the 2+3 criteria combined is to use a regex (not more efficient - but more elegant):
return token.matches("[A-Z].*[a-z].*");
The regex is checking if the string starts with an upper case letter, and then followed by any sequence which contains at least one lower case character.

It is at least 2 characters in length
The first character is
uppercased
The remaining substring after the first character contains
at least 1 lowercased character
Code:
private boolean isInProperForm(final String token) {
if(token.length() < 2) return false;
if(!Character.isUpperCase(token.charAt(0)) return false;
for(int i = 1; i < token.length(); i++) {
if(Character.isLowerCase(token.charAt(i)) {
return true; // our last criteria, so we are free
// to return on a met condition
}
}
return false; // didn't meet the last criteria, so we return false
}
If you added more criteria, you'd have to revise the last condition.

What about:
return token.matches("[A-Z].*[a-z].*");
This regular expression starts with an uppercase letter and has at least one following lowercase letter and therefore meets your requirements.

To find if the first character is uppercase:
Character.isUpperCase(token.charAt(0))
To check if there is at least one lowercase:
if(Pattern.compile("[a-z]").matcher(token).find()) {
//At least one lowercase
}

To check if first char is uppercase you can use:
Character.isUpperCase(s.charAt(0))

return token.matches("[A-Z].[a-z].");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

"Find substring in char[]" getting unexpected results - java

Related

finding the middle index of a substring when there are duplicates in the string

Boolean always returns false possibly due to values not increasing

How to find duplicates inside a string?

Comparing String Integers Issue

Java: Efficient way to determine if a String meets several criteria?

Categories

Resources