Prefix search using BinarySearch

Prefix search using BinarySearch - java

Using prefix string i need to display all possible string from ArrayList using BinarySearch. Is it possible tell me the wright way.
BinarySearch(myList, SearchString);

To every String prefix and non-empty String suffix applies prefix < prefix+suffix.
So you can use Collections.<String>binarySearch(List<String>,String) to search for the position of prefix. Of course, the existence or absence of the prefix does not say anything about the existence or absence of the prefix+suffix Strings. So if the index is negative, convert it using -index-1 and check that position whether it is equal to the list’s size (in this case no prefixed String was found) or if the String at that index has the prefix. If the index was not negative, i.e. prefix without a suffix was found you have to decide whether to include Strings with empty suffix or not. Since binarySearch will return an arbitrary index in the case of multiple occurrences you have to use the returned index to go linearly to the first occurrence, either backward or forward depending on your decision whether prefix without a suffix shall be included or not.
Once you have the first position you could use binarySearch again to find the end of all prefixed Strings by searching for the smallest String which is greater than any prefixed String. This String can be constructed by incrementing the last character of the prefix by one. Here again it doesn’t matter whether that String is really there, it just gives us the delimiter for the range of prefixed Strings. So you will convert negative values using -index-1 and have the first index of a String without the prefix; it will be the size of the list if no such String exists.
public static List<String> findAllPrefixed(
List<String> list, String prefix, boolean includeEmptySuffixed)
{
int first=Collections.binarySearch(list, prefix);
if(first<0)
{
first=-first-1;
if(first==list.size() || !list.get(first).startsWith(prefix))
return Collections.emptyList();
}
else
{
if(includeEmptySuffixed)
while(first>0 && list.get(first-1).equals(prefix)) first--;
else
{
do first++; while(first<list.size() && list.get(first).equals(prefix));
if(first==list.size() || !list.get(first).startsWith(prefix))
return Collections.emptyList();
}
}
// the conditional is just a small optimization
List<String> notSmaller=first>0? list.subList(first, list.size()): list;
final int p = prefix.length()-1;
if(p<0) return notSmaller;//empty prefix, there are no larger values
final String after=prefix.substring(0, p)+(char)(prefix.charAt(p)+1);
int last=Collections.binarySearch(notSmaller, after);
if(last<0) last=-last-1;
// could just do notSmaller.subList(0,last); but this here reduces heap usage
return last==notSmaller.size()? notSmaller: list.subList(first, first+last);
}

Related

Count the Characters in a String Recursively & treat "eu" as a Single Character

I am new to Java, and I'm trying to figure out how to count Characters in the given string and threat a combination of two characters "eu" as a single character, and still count all other characters as one character.
And I want to do that using recursion.
Consider the following example.
Input:
"geugeu"
Desired output:
4 // g + eu + g + eu = 4
Current output:
2
I've been trying a lot and still can't seem to figure out how to implement it correctly.
My code:
public static int recursionCount(String str) {
if (str.length() == 1) {
return 0;
}
else {
String ch = str.substring(0, 2);
if (ch.equals("eu") {
return 1 + recursionCount(str.substring(1));
}
else {
return recursionCount(str.substring(1));
}
}
}

OP wants to count all characters in a string but adjacent characters "ae", "oe", "ue", and "eu" should be considered a single character and counted only once.
Below code does that:
public static int recursionCount(String str) {
int n;
n = str.length();
if(n <= 1) {
return n; // return 1 if one character left or 0 if empty string.
}
else {
String ch = str.substring(0, 2);
if(ch.equals("ae") || ch.equals("oe") || ch.equals("ue") || ch.equals("eu")) {
// consider as one character and skip next character
return 1 + recursionCount(str.substring(2));
}
else {
// don't skip next character
return 1 + recursionCount(str.substring(1));
}
}
}

Recursion explained
In order to address a particular task using Recursion, you need a firm understanding of how recursion works.
And the first thing you need to keep in mind is that every recursive solution should (either explicitly or implicitly) contain two parts: Base case and Recursive case.
Let's have a look at them closely:
Base case - a part that represents a simple edge-case (or a set of edge-cases), i.e. a situation in which recursion should terminate. The outcome for these edge-cases is known in advance. For this task, base case is when the given string is empty, and since there's nothing to count the return value should be 0. That is sufficient for the algorithm to work, outcomes for other inputs should be derived from the recursive case.
Recursive case - is the part of the method where recursive calls are made and where the main logic resides. Every recursive call eventually hits the base case and stars building its return value.
In the recursive case, we need to check whether the given string starts from a particular string like "eu". And for that we don't need to generate a substring (keep in mind that object creation is costful). instead we can use method String.startsWith() which checks if the bytes of the provided prefix string match the bytes at the beginning of this string which is chipper (reminder: starting from Java 9 String is backed by an array of bytes, and each character is represented either with one or two bytes depending on the character encoding) and we also don't bother about the length of the string because if the string is shorter than the prefix startsWith() will return false.
Implementation
That said, here's how an implementation might look:
public static int recursionCount(String str) {
if(str.isEmpty()) {
return 0;
}
return str.startsWith("eu") ?
1 + recursionCount(str.substring(2)) : 1 + recursionCount(str.substring(1));
}
Note: that besides from being able to implement a solution, you also need to evaluate it's Time and Space complexity.
In this case because we are creating a new string with every call time complexity is quadratic O(n^2) (reminder: creation of the new string requires allocating the memory to coping bytes of the original string). And worse case space complexity also would be O(n^2).
There's a way of solving this problem recursively in a linear time O(n) without generating a new string at every call. For that we need to introduce the second argument - current index, and each recursive call should advance this index either by 1 or by 2 (I'm not going to implement this solution and living it for OP/reader as an exercise).
In addition
In addition, here's a concise and simple non-recursive solution using String.replace():
public static int count(String str) {
return str.replace("eu", "_").length();
}
If you would need handle multiple combination of character (which were listed in the first version of the question) you can make use of the regular expressions with String.replaceAll():
public static int count(String str) {
return str.replaceAll("ue|au|oe|eu", "_").length();
}

finding the middle index of a substring when there are duplicates in the string

I was working on a Java coding problem and encountered the following issue.
Problem:
Given a string, does "xyz" appear in the middle of the string? To define middle, we'll say that the number of chars to the left and right of the "xyz" must differ by at most one
xyzMiddle("AAxyzBB") → true
xyzMiddle("AxyzBBB") → false
My Code:
public boolean xyzMiddle(String str) {
boolean result=false;
if(str.length()<3)result=false;
if(str.length()==3 && str.equals("xyz"))result=true;
for(int j=0;j<str.length()-3;j++){
if(str.substring(j,j+3).equals("xyz")){
String rightSide=str.substring(j+3,str.length());
int rightLength=rightSide.length();
String leftSide=str.substring(0,j);
int leftLength=leftSide.length();
int diff=Math.abs(rightLength-leftLength);
if(diff>=0 && diff<=1)result=true;
else result=false;
}
}
return result;
}
Output I am getting:
Running for most of the test cases but failing for certain edge cases involving more than once occurence of "xyz" in the string
Example:
xyzMiddle("xyzxyzAxyzBxyzxyz")
My present method is taking the "xyz" starting at the index 0. I understood the problem. I want a solution where the condition is using only string manipulation functions.
NOTE: I need to solve this using string manipulations like substrings. I am not considering using list, stringbuffer/builder etc. Would appreciate answers which can build up on my code.

There is no need to loop at all, because you only want to check if xyz is in the middle.
The string is of the form
prefix + "xyz" + suffix
The content of the prefix and suffix is irrelevant; the only thing that matters is they differ in length by at most 1.
Depending on the length of the string (and assuming it is at least 3):
Prefix and suffix must have the same length if the (string's length - the length of xyz) is even. In this case:
int prefixLen = (str.length()-3)/2;
result = str.substring(prefixLen, prefixLen+3).equals("xyz");
Otherwise, prefix and suffix differ in length by 1. In this case:
int minPrefixLen = (str.length()-3)/2;
int maxPrefixLen = minPrefixLen+1;
result = str.substring(minPrefixLen, minPrefixLen+3).equals("xyz") || str.substring(maxPrefixLen, maxPrefixLen+3).equals("xyz");
In fact, you don't even need the substring here. You can do it with str.regionMatches instead, and avoid creating the substrings, e.g. for the first case:
result = str.regionMatches(prefixLen, "xyz", 0, 3);

Super easy solution:
Use Apache StringUtils to split the string.
Specifically, splitByWholeSeparatorPreserveAllTokens.
Think about the problem.
Specifically, if the token is in the middle of the string then there must be an even number of tokens returned by the split call (see step 1 above).
Zero counts as an even number here.
If the number of tokens is even, add the lengths of the first group (first half of the tokens) and compare it to the lengths of the second group.
Pay attention to details,
an empty token indicates an occurrence of the token itself.
You can count this as zero length, count as the length of the token, or count it as literally any number as long as you always count it as the same number.
if (lengthFirstHalf == lengthSecondHalf) token is in middle.

Managing your code, I left unchanged the cases str.lengt<3 and str.lengt==3.
Taking inspiration from #Andy's answer, I considered the pattern
prefix+'xyz'+suffix
and, while looking for matches I controlled also if they respect the rule IsMiddle, as you defined it. If a match that respect the rule is found, the loop breaks and return a success, else the loop continue.
public boolean xyzMiddle(String str) {
boolean result=false;
if(str.length()<3)
result=false;
else if(str.length()==3 && str.equals("xyz"))
result=true;
else{
int preLen=-1;
int sufLen=-2;
int k=0;
while(k<str.lenght){
if(str.indexOf('xyz',k)!=-1){
count++;
k=str.indexOf('xyz',k);
//check if match is in the middle
preLen=str.substring(0,k).lenght;
sufLen=str.substring(k+3,str.lenght-1).lenght;
if(preLen==sufLen || preLen==sufLen-1 || preLen==sufLen+1){
result=true;
k=str.length; //breaks the while loop
}
else
result=false;
}
else
k++;
}
}
return result;
}

Finding the strings in a TreeSet that start with a given prefix

I'm trying to find the strings in a TreeSet<String> that start with a given prefix. I found a previous question asking for the same thing — Searching for a record in a TreeSet on the fly — but the answer given there doesn't work for me, because it assumes that the strings don't include Character.MAX_VALUE, and mine can.
(The answer there is to use treeSet.subSet(prefix, prefix + Character.MAX_VALUE), which gives all strings between prefix (inclusive) and prefix + Character.MAX_VALUE (exclusive), which comes out to all strings that start with prefix except those that start with prefix + Character.MAX_VALUE. But in my case I need to find all strings that start with prefix, including those that start with prefix + Character.MAX_VALUE.)
How can I do this?

To start with, I suggest re-examining your requirements. Character.MAX_VALUE is U+FFFF, which is not a valid Unicode character and never will be; so I can't think of a good reason why you would need to support it.
But if there's a good reason for that requirement, then — you need to "increment" your prefix to compute the least string that's greater than all strings starting with your prefix. For example, given "city", you need "citz". You can do that as follows:
/**
* #param prefix
* #return The least string that's greater than all strings starting with
* prefix, if one exists. Otherwise, returns Optional.empty().
* (Specifically, returns Optional.empty() if the prefix is the
* empty string, or is just a sequence of Character.MAX_VALUE-s.)
*/
private static Optional<String> incrementPrefix(final String prefix) {
final StringBuilder sb = new StringBuilder(prefix);
// remove any trailing occurrences of Character.MAX_VALUE:
while (sb.length() > 0 && sb.charAt(sb.length() - 1) == Character.MAX_VALUE) {
sb.setLength(sb.length() - 1);
}
// if the prefix is empty, then there's no upper bound:
if (sb.length() == 0) {
return Optional.empty();
}
// otherwise, increment the last character and return the result:
sb.setCharAt(sb.length() - 1, (char) (sb.charAt(sb.length() - 1) + 1));
return Optional.of(sb.toString());
}
To use it, you need to use subSet when the above method returns a string, and tailSet when it returns nothing:
/**
* #param allElements - a SortedSet of strings. This set must use the
* natural string ordering; otherwise this method
* may not behave as intended.
* #param prefix
* #return The subset of allElements containing the strings that start
* with prefix.
*/
private static SortedSet<String> getElementsWithPrefix(
final SortedSet<String> allElements, final String prefix) {
final Optional<String> endpoint = incrementPrefix(prefix);
if (endpoint.isPresent()) {
return allElements.subSet(prefix, endpoint.get());
} else {
return allElements.tailSet(prefix);
}
}
See it in action at: http://ideone.com/YvO4b3.

If anybody is looking for a shorter version of ruakh's answer:
First element is actually set.ceiling(prefix),and last - you have to increment the prefix and use set.floor(next_prefix)
public NavigableSet<String> subSetWithPrefix(NavigableSet<String> set, String prefix) {
String first = set.ceiling(prefix);
char[] chars = prefix.toCharArray();
if(chars.length>0)
chars[chars.length-1] = (char) (chars[chars.length-1]+1);
String last = set.floor(new String(chars));
if(first==null || last==null || last.compareTo(first)<0)
return new TreeSet<>();
return set.subSet(first, true, last, true);
}

How to find duplicates inside a string?

I want to find out if a string that is comma separated contains only the same values:
test,asd,123,test
test,test,test
Here the 2nd string contains only the word "test". I'd like to identify these strings.
As I want to iterate over 100GB, performance matters a lot.
Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?
public static boolean stringHasOneValue(String string) {
String value = null;
for (split : string.split(",")) {
if (value == null) {
value = split;
} else {
if (!value.equals(split)) return false;
}
}
return true;
}

No need to split the string at all, in fact no need for any string manipulation.
Find the first word (indexOf comma).
Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.
Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:
for (int i=startPos;i<str.length();i++) {
if (str.charAt(i) != str.charAt(i-startPos)) {
return false;
}
}
return true;
You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.

Calling split might be expensive - especially if it is 200 GB data.
Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -
public static boolean stringHasOneValue(String string) {
String seperator = ",";
int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma
for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
String currentValue = string.substring(i, firstValue.length());
if (!firstValue.equals(currentValue)) {
return false;
}
}
return true;
}
Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.

for a crack just a line code:
(#Tim answer is more efficient)
System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));

Java: Efficient way to determine if a String meets several criteria?

I would like to find an efficient way (not scanning the String 10,000 times, or creating lots of intermediary Strings for holding temporary results, or string bashing, etc.) to write a method that accepts a String and determine if it meets the following criteria:
It is at least 2 characters in length
The first character is uppercased
The remaining substring after the first character contains at least 1 lowercased character
Here's my attempt so far:
private boolean isInProperForm(final String token) {
if(token.length() < 2)
return false;
char firstChar = token.charAt(0);
String restOfToken = token.substring(1);
String firstCharAsString = firstChar + "";
String firstCharStrToUpper = firstCharAsString.toUpperCase();
// TODO: Giving up because this already seems way too complicated/inefficient.
// Ignore the '&& true' clause - left it there as a placeholder so it wouldn't give a compile error.
if(firstCharStrToUpper.equals(firstCharAsString) && true)
return true;
// Presume false if we get here.
return false;
}
But as you can see I already have 1 char and 3 temp strings, and something just doesn't feel right. There's got to be a better way to write this. It's important because this method is going to get called thousands and thousands of times (for each tokenized word in a text document). So it really really needs to be efficient.
Thanks in advance!

This function should cover it. Each char is examined only once and no objects are created.
public static boolean validate(String token) {
if (token == null || token.length() < 2) return false;
if (!Character.isUpperCase(token.charAt(0)) return false;
for (int i = 1; i < token.length(); i++)
if (Character.isLowerCase(token.charAt(i)) return true;
return false;

The first criteria is simply the length - this data is cached in the string object and is not requiring traversing the string.
You can use Character.isUpperCase() to determine if the first char is upper case. No need as well to traverse the string.
The last criteria requires a single traversal on the string- and stop when you first find a lower case character.
P.S. An alternative for the 2+3 criteria combined is to use a regex (not more efficient - but more elegant):
return token.matches("[A-Z].*[a-z].*");
The regex is checking if the string starts with an upper case letter, and then followed by any sequence which contains at least one lower case character.

It is at least 2 characters in length
The first character is
uppercased
The remaining substring after the first character contains
at least 1 lowercased character
Code:
private boolean isInProperForm(final String token) {
if(token.length() < 2) return false;
if(!Character.isUpperCase(token.charAt(0)) return false;
for(int i = 1; i < token.length(); i++) {
if(Character.isLowerCase(token.charAt(i)) {
return true; // our last criteria, so we are free
// to return on a met condition
}
}
return false; // didn't meet the last criteria, so we return false
}
If you added more criteria, you'd have to revise the last condition.

What about:
return token.matches("[A-Z].*[a-z].*");
This regular expression starts with an uppercase letter and has at least one following lowercase letter and therefore meets your requirements.

To find if the first character is uppercase:
Character.isUpperCase(token.charAt(0))
To check if there is at least one lowercase:
if(Pattern.compile("[a-z]").matcher(token).find()) {
//At least one lowercase
}

To check if first char is uppercase you can use:
Character.isUpperCase(s.charAt(0))

return token.matches("[A-Z].[a-z].");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Prefix search using BinarySearch - java

Using prefix string i need to display all possible string from ArrayList using BinarySearch. Is it possible tell me the wright way. BinarySearch(myList, SearchString);

Related

Count the Characters in a String Recursively & treat "eu" as a Single Character

finding the middle index of a substring when there are duplicates in the string

Finding the strings in a TreeSet that start with a given prefix

How to find duplicates inside a string?

Java: Efficient way to determine if a String meets several criteria?

Categories

Resources