Map words to single characters

Map words to single characters - java

I'm building an hash function which should map any String (max length 100 characters) to a single [A-Z] character (I'm using it for sharding purposes).
I came up with this simple Java function, is there any way to make it faster?
public static final char stringToChar(final String s) {
long counter = 0;
for (char c : s.toCharArray()) {
counter += c;
}
return (char)('A'+(counter%26));
}

A quick trick to have an even distribution of the "shards" is using an hash function.
I suggest this method that uses the default java String.hashCode() function
public static char getShardLabel(String string) {
int hash = string.hashCode();
// using Math.flootMod instead of operator % beacause '%' can produce negavive outputs
int hashMod = Math.floorMod(hash, 26);
return (char)('A'+(hashMod));
}
As pointed out here this method is considered "even enough".
Based on a quick test it looks faster than the solution you suggested.
On 80kk strings of various lengths:
getShardLabel took 65 milliseconds
stringToChar took 571 milliseconds

Related

Count the Characters in a String Recursively & treat "eu" as a Single Character

I am new to Java, and I'm trying to figure out how to count Characters in the given string and threat a combination of two characters "eu" as a single character, and still count all other characters as one character.
And I want to do that using recursion.
Consider the following example.
Input:
"geugeu"
Desired output:
4 // g + eu + g + eu = 4
Current output:
2
I've been trying a lot and still can't seem to figure out how to implement it correctly.
My code:
public static int recursionCount(String str) {
if (str.length() == 1) {
return 0;
}
else {
String ch = str.substring(0, 2);
if (ch.equals("eu") {
return 1 + recursionCount(str.substring(1));
}
else {
return recursionCount(str.substring(1));
}
}
}

OP wants to count all characters in a string but adjacent characters "ae", "oe", "ue", and "eu" should be considered a single character and counted only once.
Below code does that:
public static int recursionCount(String str) {
int n;
n = str.length();
if(n <= 1) {
return n; // return 1 if one character left or 0 if empty string.
}
else {
String ch = str.substring(0, 2);
if(ch.equals("ae") || ch.equals("oe") || ch.equals("ue") || ch.equals("eu")) {
// consider as one character and skip next character
return 1 + recursionCount(str.substring(2));
}
else {
// don't skip next character
return 1 + recursionCount(str.substring(1));
}
}
}

Recursion explained
In order to address a particular task using Recursion, you need a firm understanding of how recursion works.
And the first thing you need to keep in mind is that every recursive solution should (either explicitly or implicitly) contain two parts: Base case and Recursive case.
Let's have a look at them closely:
Base case - a part that represents a simple edge-case (or a set of edge-cases), i.e. a situation in which recursion should terminate. The outcome for these edge-cases is known in advance. For this task, base case is when the given string is empty, and since there's nothing to count the return value should be 0. That is sufficient for the algorithm to work, outcomes for other inputs should be derived from the recursive case.
Recursive case - is the part of the method where recursive calls are made and where the main logic resides. Every recursive call eventually hits the base case and stars building its return value.
In the recursive case, we need to check whether the given string starts from a particular string like "eu". And for that we don't need to generate a substring (keep in mind that object creation is costful). instead we can use method String.startsWith() which checks if the bytes of the provided prefix string match the bytes at the beginning of this string which is chipper (reminder: starting from Java 9 String is backed by an array of bytes, and each character is represented either with one or two bytes depending on the character encoding) and we also don't bother about the length of the string because if the string is shorter than the prefix startsWith() will return false.
Implementation
That said, here's how an implementation might look:
public static int recursionCount(String str) {
if(str.isEmpty()) {
return 0;
}
return str.startsWith("eu") ?
1 + recursionCount(str.substring(2)) : 1 + recursionCount(str.substring(1));
}
Note: that besides from being able to implement a solution, you also need to evaluate it's Time and Space complexity.
In this case because we are creating a new string with every call time complexity is quadratic O(n^2) (reminder: creation of the new string requires allocating the memory to coping bytes of the original string). And worse case space complexity also would be O(n^2).
There's a way of solving this problem recursively in a linear time O(n) without generating a new string at every call. For that we need to introduce the second argument - current index, and each recursive call should advance this index either by 1 or by 2 (I'm not going to implement this solution and living it for OP/reader as an exercise).
In addition
In addition, here's a concise and simple non-recursive solution using String.replace():
public static int count(String str) {
return str.replace("eu", "_").length();
}
If you would need handle multiple combination of character (which were listed in the first version of the question) you can make use of the regular expressions with String.replaceAll():
public static int count(String str) {
return str.replaceAll("ue|au|oe|eu", "_").length();
}

Why use bit shifting instead of a for loop?

I created the following code to find parity of a binary number (i.e output 1 if the number of 1's in the binary word is odd, output 0 if the number of 1's is even).
public class CalculateParity {
String binaryword;
int totalones = 0;
public CalculateParity(String binaryword) {
this.binaryword = binaryword;
getTotal();
}
public int getTotal() {
for(int i=0; i<binaryword.length(); i++) {
if (binaryword.charAt(i) == '1'){
totalones += 1;
}
}
return totalones;
}
public int calcParity() {
if (totalones % 2 == 1) {
return 1;
}
else {
return 0;
}
}
public static void main(String[] args) {
CalculateParity bin = new CalculateParity("1011101");
System.out.println(bin.calcParity());
}
}
However, all of the solutions I find online almost always deal with using bit shift operators, XORs, unsigned shift operations, etc., like this solution I found in a data structure book:
public static short parity(long x){
short result = 0;
while (x != 0) {
result A=(x&1);
x >>>= 1;
}
return result;
}
Why is this the case? What makes bitwise operators more of a valid/standard solution than the solution I came up with, which is simply iterating through a binary word of type String? Is a bitwise solution more efficient? I appreciate any help!

The code that you have quoted uses a loop as well (i.e., while):
public static short parity(long x){
short result = 9;
while (x != 9) {
result A=(x&1);
x >>>= 1;
}
return result;
}
You need to acknowledge that you are using a string that you know beforehand will be composed of only digits, and conveniently in a binary representation. Naturally, given those constraints, one does not need to use bitwise operations instead one just parsers char-by-char and does the desired computations.
On the other hand, if you receive as a parameter a long, as the method that you have quoted, then it comes in handy to use bitwise operations to go through each bit (at a time) in a number and perform the desired computation.
One could also convert the long into a string and apply the same logic code-wise that you have applied, but first, one would have to convert that long into binary. However, that approach would add extra unnecessary steps, more code, and would be performance-wise worse. Probably, the same applies vice-versa if you have a String with your constraints. Nevertheless, a String is not a number, even if it is only composed of digits, which makes using a type that represents a number (e.g., long) even a more desirable approach.
Another thing that you are missing is that you did some of the heavy lifting by converting already a number to binary, and encoded into a String new CalculateParity("1011101");. So you kind of jump a step there. Now try to use your approach, but this time using "93" and find the parity.

If you want know if a String is even. I think this method below is better.
If you convert a String too
long which the length of the String is bigger than 64. there will a error occur.
both of the method you
mention is O(n) performance.It will not perform big different. but
the shift method is more precise and the clock of the cpu use will a little bit less.
private static boolean isEven(String s){
char[] chars = s.toCharArray();
int i = 0;
for(char c : chars){
i ^= c;
}
return i == 0;
}

You use a string based method for a string input. Good choice.
The code you quote uses an integer-based method for an integer input. An equally good choice.

How to handle the time complexity for permutation of strings during anagrams search?

I have a program that computes that whether two strings are anagrams or not.
It works fine for inputs of strings below length of 10.
When I input two strings whose lengths are equal and have lengths of more than 10 program runs and doesn't produce an answer .
My concept is that if two strings are anagrams one string must be a permutation of other string.
This program generates the all permutations from one string, and after that it checks is there any matching permutation for the other string. In this case I wanted to ignore cases.
It returns false when there is no matching string found or the comparing strings are not equal in length, otherwise returns true.
public class Anagrams {
static ArrayList<String> str = new ArrayList<>();
static boolean isAnagram(String a, String b) {
// there is no need for checking these two
// strings because their length doesn't match
if (a.length() != b.length())
return false;
Anagrams.permute(a, 0, a.length() - 1);
for (String string : Anagrams.str)
if (string.equalsIgnoreCase(b))
// returns true if there is a matching string
// for b in the permuted string list of a
return true;
// returns false if there is no matching string
// for b in the permuted string list of a
return false;
}
private static void permute(String str, int l, int r) {
if (l == r)
// adds the permuted strings to the ArrayList
Anagrams.str.add(str);
else {
for (int i = l; i <= r; i++) {
str = Anagrams.swap(str, l, i);
Anagrams.permute(str, l + 1, r);
str = Anagrams.swap(str, l, i);
}
}
}
public static String swap(String a, int i, int j) {
char temp;
char[] charArray = a.toCharArray();
temp = charArray[i];
charArray[i] = charArray[j];
charArray[j] = temp;
return String.valueOf(charArray);
}
}
1. I want to know why can't this program process larger strings
2. I want to know how to fix this problem
Can you figure it out?

To solve this problem and check whether two strings are anagrams you don't actually need to generate every single permutation of the source string and then match it against the second one. What you can do instead, is count the frequency of each character in the first string, and then verify whether the same frequency applies for the second string.
The solution above requires one pass for each string, hence Θ(n) time complexity. In addition, you need auxiliary storage for counting characters which is Θ(1) space complexity. These are asymptotically tight bounds.

you're doing it in very expensive way and the time complexity here is exponential because your'e using permutations which requires factorials and factorials grow very fast , as you're doing permutations it will take time to get the output when the input is greater than 10.
11 factorial = 39916800
12 factorial = 479001600
13 factorial = 6227020800
and so on...
So don't think you're not getting an output for big numbers you will eventually get it
If you go something like 20-30 factorial i think i will take years to produce any output , if you use loops , with recursion you will overflow the stack.
fact : 50 factorial is a number that big it is more than the number of sand grains on earth , and computer surrender when they have to deal with numbers that big.
That is why they make you include special character in passwords to make the number of permutations too big that computers will not able to crack it for years if they try every permutations , and encryption also depends on that weakness of the computers.
So you don't have to and should not do that to solve it (because computer are not good very at it), it is an overkill
why don't you take each character from one string and match it with every character of other string, it will be quadratic at in worst case.
And if you sort both the strings then you can just say
string1.equals(string2)
true means anagram
false means not anagram
and it will take linear time,except the time taken in sorting.

You can first get arrays of characters from these strings, then sort them, and then compare the two sorted arrays. This method works with both regular characters and surrogate pairs.
public static void main(String[] args) {
System.out.println(isAnagram("ABCD", "DCBA")); // true
System.out.println(isAnagram("𝗔𝗕𝗖𝗗", "𝗗𝗖𝗕𝗔")); // true
}
static boolean isAnagram(String a, String b) {
// invalid incoming data
if (a == null || b == null
|| a.length() != b.length())
return false;
char[] aArr = a.toCharArray();
char[] bArr = b.toCharArray();
Arrays.sort(aArr);
Arrays.sort(bArr);
return Arrays.equals(aArr, bArr);
}
See also: Check if one array is a subset of the other array - special case

How to automatically create a string with 1s in Java

I am wondering if there is a shorter way in Java to create a String with a number of 1s. I would like to create a string like 111, or 1111 or 11111 without using loops or recursive calls.
For example, in Perl code, something like '0b' . ('1' x $numberOf1s) would return 11 (if numberOf1s) is 2 and 111 (if numberOf1s) is 3
Thanks

StringUtils provided by Apache commons jar has many static methods which can be used.For example,StringUtils has a method repeat(String str,int repeat).Example
String str = StringUtils.repeat("1",5);
See the doc here StringUtils's repeat method

If you have a maximum number of 1s that you would want to generate, you can do this:
private static final String ALL_ONES = "11111111111111111111111111"; // max # of 1s
public String getNOnes(int n) {
// perhaps should do some error checking here
return ALL_ONES.substring(0, n);
}
If you have no maximum in mind, you could use #f1sh's answer:
public String getNOnes(int n) {
char [] ones = new char[n];
Arrays.fill(ones, '1');
return new String(ones);
}
But the entire problem seems to have ridiculous requirements.

In short: no.
You can use something like Arrays.fill(char[] arr, char value) to fill up a whole char array and then make a String out of it, but internally it uses a for loop anyways.
Also: what requirement would disallow a for loop?

You can try with something like
new String(new char[5]).replace('\0','1')
but replace iterates over all characters in char[] which are by default set to '\0'.

(Any power of two) - 1 converted to binary is a string of all 1s.
For Example
4-1 = 3 = binary 11
8-1 = 7 = binary 111
16-1= 15 = binary 1111 and so on.
I used this fact to write the following code...
BigInteger will produce any size of string but will be a bit slow. If you need a string of size below 64 you can use long in the same logic.
private static String stringOf1s(int size)
{
BigInteger powerOfTwo = BigInteger.TWO.pow(size);
return powerOfTwo.subtract(BigInteger.ONE).toString(2);
}
private static String stringOfOnes(int size)
{
long powerOfTwo = (long) Math.pow(2,size);
return Long.toBinaryString(powerOfTwo-1);
}

intersection of two strings using Java HashSet

I am trying to learn Java by doing some assignments from a Stanford class and am having trouble answering this question.
boolean stringIntersect(String a, String b, int len): Given 2 strings,
consider all the substrings within them of length len. Returns true if
there are any such substrings which appear in both strings. Compute
this in O(n) time using a HashSet.
I can't figure out how to do it using a Hashset because you cannot store repeating characters. So stringIntersect(hoopla, loopla, 5) should return true.
thanks!
Edit: Thanks so much for all your prompt responses. It was helpful to see explanations as well as code. I guess I couldn't see why storing substrings in a hashset would make the algorithm more efficient. I originally had a solution like :
public static boolean stringIntersect(String a, String b, int len) {
assert (len>=1);
if (len>a.length() || len>b.length()) return false;
String s1=new String(),s2=new String();
if (a.length()<b.length()){
s1=a;
s2=b;
}
else {
s1=b;
s2=a;
}
int index = 0;
while (index<=s1.length()-len){
if (s2.contains(s1.substring(index,index+len)))return true;
index++;
}
return false;
}

I'm not sure I understand what you mean by "you cannot store repeating characters" A hashset is a Set, so it can do two things: you can add value to it, and you can add values to it, and you can check if a value is already in it. In this case, the problem wants you to answer the question by storing strings, not chars, in the HashSet. To do this in java:
Set<String> stringSet = new HashSet<String>();
Try breaking this problem into two parts:
1. Generate all the substrings of length len of a string
2. Use this to solve the problem.
The hint for part two is:
Step 1: For the first string enter the substrings into a hashset
Step 2: For the second string, check the values in the hashset
Note (Advanced): this problem is poorly specified. Entering and checking strings in a hashtable is O the length of the string. For string a of length n you have O(n-k) substrings of length k. So for string a being a string of length n and string b being a string of length m you have O((n-k)*k+(m-k)*k) this is not really big Oh of n, since your running time for k = n/2 is O((n/2)*(n/2)) = O(n^2)
Edit: So what if you actually want to do this in O(n) (or perhaps O(n+m+k))? My belief is that the original homework was asking for something like the algorithm I described above. But we can do better. Whats more, we can do better and still make a HashSet the crucial tool for our algorithm. The idea is to perform our search using a "Rolling Hash." Wikipedia describes a couple: http://en.wikipedia.org/wiki/Rolling_hash, but we will implement our own.
A simple solution would be to XOR the values of the character hashes together. This could allow us to add a new char to the hash O(1) and remove one O(1) making computing the next hash trivial. But this simple algorithm wont work for two reasons
The character hashes may not provide sufficient entropy. Okay, we dont know if we will have this problem, but lets solve it anyways, just for fun.
We will hash permutations to the same value ... "abc" should not have the same hash as "cba"
To solve the first problem we can use an idea from AI, namely lets steel from Zobrist hashing. The idea is to assign every possible character a random value of a greater length. If we were using ASCI, we could easily create an array with all the ASCI characters, but that will run into problems when using unicode characters. The alternative is to assign values lazily.
object LazyCharHash{
private val map = HashMap.empty[Char,Int]
private val r = new Random
def lHash(c: Char): Int = {
val d = map.get(c)
d match {
case None => {
map.put(c,r.nextInt)
lHash(c)
}
case Some(v) => v
}
}
}
This is Scala code. Scala tends to be less verbose than Java, but still allows me to use Java collections, as such I will be using imperative style Scala through out. It wouldn't be that hard to translate.
The second problem can be solved aswell. First, instead of using a pure XOR, we combine our XOR with a shift, thus the hash function is now:
def fullHash(s: String) = {
var h = 0
for(i <- 0 until s.length){
h = h >>> 1
h = h ^ LazyCharHash.lHash(s.charAt(i))
}
h
}
Of-course, using fullHash wont give a performance advantage. It is just a specification
We need a way of using our hash function to store values in the HashSet (I promised we would use it). We can just create a wrapper class:
class HString(hash: Int, string: String){
def getHash = hash
def getString = string
override def equals(otherHString: Any): Boolean = {
otherHString match {
case other: HString => (hash == other.getHash) && (string == other.getString)
case _ => false
}
}
override def hashCode = hash
}
Okay, to make the hashing function rolling, we just have to XOR the value associated with the character we will no longer be using. To that just takes shifting that value by the appropriate amount.
def stringIntersect(a: String, b: String, len: Int): Boolean = {
val stringSet = new HashSet[HString]()
var h = 0
for(i <- 0 until len){
h = h >>> 1
h = h ^ LazyCharHash.lHash(a.charAt(i))
}
stringSet.add(new HString(h,a.substring(0,len)))
for(i <- len until a.length){
h = h >>> 1
h = h ^ (LazyCharHash.lHash(a.charAt(i - len)) >>> (len))
h = h ^ LazyCharHash.lHash(a.charAt(i))
stringSet.add(new HString(h,a.substring(i - len + 1,i + 1)))
}
...
You can figure out how to finish this code on your own.
Is this O(n)? Well, it matters what mean. Big Oh, big Omega, big Theta, are all metrics of bounds. They could serve as metrics of the worst case of the algorithm, the best case, or something else. In this case these modification gives expected O(n) performance, but this only holds if we avoid hash collisions. It still take O(n) to tell if two Strings are equals. This random approach works pretty well, and you can scale up the size of the random bit arrays to make it work better, but it does not have guaranteed performance.

You should not store characters in the Hashset, but substrings.
When considering string "hoopla": if you store the substrings "hoopl" and "oopla" in the Hashset (linear operation), then it's linear again to find if one of the substrings of "loopla" matches.

I don't know how they're thinking you're supposed to use the HashSet but I ended up doing a solution like this:
public class StringComparator {
public static boolean compare( String a, String b, int len ) {
Set<String> pieces = new HashSet<String>();
for ( int x = 0; (x + len) <= b.length(); x++ ) {
pieces.add( a.substring( x, x + len ) );
}
for ( String piece : pieces ) {
if ( b.contains(piece) ) {
return true;
}
}
return false;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Map words to single characters - java

Related

Count the Characters in a String Recursively & treat "eu" as a Single Character

Why use bit shifting instead of a for loop?

How to handle the time complexity for permutation of strings during anagrams search?

How to automatically create a string with 1s in Java

intersection of two strings using Java HashSet

Categories

Resources