String data manipulation with Maps for very large data input

String data manipulation with Maps for very large data input - java

I have solved Two Strings problem in HackerRank
Here is the problem.
Given two strings, determine if they share a common substring. A
substring may be as small as one character.
For example, the words "a", "and", "art" share the common substring.
The words "be" and "cat" do not share a substring.
Function Description
Complete the function twoStrings in the editor below. It should return
a string, either YES or NO based on whether the strings share a common
substring.
twoStrings has the following parameter(s):
s1, s2: two strings to analyze .
Output Format
For each pair of strings, return YES or NO.
However, when extra-long strings are subjected, my code does not run within the time limit. Any suggestions to improve efficiency? I think I can improve substring finding with using the Stream API. But I'm not sure how to use it in this context. Could someone please help me to understand this better?
public static void main(String[] args) {
String s1 = "hi";
String s2 = "world";
checkSubStrings(s1, s2);
}
static void checkSubStrings(String s1, String s2) {
Map<String, Long> s1Map = new HashMap<>();
Map<String, Long> s2Map = new HashMap<>();
findAllSubStrings(s1, s1Map);
findAllSubStrings(s2, s2Map);
boolean isContain = s2Map.entrySet().stream().anyMatch(i -> s1Map.containsKey(i.getKey()) );
if (isContain) {
System.out.println("YES");
} else {
System.out.println("NO");
}
}
static void findAllSubStrings(String s, Map<String, Long> map) {
for (int i = 0; i < s.length(); i++) {
String subString = s.substring(i);
for (int j = subString.length(); j > 0; j--) {
String subSubString = subString.substring(0, j);
if (map.containsKey(subSubString)) {
map.put(subSubString, map.get(subSubString) + 1);
} else {
if (!subSubString.equals(""))
map.put(subSubString, 1L);
}
}
}
}
Update
I just solved the question using HashSets.
I optimized the code using Set. Now it runs with very large Strings.
static String twoStrings(String s1, String s2) {
String result = null;
Set<Character> s1Set = new HashSet<>();
Set<Character> s2Set = new HashSet<>();
for(char a : s1.toCharArray()){
s1Set.add(a);
}
for(char a : s2.toCharArray()){
s2Set.add(a);
}
boolean isContain = s2Set.stream().anyMatch(s1Set::contains);
if(isContain){
result = "YES";
} else {
result = "NO";
}
return result;
}

If 2 strings share an N (>=2) character substring, they also share an N-1 character substring (because you can chop a character off the end of the common substring, and this will still be found in both strings). Extending this argument, they also share a 1-character substring.
As such, all you need to check are single-character substrings.
Fill your maps with single-character substrings instead, and you will avoid creating (and checking) unnecessary substrings. (And just use a Set instead of a Map, you never use the counts).
// Yields a `Set<Integer>`, which can be used directly to check.
return s.codePoints().boxed().collect(Collectors.toSet());

Related

Finding the first Non-repeating Character in the given string, not able to pass a few test cases due to Timeout

I'm working on a Problem from CodeSignal:
Given a String s consisting of the alphabet only, return the first
non-repeated element. Otherwise, return '-'.
Example: input -
s="abacabad", output - 'c'.
I came up with the following the code. It passes only 16/19 test cases. Is there a way to solve this problem in O(n) or O(1)?
My code:
public char solution(String s) {
ArrayList<Character> hs = new ArrayList<>();
for (char c:s.toCharArray()) {
hs.add(c);
}
for (int j=0; j<s.length(); j++) {
if ( 1 == Collections.frequency(hs, s.charAt(j))) {
return s.charAt(j);
}
}
return '_';
}

The minimal possible time complexity for this task is linear O(n), because we need to examine every character in the given string to find out whether a particular character is unique.
Your current solution runs in O(n^2) - Collections.frequency() iterates over all characters in the string and this iteration and this method is called for every character. That's basically a brute-force implementation.
We can generate a map Map<Character,Boolean>, which associates each character with a boolean value denoting whether it's repeated or not.
That would allow to avoid iterating over the given string multiple times.
Then we need to iterate over the key-set to find the first non-repeated character. As the Map implementation LinkedHashMap is used to ensure that returned non-repeated character would be the first encountered in the given string.
To update the Map I've used Java 8 method merge(), which expects three arguments: a key, a value, and a function responsible for merging the old value and the new one.
public char solution(String s) {
Map<Character, Boolean> isNonRepeated = getMap(s);
for (Map.Entry<Character, Boolean> entry: isNonRepeated.entrySet()) {
if (entry.getValue()) {
return entry.getKey();
}
}
return '_';
}
public Map<Character, Boolean> getMap(String s) {
Map<Character, Boolean> isNonRepeated = new LinkedHashMap<>();
for (int i = 0; i < s.length(); i++) {
isNonRepeated.merge(s.charAt(i), true, (v1, v2) -> false);
}
return isNonRepeated;
}
In case if you're comfortable with streams, this problem can be addressed in one statement (the algorithm remains the same and time complexity would be linear as well):
public char solution(String s) {
return s.chars()
.mapToObj(c -> (char) c)
.collect(Collectors.toMap( // creates intermediate Map<Character, Boolean>
Function.identity(), // key
c -> true, // value - first occurrence, character is considered to be non-repeated
(v1, v2) -> false, // resolving values, character is proved to be a duplicate
LinkedHashMap::new
))
.entrySet().stream()
.filter(Map.Entry::getValue)
.findFirst()
.map(Map.Entry::getKey)
.orElse('_');
}

Here is a slightly different approach using both a Set to account for duplicates, and a Queue to hold candidates before a possible duplicate is discovered.
iterate over the list of characters.
try and add the character to the seen set. If not already there,
also add it to the candidates queue.
else if it has been "seen", try and remove it from the candidates queue.
By the time this gets done, the head of the queue should contain the first, non-repeating character. If the queue is empty, return the default value as no unique character was found.
public char solution(String s) {
Queue<Character> candidates = new LinkedList<>();
Set<Character> seen = new HashSet<>();
for (char c : s.toCharArray()) {
if (seen.add(c)) {
candidates.add(c);
} else {
candidates.remove(c);
}
}
return candidates.isEmpty() ? '_' : candidates.peek();
}
I have done pretty extensive testing of this and it has yet to fail. It is also comparatively very efficient. But as can happen, I may have overlooked something.

One technique would be a 2 pass solution using a frequency/count array for each character.
public static char firstNonRepeatingChar(String s) {
int[] frequency = new int[26]; // this is O(1) space complexity because alphabet is finite of 26 letters
/* First Pass - Fill our frequency array */
for(int i = 0; i < s.length(); i++) {
frequency[s.charAt(i) - 'a']++;
}
/* Second Pass - Look up our frequency array */
for(int i = 0; i < s.length(); i++) {
if(frequency[s.charAt(i) - 'a'] == 1) {
return s.charAt(i);
}
}
/* Not Found */
return '_';
}
This solution is O(2n) -> O(n) and a space complexity of O(1) because we are using a finite set of the English alphabet (26 letters). This wouldn't work in other scenarios for non-English alphabets.

Replace multiple replaceAll with a cleaner way of coding in a string

I need to perform multiple replaceAll commands in a string and i wonder if there is a clean way to do it. This is how it is currently:
newString = oldString.replaceAll("α","a").replaceAll("β","b").replace("c","σ") /* This goes on for over 60 replacements*/;

I have implemented a specialized solution if you only want to replace a single Character with a single Character or another String:
private static Map<Character, Character> REPLACEMENTS = new HashMap<>();
static {
REPLACEMENTS.put('α','a');
REPLACEMENTS.put('β','b');
}
public static String replaceChars(String input) {
StringBuilder sb = new StringBuilder(input.length());
for(int i = 0;i<input.length();++i) {
char currentChar = input.charAt(i);
sb.append(REPLACEMENTS.getOrDefault(currentChar, currentChar));
}
return sb.toString();
}
This implementation avoids excessive string copies / complex regexes and thus should perform really well compared to an implementation that uses either replace or replaceAll. You can change the replacement to String too but replacing whole Strings instead of Characters is more complicated - I would prefer a regex then.
EDIT:
Here is a solution for whole Strings in the above style but I would recommend you to look into other solutions like e.g. regex as its performance characteristics are not as good as the above example for Character. Furthermore its more complex and error prone, a simple test showed it's working correctly though. It still avoids the string copies though so it may be preferable in performance sensitive scenarios.
private static Map<String, String> REPLACEMENTS = new HashMap<>();
static {
REPLACEMENTS.put("aa","AA");
REPLACEMENTS.put("bb","BB");
}
public static String replace(String input) {
StringBuilder sb = new StringBuilder(input.length());
for (int i = 0; i < input.length(); ++i) {
i += replaceFrom(input, i, sb);
}
return sb.toString();
}
private static int replaceFrom(String input, int startIndex, StringBuilder sb) {
for (Map.Entry<String, String> replacement : REPLACEMENTS.entrySet()) {
String toMatch = replacement.getKey();
if (input.startsWith(toMatch, startIndex)) {
sb.append(replacement.getValue());
//we matched the whole word skip all matched characters
//not just the first
return toMatch.length() - 1;
}
}
sb.append(input.charAt(startIndex));
return 0;
}

You can do something like this. Map will contain the mappings and all you have to do is to loop through the mappings and call replace.
public static void main(String[] args) {
// your input
String old = "something";
// the mappings
Map<Character, Character> mappings = new HashMap<>();
mappings.put('α','a');
// loop through the mappings and perform the action
for (Map.Entry<Character, Character> entry : mappings.entrySet()) {
old = old.replace(entry.getKey(), entry.getValue());
}
}

how to access String... (varargs) to get specific characters and save them into String

as it mentioned in the title, I have this code
String a = flett("AM ","L","GEDS","ORATKRR","","R TRTE","IO","TGAUU");
public static String flett(String... s){
StringBuilder merge = new StringBuilder();
for (int i = 0; i < s.length; i++) {
merge.append(s.charAt(i));
}
return merge;
}
I got an error at chartAt(i) ?
how for example I can call every character in the array s and save them into merge or call an specific character like the first character from each one and save them into merge ?

s[i].charAt(j);
where i - the index of an array, j - the index of a letter within a String.
A Java 8 method that collects the first letter of each array's element might look like
public String flett(String... s) {
return Arrays.stream(s)
.map(i -> i.length() > 0 ? String.valueOf(i.charAt(0)) : "")
.collect(Collectors.joining());
}
For the array "AM ","L","GEDS","ORATKRR","","R TRTE","IO","TGAUU", it results in "ALGORIT".

You have to use a variable amount of String parameters, then concatenate all first characters of non empty Strings of the parameters and return the concatenated object:
public static void main(String[] args) {
String s = flett("AM ","L","GEDS","ORATKRR","","R TRTE","IO","TGAUU", "HOLA", "MMMMH");
System.out.println(s);
}
// Please note the parameter, it takes a various amount of Strings
public static String flett(String ... values) {
// create something that concatenates Strings (other options possible)
StringBuilder sb = new StringBuilder();
// the parameters are now an array of Strings, which you can "foreach"
for (String s : values) {
// check for empty ones and skip those
if (!s.equals("")) {
// append the first character of a valid parameter
sb.append(s.charAt(0));
}
}
return sb.toString();
}
Be surprised by the output…

This method get some Strings and Create String from the first character of each String.
public static String flett(String... s) {
StringBuilder res = new StringBuilder(s.length);
for (String a : s) {
if (!a.isEmpty()) {
res.append(a.charAt(0));
}
}
return res.toString();
}

Java equivalent of Python "join" method for array? [duplicate]

This question already has answers here:
Java function for arrays like PHP's join()?
(24 answers)
Closed 7 years ago.
See Related .NET question
I'm looking for a quick and easy way to do exactly the opposite of split
so that it will cause ["a","b","c"] to become "a,b,c"
Iterating through an array requires either adding a condition (if this is not the last element, add the seperator) or using substring to remove the last separator.
I'm sure there is a certified, efficient way to do it (Apache Commons?)
How do you prefer doing it in your projects?

Using Java 8 you can do this in a very clean way:
String.join(delimiter, elements);
This works in three ways:
1) directly specifying the elements
String joined1 = String.join(",", "a", "b", "c");
2) using arrays
String[] array = new String[] { "a", "b", "c" };
String joined2 = String.join(",", array);
3) using iterables
List<String> list = Arrays.asList(array);
String joined3 = String.join(",", list);

If you're on Android you can TextUtils.join(delimiter, tokens)

I prefer Guava over Apache StringUtils for this particular problem:
Joiner.on(separator).join(array)
Compared to StringUtils, the Joiner API has a fluent design and is a bit more flexible, e.g. null elements may be skipped or replaced by a placeholder. Also, Joiner has a feature for joining maps with a separator between key and value.

Apache Commons Lang does indeed have a StringUtils.join method which will connect String arrays together with a specified separator.
For example:
String[] s = new String[] {"a", "b", "c"};
String joined = StringUtils.join(s, ","); // "a,b,c"
However, I suspect that, as you mention, there must be some kind of conditional or substring processing in the actual implementation of the above mentioned method.
If I were to perform the String joining and didn't have any other reasons to use Commons Lang, I would probably roll my own to reduce the number of dependencies to external libraries.

A fast and simple solution without any 3rd party includes.
public static String strJoin(String[] aArr, String sSep) {
StringBuilder sbStr = new StringBuilder();
for (int i = 0, il = aArr.length; i < il; i++) {
if (i > 0)
sbStr.append(sSep);
sbStr.append(aArr[i]);
}
return sbStr.toString();
}

"I'm sure there is a certified, efficient way to do it (Apache Commons?)"
yes, apparenty it's
StringUtils.join(array, separator)
http://www.java2s.com/Code/JavaAPI/org.apache.commons.lang/StringUtilsjoinObjectarrayStringseparator.htm

With Java 1.8 there is a new StringJoiner class - so no need for Guava or Apache Commons:
String str = new StringJoiner(",").add("a").add("b").add("c").toString();
Or using a collection directly with the new stream api:
String str = Arrays.asList("a", "b", "c").stream().collect(Collectors.joining(","));

Even easier you can just use Arrays, so you will get a String with the values of the array separated by a ","
String concat = Arrays.toString(myArray);
so you will end up with this: concat = "[a,b,c]"
Update
You can then get rid of the brackets using a sub-string as suggested by Jeff
concat = concat.substring(1, concat.length() -1);
so you end up with concat = "a,b,c"
if you want to use Kotlin:
val concat = myArray.joinToString(separator = ",") //"a,b,c"

You can use replace and replaceAll with regular expressions.
String[] strings = {"a", "b", "c"};
String result = Arrays.asList(strings).toString().replaceAll("(^\\[|\\]$)", "").replace(", ", ",");
Because Arrays.asList().toString() produces: "[a, b, c]", we do a replaceAll to remove the first and last brackets and then (optionally) you can change the ", " sequence for "," (your new separator).
A stripped version (fewer chars):
String[] strings = {"a", "b", "c"};
String result = ("" + Arrays.asList(strings)).replaceAll("(^.|.$)", "").replace(", ", "," );
Regular expressions are very powerful, specially String methods "replaceFirst" and "replaceAll". Give them a try.

All of these other answers include runtime overhead... like using ArrayList.toString().replaceAll(...) which are very wasteful.
I will give you the optimal algorithm with zero overhead;
it doesn't look as pretty as the other options, but internally, this is what they are all doing (after piles of other hidden checks, multiple array allocation and other crud).
Since you already know you are dealing with strings, you can save a bunch of array allocations by performing everything manually. This isn't pretty, but if you trace the actual method calls made by the other implementations, you'll see it has the least runtime overhead possible.
public static String join(String separator, String ... values) {
if (values.length==0)return "";//need at least one element
//all string operations use a new array, so minimize all calls possible
char[] sep = separator.toCharArray();
// determine final size and normalize nulls
int totalSize = (values.length - 1) * sep.length;// separator size
for (int i = 0; i < values.length; i++) {
if (values[i] == null)
values[i] = "";
else
totalSize += values[i].length();
}
//exact size; no bounds checks or resizes
char[] joined = new char[totalSize];
int pos = 0;
//note, we are iterating all the elements except the last one
for (int i = 0, end = values.length-1; i < end; i++) {
System.arraycopy(values[i].toCharArray(), 0,
joined, pos, values[i].length());
pos += values[i].length();
System.arraycopy(sep, 0, joined, pos, sep.length);
pos += sep.length;
}
//now, add the last element;
//this is why we checked values.length == 0 off the hop
System.arraycopy(values[values.length-1].toCharArray(), 0,
joined, pos, values[values.length-1].length());
return new String(joined);
}

it's in StringUtils:
http://www.java2s.com/Code/JavaAPI/org.apache.commons.lang/StringUtilsjoinObjectarrayStringseparator.htm

This options is fast and clear:
public static String join(String separator, String... values) {
StringBuilder sb = new StringBuilder(128);
int end = 0;
for (String s : values) {
if (s != null) {
sb.append(s);
end = sb.length();
sb.append(separator);
}
}
return sb.substring(0, end);
}

This small function always comes in handy.
public static String join(String[] strings, int startIndex, String separator) {
StringBuffer sb = new StringBuffer();
for (int i=startIndex; i < strings.length; i++) {
if (i != startIndex) sb.append(separator);
sb.append(strings[i]);
}
return sb.toString();
}

The approach that I've taken has evolved since Java 1.0 to provide readability and maintain reasonable options for backward-compatibility with older Java versions, while also providing method signatures that are drop-in replacements for those from apache commons-lang. For performance reasons, I can see some possible objections to the use of Arrays.asList but I prefer helper methods that have sensible defaults without duplicating the one method that performs the actual work. This approach provides appropriate entry points to a reliable method that does not require array/list conversions prior to calling.
Possible variations for Java version compatibility include substituting StringBuffer (Java 1.0) for StringBuilder (Java 1.5), switching out the Java 1.5 iterator and removing the generic wildcard (Java 1.5) from the Collection (Java 1.2). If you want to take backward compatibility a step or two further, delete the methods that use Collection and move the logic into the array-based method.
public static String join(String[] values)
{
return join(values, ',');
}
public static String join(String[] values, char delimiter)
{
return join(Arrays.asList(values), String.valueOf(delimiter));
}
// To match Apache commons-lang: StringUtils.join(values, delimiter)
public static String join(String[] values, String delimiter)
{
return join(Arrays.asList(values), delimiter);
}
public static String join(Collection<?> values)
{
return join(values, ',');
}
public static String join(Collection<?> values, char delimiter)
{
return join(values, String.valueOf(delimiter));
}
public static String join(Collection<?> values, String delimiter)
{
if (values == null)
{
return new String();
}
StringBuffer strbuf = new StringBuffer();
boolean first = true;
for (Object value : values)
{
if (!first) { strbuf.append(delimiter); } else { first = false; }
strbuf.append(value.toString());
}
return strbuf.toString();
}

public String join(String[] str, String separator){
String retval = "";
for (String s: str){ retval+= separator + s;}
return retval.replaceFirst(separator, "");
}

Better way to detect if a string contains multiple words

I am trying to create a program that detects if multiple words are in a string as fast as possible, and if so, executes a behavior. Preferably, I would like it to detect the order of these words too but only if this can be done fast. So far, this is what I have done:
if (input.contains("adsf") && input.contains("qwer")) {
execute();
}
As you can see, doing this for multiple words would become tiresome. Is this the only way or is there a better way of detecting multiple substrings? And is there any way of detecting order?

I'd create a regular expression from the words:
Pattern pattern = Pattern.compile("(?=.*adsf)(?=.*qwer)");
if (pattern.matcher(input).find()) {
execute();
}
For more details, see this answer: https://stackoverflow.com/a/470602/660143

Editors note: Despite being heavily upvoted and accepted, this does not function the same as the code in the question. execute is called on the first match, like a logical OR.
You could use an array:
String[] matches = new String[] {"adsf", "qwer"};
bool found = false;
for (String s : matches)
{
if (input.contains(s))
{
execute();
break;
}
}
This is efficient as the one posted by you but more maintainable. Looking for a more efficient solution sounds like a micro optimization that should be ignored until proven to be effectively a bottleneck of your code, in any case with a huge string set the solution could be a trie.

In Java 8 you could do
public static boolean containsWords(String input, String[] words) {
return Arrays.stream(words).allMatch(input::contains);
}
Sample usage:
String input = "hello, world!";
String[] words = {"hello", "world"};
if (containsWords(input, words)) System.out.println("Match");

This is a classical interview and CS problem.
Robin Karp algorithm is usually what people first talk about in interviews. The basic idea is that as you go through the string, you add the current character to the hash. If the hash matches the hash of one of your match strings, you know that you might have a match. This avoids having to scan back and forth into your match strings.
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
Other typical topics for that interview question are to consider a trie structure to speed up the lookup. If you have a large set of match strings, you have to always check a large set of match strings. A trie structure is more efficient to do that check.
https://en.wikipedia.org/wiki/Trie
Additional algorithms are:
- Aho–Corasick https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
- Commentz-Walter https://en.wikipedia.org/wiki/Commentz-Walter_algorithm

If you have a lot of substrings to look up, then a regular expression probably isn't going to be much help, so you're better off putting the substrings in a list, then iterating over them and calling input.indexOf(substring) on each one. This returns an int index of where the substring was found. If you throw each result (except -1, which means that the substring wasn't found) into a TreeMap (where index is the key and the substring is the value), then you can retrieve them in order by calling keys() on the map.
Map<Integer, String> substringIndices = new TreeMap<Integer, String>();
List<String> substrings = new ArrayList<String>();
substrings.add("asdf");
// etc.
for (String substring : substrings) {
int index = input.indexOf(substring);
if (index != -1) {
substringIndices.put(index, substring);
}
}
for (Integer index : substringIndices.keys()) {
System.out.println(substringIndices.get(index));
}

Use a tree structure to hold the substrings per codepoint. This eliminates the need to
Note that this is efficient only if the needle set is almost constant. It is not inefficient if there are individual additions or removals of substrings though, but a different initialization each time to arrange a lot of strings into a tree structure would definitely slower it.
StringSearcher:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
class StringSearcher{
private NeedleTree needles = new NeedleTree(-1);
private boolean caseSensitive;
private List<Integer> lengths = new ArrayList<>();
private int maxLength;
public StringSearcher(List<String> inputs, boolean caseSensitive){
this.caseSensitive = caseSensitive;
for(String input : inputs){
if(!lengths.contains(input.length())){
lengths.add(input.length());
}
NeedleTree tree = needles;
for(int i = 0; i < input.length(); i++){
tree = tree.child(caseSensitive ? input.codePointat(i) : Character.toLowerCase(input.codePointAt(i)));
}
tree.markSelfSet();
}
maxLength = Collections.max(legnths);
}
public boolean matches(String haystack){
if(!caseSensitive){
haystack = haystack.toLowerCase();
}
for(int i = 0; i < haystack.length(); i++){
String substring = haystack.substring(i, i + maxLength); // maybe we can even skip this and use from haystack directly?
NeedleTree tree = needles;
for(int j = 0; j < substring.maxLength; j++){
tree = tree.childOrNull(substring.codePointAt(j));
if(tree == null){
break;
}
if(tree.isSelfSet()){
return true;
}
}
}
return false;
}
}
NeedleTree.java:
import java.util.HashMap;
import java.util.Map;
class NeedleTree{
private int codePoint;
private boolean selfSet;
private Map<Integer, NeedleTree> children = new HashMap<>();
public NeedleTree(int codePoint){
this.codePoint = codePoint;
}
public NeedleTree childOrNull(int codePoint){
return children.get(codePoint);
}
public NeedleTree child(int codePoint){
NeedleTree child = children.get(codePoint);
if(child == null){
child = children.put(codePoint, new NeedleTree(codePoint));
}
return child;
}
public boolean isSelfSet(){
return selfSet;
}
public void markSelfSet(){
selfSet = true;
}
}

I think a better approach would be something like this, where we can add multiple values as a one string and by index of function validate index
String s = "123";
System.out.println(s.indexOf("1")); // 0
System.out.println(s.indexOf("2")); // 1
System.out.println(s.indexOf("5")); // -1

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String data manipulation with Maps for very large data input - java

Related

Finding the first Non-repeating Character in the given string, not able to pass a few test cases due to Timeout

Replace multiple replaceAll with a cleaner way of coding in a string

how to access String... (varargs) to get specific characters and save them into String

Java equivalent of Python "join" method for array? [duplicate]

Better way to detect if a string contains multiple words

Categories

Resources