Detecting duplicates in a file generated using the sliding window concept

Detecting duplicates in a file generated using the sliding window concept - java

I am working on a project where I have to parse a text file and divide the strings into substrings of a length that the user specifies. Then I need to detect the duplicates in the results.
So the original file would look like this:
ORIGIN
1 gatccaccca tctcggtctc ccaaagtgct aggattgcag gcctgagcca ccgcgcccag
61 ctgccttgtg cttttaatcc cagcactttc agaggccaag gcaggcgatc agctgaggtc
121 aggagttcaa gaccagcctg gccaacatgg tgaaacccca tctctaatac aaatacaaaa
181 aaaaaacaaa aaacgttagc caggaatgag gcccggtgct tgtaatccta aggaaggaga
241 ccaccactcc tcctgctgcc cttcccttcc ccacaccgct tccttagttt ataaaacagg
301 gaaaaaggga gaaagcaaaa agcttaaaaa aaaaaaaaaa cagaagtaag ataaatagct
I loop over the file and generate a line of the strings then use line.toCharArray() to slide over the resulting line and divide according to the user specification. So if the substrings are of length 4 the result would look like this:
GATC
ATCC
TCCA
CCAC
CACC
ACCC
CCCA
CCAT
CATC
ATCT
TCTC
CTCG
TCGG
CGGT
GGTC
GTCT
TCTC
CTCC
TCCC
CCCA
CCAA
Here is my code for splitting:
try {
scanner = new Scanner(toSplit);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
char[] chars = line.toCharArray();
for (int i = 0; i < chars.length - (k - 1); i++) {
String s = "";
for(int j = i; j < i + k; j++) {
s += chars[j];
}
if (!s.contains("N")) {
System.out.println(s);
}
}
}
}
My question is: given that the input file can be huge, how can I detect duplicates in the results?

If You want to check duplicates a Set would be a good choice to hold and test data. Please tell in which context You want to detect the duplicates: words, lines or "output chars".

You can use a bloom filter or a table of hashes to detect possible duplicates and then make a second pass over the file to check if those "duplicate candidates" are true duplicates or not.
Example with hash tables:
// First we make a list of candidates so we count the times a hash is seen
int hashSpace = 65536;
int[] substringHashes = new int[hashSpace];
for (String s: tokens) {
substringHashes[s.hashCode % hashSpace]++; // inc
}
// Then we look for words that have a hash that seems to be repeated and actually see if they are repeated. We use a set but only of candidates so we save a lot of memory
Set<String> set = new HashSet<String>();
for (String s: tokens) {
if (substringHashes[s.hashCode % hashSpace] > 1) {
boolean repeated = !set.add(s);
if (repeated) {
// TODO whatever
}
}
}

You could do something like this:
Map<String, Integer> substringMap = new HashMap<>();
int index = 0;
Set<String> duplicates = new HashSet<>();
For each substring you pull out of the file, add it to substringMap only if it's not a duplicate (or if it is a duplicate, add it to duplicates):
if (substringMap.putIfAbsent(substring, index) == null) {
++index;
} else {
duplicates.add(substring);
}
You can then pull out all the substrings with ease:
String[] substringArray = new String[substringMap.size()];
for (Map.Entry<String, Integer> substringEntry : substringMap.entrySet()) {
substringArray[substringEntry.getValue()] = substringEntry.getKey();
}
And voila! An array of output in the original order with no duplicates, plus a set of all the substrings that were duplicates, with very nice performance.

Related

Comparing array items, index out of bound

I have a piece of code and I'm a bit confused how to deal with my issue so, please review method below. I was trying to search for a solution but unfortunately none of them fit my needs so I am looking for an advice here. The method is taking a String and removing duplicated characters so for example - input: ABBCDEF should return ABCDEF, but when entering i+1 in the last iteration I got IndexOutOfBound Exception, so I can iterate until string.length-1 but then I loose the last element, what is the SMARTEST solution in your opinion, thanks.
public String removeDuplicates(String source){
if(source.length() < 2){
return source;
}
StringBuilder noDuplicates = new StringBuilder();
char[] string = source.toCharArray();
for(int i = 0; i < string.length-1; i++){
if(string[i] != string[i+1]){
noDuplicates.append(string[i]);
}
}
return noDuplicates.toString();
}

You could do this like so: append the first character in source, and then only append subsequent characters if they are not equal to the previously-appended character.
if (source.isEmpty()) {
return source; // Or "", it doesn't really matter.
}
StringBuilder sb = new StringBuilder();
sb.append(source.charAt(0));
for (int i = 1; i < source.length(); ++i) {
char c = source.charAt(i);
if (c != sb.charAt(sb.length() - 1)) {
sb.append(c);
}
}
return sb.toString();
But if you wanted to do this more concisely, you could do it with regex:
return source.replaceAll("(.)\\1+", "$1");

You could simply append the last character after the loop:
public String removeDuplicates(String source){
...
noDuplicates.append(string[string.length - 1]);
return noDuplicates.toString();
}

You have a simple logic error:
You make your string to a char array.
That is fine, but the length property of any array will show you the
human way of counting if someting is in it.
If there is 1 element the length will be 1
2 -> 2
3 -> 3
etc.
You get the idea.
So when you go string[i + 1] you go one character to far.
You could just change the abort condition to
i < = string.length - 2
Or you could write a string iterator, to be able to access the next element, but
that seems like overkill for this example

This is just what LinkedHashSet was made for! Under the hood it's a HashSet with an iterator to keep track of insertion order, so you can remove duplicates by adding to the set, then reconstruct the string with guaranteed ordering.
public static String removeDuplicates(String source) {
Set<String> dupeSet = new LinkedHashSet<>();
for (Character v : source.toCharArray()) {
dupeSet.add(v.toString());
}
return String.join("", dupeSet);
}

If you wish to remove all repeating characters regardless of their position in the given String you might want to consider using the chars() method which provides a IntStream of the chars and that has the distinct() method to filter out repeating values. You can then put them back together with a StringBuilder like so:
public class RemoveDuplicatesTest {
public static void main(String[] args) {
String value = "ABBCDEFE";
System.out.println("No Duplicates: " + removeDuplicates(value));
}
public static String removeDuplicates(String value) {
StringBuilder result = new StringBuilder();
value.chars().distinct().forEach(c -> result.append((char) c));
return result.toString();
}
}

Java: Removing an empty Element from String Array [duplicate]

This question already has answers here:
Resize an Array while keeping current elements in Java?
(12 answers)
Closed 3 years ago.
I have an array String ar[] = {"HalloWelt", " "};, ar.length is 2.
It does register two values within "HalloWelt" on index 0, and a blank/empty string on index 1;
I wonder how can I remove empty space on the index 1 - > but also keep it as a String Array since it is necessary for next task. Or how to do bunch of conversions but end up with String Array in the end.
My attempt
public String[] arraysWhiteSpaceEliminator(String[] arr) {
int k=0; //Identify how big the array should be i.e. till it reaches an empty index.
for(int i=0; i<bsp.length;i++) {
arr[i].trim();
System.out.println(arr[i].isEmpty());
if(arr[i].isEmpty()) {
}
else {
k = k+1; //if the index isn't empty == +1;
}
}
String[] clearnArray = new String[k];
for(int s = 0; s<k; s++) {
clearnArray [s] = arr[s]; //define New Array till we reach the empty index.
//System.out.println(clearnArray [s]+" " +s);
}
return clearnArray ;
};
The logic is very simple:
Identify how big the clearnArray should be.
Iterate through original Array with .trim() to remove white Space and check wether isEmpty().
Add to the k if the index isnt Empty.
Create clearnArray with the k as size.
Loop through originial Array till k -> add all the items to cleanArray till k.
Issue: .trim() and .isEmpty() don't record that the index is empty. ?!

A solution with streams:
String[] clean = Arrays.stream(ar)
.map(String::trim)
.filter(Predicate.isEqual("").negate())
.toArray(String[]::new);
Note that this assumes none of the array elements are null. If this is a possibility, simply add the following stage before the map:
.filter(Objects::nonNull)

The problem with your code is that after counting to find k, you just write the first k elements from the original array. To solve the problem by your technique, you need to check each element of the input array again to see if it's empty (after trimming), and only write to the new array the elements which pass the test.
The code can be simplified using the "enhanced" for loop, since you don't need indices for the original array. (The variable i keeps track of the current index in the new array.) Note also that strings are immutable, so calling .trim() does nothing if you don't use the result anywhere. Your code also refers to bsp which is not defined, so I changed that.
int k = 0;
for(String s : arr) {
s = s.trim();
if(!s.isEmpty()) {
k++;
}
}
String[] cleanArray = new String[k];
int i = 0;
for(String s : arr) {
s = s.trim();
if(!s.isEmpty()) {
cleanArray[i] = s;
i++;
}
}
return cleanArray;

Calculate the number of non-null elements and create an array of that size, like
String[] strs = ...;
int count = 0;
for (int i = 0; i < strs.length; i++) {
if (strs[i] != null) count++;
}
String newStrArray[] = new String[count];
int idx = 0;
for (int i = 0; i < strs.length; i++) {
if (strs[i] != null) newStrArray[idx++] = strs[i];
}
return newStrArray;
You could also probably make this prettier using streams. However I haven't used streaming functionality in Java, so I can't help there.
Two things to note:
Unless you are serializing or the nulls are causing other problems, trimming the array just to get rid of the nulls probably won't have an impact on memory, as the size of an array entry (4 bytes) is very likely inconsequential to the memory block size allocated for the Array object
Converting first to an List and then back to an array is lazy and possibly inefficient. ArrayList, for example, will likely include extra space in the array it creates internally so that you can add more elements to the ArrayList and not have to create a whole new internal array.

in your method, create a new
List cleanedList = new ArrayList();
add iterate through ar[]ar[] = ["HalloWelt",""], and add only non-empty values to cleaned List....then return the array.
return cleanedList.toArray()
like below:
List<String> cleanedList = new ArrayList<>();
for(String s : arr) {
s = s.trim();
if(!s.isEmpty()) {
cleanedList.add(s);
}
}
return cleanArray.toArray();

Reduced String - compress

I am practicing Strings programming examples. i would like to reduce the given strings. it should eliminate a character if its in even numbers
example: Input - aaabbc, Output should be: ac
I have used HashMap to count and store character and count value and computing using value % 2 then continue or else print the output. But some of the test cases are failing in Hackerrank. Could you please help me identify the problem?
static String super_reduced_string(String s){
HashMap<Character, Integer> charCount = new HashMap<Character, Integer>();
StringBuilder output = new StringBuilder();
if (s == null || s.isEmpty()) {
return "Empty String";
}
char[] arr = s.toCharArray();
for (int i = 0; i < s.length(); i++) {
char c = arr[i];
if (!charCount.containsKey(c)) {
charCount.put(c,1);
} else {
charCount.put(c,charCount.get(c)+1);
}
}
for (char c:charCount.keySet()) {
if (charCount.get(c) % 2 != 0) {
output.append(c);
}
}
return output.toString();
}

The problem lies in how you are selecting the output. HashSets and HashMaps do not have any ordering associated with them. In your test case, the output could be either ac OR ca .
To solve this, you can do a variety of things. The quickest way I see is to take your orignal string, lets say s, and call
s.replace(c,"")
for ever char you need to remove.
I doubt this is anywhere near the most, or even mildly, efficient way to solve this, but it should work.

How to define a string array in java and use it in switch case

Hi dear friends, I want to define a String array in java and use every
cells of that array in switch-case in java to count every string
elements.for example can you you help me to fix it thanks.
int i=20;
String [] str =new String[i];
string[0]="This";
string[1]="is";
string[2]="a";
string[3]="Test";
string[4]="This";
string[5]="This";
string[6]="a";
string[7]="a";
string[8]="a";
string[20]="Test";
switch(i)
{
case(0):this++
break;
case(1):is++
break;
case(2):a++
break;
case(3):test++
break;
}

Trying to enumerate all the different possible strings in the array is generally a bad idea and not the correct way to write your program. Sure your example works, but what would happen if your set of possible strings was not just {'this', 'is', 'a', 'test'}, but instead had say 10000 elements? What if you didn't know exactly what String elements were in the array? As a previous user mentioned, you want to use a HashMap<String, Integer>.
String[] arr = yourStringArray; //wherever your Strings are coming from
Map<String, Integer> strCounts = new HashMap<String, Integer>; //this stores the strings you find
for (int i = 0; i < arr.length; i++) {
String str = arr[i];
if (strCounts.containsKey(str)) {
strCounts.get(str) += 1; //if you've already seen the String before, increment count
} else {
strCounts.put(str, 1); //otherwise, add the String to the HashMap, along with a count (1)
}
}

While i don't really understand what you are trying to do, here are my two cents. You are probably getting an ArrayIndexOutOfBoundsException. This is because when you create an array of size 20, it can hold exactly 20 items. From range 0 to 19. So trying to do string[20]="Test"; will give an error because it's out of bounds.

I would do it this way:
int count;
String currentWord;
Map<String, Integer> wordMap = new HashMap<>();
StringTokenizer tokenizer = new StringTokenizer(mySentence);
while(tokenizer.hasNext()){
count = null;
currentWord = tokenizer.nextToken();
count = wordMap.get(currentWord);
if(count == null) wordMap.put(currentWord, 1);
else wordMap.put(currentWord, ++count);
}

Replacing the output of one method into the output of another method

So I have this method that should read a file and detect if the character after the symbol is a number or a word. If it is a number, I want to delete the symbol in front of it, translate the number into binary and replace it in the file. If it is a word, I want to set the characters to number 16 at first, but then, if another word is used, I want to add the 1 to the original number.
Here's the input that i'm using:
Here's my method:
try {
ReadFile files = new ReadFile(file.getPath());
String[] anyLines = files.OpenFile();
int i;
int wordValue = 16;
// to keep track words that are already used
Map<String, Integer> wordValueMap = new HashMap<String, Integer>();
for (String line : anyLines) {
if (!line.startsWith("#")) {
continue;
}
line = line.substring(1);
Integer binaryValue = null;
if (line.matches("\\d+")) {
binaryValue = Integer.parseInt(line);
}
else if (line.matches("\\w+")) {
binaryValue = wordValueMap.get(line);
// if the map doesn't contain the word value, then assign and store it
if (binaryValue == null) {
binaryValue = wordValue;
wordValueMap.put(line, binaryValue);
++wordValue;
}
}
// --> I want to replace with this
System.out.println(Integer.toBinaryString(binaryValue));
}
for (i=0; i<anyLines.length; i++) {
// --> Here are a bunch of instructions that replace certain strings - they are the lines after # symbols <--
// --> I'm not going to list them ... <--
System.out.println(anyLines[i]);
So the question is, how do I replace those lines that start with ("#" line-by-line), in order?
I basically want the output to look like this:
101
1110110000010000
10000
1110001100001000
10001
1110101010001000
10001
1111000010001000
10000
1110001110001000
10010
1110001100000110
10011
1110101010000111
10010
1110101010000111

I don't quite understand the logic. If you are simply trying to replace all the # symbols in order, why not read all the numbers into a List in order, until you see an # symbol. Then you can start replacing them in order from that List (or Queue since you want first in first out). Does that satisfy your requirements?
If you must keep the wordValueMap, the code below should loop through the lines after you have populated the wordValueMap and write them to the console. It uses the same logic that you used to populate the map in the first place and outputs the values that should be replaced.
boolean foundAt = false;
for (i=0; i<anyLines.length; i++) {
// --> Here are a bunch of instructions that replace certain strings - they are the lines after # symbols <--
// --> I'm not going to list them ... <--
if (anyLines[i].startsWith("#")) {
foundAt = true;
String theLine = anyLines[i].substring(1);
Integer theInt = null;
if (theLine.matches("\\d+")) {
theInt = Integer.parseInt(theLine);
}
else {
theInt = wordValueMap.get(anyLines[i].substring(1));
}
if(theInt!=null) {
System.out.println(Integer.toBinaryString(theInt));
}
else {
//ERROR
}
}
else if(foundAt) {
System.out.println(anyLines[i]);
}
}
When I run this loop, I get the output you were looking for from your question:
101
1110110000010000
10000
1110001100001000
10001
1110101010001000
10001
1111000010001000
10000
1110001110001000
10010
1110001100000110
10011
1110101010000111
10010
1110101010000111
I hope this helps, but take a look at my question above to see if you can do this in a more straight forward manner.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Detecting duplicates in a file generated using the sliding window concept - java

If You want to check duplicates a Set would be a good choice to hold and test data. Please tell in which context You want to detect the duplicates: words, lines or "output chars".

Related

Comparing array items, index out of bound

Java: Removing an empty Element from String Array [duplicate]

Reduced String - compress

How to define a string array in java and use it in switch case

Replacing the output of one method into the output of another method

Categories

Resources