Hashset duplicate values but I dont have custom objects - java

I am using jsoup parser to extract my anchor tags and then I am just adding the links to a hash set.
The code is as follows
Posting my entire code. I understand the issue is because I am using toString and the value would change My goal is when I get a bunch of links I want to eliminate links such as http://cse.syr.edu and http://cse.syr.edu/ so that my hashSet contains unique elements. How could I do this
for ( Element link : links)
{
String test=link.attr("abs:href");
if(!(link.attr("abs:href").contains("http://cse.syr.edu")))
continue ;
else if(h.isEmpty()){
h.add(test);
}
else if(h.contains(test) || h.contains(test+"/")) // I now removed (test+"/")
continue;
else {
h.add(test);
}
I have updated my question now thanks RJ

There's probably whitespace in your Strings. HashSet works just fine.

If we are talking about java.util.HashSet, the most likely explanation is that your diagnosis of the problem is incorrect. Make sure that the strings in the set are indeed identical (and not subtly different), and that you are not accidentally re-creating or clearing the HashSet between adding identical strings.

Related

Java - A loop in a loop and the format gets lost

I'm French so excuse my English not necessarily correct.
I explain the context, I currently have a String array list named "tempCustomerDrugsIdsList" (var1) and another string array list named "tempDrugsTableList"(var2).
When I make a loop "For" on "var1" then another one in "var2","var2" loses its format, i. e. upper case is replaced by lower case and spaces are deleted.
I tested with another loop with the same type of variables (but empty), the result being the same I think the problem comes from my way of using java. Being on vb. net before, I must have taken some bad habits !
I don't know how to solve this problem, I've only been working in java for 2 weeks.
Thank you for helping me.
[EDIT]
My problem was:
List<String[]> tempDrugsTableList = otherList;
But this code doesn't duplicate the list.
AxelH gave me the following solution:
List<String[]> tempDrugsTableList = new ArrayList<String[]>(otherList);
Well, you are not doing a "copy" of the list
tempDrugsTableListCopy = tempDrugsTableList; // Get copy of original tempDrugsTableList for comparate
but sharing the reference, every update done in the tempDrugsTableListCopy will be done in the original list (same reference, same adress in memory). Since you are updating that copy in the following loops ... you update the original list too. What you want is to clone the list.
You could do it simply with copyList = new ArrayList(originalList); or for a deep clone, you need to iterate each element to duplicate those. (array need to be duplicated too if you change the value in those)
"String[]" tmpCustomerIds means you are getting a string array from a string array, which you would be using in a 2d array. Try it with just "String" in the for each loops. I am assuming you are using 1d arrays in this case.

proper mathematics way to explain in a comment that there is no duplicate items in a set

i'm writing some code, and I want my code to be well documented.
there is a part in the code where I'm checking that when there is a try to insert new element E to a list L, the element E will be unique (so there is no other elements in L that equals to him).
I'm having difficult to write a user-friendly mathematics comment, something that will look like the example bellow
the function will change all elements (that in list L) fields E.color to "Black" only if color to black element.size > 10.
so in that case I will write the comment -
[ X.color="Black" | X in L, X.size > 10]
but for the scenario above I couldnt find any satisfied mathmatics comment.
A mathematical set by definition has no duplicates inside it, so perhaps using the a set rather than a list would solve your problem.
However if that's too hard to change now then you could write something like:
[ L.insert(E) | E not in L ]
where E is the element and L is the list.
an exhaustive answer to your question requires two observations:
Best coding practices require you to know collections very well and when to use them. So you want the right collection for the right Job. In this case as advised in other comments, you need to use a Set instead of a list. A Set uses a Map under the hood, having your elements as keys and values as DEFAULT. Every time that you add an element to your Set, the Hash value for it is calculated and compared using equals to the existing elements. So no dups are allowed.
I really appreciate the fact that you want to write good comments, however you don't need to for the following reasons:
List and Sets behaviour is largely documented already, so nobody expects you to comment them;
Books like Refactoring and Clean code, teach us that good code should never be commented as it should be self explaining. That means that your method/class/variable name should tell me what the method is doing.

Efficient data structure that checks for existence of String

I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).
However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found
Correct, checking a list for an item is linear in the number of entries of the list.
However, I am not keen on filling a HashMap where the value is completely arbitrary
You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.
You can put all your strings there, and then check for presence or absence of other strings in constant time;
Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings
if (knownString.contains(myString)) {
...
}
It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.
A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).
If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).
Otherwise, well, a HashSet...
A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:
String wordList = "$word1$word2$word3$word4$...";
Then create a search argument with your word between the separators:
String searchArg = "$" + searchWord + "$";
Then search with, say, contains:
bool wordFound = wordList.contains(searchArg);
You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.
As others mentioned HashSet is the way to go. But if the size is going to be large and you are fine with false positives (checking if the username exists) you can use BloomFilters (probabilistic data structure) as well.

How to compare two paragraphs of text?

I need to remove duplicated paragraphs in a text with many paragraphs.
I use functions from the class java.security.MessageDigest to calculate each paragraph's MD5 hash value, and then add these hash value into a Set.
If add()'ed successfully, it means the latest paragraph is a duplicate one.
Is there any risk of this way?
Except String.equals(), is there any other way to do it?
Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace.
After normalizing, paragraphs that only differ there would get the same hash.
If the MD5 hash is not yet in the set, it means the paragraph is unique. But the opposite is not true. So if you find that the hash is already in the set, you could potentially have a non-duplicate with the same hash value. This would be very unlikely, but you'll have to test that paragraph against all others to be sure. For that String.equals would do.
Moreover, you should very well consider what you call unique (regarding typo's, whitespaces, capitals, and so on), but that would be the case with any method.
There's no need to calculate the MD5 hash, just use a HashSet and try to put the strings itself into this set. This will use the String#hashCode() method to compute a hash value for the String and check if it's already in the set.
public Set removeDuplicates(String[] paragraphs) {
Set<String> set = new LinkedHashSet<String>();
for (String p : paragraphs) {
set.add(p);
}
return set;
}
Using a LinkedHashSet even keeps the original order of the paragraphs.
As others have suggested, you should be aware that minute differences in punctuation, white space, line breaks etc. may render your hashes different for paragraphs that are essentially the same.
Perhaps you should consider a less brittle metric, such as eg. the Cosine Similarity which is well suited for matching paragraphs.
Cheers,
I think this is a good way. However, there are some things to keep in mind:
Please note that calculating a hash is a heavy operation. This could render your program slow, if you had to repeat it for millions of paragraphs.
Even in this way, you could end up with slightly different paragraphs (with typos, for examplo) going undetecetd. If this is the case, you should normalize the paragraphs before calculaing the hash (putting it into lower case, removing extra-spaces and so on).

Efficient way to implement a String Array "is in" method using Java

I have a requirement to present highly structured information picked from a highly un-structured web service. In order to display the info correctly, I have to do a lot of String matches and duplicate removals to ensure I'm picking the right combination of elements.
One of my challenges involves determining if a String is in an Array of Strings.
My dream is to do "searchString.isIn(stringArray);" but I realize the String class doesn't provide for that.
Is there a more efficient way of doing this beyond this stub?:
private boolean isIn(String searchString, String[] searchArray)
{
for(String singleString : searchArray)
{
if (singleString.equals(searchString)
return true;
}
return false;
}
Thanks!
You may want to look into HashMap or HashSet, both of which give constant time retrieval, and it's as easy as going:
hashSet.contains(searchString)
Additionally, HashSet (and HashMap for its keys) prevents duplicate elements.
If you need to keep them in order of insertion, you can look into their Linked counterparts, and if you need to keep them sorted, TreeSet and TreeMap can help (note, however, that the TreeSet and TreeMap do not provide constant time retrieval).
Everybody else seems to be viewing this question in a broader scope (which is certainly valid). I am only answering this bit:
One of my challenges involves
determining if a String is in an Array
of Strings.
That's simple:
return Arrays.asList(arr).contains(str)
Reference:
Arrays.asList(array)
If you are doing this a lot, you can initially sort the array and do a binary search for your strings.
As mentioned a HashMap or HashSet can provide reasonable performance above what you've mentioned. It depends greatly on how well distributed your hash algorithm is and how many buckets are in the Map.
You could also keep a sorted list and perform a binary search on that list which could perform slightly better, though you pay the cost of sorting. If it's a one time sort, then that's not a big deal. If the list is constantly changing, you may pay a larger cost.
Lastly, you could consider a Trie structure. I think this would be the fastest way to search, but that's a gut reaction. I don't have the numbers to support that.
As explained before you can use a Set (see http://download.oracle.com/javase/1.5.0/docs/api/java/util/Set.html and specially the boolean contains(Object o) method) for that purpose. Here is a quick 'n dirty example that demonstrates this:
String[] a = {"a", "2"};
Set<String> hashSet = new HashSet<String>();
Collections.addAll(hashSet, a);
System.out.println(hashSet.contains("a")); // Returns true
System.out.println(hashSet.contains("2")); // Returns true
System.out.println(hashSet.contains("e")); // Returns false
Hope this helps ;)
As Zach has pointed out , you can use hashset to prevent duplicate, and use contains method to search for a string , which returns true when a match is found.You also need to override equals in ur class.
public boolean equals(Object other) {
return other != null && other instanceof L && this.l == ((L)other).l;
If the search space (your collection of strings) is limited than I agree with the answers already posted. If, however, you have a large collection of strings and need to perform a sufficient number of searches on it (to outweigh the setup overhead), you might also consider encoding the search strings in a trie data structure. Again this would only be advantageous if there are enough strings and you search enough times to justify the setup overhead.

Categories