How to remove all specific elements from Vector - java

In fact, regarding to the title in the question, I have a solution for this, but my approach seems to waste resources to create a List objects.
So my question is: Do we have a more efficient approach for this?
From the case, I want to remove the extra space " " and extra "a" from a Vector.
My vector includes:
{"a", "rainy", " ", "day", "with", " ", "a", "cold", "wind", "day", "a"}
Here is my code:
List lt = new LinkedList();
lt = new ArrayList();
lt.add("a");
lt.add(" ");
vec1.removeAll(lt);
As you can see the extra spaces in the list of Vector, the reason that happens is that I use Vector to read and chunk the word from word document, and sometimes the document may contain some extra spaces that caused by human error.

Your current approach does suffer the problem that deleting an element from a Vector is an O(N) operation ... and you are potentially doing this M times (5 in your example).
Assuming that you have multiple "stop words" and that you can change the data structures, here's a version that should (in theory) be more efficient:
public List<String> removeStopWords(
List<String> input, HashSet<String> stopWords) {
List<String> output = new ArrayList<String>(input.size());
for (String elem : input) {
if (!stopWords.contains(elem)) {
output.append(elem);
}
}
return res;
}
// This could be saved somewhere, assuming that you are always filtering
// out the same stopwords.
HashSet<String> stopWords = new HashSet<String>();
stopWords.add(" ");
stopWords.add("a");
... // and more
List<String> newList = removeStopwords(list, stopWords);
Points of note:
The above creates a new list. If you have to reuse the existing list, clear it and then addAll the new list elements. (This another O(N-M) step ... so don't if you don't have to.)
If there are multiple stop words then using a HashSet will be more efficient; e.g. if done as above. I'm not sure exactly where the break even point is (versus using a List), but I suspect it is between 2 and 3 stopwords.
The above creates a new list, but it only copies N - M elements. By contrast, the removeAll algorithm when applied to a Vector could copy O(NM) elements.
Don't use a Vector unless you need a thread-safe data structure. An ArrayList has a similar internal data structure, and doesn't incur synchronization overheads on each call.

Related

How to find the number of unique words in array list

So I am trying to create an for loop to find unique elements in a ArrayList.
I already have a ArrayList stored with user input of 20 places (repeats are allowed) but I am stuck on how to count the number of different places inputted in the list excluding duplicates. (i would like to avoid using hash)
Input:
[park, park, sea, beach, town]
Output:
[Number of unique places = 4]
Heres a rough example of the code I'm trying to make:
public static void main(String[] args) {
ArrayList<City> place = new ArrayList();
Scanner sc = new Scanner(System.in);
for(...) { // this is just to receive 20 inputs from users using the scanner
...
}
# This is where i am lost on creating a for loop...
}
you can use a Set for that.
https://docs.oracle.com/javase/7/docs/api/java/util/Set.html
Store the list data to the Set.Set will not have duplicates in it, so the size of set will be the elements without duplicates.
use this method to get the set size.
https://docs.oracle.com/javase/7/docs/api/java/util/Set.html#size()
Sample Code.
List<String> citiesWithDuplicates =
Arrays.asList(new String[] {"park", "park", "sea", "beach", "town"});
Set<String> cities = new HashSet<>(citiesWithDuplicates);
System.out.println("Number of unique places = " + cities.size());
If you are able to use Java 8, you can use the distinct method of Java streams:
int numOfUniquePlaces = list.stream().distinct().count();
Otherwise, using a set is the easiest solution. Since you don't want to use "hash", use a TreeSet (although HashSet is in most cases the better solution). If that is not an option either, you'll have to manually check for each element whether it's a duplicate or not.
One way that comes to mind (without using Set or hashvalues) is to make a second list.
ArrayList<City> places = new ArrayList<>();
//Fill array
ArrayList<String> uniquePlaces = new ArrayList<>();
for (City city : places){
if (!uniquePlaces.contains(city.getPlace())){
uniquePlaces.add(city.getPlace());
}
}
//number of unique places:
int uniqueCount = uniquePlaces.size();
Note that this is not super efficient =D
If you do not want to use implementations of Set or Map interfaces (that would solve you problem with one line of code) and you want to stuck with ArrayList, I suggest use something like Collections.sort() method. It will sort you elements. Then iterate through the sorted array and compare and count duplicates. This trick can make solving your iteration problem easier.
Anyway, I strongly recommend using one of the implementations of Set interface.
Use following answer. This will add last duplicate element in distinct list if there are multiple duplicate elements.
List<String> citiesWithDuplicates = Arrays.asList(new String[] {
"park", "park", "sea", "beach", "town", "park", "beach" });
List<String> distinctCities = new ArrayList<String>();
int currentIndex = 0;
for (String city : citiesWithDuplicates) {
int index = citiesWithDuplicates.lastIndexOf(city);
if (index == currentIndex) {
distinctCities.add(city);
}
currentIndex++;
}
System.out.println("[ Number of unique places = "
+ distinctCities.size() + "]");
Well if you do not want to use any HashSets or similar options, a quick and dirty nested for-loop like this for example does the trick (it is just slow as hell if you have a lot of items (20 would be just fine)):
int differentCount=0;
for(City city1 : place){
boolean same=false;
for(City city2 : place){
if(city1.equals(city2)){
same=true;
break;
}
}
if(!same)
differentCount++;
}
System.out.printf("Number of unique places = %d\n",differentCount);

What is the fastest way to find orphans between two large (size ~900K ) Vectors of Strings in Java?

I'm currently working on a Java program that is required to handle large amounts of data. I have two Vectors...
Vector collectionA = new Vector();
Vector collectionB = new Vector();
...and both of them will contain around 900,000 elements during processing.
I need to find all items in collectionB that are not contained in collectionA. Right now, this is how I'm doing it:
for (int i=0;i<collectionA.size();i++) {
if(!collectionB.contains(collectionA.elementAt(i))){
// do stuff if orphan is found
}
}
But this causes the program to run for lots of hours, which is unacceptable.
Is there any way to tune this so that I can cut my running time significantly?
I think I've read once that using ArrayList instead of Vector is faster. Would using ArrayLists instead of Vectors help for this issue?
Use a HashSet for the lookups.
Explanation:
Currently your program has to test every item in collectionB to see if it is equal to the item in collectionA that it is currently handling (the contains() method will need to check each item).
You should do:
Set<String> set = new HashSet<String>(collectionB);
for (Iterator i = collectionA.iterator(); i.hasNext(); ) {
if (!set.contains(i.next())) {
// handle
}
}
Using the HashSet will help, because the set will calculate a hash for each element and store the element in a bucket associated with a range of hash values. When checking whether an item is in the set, the hash value of the item will directly identify the bucket the item should be in. Now only the items in that bucket have to be checked.
Using a SortedSet like TreeSet would also be an improvement over Vector, since to find the item, only the position it would be in has tip be checked, instead of all positions. Which Set implementation would perform best depends on the data.
If ordering of the elements doesn't matter, I would go for HashSets, and do it as follows:
Set<String> a = new HashSet<>();
Set<String> b = new HashSet<>();
// ...
b.removeAll(a):
So in essence, you're removing from set b all the elements that are in set a, leaving the asymmetric set difference. Note that the removeAll method does modify set b, so if that's not what you want, you would need to make a copy first.
To find out whether HashSet or TreeSet is more efficient for this type of operation, I ran the below code with both types, and used Guava's Stopwatch to measure execution time.
#Test
public void perf() {
Set<String> setA = new HashSet<>();
Set<String> setB = new HashSet<>();
for (int i=0; i < 900000; i++) {
String uuidA = UUID.randomUUID().toString();
String uuidB = UUID.randomUUID().toString();
setA.add(uuidA);
setB.add(uuidB);
}
Stopwatch stopwatch = Stopwatch.createStarted();
setB.removeAll(setA);
System.out.println(stopwatch.elapsed(TimeUnit.MILLISECONDS));
}
On my modest development machine, using Oracle JDK 7, the TreeSet variant is about 4 times slower (~450ms) than the HashSet variant (~105ms).

How to add to an arraylist of linkedlists?

I am sorry if this is a stupid question but I am new to Java linkedlists and arraylists.
What I wish to do is this:
I have a text file that I run through word for word. I want to create an Arraylist of linkedlists, which each uniqye word in the text followed in the linked list by the words that it is followed by in the text.
Consider this piece of text: The cat walks to the red tree.
I want the Arraylist of LinkedLists to be like this:
The - cat - red
|
cat - walks
|
to - the
|
red - tree
What I have now is this:
while(dataFile.hasNext()){
secondWord = dataFile.next();
nWords++;
if(nWords % 1000 ==0) System.out.println(nWords+" words");
//and put words into list if not already there
//check if this word is already in the list
if(follows.contains(firstWord)){
//add the next word to it's linked list
((LinkedList)(firstWord)).add(secondWord);
}
else{
//create new linked list for this word and then add next word
follows.add(new LinkedList<E>().add(firstWord));
((LinkedList)(firstWord)).add(secondWord);
}
//go on to next word
firstWord = secondWord;
}
And it gives me plenty of errors.
How can I do to this the best way? (With linkedlists, I know hashtables and binary trees are better but I need to use linked lists)
An ArrayList is not the best data structure for purpose of your outer list, and at least part of your difficulty stems from incorrect use of a list of lists.
In your implementation, presumably follows is an ArrayList of LinkedLists declared like this:
ArrayList<LinkedList<String>> follows = new ArrayList<>();
The result of follows.contains(firstWord) will never be true, because follows contains elements of type LinkedList, not String. firstWord is a String, and so would not be an element of follows, but would be the first element of an ArrayList which is an element of follows.
The solution offered below uses a Map, or more specifically a HashMap, for the outer list follows. A Map is preferable because when searching for the first word, the amortized look-up time will be O(1) using a map versus O(n) for a list.
String firstWord = dataFile.next().toLowerCase();
Map<String, List<String>> follows = new HashMap<>();
int nWords = 0;
while (dataFile.hasNext())
{
String secondWord = dataFile.next().toLowerCase();
nWords++;
if (nWords % 1000 == 0)
{
System.out.println(nWords + " words");
}
//and put words into list if not already there
//check if this word is already in the list
if (follows.containsKey(firstWord))
{
//add the next word to it's linked list
List list = follows.get(firstWord);
if (!list.contains(secondWord))
{
list.add(secondWord);
}
}
else
{
//create new linked list for this word and then add next word
List list = new LinkedList<String>();
list.add(secondWord);
follows.put(firstWord, list);
}
//go on to next word
firstWord = secondWord;
}
The map will look like this:
the: [cat, red]
cat: [walks]
to: [the]
red: [tree]
walks: [to]
I also made the following changes to your implementation:
Don't add duplicates to the list of following words. Note that a Set would be a more appropriate data structure for this task, but you clearly state that a requirement is to use LinkedList.
Use String.toLowerCase() to move all strings to lower case, so that "the" and "The" are treated equivalently. (Be sure you apply this to the initial value of firstWord as well, which doesn't appear in the code you provided.)
Note that both this solution and your original attempt assume that punctuation has already been removed.
You should not work using direct classes implementation, instead using their interfaces to ease the development (among other reasons). So, instead do the type casting every when and now, declare your variable as List and just define the class when initializing it. Since you haven't posted the relevant code to redefine it, I could give you an example of this:
List<List<String>> listOfListOfString = new LinkedList<>(); //assuming Java 7 or later used
List<String> listOne = new ArrayList<>();
listOne.add("hello");
listOne.add("world");
listOfListOfString.add(listOne);
List<String> listTwo = new ArrayList<>();
listTwo.add("bye);
listTwo.add("world");
listOfListOfString.add(listTwo);
for (List<String> list : listOfListOfString) {
System.out.println(list);
}
This will print:
[hello, world]
[bye, world]
Note that now you can change the implementation of any of listOne or listTwo to LinkedList:
List<String> listOne = new LinkedList<>();
//...
List<String> listTwo = new LinkedList<>();
And the code will behave the same. No need to do any typecast to make it work.
Related:
What does it mean to "program to an interface"?

Howto transform each set of two elements of a source list into a transformed list?

I have a List<String> with elements like:
"<prefix-1>/A",
"<prefix-1>/B",
"<prefix-2>/A",
"<prefix-2>/B",
"<prefix-3>/A",
"<prefix-3>/B",
that is, for every <prefix>, there are two entries: <prefix>/A, <prefix>/B. (My list is already sorted, the prefixes might have different length.)
I want the list of prefixes:
"<prefix-1>",
"<prefix-2>",
"<prefix-3>",
What is a good way to transform a source list, when multiple (but always a constant amount of elements) correspond to one element in the transformed list?
Thank you for your consideration
If the prefixes are always a constant length, you can trim them out and put them into a Set:
List<String> elements = // initialize here
Set<String> prefixes = new HashSet<String>();
for( String element : elements) {
String prefix = element.substring(0,"<prefix-n>".length());
prefixes.add(prefix);
}
// Prefixes now has a unique set of prefixes.
You can do the same thing with regular expressions if you have a variable length prefix, or if you have more complex conditions.
Here is a solution that does not change the order of prefixes in the result. Since the elements are pre-sorted, you can take elements until you find a prefix that differs from the last taken element, and add new elements to the result, like this:
List<String> res = new ArrayList<String>();
String last = null;
for (String s : src) {
String cand = s.substring(0, s.lastIndexOf('/'));
// initially, last is null, so the first item will always be taken
if (!cand.equals(last)) {
// The assignment of last happens together with addition.
// If you think it's not overly readable, you can move it out.
res.add(last = cand);
}
}
Here is a demo on ideone.
If the number if structurally similar elements is always the same, then you cam just loop over the beginning of the list to find out this number, and then skip elements to construct the rest.
public List<String> getMyList(prefix){
List<String> selected= new ArrayList<String>();
for(String s:mainList){
if(s.endsWith(prefix.toLower())) // or .contains(), depending on
selected.add(s); // what you want exactly
}
return selected;
}

List filtering : recreate from empty list, or copy and delete elements?

I have an ArrayList, and I need to filter it (only to remove some elements).
I can't modify the original list.
What is my best option regarding performances :
Recreate another list from the original one, and remove items from it :
code :
List<Foo> newList = new ArrayList<Foo>(initialList);
for (Foo item : initialList) {
if (...) {
newList.remove(item);
}
}
Create an empty list, and add items :
code :
List<Foo> newList = new ArrayList<Foo>(initialList.size());
for (Foo item : initialList) {
if (...) {
newList.add(item);
}
}
Which of these options is the best ? Should I use anything else than ArrayList ? (I can't change the type of the original list though)
As a side note, approximatively 80% of the items will be kept in the list. The list contains from 1 to around 20 elements.
Best option is to go with what is easiest to write and maintain.
If performance is problem, you should profile the application afterwards and not to optimize prematurely.
In addition, I'd use filtering from library like google-collections or commons collections to make the code more readable:
Collection<T> newCollection = Collections2.filter(new Predicate<T>() {
public boolean apply(T item) {
return (...); // apply your test here
}
});
Anyway, as it seems you are optimizing for the performance, I'd go with System.arraycopy if you indeed want to keep most of the original items:
String[] arr = new String[initialList.size()];
String[] src = initialList.toArray(new String[initialList.size()]);
int dstIndex = 0, blockStartIdx=0, blockSize=0;
for (int currIdx=0; currIdx < initialList.size(); currIdx++) {
String item = src[currIdx];
if (item.length() <= 4) {
if (blockSize > 0)
System.arraycopy(src, blockStartIdx, arr, dstIndex, blockSize);
dstIndex += blockSize;
blockSize = 0;
} else {
if (blockSize == 0)
blockStartIdx = currIdx;
blockSize++;
}
}
ArrayList newList = new ArrayList(arr.length + 1);
newList.addAll(Arrays.asList(arr));
}
It seems to be about 20% faster than your option 3. Even more so (40%) if you can skip the new ArrayList creation at the end.
See: http://pastebin.com/sDhV8BUL
You might want to go with the creating a new list from the initial one and removing. They would be less method calls that way since you're keeping ~80% of the original items.
Other than that, I don't know of any way to filter the items.
Edit: Apparently Google Collections has something that might interest you?
As #Sanjay says, "when in doubt, measure". But creating an empty ArrayList and then adding items to it is the most natural implementation and your first goal should be to write clear, understandable code. And I'm 99.9% sure it will be the faster one too.
Update: By copying the old List to a new one and then striking out the elements you don't want, you incur the cost of element removal. The ArrayList.remove() method needs to iterate up to the end of the array on each removal, copying each reference down a position in the list. This almost certainly will be more expensive than simply creating a new ArrayList and adding elements to it.
Note: Be sure to allocate the new ArrayList to an initial capacity set to the size of the old List to avoid reallocation costs.
the second is faster (iterate and add to second as needed) and the code for the first will throw ConcurrentModificationException when you remove any items
and in terms of what result type will be depends on what you are going to need the filtered list for
I'd first follow the age old advice; when in doubt, measure.
Should I use anything else than
ArrayList ?
That depends on what kind of operations would you be performing on the filtered list but ArrayList is usually is a good bet unless you are doing something which really shouldn't be backed by a contiguous list of elements (i.e. arrays).
List newList = new
ArrayList(initialList.size());
I don't mean to nitpick, but if your new list won't exceed 80% of the initial size, why not fine tune the initial capacity to ((int)(initialList.size() * .8) + 1)?
Since I'm only get suggestions here, I decided to run my own bench to be sure.
Here are the conclusions (with an ArrayList of String).
Solution 1, remove items from the copy : 2400 ms.
Solution 2, create an empty list and fill it : 1600 ms. newList = new ArrayList<Foo>();
Solution 3, same as 2, except you set the initial size of the List : 1530 ms. newList = new ArrayList<Foo>(initialList.size());
Solution 4, same as 2, except you set the initial size of the List + 1 : 1500 ms. newList = new ArrayList<Foo>(initialList.size() + 1); (as explained by #Soronthar)
Source : http://pastebin.com/c2C5c9Ha

Categories