creating dataset in weka if values are lists/arrays

creating dataset in weka if values are lists/arrays - java

if the elements are just nominal or string values we can use Instance object to represent that particular instance. Also for the Instances dataset we can get attributes by pre-defining them. But i have question. what is the approach if we want to use collections as values to the attribute elements?
for ex:
weka.core.Attribute attribute1 = new weka.core.Attribute("list1");
weka.core.Attribute attribute2 = new weka.core.Attribute("list2");
weka.core.Attribute classAttribute = new weka.core.Attribute("Function");
FastVector fvWekaAttributes = new FastVector(3);
fvWekaAttributes.addElement(attribute1);
fvWekaAttributes.addElement(attribute2);
fvWekaAttributes.addElement(classAttribute);
is the way we create the attributes if two are nominal values and one is string(class). and the way we add elements in to any dataset(ex:trainInstances), we create Instance object and add like this:
Instance iExample = new Instance(3);
iExample.setValue((weka.core.Attribute)fvWekaAttributes.elementAt(0), 10);
iExample.setValue((weka.core.Attribute)fvWekaAttributes.elementAt(0), 15);
iExample.setValue((weka.core.Attribute)fvWekaAttributes.elementAt(2), "F1");
trainInstances.add(iExample);
this is ok, but what should i use to store Lists/collections instead of single nominal values. I want to do like this:
int[] list1={10,20,30,40};
int[] list2={90,80,70,60};
iExample.setValue((weka.core.Attribute)fvWekaAttributes.elementAt(0), **list1**);
iExample.setValue((weka.core.Attribute)fvWekaAttributes.elementAt(0), **list2**);
iExample.setValue((weka.core.Attribute)fvWekaAttributes.elementAt(2), "F1");
trainInstances.add(iExample);
to be more specific, these lists might change their sizes sometimes. i..e, in this example we see each list of length 4 in size but should support lists of different sizes in other Instance objects.
Is that possible using WEKA or any learning API. If so, please provide me the resources. It is mandatory to my master thesis..

In order to keep their Instances (dataset) objects as compact as possible, weka uses an index-value method to represent each value of a string or nominal Attribute. Each weka Instance (row in a dataset) only stores the index associated with the value for the attribute.
You are probably going to have to decide if the list element (as a whole) is more important that the individual elements of the list. If so, you will need to enumerate each of the possible lists that can occur as a value for that Attribute, and this list will need to be provided to the Attribute when it is created. If this is reasonable, you may decide to convert the lists to strings (i.e. list1="10,20,30,40").
If the individual elements of the list have value, it may be easier to create separate Attributes to identify if elements occur in the list or not.
If there is a fixed limit to the number of elements that occur in a list (and especially if the order of the list is important), you may consider having a separate Attribute for each list element. (i.e. Attibute("first_element_of_list"), Attribute("second_element_of_list"), ... etc)
If there is a fixed number of values that may occur on those lists and/or if the order is not important, you may consider having boolean Attribute to indicate if a specified element occurs in the list. (i.e. Attribute("10_in_list"), Attribute("20_in_list"), ... etc)

Related

Is there an efficient way to get elements from one index to another index in a collection without changing it to an ArrayList?

I'm trying to implement pagination within a java/spring boot app on a Collection that is returned from a function. I want to get pages that are ordered by each elements "startTime". So if the user asks for page 2 and each page has 10 items in it, then I would want to give the user the items with the top 10-20 most recent start times. I've since tried two approaches:
a) Converting the returned collection into an array and then using IntStream on it to get elements from one index to another.
final exampleClass[] example = exampleCollection.toArray(new exampleClass[0]);
Collection<exampleClass> examplePage = IntStream.range(start, end)
...
b) Converting the returned collection into an ArrayList and then using Pageable/PageRequest to create a new page from that ArrayList.
The problem is that these seem very inefficient since I first have to change the Collection to an ArrayList or array and then operate on it. I would like to know if there are more efficient ways to turn collections into structures that I can iterate on using indices so that I can implement pagination. Or, if there are some Spring functions for creating pages that don't require non-Collection parameters. However, I can't find any Spring functions for creating pages that do.
Also, is there any difference in runtime between
List<exampleClass> x = new ArrayList<>(exampleCollection);
and
List<exampleClass> x = (List<exampleClass>)exampleCollection;

I would like to know if there are more efficient ways to turn collections into structures that I can iterate on using indices
The only efficient way is to check by instanceof if your collection is indeed a List. If it is, then you can cast it, and simply use e.g. sublist(start, stop) to produce your paginated result.
Please note that accessing an element by its index might not be efficient either. In a LinkedList, accessing an element is a O(N) operation, so accesing M elements by index produces a O(M*N) operation, whereas using sublist() is a O(M+N) operation.
There is a specialization of the List interface that is used to mark lists that are fast at being accessed by index, and that is : RandomAccess, you may or may not want to check on that to decide on the best strategy.
Also, is there any difference in runtime between
List x = new ArrayList<>(exampleCollection);
and
List<exampleClass> x = (List<exampleClass>)exampleCollection;
There absolutely is.
The second is a cast, and has virtually no cost. Just beware that x and exampleCollection are one and the same object (modifying one is the same as modifying the other). Obviously, a cast may fail with a ClassCastException if exampleCollection is not actually a list.
The first performs a copy, which has a cost in both CPU (traversal of exampleCollection) and memory (allocating an array of the collection's size). Both are pretty low for small collections, but your mileage may vary.
In this copy case, modifying one collection does nothing to the other, you get a copy.

A collection does not have to have consistent iteration order: if you call iterator() twice you may get two different sequences. Converting the collection into an array or a list is the best solution.
As for the second question: This line of code:
List<exampleClass> x = new ArrayList<>(exampleCollection);
creates a new ArrayList, which is a shallow copy of the original collection. That is, it contains pointers to the same objects as the original collection, but the list itself is new and you could for example sort the list, or add or remove items, without affecting the original collection. Compared to:
List<exampleClass> x = (List<exampleClass>)exampleCollection;
Assuming exampleCollection is actually a List, this gives you a pointer to that list with a new data type. If you make changes like sort or add or remove items, you will see these modifications in exampleCollection. On the other hand if exampleCollection is not a List, you will get a run-time error (ClassCastException).

Issue iterating over custom writable component in reducer

I am using a custom writable class as VALUEOUT in the map phase in my MR job where the class has two fields, A org.apache.hadoop.io.Text and org.apache.hadoop.io.MapWritable. In my reduce function I iterate through the values for each key and I perform two operations, 1. filter, 2. aggregate. In the filter, I have some rules to check if certain values in the MapWritable(with key as Text and value as IntWritable or DoubleWritable) satisfy certain conditions and then I simply add them to an ArrayList. At the end of the filter operation, I have a filtered list of my custom writable objects. At the aggregate phase, when I access the objects, it turns out that the last object that was successfully filtered in, has overwritten all other objects in the arraylist. After going through some similar issues with lists on SO where the last object overwrite all the others, I confirmed that I do not have static fields nor am I reusing the same custom writable by setting different values(which was quoted as the possible reasons for such an issue). For each key in the reducer I have made sure that the CustomWritable, Text key and the MapWritable are new objects.
In addition, I also performed a simple test by eliminating the filter & aggregate operations in my reduce and just iterated through the values and added them to an ArrayList using a for loop. In the loop, everytime I added a CustomWritable into the list, I logged the values of all the contents of the List. I logged before and after adding the element to the list. Both logs presented that the previous set of elements have been overwritten. I am confused on how this could even happen. As soon as the next element in the iterable of values was accessed by the loop for ( CustomWritable result : values ), the list content was modified. I am unable to figure out the reason for this behaviour. If anyone can shed some light on this, it would be really helpful. Thanks.

The"values" iterator in the reducer reuses the value as you iterate. It's a technique used for performance and smaller memory footprint. Behind the scenes, Hadoop deserializes the next record into the same Java object. If you need to "remember" an object, you'll need to clone it.
You can take advantage of the Writable interface and use the raw bytes to populate a new object.
IntWritable first = WritableUtils.clone(values.next(), context.getConfiguration());
IntWritable second = WritableUtils.clone(values.next(), context.getConfiguration());

Java - How to separate a list based on a property of it's elements

I have a list of objects which I want to perform an operation on. However I firstly need to divide the list into separate lists such that all items with the same parentID are in the same list, and then the operation is performed on each list separately (the reason being that the operation takes the parentID of the objects as a parameter).
What is the best way to separate a list based on a given property of it's elements, as required here? The highest number of objects that will be passed in the original list is < 10,000 and normally will be < 1,000.
All help is much appreciated!

It sounds like you might want to use Multimaps.index from Guava. That will build you a multi-map, where each key has a collection of elements.
The keyFunction passed into index would be a Function which just retrieves the property from a single element.

Create a
Map <IdType, List<YourObject>> map
loop thru the list, and for each id do something like
List theList = map.get(id);
if (theList == null ) {
// create a new list, add it to the map under the id
}
// add the item to theList
then you can loop thru the map's entries and you have a list of objects for each id. This approach does not require you to know how many different ids are in your list to begin with....

I would recommend writing an Iterator that wraps an Iterator, returning only elements that match what you want. You could then write an implementation of Iterable that takes an Iterable, returning such an iterator (this would allow you to use an enhanced for loop).

If you're okay with adding a 3rd party library, Google's Guava supplies various utilities which could help you out.
Specifically, use Collections2.transform like this:
Collection myOriginalList;
Collection mySplitList1 = Collections2.transform(myOriginalList, new Function() { /* method to filter out parent ID 1 */ });
... // repeat for each parent id you're interested in

When to use a Map instead of a List in Java?

I didn't get the sense of Maps in Java. When is it recommended to use a Map instead of a List?

Say you have a bunch of students with names and student IDs. If you put them in a List, the only way to find the student with student_id = 300 is to look at each element of the list, one at a time, until you find the right student.
With a Map, you associate each student's ID and the student instance. Now you can say, "get me student 300" and get that student back instantly.
Use a Map when you need to pick specific members from a collection. Use a List when it makes no sense to do so.
Say you had exactly the same student instances but your task was to produce a report of all students' names. You'd put them in a List since there would be no need to pick and choose individual students and thus no need for a Map.

Java map: An object that maps keys to values. A map cannot contain duplicate keys; each key can map to at most one value.
Java list: An ordered collection (also known as a sequence). The user of this interface has precise control over where in the list each element is inserted. The user can access elements by their integer index (position in the list), and search for elements in the list.
The difference is that they are different. Map is a mapping of key/values, a list of a list of items.

I thinks its a lot the question of how you want to access your data. With a map you can "directly" access your items with a known key, in a list you would have to search for it, evan if its sorted.
Compare:
List<MyObject> list = new ArrayList<MyObject>();
//Fill up the list
// Want to get object "peter"
for( MyObject m : list ) {
if( "peter".equals( m.getName() ) {
// found it
}
}
In a map you can just type
Map<String, MyObject> map = new HashMap<String, MyObject>();
// Fill map
MyObject getIt = map.get("peter");
If you have data to process and need to do it with all objects anyway, a list is what you want. If you want to process single objects with well known key, a map is better.
Its not the full answer (just my 2...) but I hope it might help you.

A map is used as an association of a key and a value. With a list you have basically only values.
The indexes in List are always int, whereas in Map you can have another Object as a key.
Resources :
sun.com - Introduction to the Collections Framework, Map

Depends on your performance concerns. A Map more explicitly a HashMap will guarantee O(1) on inserts and removes. A List has at worst O(n) to find an item. So if you would be so kind as to elaborate on what your scenario is we may help more.

Its probably a good idea to revise Random Access Vs Sequential Access Data Structures. They both have different run time complexities and suitable for different type of contexts.

When you want to map instead of list. The names of those interfaces have meaning, and you shouldn't ignore it.
Use a map when you want your data structure to represent a mapping for keys to values. Use a list when you want your data to be stored in an arbitrary, ordered format.

Map and List serve different purpose.
List holds collection of items. Ordered (you can get item by index).
Map holds mapping key -> value. E.g. map person to position: "JBeg" -> "programmer". And it is unordered. You can get value by key, but not by index.

Maps store data objects with unique keys,therefore provides fast access to stored objects. You may use ConcurrentHashMap in order to achieve concurrency in multi-threaded environments.
Whereas lists may store duplicate data and you have to iterate over the data elements in order to access a particular element, therefore provide slow access to stored objects.
You may choose any data structure depending upon your requirement.

Difference between HashMap and ArrayList in Java?

In Java, ArrayList and HashMap are used as collections. But I couldn't understand in which situations we should use ArrayList and which times to use HashMap. What is the major difference between both of them?

You are asking specifically about ArrayList and HashMap, but I think to fully understand what is going on you have to understand the Collections framework. So an ArrayList implements the List interface and a HashMap implements the Map interface. So the real question is when do you want to use a List and when do you want to use a Map. This is where the Java API documentation helps a lot.
List:
An ordered collection (also known as a
sequence). The user of this interface
has precise control over where in the
list each element is inserted. The
user can access elements by their
integer index (position in the list),
and search for elements in the list.
Map:
An object that maps keys to values. A
map cannot contain duplicate keys;
each key can map to at most one value.
So as other answers have discussed, the list interface (ArrayList) is an ordered collection of objects that you access using an index, much like an array (well in the case of ArrayList, as the name suggests, it is just an array in the background, but a lot of the details of dealing with the array are handled for you). You would use an ArrayList when you want to keep things in sorted order (the order they are added, or indeed the position within the list that you specify when you add the object).
A Map on the other hand takes one object and uses that as a key (index) to another object (the value). So lets say you have objects which have unique IDs, and you know you are going to want to access these objects by ID at some point, the Map will make this very easy on you (and quicker/more efficient). The HashMap implementation uses the hash value of the key object to locate where it is stored, so there is no guarentee of the order of the values anymore. There are however other classes in the Java API that can provide this, e.g. LinkedHashMap, which as well as using a hash table to store the key/value pairs, also maintains a List (LinkedList) of the keys in the order they were added, so you can always access the items again in the order they were added (if needed).

If you use an ArrayList, you have to access the elements with an index (int type). With a HashMap, you can access them by an index of another type (for example, a String)
HashMap<String, Book> books = new HashMap<String, Book>();
// String is the type of the index (the key)
// and Book is the type of the elements (the values)
// Like with an arraylist: ArrayList<Book> books = ...;
// Now you have to store the elements with a string key:
books.put("Harry Potter III", new Book("JK Rownling", 456, "Harry Potter"));
// Now you can access the elements by using a String index
Book book = books.get("Harry Potter III");
This is impossible (or much more difficult) with an ArrayList. The only good way to access elements in an ArrayList is by getting the elements by their index-number.
So, this means that with a HashMap you can use every type of key you want.
Another helpful example is in a game: you have a set of images, and you want to flip them. So, you write a image-flip method, and then store the flipped results:
HashMap<BufferedImage, BufferedImage> flipped = new HashMap<BufferedImage, BufferedImage>();
BufferedImage player = ...; // On this image the player walks to the left.
BufferedImage flippedPlayer = flip(player); // On this image the player walks to the right.
flipped.put(player, flippedPlayer);
// Now you can access the flipped instance by doing this:
flipped.get(player);
You flipped player once, and then store it. You can access a BufferedImage with a BufferedImage as key-type for the HashMap.
I hope you understand my second example.

Not really a Java specific question. It seems you need a "primer" on data structures. Try googling "What data structure should you use"
Try this link http://www.devx.com/tips/Tip/14639
From the link :
Following are some tips for matching the most commonly used data structures with particular needs.
When to use a Hashtable?
A hashtable, or similar data structures, are good candidates if the stored data is to be accessed in the form of key-value pairs. For instance, if you were fetching the name of an employee, the result can be returned in the form of a hashtable as a (name, value) pair. However, if you were to return names of multiple employees, returning a hashtable directly would not be a good idea. Remember that the keys have to be unique or your previous value(s) will get overwritten.
When to use a List or Vector?
This is a good option when you desire sequential or even random access. Also, if data size is unknown initially, and/or is going to grow dynamically, it would be appropriate to use a List or Vector. For instance, to store the results of a JDBC ResultSet, you can use the java.util.LinkedList. Whereas, if you are looking for a resizable array, use the java.util.ArrayList class.
When to use Arrays?
Never underestimate arrays. Most of the time, when we have to use a list of objects, we tend to think about using vectors or lists. However, if the size of collection is already known and is not going to change, an array can be considered as the potential data structure. It's faster to access elements of an array than a vector or a list. That's obvious, because all you need is an index. There's no overhead of an additional get method call.
4.Combinations
Sometimes, it may be best to use a combination of the above approaches. For example, you could use a list of hashtables to suit a particular need.
Set Classes
And from JDK 1.2 onwards, you also have set classes like java.util.TreeSet, which is useful for sorted sets that do not have duplicates. One of the best things about these classes is they all abide by certain interface so that you don't really have to worry about the specifics. For e.g., take a look at the following code.
// ...
List list = new ArrayList();
list.add(

Use a list for an ordered collection of just values. For example, you might have a list of files to process.
Use a map for a (usually unordered) mapping from key to value. For example, you might have a map from a user ID to the details of that user, so you can efficiently find the details given just the ID. (You could implement the Map interface by just storing a list of keys and a list of values, but generally there'll be a more efficient implementation - HashMap uses a hash table internally to get amortised O(1) key lookup, for example.)

A Map vs a List.
In a Map, you have key/value pairs. To access a value you need to know the key. There is a relationship that exists between the key and the value that persists and is not arbitrary. They are related somehow. Example: A persons DNA is unique (the key) and a persons name (the value) or a persons SSN (the key) and a persons name (the value) there is a strong relationship.
In a List, all you have are values (a persons name), and to access it you need to know its position in the list (index) to access it. But there is no permanent relationship between the position of the value in the list and its index, it is arbitrary.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.