Create Weka Instance with string attribute - java

I'm trying to convert an ArrayList that is custom code I have inherited to a Weka Instances structure so I can use the Weka IBk classifier on it.
In the Instance the features are represented with a HashMap. So if I'm classifying a film review for example a feature might be a HashMap of ("funny", 2), 2 being the occurrence of the word "funny"
Although there's probably a better way I'm iterating over my instances to try and convert them to Weka Instances.
The problem is I can't instance.setValue("funny", 2) as setValue() requires an int,double input. Is there a way to do this or should I be approaching it a different way?

You can create an attribute per key (get all your distinct keys in a list, sort it to keep the order fixed). The attributes will have numeric values which are the number of occurrences.

Related

ArrayList of objects vs Arraylist of HashMaps

I have a file containing approximately 10,000 json dumps. Each json has about 20 fields, out of which only 5 are of use to me. I need to iterate over the file, parse each json and store the relevant elements for further processing.
In Java what will be an efficient data structure to store the relevant json fields. I am confused between an ArrayList of Objects (for which I will create a bean to hold the various fields) and an ArrayList of HashMaps (where each of the relevant json fields will be stored as key value pairs).
Which of the two is better in regards to memory usage and computation?
It depends on your use case. If you are going to use all the 5 fields as it is . Like putting it in a database, or displaying in UI, then the first approach (array of beans). If you are going to use the fields selectively, (1 out 5 fields here, and another of 5 fields there) then the sceond approach is better (array of hash maps).
a List of Beans has better type safety and readability. Use that until you can prove that there is a problem with that approach.
If you have a fixed set of fields an Object will be smaller than a HashMap.
The HashMap has to store the keys as Strings for each instance. Also accessing the fields in an Object will be much faster. Accessing a field in an Object is a single byte code operation. Accessing a HashMap requires computing the hash for the given field and then access an element in an array.
Regardless, performance is probably not a dominant factor for this particular problem and using an Object will probably be more readable.

MapReduce with "customized" key

I have the following problem: I have a lot of data in form of key-value pairs. The key is some id and the value - some piece of text. And my aim is to group that objects in clusters where the text pieces are "similar" in some way. So it would look like a task for the MapReduce, if to take my text piece as a key, and id as a value. But such keys is not traditional way of MapReduce usage, and as I am not really aware of internal implemetation of MapReduces frameworks, I am not sure that this way works. So my idea in detail is:
1. take some MapReduce in Java (Hadoop, GridGain)
2. create special class for my text pieces (say TextKey)
3. Override equals() of the class, packing the text comparison logic here(say levenstein distance comparison, or whatever)
4. Override compareTo() for allowing the MapReduce to sort by key (say lexicographical order)
5. Probably override hashCode()
6. Map my data to key-value pairs: keys -> text pieces, packed in TextKey class, values -> ids
7. Simply reduce by collecting ids of every "equal" (actually similar) key
Can MapReduce work on that way?
In GridGain this can be easily solved by storing your text keys in partitioned data grid. GridGain Data Grid will automatically partition your data set across the cluster based on keys, so as long as you have your similar text pieces properly implement standard java hashCode() and equals(), you should be fine.
You can also send affinity-based MapReduce tasks in GridGain to make sure that your jobs end up on the same node as the data to avoid redundant data movements should you require to run some computations on your data going forward. This can be achieved by executing GridProjection.affinityRun(...) methods.
Right after the map phase, its output is partitioned using a Partitioner (HashPartitioner by default but you can provide your own Parititioner). Your TextKey should implement a LSH hashCode so that similar Text values are likely to go to the same partition.
If the keys are Strings/Text objects the default sorter will work but I think this is not going to affect your result given the scenario you described.
The problem is at the Grouper which passes each group within a partition to a single reduce call. By default this grouper iterates through the partition which is sorted by this moment and it forms groups out of equal values. In your case you should make sure the grouping is done not by equality but by similarity. So, your TextKey should also implement the compareTo() method and take care to return 0 if the LSH hashCodes are the same.
In conclusion you can go with the default data path (i.e. default Partitioner, Sorter, Grouper) but your TextKey (which should implement WritableComparable) should do the magic in the hashCode() and compareTo() methods

Java map content comparison

Here is a tricky data structure and data organization case.
I have an application that reads data from large files and produces objects of various types (e.g., Boolean, Integer, String) that are categorized in a few (less than a dozen) groups and then stored in a database.
Each object is currently stored in a single HashMap<String, Object> data structure. Each such HashMap corresponds to a single category (group). Each database record is built from the information in all the objects contained in all categories (HashMap data structures).
A requirement has appeared for checking whether subsequent records are "equivalent" in the number and type of columns, where equivalence must be verified across all maps by comparing the name (HashMap key) and the type (actual class) of each stored object.
I am looking for an efficient way of implementing this functionality, while maintaining the original object categorization, because listing objects by category in the fastest possible way is also a requirement.
An idea would be to just sort the keys (e.g., by replacing each HashMap with a TreeMap) and then walk over all maps. An alternative would be to just copy everything in a TreeMap for comparison purposes only.
What would be the most efficient way of implementing this functionality?
Also, if how would you go about finding the difference (i.e., the fields added and those removed), between successive records?
Create a meta SortedSet in which you store all the created maps.
Means SortedSet<Map<String,Object>> e.g. a TreeSet which as a custom Comparator<Map<String,Object>> which does check exactly your requirements of same number and names of keys and same object type per value.
You can then use the contains() method of this meta set structure to find out if a similar record does already exist.
==== EDIT ====
Since I've misundertood the relation between database records and the maps in the first place, I've to change some semantics my answer now of course a little bit.
Still I'would use the mentioned SortedSet<Map<String,Object>> but of course the Map<String,Object> would now point to that Map you and havexy suggested.
On the other hand could it be a step forward to use a Set<Set<KeyAndType>> or SortedSet<Set<KeyAndType>> where your KeyAndType will only contain the key and the type with appropriate Comparable implementation or equals with hashcode.
Why? You asked how to find the differences between two records? If each record relates to one of those inner Set<KeyAndType> you can easily use retainAll() to form the intersection of two successive Sets.
If you would compare this to the idea of a SortedSet<Map<String,Object>>, in both ways you would have the logic which differenciates between the fields within the comparator, one time comparing inner sets, one time comparing inner maps. And since this information gets lost when the surrounding set is constructed, it will be hard to get the differences between two records later on, if you do not have another reduced structure which is easy to use to find such differences. And since such a Set<KeyAndType> could act as key as well as as easy base for comparison between two records, it could be a good candidate to be used for both purposes.
If furthermore you wanna keep the relation between such a Set<KeyAndType> to your record or the group of Map<String,Object> your meta structure could be something like:
Map<Set<KeyAndType>,DatabaseRecord> or Map<Set<KeyAndType>,GroupOfMaps> implemented by a simple LinkedHashMap which allows simple iteration in original order.
One soln is to keep both category based HashMap and combined TreeMap. This will have slight more memory requirement, not much though, as you ll just keep the same reference in both of them.
So whenever you are adding/removing to HashMap you will do the same operation in the TreeMap too. This way both will always be in sync.
You can then use TreeMap for comparison, whether you want comparison of type of object or actual content comparison.

Storing ArrayList and HashMap using java.util.properties

How can I store an ArrayList and/or a HashMap variable using java.util.properties? If it's not possible what other class can I use to store application configuration?
If you just need to serialize your collections into Strings, I highly recommend XStream. It uses reflection to serialize a class into XML. There is documentation if the default behavior doesn't work for the class you want to serialize, but the following has worked for me every time so far:
XStream xstream = new XStream();
String xml = xstream.toXML(myObject);
MyClass deserializedObject = (MyClass)xstream.fromXML(xml);
assert deserializedObject.equals(myObject);
So... if "don't do that" doesn't work for you, then you need to encode the data somehow. One common technique is to prepend some string to the name of each element. For example if I have a map MyMap containing a->1, b->2, c->3, I might store in the properties file:
MyMap.a=1
MyMap.b=2
MyMap.c=3
For lists, you can do the same, just mapping indices to values. So if MyList contains {a,b,c}
MyList.0=a
MyList.1=b
MyList.2=c
This is a hack, and everything everyone else said is true. But sometimes you gotta do what you gotta do.
Properties is basically Map<String, String> meaning both key and value must be String objects. If you want more advanced configuration, you could go with Spring. Its an excellent framework and I use it in every project. Spring config files are extremely flexible.
java.util.Properties is only intended to be used with String keys and values. It does inherit the put() and putAll() methods from Hashtable, but it's rarely a good idea to use those to "cheat". Have you considered just storing your configuration information in a HashMap rather than a Properties object? You would have to customize the serialization a bit, but you're going to have to do that in any case as you can't take advantage of the default loading functionality of the Properties class.
Storing a HashMap would be easy, since each key and value in the Map can be represented by a corresponding key and value in the Properties object (see the setProperty method in Properties.
For the ArrayList you could do something similar, the keys would be the indexes and the values the items in the corresponding indexes.
In both cases, remember that a properties file only stores strings, so you'd have to devise a way to represent the keys and values in your objects as strings.

Best Java data structure to store a 3 column oracle table? 3 column array? or double map?

What is the best data structure to store an oracle table that's about 140 rows by 3 columns. I was thinking about a multi dimensional array.
By best I do not necessarily mean most efficient (but i'd be curious to know your opinions) since the program will run as a job with plenty of time to run but I do have some restrictions:
It is possible for multiple keys to be "null" at first. so the first column might have multiple null values. I also need to be able to access elements from the other columns. Anything better than a linear search to access the data?
So again, something like [][][] would work.. but is there something like a 3 column map where I can access by the key or the second column ? I know maps have only two values.
All data will probably be strings or cast as strings.
Thanks
A custom class with 3 fields, and a java.util.List of that class.
There's no benefit in shoe-horning data into arrays in this case, you get no improvement in performance, and certainly no improvement in code maintainability.
This is another example of people writing FORTRAN in an object-oriented language.
Java's about objects. You'd be much better off if you started using objects to abstract your problem, hide details away from clients, and reduce coupling.
What sensible object, with meaningful behavior, do those three items represent? I'd start with that, and worry about the data structures and persistence later.
All data will probably be strings or cast as strings.
This is fine if they really are strings, but I'd encourage you to look deeper and see if you can do better.
For example, if you write an application that uses credit scores you might be tempted to persist it as a number column in a database. But you can benefit from looking at the problem harder and encapsulating that value into a CreditScore object. When you have that, you realize that you can add something like units ("FICO" versus "TransUnion"), scale (range from 0 to 850), and maybe some rich behavior (e.g., rules governing when to reorder the score). You encapsulate everything into a single object instead of scattering the logic for operating on credit scores all over your code base.
Start thinking less in terms of tables and columns and more about objects. Or switch languages. Python has the notion of tuples built in. Maybe that will work better for you.
If you need to access your data by key and by another key, then I would just use 2 maps for that and define a separate class to hold your record.
class Record {
String field1;
String field2;
String field3;
}
and
Map<String, Record> firstKeyMap = new HashMap<String, Record>();
Map<String, Record> secondKeyMap = new HashMap<String, Record>();
I'd create an object which map your record and then create a collection of this object.

Categories