Java - WEKA - Add new catergories to training set - java

I'm using WEKA to train a categorization Java program. There are initially several categories, let's say 10, and the system must work with those initial categories and start training. In order to do that...:
String [] categories = {"cat1", "cat2", ..., "cat10"};
public SomeClassifier(String[] categories) {
// Creates a FastVector of attributes.
FastVector attributes = new FastVector(3);
// Add attribute for holding property one.
attributes.addElement(new Attribute(P1_ATTRIBUTE, (FastVector) null));
// Add attribute for holding property two.
attributes.addElement(new Attribute(P2_ATTRIBUTE, (FastVector) null));
// Add values attribute.
FastVector values = new FastVector(categories.length);
for (int i = 0; i < categories.length; i++) {
values.addElement(categories[i]);
}
attributes.addElement(new Attribute(CATEGORY_ATTRIBUTE, values));
// Create dataset with initial capacity of 25, and set index
Instances myInstances = new Instances(SOME_NAME, attributes, 25);
myInstances.setClassIndex(myInstances.numAttributes() - 1);
}
OK, now, time goes by and I want to add a new category to my training set (let's say, "cat11"), which is already being trained with some success. How can I accomplish this? WEKA documentation says "Once an attribute has been created, it can't be changed".
So, maybe I can take out the Attribute from the Instances object, recreate the Attribute and then insert it again... or will that mess everything up?
Thanks in advance.

OK, apparently, there is no way to do such thing using this implementation of Naïve Bayes. This is because when initializing the classifier, all categories appended to the classifier must sum 1, and when the classifier is being trained, new categories with probability != 0 would cause the classifier to behave in a strange manner having a sum > 1. Morevoer, the classifier may initialize its algorithm (calculation of conditioned probabilities and iterations) with the influence of the number of categories, and adding a new one after creation would mean to rebuild the algorithm in some way.
So, that leaves a question open... what classification mechanism can I use that allows me to introduce new categories over time?

Related

Creating a Sorted 2D Array from custom Object properties

I have a List of objects created from a map like so:
Map incomingRequest = (Map)object;
List accounts = (List)incomingRequest.get("accountList");
In addition, I loop through these objects and pull them out one by one via the index of that list like so and create the account object:
for (int accountRow = 0; accountRow < accounts.size(); accountRow++){
Account account = (Account)accounts.get(accountRow);
There is a method on this Account account object that I can use to get an identifier I can sort on called like so: account.getComp_id().getIcLine(). This gives me a non-unique number that I can use to group with. I now have a need to do some calculations involving only the grouping ov like IcLine properties.
My thought is to create a 2D ArrayList so that I can loop through each sorted array of objects sharing the same IcLine number. However, I currently can't figure out exactly how I would do that after googling around and trying to work through it. I feel like this is a good job for recursion, but I can't figure out how to create the 2D ArrayList I need. Your guidance is appreciated

Cull all duplicates in a set

I'm using Set to isolate the unique values of a List (in this case, I'm getting a set of points):
Set<PVector> pointSet = new LinkedHashSet<PVector>(listToCull);
This will return a set of unique points, but for every item in listToCull, I'd like to test the following: if there is a duplicate, cull all of the duplicate items. In other words, I want pointSet to represent the set of items in listToCull which are already unique (every item in pointSet had no duplicate in listToCull). Any ideas on how to implement?
EDIT - I think my first question needs more clarification. Below is some code which will execute what I'm asking for, but I'd like to know if there is a faster way. Assuming listToCull is a list of PVectors with duplicates:
Set<PVector> pointSet = new LinkedHashSet<PVector>(listToCull);
List<PVector> uniqueItemsInListToCull = new ArrayList<PVector>();
for(PVector pt : pointSet){
int counter=0;
for(PVector ptCheck : listToCull){
if(pt==ptCheck){
counter++;
}
}
if(counter<2){
uniqueItemsInListToCull.add(pt);
}
}
uniqueItemsInListToCull will be different from pointSet. I'd like to do this without loops if possible.
You will have to do some programming yourself: Create two empty sets; on will contain the unique elements, the other the duplicates. Then loop through the elements of listToCull. For each element, check whether it is in the duplicate set. If it is, ignore it. Otherwise, check if it is in the unique element set. If it is, remove it there and add to the duplicates set. Otherwise, add it to the unique elements set.
If your PVector class has a good hashCode() method, HashSets are quite efficient, so the performance of this will not be too bad.
Untested:
Set<PVector> uniques = new HashSet<>();
Set<PVector> duplicates = new HashSet<>();
for (PVector p : listToCull) {
if (!duplicates.contains(p)) {
if (uniques.contains(p)) {
uniques.remove(p);
duplicates.add(p);
}
else {
uniques.add(p);
}
}
}
Alternatively, you may use a third-party library which offers a Bag or MultiSet. This allows you to count how many occurrences of each element are in the collection, and then at the end discard all elements where the count is different than 1.
What you are looking for is the intersection:
Assuming that PVector (terrible name by the way) implements hashCode() and equals() correctly a Set will eliminate duplicates.
If you want a intersection of the List and an existing Set create a Set from the List then use Sets.intersection() from Guava to get the ones common to both sets.
public static <E> Sets.SetView<E> intersection(Set<E> set1, Set<?> set2)
Returns an unmodifiable view of the intersection of two sets. The returned set contains all
elements that are contained by both backing sets. The iteration order
of the returned set matches that of set1. Results are undefined if
set1 and set2 are sets based on different equivalence relations (as
HashSet, TreeSet, and the keySet of an IdentityHashMap all are).
Note: The returned view performs slightly better when set1 is the
smaller of the two sets. If you have reason to believe one of your
sets will generally be smaller than the other, pass it first.
Unfortunately, since this method sets the generic type of the returned
set based on the type of the first set passed, this could in rare
cases force you to make a cast, for example:
Set aFewBadObjects = ... Set manyBadStrings =
...
// impossible for a non-String to be in the intersection
SuppressWarnings("unchecked") Set badStrings = (Set)
Sets.intersection(
aFewBadObjects, manyBadStrings); This is unfortunate, but should come up only very rarely.
You can also do union, complement, difference and cartesianProduct as well as filtering very easily.
So you want pointSet to hold the items in listToCull which have no duplicates? Is that right?
I would be inclined to create a Map, then iterate twice over the list, the first time putting a value of zero in for each PVector, the second time adding one to the value for each PVector, so at the end you have a map with counts. Now you're interested in the keys of the map for which the value is exactly equal to one.
It's not perfectly efficient - you're operating on list items more times than absolutely necessary - but it's quite clean and simple.
OK, here's the solution I've come up with, I'm sure there are better ones out there but this one's working for me. Thanks to all who gave direction!
To get unique items, you can run a Set, where listToCull is a list of PVectors with duplicates:
List<PVector> culledList = new ArrayList<PVector>();
Set<PVector> pointSet = new LinkedHashSet<PVector>(listToCull);
culledList.addAll(pointSet);
To go further, suppose you want a list where you've removed all items in listToCull which have a duplicate. You can iterate through the list and test whether it's in the set for each item. This let's us do one loop, rather than a nested loop:
Set<PVector> pointSet = new HashSet<PVector>(listToCull);
Set<PVector> removalList = new HashSet<PVector>();//list to remove
for (PVector pt : listToCull) {
if (pointSet.contains(pt)) {
removalList.add(pt);
}
else{
pointSet.add(pt);
}
}
pointSet.removeAll(removalList);
List<PVector> onlyUniquePts = new ArrayList<PVector>();
onlyUniquePts.addAll(pointSet);

Merge CSV files with dynamic headers in Java

I have two or more .csv files which have the following data:
//CSV#1
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType
1, Test, 2014-04-03, 2, page
//CSV#2
Actor.id, Actor.DisplayName, Published, Object.id
2, Testing, 2014-04-04, 3
Desired Output file:
//CSV#Output
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page,
2, Testing, 2014-04-04, , , 3
For the case some of you might wonder: the "." in the header is just an additional information in the .csv file and shouldn't be treated as a separator (the "." results from the conversion of a json-file to csv, respecting the level of the json-data).
My problem is that I did not find any solution so far which accepts different column counts.
Is there a fine way to achieve this? I did not have code so far, but I thought the following would work:
Read two or more files and add each row to a HashMap<Integer,String> //Integer = lineNumber, String = data, so that each file gets it's own HashMap
Iterate through all indices and add the data to a new HashMap.
Why I think this thought is not so good:
If the header and the row data from file 1 differs from file 2 (etc.) the order won't be kept right.
I think this might result if I do the suggested thing:
//CSV#Suggested
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page //wrong, because one "," is missing
2, Testing, 2014-04-04, 3 // wrong, because the 3 does not belong to Target.id. Furthermore the empty values won't be considered.
Is there a handy way I can merge the data of two or more files without(!) knowing how many elements the header contains?
This isn't the only answer but hopefully it can point you in a good direction. Merging is hard, you're going to have to give it some rules and you need to decide what those rules are. Usually you can break it down to a handful of criteria and then go from there.
I wrote a "database" to deal with situations like this a while back:
https://github.com/danielbchapman/groups
It is basically just a Map<Integer, Map<Integer. Map<String, String>>> which isn't all that complicated. What I'd recommend is you read each row into a structure similar to:
(Set One) -> Map<Column, Data>
(Set Two) -> Map<Column, Data>
A Bidi map (as suggested in the comments) will make your lookups faster but carries some pitfalls if you have duplicate values.
Once you have these structures you lookup can be as simple as:
public List<Data> process(Data one, Data two) //pseudo code
{
List<Data> result = new List<>();
for(Row row : one)
{
Id id = row.getId();
Row additional = two.lookup(id);
if(additional != null)
merge(row, additional);
result.add(row);
}
}
public void merge(Row a, Row b)
{
//Your logic here.... either mutating or returning a copy.
}
Nowhere in this solution am I worried about the columns as this is just acting on the raw data-types. You can easily remap all the column names either by storing them each time you do a lookup or by recreating them at output.
The reason I linked my project is that I'm pretty sure I have a few methods in there (such as outputing column names etc...) that might save you considerable time/point you in the right direction.
I do a lot of TSV processing in my line of work and maps are my best friends.

object search optimization in memory

I have an ArrayList of MyObjects. The MyObjects class has more than 10 properties, but I need to search only from 4 properties. The user will press a button and is able to select values for property1.
Lets say the user will select property1Value1, property1Value2 and property1Value4, than he will press a button and will make a selection for property2 values: property2Value1, property2Value5, property2Value7 and so on. Those are the filter1 and filter2.
The property2Value2, property2Value3 and property2Value4 is not visible to user because he filtered out with the filter1. Is like doing a search before he enter to a new filter screen.
I need to store somewhere what has he selected at each filter because when he navigate back I must show to him the selected values.
I think easier to understand with pictures, since similar in implemented at ebay:
No filters at beginning: user able to select all values for each property:
The user selected "Tablet" for type property. - a search is done and some property values aren't visible anymore:
The second filter value is selected:
Pressing ( automatically) the search I should do something like this in SQL:
SELECT * FROM MyObjects WHERE
( (property1 = property1Value1) || (property1 = property1Value2) || (property1 = property1Value4) )
AND
( (property2 = property2Value1) || (property2 = property2Value5) )
Since I have the objects in memory I don't think is a good idea to make an sqlLite3 database, write out than select. At iOS implementation I did very complex caching algorithm. Caching the filter values separated. A loooooot of auxiliary index holders(min 20), because for each filter I need some extra to do, not mentioned here and the data only once are stored.
I am scared to rewrite that algorithm to Android, what is at iOS, must be something easy.
Edit:
Basically I need to rewrite that SQL search in Java object searching.
Edit2:
Based on answer with Multimap.
The Multimap is not better than a HashMap<String, <ArrarList<Integer>>
where the key is the value of property (property2Value3) and the value is a list of index to my ArrayList<MyObjects> (1,2,3,4,5...100)
Need to build up at each filter, each filter value the HashMap<String, <ArrarList<Integer>> and than exactly there I am, where the iOS...maybe with a few auxiliary collections less.
Any idea?
What you're talking about is basically indexing. A similar approach to what you describe is perfectly manageable in Java, it just takes the same careful coding it would in Objective C.
You haven't specified much about questions like whether multiple items are allowed to have the same values in their fields, so I'll presume they are. In that case, here's how I'd start:
Use Guava's Multimap, probably HashMultimap, where the key is the property being index and each object being indexed gets put into the map under that key.
When you're trying to search on multiple fields, call multimap.get(property) to get a Collection of all of the objects that match that property and keep only the objects that match all the properties:
Set<Item> items = new Set<Items>(typeMultimap.get("tablet"));
items.retainAll(productLineMultimap.get("Galaxy Tab"));
// your results are now in "items"
If your property list is stable, write a wrapper Indexer class that has fields for all of the Multimaps and ensures that objects are inserted into and removed from all of the property indexes, and maybe has convenience wrappers for the map getters.
How it executes the MYSQL that SQL behind of scene? - At a MyISAM table has a file, where he has the data, at other file the id positions.
SELECT * FROM mytable will put all IDs to the result set, because there is no filter.
Because is the * will copy all fields to id. This is equivalent with:
ArrayList<MyObject> result = new ArrayList<MyObject>();
for(int i=0; i < listMyObjects.size(); i++){
if(true == true){// SELECT * FROM has a hidden WHERE 1, which is always true
result.add(listMyObjects.get(i));
}
}
in case of filter it should have a list of filters:
ArrayList<String> filterByProperty1 = new ArrayList<String> ();
at filter interface I will add some Strings property1Value1, property1Value2.... The search algorithm it will be:
ArrayList<MyObject> result = new ArrayList<MyObject>();
for(int i=0; i < listMyObjects.size(); i++){
MyObject curMyObject = listMyObjects.get(i);
// lets see if bypass the filter, if filter exists
boolean property1Allow = false;
boolean property2Allow = false;
if(filterByProperty1.size() > 0){
String theCurProperty1Value = curMyObject.getProperty1();
if(filterByProperty1.contains(theCurPropertyValue)){
property1Allow = true;
}
}
else{// no filter by property1: allowed to add to result
property1Allow = true;
}
// do the same with property2,3,4, lazzy to write it
if(property1Allow && property2Allow){
result.add(theCurPropertyValue);
}
}
}
Not sure if this is a lot slower, but I least I have escaped from tenth / hundred of auxiliary collections, indexes. After this I will make the extra stuff required and is done

2-dimensional object that can grow in java

I need to associate a unique key with each of a number of rectangle objects in Java. The key is in double data type, and the rectangles are obviously rectangle data types.
Currently, I have the rectangles in a vector, but they are not of much use to me unless I can also access their keys, as specified in the first paragraph above.
I would make a 2d array, with the first column being the key and the second column being the rectangle, but the number of rows in the array will need to change all the time, so I do not think an array would work. I have looked into vectors and arrayLists, but I am concerned about being able to search and slice the data.
Can anyone show me some simple java code for creating and then accessing a 2D data set with a variable number of rows?
Currently, my prototype looks like:
ArrayList<Double> PeakList = new ArrayList<Double>();
Vector<Rectangle> peakVector = new Vector<Rectangle>();
Vector<Double> keyVector = new Vector<Double>();
if(PeakList.contains((double)i+newStartingPoint)){
Rectangle myRect = new Rectangle(x2-5, y2-5, 10, 10);
boolean rectFound = peakVector.contains(myRect);
System.out.println("rectFound is: "+rectFound);
Double myPeak = ((double)i+newStartingPoint);
if(rectFound!=true){
peakVector.add(myRect);
keyVector.add(myPeak);
System.out.println("rectFound was added.");
}else{System.out.println("rectFound was NOT added.");}
}
I then enumerate through the data for subsequent processing with something like the following:
Enumeration e = peakVector.elements();
while (e.hasMoreElements()) {
g2.fillRect(peakVector.lastElement().x, peakVector.lastElement().y, 10, 10);
}
As you can see, there is no way to subsequently integrate the keys with the rectangles. That is why I am looking for a 2D object to use. Can anyone show me how to fix this code so that I can associate keys with rectangles and subsequently access the appropriately associated data?
Why not simply use a HashMap<Double, Rectangle>?
Edit: no, there are significant problems with this since there's no guarantee that two doubles will equal each other even though numerically they should. Does it have to be Double? Could use use some other numeric or String representation such as a Long? Is there a physical reality that you're trying to model?
Why not use a Map? They are specifically designed to associate keys with values. You can iterate through the keys of the map with keySet(), the values with valueSet() and both the keys and values at the same time with entrySet()
A Map will surely be the right answer, you don't worry about cardinality of the domain or of the codomain of the mapping function. Having double as the key datatype forbids you from using some of the predefined types.
I would go with a TreeMap<Double, Rectangle> just because the natural ordering is used to sort the entries inside the structure, so having a double is perfectly allowed, but you would have problems with the retrieval (I actually used myself floats as keys for maps and never had a problem with some precautions but it mostly depends on the nature of your data.

Categories