Can I have a vector attribute in WEKA? - java

I am new to WEKA/machine learning and I am trying to create a model in which a single feature is a vector of 8 integers (ranging from 0-11) containing information of past choices. For example, [0,1,8,4,4,2,2,6] would mean that 0 was chosen in the last iteration, 1 was chosen two iterations ago, etc. Each choice has an impact on the next in this case and the order is important.
I was wondering if it is possible to represent this in WEKA as a feature. I am currently representing them as individual features but this does not make the relation or order between the values obvious and I was wondering if there is a better way to do it. Any input is appreciated, thanks!

Weka's ARFF format does not offer an attribute type that would allow you to encapsulate an ordered vector. Attributes are basically independent columns and the relational attribute type does not enforce an ordering either.
Your data sounds more like a time series. If that is the case, you could look a the time series support in Weka.
If your data does not represent a time series, then you may have to fall back on feature engineering. You can use the AddExpression filter for creating new attributes based on values from other attributes (e.g., difference between two attributes).
With the MultiFilter you can combine an arbitrary number of filters into a single one. Which you then can use in conjunction with the FilteredClassifier meta-classifier.

Related

Alternate approach to storing complex data types in room database

Existing approach: Currently we use TypeConverters to help database to store and retrieve complex data type (POJO class objects). But that involves serializing and deserializing the objects, which seems unnecessary when we only need a simple primitive data type like int, string, float, etc.
My approach: I am thinking of an approach of breaking down the complex data type to primitive ones and create separate columns to store them. When we need a simple primitive type from the database, then we won't have to go through the process of deserializing complex objects.
I have tried the my approach and it is working but I'm not sure of the corner cases that may arise while implementing this approach in big projects.
I am still new to this, need help in finding pros and cons of my approach.
There are some who advocate storing a representation of objects as a single column. This is fine if you just want to store and retrieve the objects and then work with the objects. The code itself can often be shorter.
If you want to manipulate the underlying values (fields in the object) embedded in the representation via the power of SQLite then matters can get quite complex and perhaps inefficient due to the much higher likelihood of full table scans due to the likely lack of available/usable indexes.
e.g. if myvalue is a value within a representation (typically JSON) then to find rows with this value you would have to use something like
#Query("SELECT * FROM the_table WHERE the_one_for_many_values_column LIKE '%myvalue%'")
or
#Query("SELECT * FROM the_table WHERE instr(the_one_for_may_values_column,'myvalue')
as opposed to myvalue being stored in a column of it's own (say the_value) then
#Query("SELECT * FROM the_table WHERE the_value LIKE 'myvalue')
the former two have the flaw that if myvalue is stored elsewhere within the representation then that row is also included even. Other than the fact that LIKE is case independent the third is an EXACT.
an index on the the_value column may improve performance
Additionally the representation will undoubtedly add bloat (separators and descriptors of the values) and thus will require more storage. This compounded as the same data would often be stored multiple times whilst a normalised relational approach may well store just 1 instance of the data (with just up to 8 bytes per repetition if a 1-M relationship (indexes excluded)).
32 bytes of bloat and the maximum needed to cater for a many-many relationship (8 bytes parent 8 bytes child and two 8 bytes columns in the mapping table) will be exceeded.
As the SQLite API utilises Cursors (buffering) to retrieve extracted data, then with a greater storage requirement fewer rows can be held at once by a CursorWindow (the limited sized buffer that is loaded by x rows of output). There is also greater potential, again due to the bloat, of a single row being larger than the CursorWindow permits.
In short for smaller simpler projects not interested greatly in performance, then storing representations via TypeConverters could be the more convenient and practical approach. For larger more complex projects then unleashing the relational aspects of SQLite via data that is related rather than embedded within a representation could well be the way to go.

A three-dimensional data structure that holds items in a positional relationship to each other

I have an assignment on Data structure, the prof wants us to use different kinds of DS in the project, but I don't know what he means with (A three-dimensional data structure that holds items in a positional relationship to each other. Each cell in the data structure can hold multiple items.)
I tried Arraylists of objects, queues with objects!
any idea what kind of DS I can try to save my time?
thanks
If you are allowed to use Guava, then I would consider a Multimap of MyObj indexed by XyzCoord, where XyzCoord is a custom object to hold three positional numbers, and MyObj is the custom object you wish to store one or more of at various coordinates.
Avoiding Guava, you can use a standard Map of List<MyObj>. It could also be indexed by List<Integer> which are of length 3.
The fact is that there are many, many ways to do this. Your question may be a bit too broad as a result. Have a look at the collection classes some more and try to ask specific questions about each one if you don't know how they are used.
The simplest spatial data structure is a 3D array. In java you can create one with as follows:
Object[][][] my3DArray = new Object[10][10][10];
Here you can store 10*10*10=1000 Object in spatial relation to each other. Unfortunately, there are only 10 possible coordinates in each dimension.
If you want something more efficient, look for quadtrees/octrees, kd-trees (as mentioned by #BeyelerStudios in the comments), R-Trees, LSH (locality sensitive hashing) or even space-filling curves (Z-curve, Hilbert curve, ...). These are just the main families, there are many versions of DSs of each type.
EDIT to answer comment.
Except the 3D array approach, all solutions above are quite space efficient. The most space efficient one may be the PH-Tree (a type of quadtree, developed by myself) in some cases it may require less memory than a plain array of coordinates. Unfortunately it is comparatively complex to understand and implement.
If you want to use a 1D sorting scheme, such as array or list, try using a space filling curve. The Z-curve may be easiest.
Using spacefilling curves you can calculate a 'key' (z-key or Morton-number) for each point in space and then sort these in the array or list. In such an ordered list/array, direct neighbors are also likely (but not guaranteed) to be close in 3D space. Inversely, points that are close in 3D space tend (but not guaranteed) to be close in the list/array.
For integer coordinates, the z-key (also called MortonNumber) can be calculated by interleaving the bits of the coordinates. You can also do this for floating points values, but you need to be careful with negative values, otherwise you may get a rift between positive and negative values.

How to improve performance of SMO classifier in weka?

I am using weka SMO classifier for classify the documents.There are many parameters for smo available like Kernal, tolerance etc.., I tested using different parameters but i not get good result large data set.
For more than 90 category only 20% documents getting correctly classified.
Please anyone tell me the best set of parameter to get highest performance in SMO.
Principal issue here is not classification itself, but rather selecting suitable features. Using raw HTML leads to very large noise which in its turn makes classification results very poor. Thus, to get good results do the following:
Extract relevant text. Not just remove HTML tags, but get exactly the text describing item.
Create dictionary of key words. E.g. capuccino, latte, white rice, etc.
Use stemming or lemmatization to get word's base form and avoid counting, for example, "cotton" and "cottons" as 2 different words.
Make feature vectors from text. Attributes (feature names) should be all words from your dictionary. Values may be: binary (1 if word occurs in text, 0 otherwise), integer (number of occurrences of word in question in text), tf-idf (use this one if your texts have very different lengths) and others.
And only after all these steps you can use classifer.
Most probably classifier type won't play a big role here: dictionary-based features normally lead to quite exact results regardless of classification technique in use. You can use SVM (SMO), Naive Bayes, ANN or even kNN. More sophisticated methods include creation of category hierarchy, where, for example, category "coffee" is included into category "drinks" which in its turn is part of category "food".

Most efficient way to check if row exists in grid Java

All,
I am wondering what's the most efficient way to check if a row already exists in a List<Set<Foo>>. A Foo object has a key/value pair(as well as other fields which aren't applicable to this question). Each Set in the List is unique.
As an example:
List[
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:2][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:3]
]
I want to be able to check if a new Set (Ex: Set[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]) exists in the List.
Each Set could contain anywhere from 1-20 Foo objects. The List can contain anywhere from 1-100,000 Sets. Foo's are not guaranteed to be in the same order in each Set (so they will have to be pre-sorted for the correct order somehow, like a TreeSet)
Idea 1: Would it make more sense to turn this into a matrix? Where each column would be the Foo_Key and each row would contain a Foo_Value?
Ex:
A B C
-----
1 3 4
1 2 4
1 3 3
And then look for a row containing the new values?
Idea 2: Would it make more sense to create a hash of each Set and then compare it to the hash of a new Set?
Is there a more efficient way I'm not thinking of?
Thanks
If you use TreeSets for your Sets can't you just do list.contains(set) since a TreeSet will handle the equals check?
Also, consider using Guava's MultSet class.Multiset
I would recommend you use a less weird data structure. As for finding stuff: Generally Hashes or Sorting + Binary Searching or Trees are the ways to go, depending on how much insertion/deletion you expect. Read a book on basic data structures and algorithms instead of trying to re-invent the wheel.
Lastly: If this is not a purely academical question, Loop through the lists, and do the comparison. Most likely, that is acceptably fast. Even 100'000 entries will take a fraction of a second, and therefore not matter in 99% of all use cases.
I like to quote Knuth: Premature optimisation is the root of all evil.

How to best represent Constants (Enums) in the Database (INT vs VARCHAR)?

what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!
In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.
I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.

Categories