JSAT: Data wrangling / manipulating - java

After building a prototype in R (using dplyr), I need to build a model that is deployable to our Java based server-infrastructure. Right now, I'm using the JSAT-machine-learning library.
What is the best way to wrangle data?
None of the collection-like types from the JSAT package (ClassificationDataSet, RegressionDataSet, DataSet) seem to support even basic tasks like:
Filtering out datapoints based on conditions
Splitting the dataset into two (different sized) datasets, e.g. training and testing dataset
Mutating or adding new rows based on the values of other rows

1) This isn't currently supported in JSAT, JSAT is a source of Machine Learning algorithms. Dataframe like operations are not a goal of the project in any way. I'm not sure why you would want to be filtering out data in a production system, there is no reason you couldn't do that in a better tool and then export the data to have JSAT build the model.
2) All DataSet objects inherit a randomSplit method that can do what you have asked for. An example of that is here.
3) See 1, I'm not sure what the use case is for adding "new rows based on the values of other rows". All the different DataSet classes support adding new data points, you just have to create them yourself.
source: I'm the author of JSAT

Related

How to capture formulas and support formula evaluation in java web application

We have a requirement to incorporate an excel based tool in java web application. This excel tool has set of master data and couple of result outputs using formula calculations on master data.
Master data can be captured in database with relational tables. We are looking for the best way to provide capability to capture, validate and evaluate. formulas.
So far looked at using scripting engines nashorn and provide formula support using eval. We would like to know how people are doing in other places.
I've searched and found two possible libraries that could be useful for you please have a look.
http://mathparser.org/
http://mathparser.org/mxparser-hello-world/mxparser-hello-world-java/
https://lallafa.objecthunter.net/exp4j/
https://lallafa.objecthunter.net/exp4j/#Evaluating_an_expression_asynchronously
Depends on how big your data is and what your required SLA is. Also on what kind of formulas/other functions that you want to support.
For example, consider a function like sum or max. Now, the master data is in some relation table containing 10k rows. You could pull in all this data inside a java app and do a sum (or run any function). However, imagine if the table contained 500K rows. This would take some time to stream all 500K rows to Java app but consumes lot of cpu and network bandwidth (database resources, local cpu resources). A better optimized scenario in that case would be index that column in the database and let database do all the hard work for you.
Personally, I don't like using eval. I would rather parse the user input to determine what actions to take.
I am assuming that data is not big to use big data tools.

How do I store objects if I want to search them by multiple attributes later?

I want to code a simple project in java in order to keep track of my watched/owned tv shows, movies, books, etc.
Searching and retrieving the metadata from an API (themovieDB, Google Books) is already working.
How would I store some of this metadata together with user-input (like progress or rating)?
I'm planning on displaying the data in a table like form (example). Users should also be able to search the local data with multiple attributes. Is there any easy way to do this? I already thought about a database since it seemed that was the easiest solution.
Any suggestions?
Thanks in advance!
You can use lightweight database as H2, HSQLDB or SqlLite. These databases can be embedded in the Java app itself and does not require extra server.
If your data is less, you can also save it in XML or Json by using any XMLParser or JsonParser (e.g. Gson()).
Your DB table will have various attributes which are fetched from API as well as user inputs. You can write query on the top of these DBs to fetch and show the various results.
Either write everything to files, or store everything on a database. It depends on what you want though.
If you choose to write everything to files, you'll have to implement both the writing and the reading to suit your needs. You'll also have to deal with read/write bugs and performance issues yourself.
If you choose a database, you'll just have to implement the high level read and write methods, i.e., the methods that format the data and store it on the appropriate tables. The actual reading and writing is already implemented and optimized for performance.
Overall, databases are usually the smart choice. Although, be careful of which one you choose. Some types might be better for reading, while others are better for writting. You should carefully evaluate what's best, given your problem's domain.
There are many ways to accomplish this but as another user posted, a database is the clear choice.
However, if you're looking to make a program to learn with or something simple for personal use, you could also use a multi dimensional array of strings to hold the name of the program, as well as any other metadata fields and treat the array like a table in excel. This is not the best way to do it, but you can get away with it with very simple code. To search you would only need to loop through the array elements and check that the name of the program (i.e. movieArray[x][0] matches the search string. Once located you can perform actions or edit the other array indexes pertaining to that movie.
For a little more versatility, you would create a class to hold the movie information with fields to hold any metadata. The advantage here is that the metadata fields can be different types rather than having to conform to the array type, and their packaged together in the instance of the class. If you're getting the info from an API then you can update or create the classes from the API response. These objects can be stored in an ArrayList and searched with a loop that checks for a certain value i.e.
for (Movie M : movieArrayList){
if(m.getTitle().equals("Arrival")){
return m;
}
}
Alternatively of course for large scale, a database would be the best answer but it all depends what this is really for and what it's needs will be in the real world.

Document classification using libsvm in java

I am using libsvm library for document classification of resumes. I have multiple resumes and I need to classify them. Do I need multilabel classification OR multiclass classification in this case. Which above option should I consider and also please suggest a way to do it?
Your requirement is not straight forward, In order to develop such system you need to come up with several steps, as an Example :
You need a data set of different types of documents (various type of resumes)
Then you need to identify what kind of features that can be use to separate them(how do you going to distinguish them, based on what (ex, resume length, count of word, content of resume header, etc))
Then you need to prepare sets of feature vectors in order to train the SVM. (if you need to classify only relevant and irrelevant resumes, this will be two classes. If there are more than two classes , this will be multi-class and LibSVM supports multi-class)
When training, you need to perform scaling, cross validation in order to increse the accuracy (read here )
You need to complete above steps in order to make successful prediction.

Three arff files majority voting

I have three different ARFF files that contain different classification information about the same instances, so that each line of each ARFF file concerns the same instance, but contains different information on that instance. I would like to build a new classifier that would have a majority vote on the three classifiers that would applied on each ARFF data file with a cross validation
Any clue or hint is highly appreciated...
This is a very basic proposition, based that you would use Java to train and evaluate your ensemble:
Prepare each of the three datasets according to their attribute requirements, or if possible, use the same dataset for all three models using an attribute filter for each classifier (never tried this)
Train each of the three classifiers using the required training/attribute data
Code the Majority Vote rules at the end of your evaluation process
Evaluate your model on your Testing Set.
There may be other ways to do this (such as using the Vote combiner with AttributeSelectedClassifier), but doing this by code may give you more control and flexibility for what you are trying to combine.

How to classify documents indexed with lucene

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?
Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.
Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py
As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

Categories