DataModel usage with FileItemSimilarity in Mahout - java

I'm building a recommender where the actual similarity computation is done with the ItemSimilarityJob and which is then loaded into a non distributed recommender through FileItemSimilarity.
All this works so far(2), but there's one thing I'm a bit puzzled about.
When instantiating the recommender (GenericItemBasedRecommender), I've to pass along a data model - which would be FileDataModel in my case, but due to the fact that the similarity computation already took place, I don't really know what data I should pass into the model?
Clearly the model is used to determine maximum and minimum preference value and item- and user-ids. Regarding the users I'm planning to have only anonymous "profiles" anyways - so would it then be ok to pass along fake data?
How's that supports to work - the Mahout examples (1) and the MiA book don't give any answers on that but both state that pre-computation is the way to go :(
(1) I'm running on Mahout 0.7 but also looked into trunk already.
(2) I had to transfer the generated similarity matrix into a textual format myself of course.

You should pass the same DataModel that was fed to the similarity computation. The recommender's output is certainly a function of the similarities, but, also the original data of course! That's why it's an input.
You could in theory build similarities off a different DataModel than the data you are actually making recommendations from. It's possible and might make sense in some cases but is not normal.

Related

Predictive model using TensorFlow

My goal is to generate a predictive model using tensor flow in Java but I first want to ensure that my goal is achievable. Firstly, if I have a bunch of parameters and each set of parameters is assigned an output is it possible to train a model to predict an output given similar parameters? I am able to get hundreds of thousands samples (if needed) in order to train it so is this possible?
Secondly, after the model is trained how fast can it actually generate results?
Lastly, assuming everything up until this point checks out what is the best method in Java’s tensor flow to train a model with data that has multiple parameters associated with an outcome? Also in the result a given piece of data satisfies two results both can be returned as options ordered from most likely to least.
Also just to clarify I am not asking someone to make this for me I am just trying to make sure that a solution exists and is quick (if it’s slow I could just go back to brute forcing which I am trying to move away from since is kinda slow and resource intensive). Also, if you have any pointers on getting started tackling this I would greatly appreciate it!
Your question is very, very general, but I'll try to offer some insight:
Firstly, if I have a bunch of parameters and each set of parameters is assigned an output is it possible to train a model to predict an output given similar parameters?
Taking a set of parameters (known as the feature set X) and making predictions of another set of parameters (known as the output set Y) is the primary purpose of machine learning. Exactly how to do this requires many steps, how to do it well takes a lot of experience... However if you are asking if it is possible in principle, that depends on the specific feature set X, and output set Y.
I am able to get hundreds of thousands samples (if needed) in order to train it so is this possible?
The trick to machine learning is the data must be of a sufficient quantity and quality. This takes domain specific knowledge to know.
Are you able to provide any specifics about your data to help us understand?

Function fit and data-fitting with AnyLogic sims

In a simulation i get some data looking like a arctan or tanh function.
I want to implement a function fit in Java for getting the parameter of this function for optimization. For other functions i used for example the Apache code for function fit of polynomial and gaus function but couldn't find a solution for tangent.
To be honest I don't know how to write such a function fit so maybe someone can help me fixing this problem or does know if there is already a function fit existing for such functions.
There is an example model called "Calibration of agent based SIR model" that does what you are looking for: Calibrate model parameters so the output matches a given function (not tangent in this example but easy to adjust)
Short answer
AnyLogic does not have any data-fitting capabilities built-in, other than simple interpolation of discrete data (see Table Functions in the help). So
(a) if you needed to do it in-model (e.g., driven by some model state), you'd need to find a suitable Java library that did what was missing in what you'd already tried (Apache Commons), and call that from the AnyLogic model;
(b) if you could do it outside the model, use a data-fitting tool like Stat::Fit (which exists as a plug-in for some sim tools like Simul8, but not for AnyLogic).
Longer answer
Based on your additional explanatory comments, it sounds like this is a question where it's crucial to properly explain your context, and perhaps you don't need to use data-fitting at all (and there may be a more 'AnyLogic-centric' way of approaching it in that case). Particularly around the intended interaction between simulation and (mathematical) Gurobi optimisation; note that AnyLogic has built-in heuristic optimisation via OptQuest so any normal discussion of 'optimisation' with AnyLogic is referring to that.
On the one hand you seem to suggest you want to fit a function to some input data to your simulation. (You talk about having Excel inputs and wanting to fit a curve to it.)
On the other hand, you seem to suggest you want an approach where you are optimising at intermediate time intervals based on run-time model state. But what is the optimiser determining and how do its results affect the ongoing execution of the simulation? You say "So it is not about an optimization of the whole model but of intermediate results. Since I didn't find a solution for this". What 'solution' are you looking for? This sounds like an approach where you're modelling decisions for time period N being made inside the simulation, where those decisions are based on an optimisation using the outcomes from period N-1 as its inputs (and thus the optimisation is effectively based on a simplified emulation of the simulation using a function, since the simulation is already supposed to be the most-accurate computational representation of the real-world system).
So perhaps(?) you're saying that you are emulating/approximating the simulation as a function of its input data (where you happen to think a tangent function fits). In which case the original suggestion (a) is probably the only thing that makes sense. Though, even then, when you are optimising for anything after the first time period, the 'inputs' are no longer the original model inputs; they are some representation of the simulation's current state/outcomes (so it's not clear that this relates to the Excel input data directly, and so maybe I'm barking up the wrong tree).

ML and DL4J Phases by Example

I have a large S3 bucket full of photos of 4 different types of animals. My foray into ML will be to see if I can successfully get Deep Learning 4 Java (DL4J) to be shown a new arbitrary photo of one of those 4 species and get it to consistently, correctly guess which animal it is.
My understanding is that I must first perform a "training phase" which effectively builds up an (in-memory) neural network that consists of nodes and weights derived from both this S3 bucket (input data) and my own coding and usage of the DL4J library.
Once trained (meaning, once I have an in-memory neural net built up), then my understanding is that I can then enter zero or more "testing phases" where I give a single new image as input, let the program decide what type of animal it thinks the image is of, and then manually mark the output as being correct (the program guessed right) or incorrect w/ corrections (the program guessed wrong, and oh by the way, such and so was the correct answer). My understanding is that these test phases should help tweak you algorithms and minimize error.
Finally, it is my understanding that the library can then be used in a live "production phase" whereby the program is just responding to images as inputs and making decisions as to what it thinks they are.
All this to ask: is my understanding of ML and DL4J's basic methodology correction, or am I mislead in any way?
Training: That's any framework. You can also persist the neural network as well with either the java based SerializationUtils or in the newer release we have a ModelSerializer as well.
This is more of an integrations play than a "can it do x?"
DL4j can integrate with kafka/spark streaming and do online/mini batch learning.
The neural nets are embeddable in a production environment.
My only tip here is to ensure that you have the same data pipeline for training as well as test.
This is mainly for ensuring consistency of your data you are training vs testing on.
As well as for mini batch learning ensure you have minibatch(true) (default) if you are doing mini batch/online learning or minibatch(false) if you are training on the whole dataset at once.
I would also suggest using StandardScalar (https://github.com/deeplearning4j/nd4j/blob/master/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/linalg/dataset/api/iterator/StandardScaler.java) or something similar for persisting global statistics around your data. Much of the data pipeline will depend on the libraries you are using to build your data pipeline though.
I would assume you would want to normalize your data in some way though.

Encog - How to load training data for Neural Network

The NeuralDataSet objects that I've seen in action haven't been anything but XOR which is just two small data arrays... I haven't been able to figure out anything from the documentation on MLDataSet.
It seems like everything must be loaded at once. However, I would like to loop through training data until I reach EOF and then count that as 1 epoch.. However, everything I've seen all the data must be loaded into 1 2D array from the beginning. How can I get around this?
I've read this question, and the answers didn't really help me. And besides that, I haven't found a similar question asked on here.
This is possible, you can either use an existing implementation of a data set that supports streaming operation or you can implement your own on top of whatever source you have. Check out the BasicMLDataSet interface and the SQLNeuralDataSet code as an example. You will have to implement a codec if you have a specific format. For CSV there is an implementation already, I haven't checked if it is memory based though.
Remember when doing this that your data will be streamed fully for each epoch and from my experience that is a much higher bottleneck than the actual computation of the network.

Identify an english word as a thing or product?

Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Categories