I'm interested in AI and 2 days ago I found an interesting recent development in this area, it's called ES-HyperNEAT, first there was NEAT, then HyperNEAT then ES-HyperNEAT.
Here are some links to the topic :
http://eplex.cs.ucf.edu/hyperNEATpage/
http://eplex.cs.ucf.edu/ESHyperNEAT/
So I've downloaded the Java version of AHNI, but there is no tutorial, I guess the developers took for granted that it's easy to use, but to me, I don't know how to implement a solution to the following problem, doesn't seem very hard, but could someone show me how to get started ?
Input looks like this :
Date , A , B , C , D
2013-07-26,18.94,19.06,18.50,18.63
2013-07-25,18.85,19.26,18.55,19.04
2013-07-24,19.32,19.40,18.47,18.99
2013-07-23,20.15,20.30,19.16,19.22 <-- Predict it ? [ Output ]
2013-07-22,20.09,20.23,19.80,20.03 <-- Start Date
2013-07-19,20.08,20.48,19.76,20.02
2013-07-18,19.88,20.68,19.64,20.12
2013-07-17,19.98,20.07,19.69,19.83
2013-07-16,20.38,20.49,19.51,19.92
......
2013-07-02,18.19,18.20,17.32,17.69
2013-07-01,18.38,18.96,17.95,18.15 <-- End Date
The program should read the above data from Start Date counting back n days to End Date, train on those data and the correct output will always be the next day's D value, I wonder how this can be implemented with ES-HyperNEAT ?
Specifically :
[1] Which classes to call to start the process ?
[2] How to tell it which fields in the input file to gather data, in this case it can ignore the Date field, and gather data from A,B,C,D [ not normalized to 0,1 ]
[3] How to tell it the correct result is the next day's D value ?
[4] How to specify the program should start from line x at the Start Date, and get data through line y at the End Date ?
Is there something like : myProgram.start(FilePath,Delimiliter,Filed2,Field3,..,Line_X,Line_Y,...) ?
The readme.txt (which you can see at https://github.com/OliverColeman/ahni) contains some info about getting started with your own experiments, specifically see the DEVELOPMENT AND CREATING NEW EXPERIMENTS section. There is currently no code specific to performing time-series prediction in AHNI, so you would have to extend one of the base fitness function classes (see the readme). Your code would need to do the things you ask about (points 2-4), but you could create a fairly generic time-series prediction class which can be configured via the .properties file to specify the things in points 2-4. If you do do this then feel free to contribute it and we'll add it to the AHNI software on github :).
AHNI is intended as a research platform to support my own research (and hopefully others along the way), rather than an "easy to use, throw generic machine learning problem X at it" kind of software package (depending on your definition of "easy to use"). I try to keep the code clean, well-organised and the API well-documented so that others may use it, but creating a full-blown tutorial (and functionality) for the many possible use-cases is beyond the scope of the project (though of course I'd gladly include tutorials written by others).
Before going further I recommend considering the below:
When googling around for previous research on using HyperNEAT for time-series prediction I came across a question I asked several years ago that is similar to yours that I had completely forgotten about (I was surprised to see my name attached to the question! :)) http://tech.groups.yahoo.com/group/neat/message/5470 The reply to this question is good food for thought on the matter. Additionally:
(ES-)HyperNEAT is designed to exploit geometric regularities (patterns, correlations) in the input or output (see http://eplex.cs.ucf.edu/papers/gauci_nc10.pdf), so one question that might be worth exploring is whether the data contains regularities that can be represented geometrically (in my question I suggested plotting some window of the time-series on a 2D plane, which the 2D input layer of the network "sees", similar to the approach used in http://eplex.cs.ucf.edu/papers/verbancsics_gecco10.pdf. However, it sounds like NEAT, using a recurrent network, might be just as good if not better than HyperNEAT for this kind of problem.
Related
I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.
Are there any particular techniques or approaches that try to solve this problem?
Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.
Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.
For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.
It may seems a bit like over-engineering, but its not :).
I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.
Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM
I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.
The NeuralDataSet objects that I've seen in action haven't been anything but XOR which is just two small data arrays... I haven't been able to figure out anything from the documentation on MLDataSet.
It seems like everything must be loaded at once. However, I would like to loop through training data until I reach EOF and then count that as 1 epoch.. However, everything I've seen all the data must be loaded into 1 2D array from the beginning. How can I get around this?
I've read this question, and the answers didn't really help me. And besides that, I haven't found a similar question asked on here.
This is possible, you can either use an existing implementation of a data set that supports streaming operation or you can implement your own on top of whatever source you have. Check out the BasicMLDataSet interface and the SQLNeuralDataSet code as an example. You will have to implement a codec if you have a specific format. For CSV there is an implementation already, I haven't checked if it is memory based though.
Remember when doing this that your data will be streamed fully for each epoch and from my experience that is a much higher bottleneck than the actual computation of the network.
Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.
getting the Tomcat log file data by giving specific dates(from and to) and display it as a file in java.
Can anyone please guide me how to do this part asap.
Thanks in Adavance
Create an empty file with some lines containing log entries from Tomcat.
Start writing some code to read at least one of these lines. Here's a resource that will help you with this task:
http://docs.oracle.com/javase/tutorial/essential/io/file.html
Once you read the line, keep studying on how to iterate through all the lines within the file.
The next step is to identify the date, you can assume that the characters will always be on the same position and retrieve the specific digits that you want (e.g., myString.charAt(6) ), convert them to numbers and code some comparison to find the range of the log entries that you want (which can get quite messy) OR you can move on to Regular Expressions and use the Java Time/Date API to process this comparison.
How about finding a match between the "from" parameter that will be used as the input and the line, then you can keep adding all the subsequent lines to a List of log entries until you reach the "to" date, you can keep iterating through the lines and, when you find "from", start adding lines to the list, when you find "to", just break the loop and then you will have the log entries you need.
I'm not going to share any code here because, well.. you probably know the classic saying: "give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime"
Keep in mind the "DTSTTCPW" principle:
http://www.xprogramming.com/Practices/PracSimplest.html
I will probably regret stepping in here, as I have seen where these sort of posts head.
To answer your question from the comments above: DTSTTCPW is "Do The Simplest Thing That Could Possibly Work". In other words, figure out what you need to do, then do it in the easiest way possible.
Step 1
So, lets take the problem at hand. You need to read in the log files from the Tomcat logs. Do you know where those logs are? If not, figure that out. If so, good...on to Step 2.
Step 2
Can you read in a file in Java? There are several ways to do it. I would suggest looking at File, FileInputStream, and BufferedReader as a start. You can find these in the good-ol-Java API documentation, which I will note is VERY clear and concise. Once you can read in a file, go to Step 3.
Step 3
Using File, get a list of all files in the Tomcat log folder. Again, Java API will help here. I would suggest looking at the set of list() methods, which returns an array of Strings that are the files in the directory (assuming you created the File from the directory path string to the Tomcat logs as mentioned prior). I would suggest adding a FilenameFilter to filter just the files you are interested in.
Step 4
Start opening and reading files. Here we can possibly get a little "fancy" by only reading in files that have been modified between your start and end dates, since anything older/younger would probably not contain log entries for the ranges you are looking for. BufferedReader is good for text files, and even has a readLine(). Once you get the file reading in, move to Step 5
Step 5
Upon reading a line, parse it to see what date/time stamp is on it. There are lots of ways to do this. split() works well if your lines are delimited by spaces, or substring() if you know the fall on solid indexed bounds. REGEX matching/parsing can be used as well if you have something more complicated. Use the KISS principal here...don't make it hard if it doesn't need to be. Note: I am not going to get into a discussion about REGEX here. There are entire VOLUMES of information on REGEX, how to write it, etc. Matching a date/time string using REGEX would not be complicated, and I am sure you can Google and find a solution for that.
Step 6
When you pull out the date, convert it to a Java Date (See SimpleDateFormat) and determine if that line is in range for what you are looking for. If it is, put it somewhere (a List, an Array of String...whatever) and move on to the next line. Keep doing this until all Lines of all Files you intend to look at are done. In the end, you should have the lines you are trying to find. Return those.
Step 7
Print them out in your JSP. Again, not going to go into depth here as there are probably 1,000,000 possible ways you could decide to do this. I would suggest looking at Spring's Model, putting the data there, and printing it on the JSP. There are JSP tags for looping and such...Google is your friend.
Bonus Information
This is the Bonus round, where we discuss what makes a good StackOverflow post and what doesn't.
First, do your own research. What I mean by that is: If you have a question like "How do I build a rocket" you are not going to get very good answers. If you try to start your own rocket, and run into issues with the boosters, then you can ask a more specific question like "I am having premature failure in my booster. I think my O2 mixture is wrong. For a 2-ton solid booster, what is the proper O2 mixture to get maximum lift". See? The second question is MUCH more specific and answerable. Could you imagine how long an answer would be to "how do I build a rocket"? I mean, you could probably write several books on the subject, right? Almost like asking "how do I write this program as a JSP"...
Second, after you have made a more specific and better researched question, MAKE SURE YOU HAVE CODE EXAMPLES!!!! Some questions might not warrant code, but must do. If you are asking something, and you can produce some code that people can use to see what you are doing, then provide the code. We cannot read minds, nor do we want to waste our time writing our own sample trying to reproduce your problem.
Third, CODE SAMPLES SHOULD BE SHORT, CONCISE, AND MOST IMPORTANTLY RELEVANT!!! If you are asking about A, but provide a sample of B, then cluttering the question with B only prevents us from answering A.
Forth, POSTS SHOULD ONLY CONTAIN ONE QUESTION, NOT MULTI-PART QUESTIONS!!! We have all done this, and probably still do it, but don't ask questions that require more than one answer. How are you going to reward people if one person answers one half the question while another answers the other half? You can't check two questions, can you? People are here for the "free internet points".
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I donĀ“t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.