Learning Algorithm and Saving Data in Software - java

I'm coming from a web-development background and I am wondering how I would make a learning algorithm in Java/C++. Not so much the algorithm part, but making the program "remember" what it learned from the previous day. I would think something like saving a file, but I suspect their might be an easier way. Apologies if this question is just over the top stupid. Thanks.

I think that would depend a bit on the problem domain. You might want to store learned "facts" or "relationships" in a DB so that they can be easily searched. If you are training a neural network, then you'd probably just dump the network state to a file. In general, I think once you have a mechanism that does the learning, the appropriate storage representation will be relatively apparent.
Maybe if you can flesh out your plan on what kind of learning you'd like to implement, people can provide more guidance on what the implementation should look like, including the state storage.

Not stupid, but a little ill-formed maybe.
What you're going to do as your program "learns" something is update the state of some data structure. If you want to retain that state, you need to persist the data structure to some external store. That means translating the data structure to some external formal that you can read back in without loss.
Java provides a straightforward way to do this via the Serializable interface; you Serialize the data by sending Serializable ojects out through an ObjectStream; the same ObjectStream will reload them later.

If you want to access and save large amounts of data maybe a database would work well. This would allow you to structure the data and look it up in an easier manner. I'm not too well versed on the subject but I think for remembering and recalling things a database would be vastly superior to a file system.

A robust/flexible solution in Java (C++ too, but I wouldn't know how) would be using a database. Java 1.6 comes with the Apache derby database which can be embedded in your apps. Using JPA (Java Persistence API) makes it easy to interface with any database you can find drivers for.

You should look into Neural Network software development. Here's a collection of nice Neural Network libraries for different languages. I am not sure if this is the easy way but once accomplished would be very handy.

Related

Online java applet that reads and writes from/to a mySQL database?

I'm considering creating a little chatbot that learns from users, similar to Clever-Bot (but very different in the way it learns), but I need a way to interface whatever language I use, with mySQL.
I was thinking that java would be the smarter option, though after hunting around it seems a bit difficult to integrate it with a server. Is this true? If not, how would I go about doing that?
Otherwise, would it be smarter to use JavaScript/jQuery/PHP? I'm quite good with these and haven't got a whole lot of experience with Java (but it would be good practise).
Thoughts?
For creating chat bots, consider AIML. And an interpreter that would help you use your AIML files. Going for a backend database isn't a good thing I would say, since AIML is built for Artificial Intelligence stuff,chat bot in your case. Most of the famous chat bots on the Internet use AIML, example A.L.I.C.E .

How to organize data in Java

I was wondering what the easiest and smartest way to organize data for easy retrieval and manipulation. I am creating a program in Java that will keep track of employee information, such as names, number, address, phone numbers, etc. The obvious solution would be to save the information in a text file, but that doesn't seem very smart or elegant. I was looking at databases, but they seem like overkill, since this information will only be accessed by one person at a time.
Databases are not overkill. They're great for organizing data even for embedded systems. (Take a look at SQLite, for example.) Other possibilities, depending on how much data you're talking about, are XML files (for which there are several APIs you could use) and Properties persisted to files.
Answering this kind of question is always like trying to answer, "what kind of car should I buy?" without more information on your needs.
Are you building a desktop application? Web application? Terminal app? What kind of information will you be storing? How much? Etc.
Databases arent anything to fear, to be sure, they can offer structured data storage in an easily retrievable and reliable format. You can also look into embedded databases like HSQL or SQLite or non-relational models like MongoDB (best if you use highly unstructured data).
But I would suggest you share a bit more information on your needs first, then we might be able to tell you if you need the SUV or the sedan.

Is there an SQL alternative like Python's Shelve for Android/Java?

Shelve is an ultra-simple No-SQL persistence layer which allows you to trivially persist a mapping of objects. It's a commonly used package in Python because it allows you to trivially add persistence to any application.
It's simplistic nature means it's somewhat limited - but it's surprisingly useful. You can map any arbitrary hashable key onto any serializable object.
Does something like this exist for Android? I'm writing a very simple app, and I've noticed that I'm spending a lot of time faffing around with table structures, select & insert statements. That's the sort of thing I almost never do in Python since I'd usually have some kind of NoSQL alternative.
I'm not expecting to to work exactly the same way - clearly Python and Java are languages with very different characteristics. I just want something that nearly as simple to use and requires less manual SQL faffing.
One more thing - this is a fairly trivial app. I'd prefer to introduce the bare minimum of additional project dependencies. Preference will be given to solutions which require nothing more than the Android APIs.
You said preference to Java API answers so I probably won't get preference, but Couchbase Mobile is the best Android No-SQL I have ever come across.
http://www.couchbase.org/get/couchbase-mobile-for-android/current
You can use SharedPrefences from DataStore. It's pretty much what you need. You don't need full SQLite power for this.

Android object handling / persistence

I am pretty far into my first Android application, and I have the sneaking suspicion that I'm "Doing It Wrong". My app talks to a Ruby on Rails server and serializes objects back and forth via XML. Before I knew what was happening, I found myself knee deep in writing my own crappy ORM, a problem which is compounded by the fact that I haven't written any Java since high school.
My conflict here is that I want my client-side (android) app to be capable of serializing via a variety of methods, such as HTTP/XML, to a local database, or out to the local filesystem. I started out with the Strategy pattern, but I feel like my solution is badly lacking.
For one, should I re-implement all of Rails model validation on the client side, because I don't know if I'm always going to be working with Rails on the other side? The even bigger issue is that right now I can only represent flat objects as key-values, as my code can't handle nested objects like a true ORM.
I'm sure Android devs deal with this all the time, so I'm interested to hear what other people do to cope with these issues.
I wouldn't approach your Android application as an extension of a Ruby app - rather a consumer of an API. If you can try to expose your server application as JSON (or other format, but JSON is the most lightweight) and consume these APIs from your Android application you would most likely have less problems as JSON is already in K/V format.
I have not yet written Android objects to SQLite yet, but I have written them as both Parcelable objects and to the SharedPreferences. Both of these strategies are sufficient for small to mid-range apps. For data intensive apps, obviously you will have to take it a step further to SQLite.
There are some great articles for these approaches: Managing State.
It boils down to designing your objects in a way that can be serialized easily. That means no circular references or extremely complex objects. This shouldn't be a very large problem, especially if your data is in JSON format already. You simply need to extend some classes and add functions that return a Parcelable object representation or a string representation so your objects can be saved thus.
I would avoid cloning your server-side objects and validation in Android as it then requires modifying both sources if you make small changes. The server should handle all data and validation and you should simply be requesting, caching, and sending data from Android.
I'd be interested to hear if there any challenges to writing objects to SQLite, but I imagine it's not that much more of step from the details I've outlined above. Hope this helps in some capacity!
Hessian is great for RPC. You don't have to do any serialization yourself. It doesn't use XML, so it's more efficient and more appropriate for a mobile platform.
I haven't done much of persistence storage on Android but I think you need to use SQLiteDatabase and make your own Cusor that De/Serializes your object so that it can be added to the database. A possible solution would be to extend a SQLiteCursor or an AbstractCursor.
Otherwise I don't think there is other solution apart from, possibly, "hardcore" Serializabled (Which I suspect it may be too much for a phone, I may be wrong)
I think you might be going too heavy for a smart phone application. I would look at using RESTful style web services with JSON content.
Looking to your question I got the feeling that maybe you just over-complicating your requirements? Why can't you just use JSON format to represent your objects data in portable way? Then you will be able just to store it either on file system or in database in simple text field. You can leverage android-active-record library for transparent DB persistence (http://code.google.com/p/android-active-record)

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?
This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.
Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".
You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.
Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

Categories