Workflow of Vowpal Wabbit

Workflow of Vowpal Wabbit - java

I have troubles with understanding how I can work with Vowpal Wabbit (in this context, it doesn't matter. It can be something like one).
There are several steps of working with this program:
prepare data
train a model
?
profit
What step will be number 3?
I have found 2 ways of working with vowpal from my java app.
One of them is creating a vowpal process with necessary parameters like a path to a created model. But there is a problem. I am not sure, that this approach is good in a concurrent environment. Of course, I can run a stuck of processes for each thread, but it will be not ok.
Another way is running a vowpal daemon and connect to it via a socket connection. I see problems here too. For instance, I have to create a socket connection and send some data to a daemon. Then, I have to wait for a result, but I don't know when a result will be ready. Also, when I will receive data from a daemon, I don't know which chunk of data is the last. A result of work is just a string and its format doesn't let me process an output stream correctly.
Maybe are there other ways of working with vowpal wabbit, which are more productive and more reliable?

Related

Communicate between 2 different java processes

I have 2 java processes, Process1 is responsible for importing some external data to the database, Process2 is running the rest of the application using the same database, i.e. it hosts the web module the everything else. Process1 would normally import data once a day.
What I require is when Process1 has finished it's work it should notify the Process2 about it, so that it can perform some subsequent tasks. That is it, this will be their limit of interaction with each other. No other data has to be shared later.
No I know I can do this in one of the following ways:
Have the Process1 write an entry in the database when it has finished its execution and have a demon thread in Process2 looking for that entry. Once this entry is read, complete the task in Process2. Even though this might be the easiest to implement in the existing ecosystem, I think having a thread loop the database just for one notification looks kind of ugly. However, it could be optimised by starting the thread only when the import job starts and killing it after the notification is received.
Use a socket. I have never worked with sockets before, so this might be an interesting learning curve. But after my initial readings I am afraid it might be an overkill.
Use RMI
I would like to hear from people who have worked on similar problems, and what approach they choose and why and also would like to know what will be an appropriate solution for my problem.
Edit.
I went through this but found that for a starter in interprocess communication it lacks basic examples. That is what I am looking in this post.

I would say take a look at Chronicle-Queue
It uses a memory mapped file and saves data off-heap (so no problem with GC). Also, Provides TCP replication for failover scenarios.
It scales pretty well and supports distributed processing when more than one machine is available.

Java File Single Writer, Single reader

I;m looking for a very basis IPC mechanism between Java programs. I prefer not to make use of sockets because my 'agent' is spawning new JVM's and setting up sockets in such an environments is a bit more complicated.
I was thinking about having 2 files per spawned JVM: in and out. On the in, the agent sends commands to the worker. And on the out, the worker sends back a response back to the agent.
The big problem is that till so far I didn't manage to get the communication up and running. Just creating ObjectOutputStream/ObjectInputStream doesn't work out of the box, because the readObject method isn't blocking. It will throw an EOFException when there is no content instead instead of blocking. Luckily that was easy to fix, by adding a delay and trying again a bit later.
So I got my POC up and running, but eventually I ran into a stream corruption issue. So apparently, even in append only mode, you still can run into corruption issue. So I started to look at the FileLock, but I'm running now into a ""main" java.lang.Error: java.io.IOException: Bad file descriptor".
So till so far the 'lets do the simple file thing' has been quite an undertaking and I'm not sure if I'm in the right path at all. I don't want to introduce a heavy weight solution like JMS or a less heavyweight solution like sockets. Does anyone know something extremely simple that solves this particular problem? My preference is still for a file based approach.

How can I have Java trigger C++ programs and vice versa when data is written to protocol buffers?

Long story short, I have a Java process that reads and writes data to/from a process. I have a C++ program that takes the data, processes it and then needs to pass it back to Java so that Java can write it to a database.
The Java program pulls its data from Hadoop, so once the Hadoop process kicks off, it gets flooded with data but the actual processing(done by the C++ program) cannot handle all the data at once. So I need a way to control the flow as well. Also to complicate the problem(but simplify my work), I do the Java stuff and my friend does the C++ stuff and are trying to keep our programs as independent as possible.
That’s the problem. I found Google protocol buffer and it seems pretty cool to pass data between the programs but I’m unsure how the Java Program saving data can trigger the c++ program to process and then when the c++ program saves the results how the Java program will be triggered to save the results (this is for one or a few records but we plan to process billions of records).
What is the best approach to this problem? Is there a simple way of doing this?

The simplest approach may be to use a TCP Socket connection. The Java program sends when you want to be done and the C++ program sends back the results.

Since you're going to want to scale this solution, i suggest using ZMQ.
Have your java app still pull the data from Hadoop.
It will then in turn push the data out using a PUSH socket.
Here you will have as many as needed c++ workers going who will process this data accepting connections as PULL sockets. This is scalable to as many different processor/cores/etc... that you need.
When each worker is finished it will push the results out on a PUSH socket to the 'storing' java program which is accepting info on a PULL socket.
It looks something like this example (standard divide and conquer methodology)
This process is scalable to as many workers as necessary as your first java program will block (but still is processing) when there aren't any available workers. So long as your ending java program is fast, you will see this scales really really nicely.
The emitting and saving program can be in the same program just use a zmq_poll device :)

J2EE - implementing constantly running component/daemon

I am designing a server application, that is supposed to crunch a lot of data continuously and present results on demand using web interface.
The operating scheme goes roughly like this:
An electronic sensor array constantly spills data into ramdisk through USB
A "flusher" application processes data as fast as it can and loads it into db (staging area)
Using triggers, db performs calculations on data and stores results in another schema (data area)
Client webapp can display processed data in graphs/reports etc. on demand
The solution would ideally look like this:
Database server - PostgreSQL
Have an administration web interface, that can monitor the flusher (i.e. records processed per hour or something like that) and if implemented as separate daemon, control it.
Flusher and Client applications written in Java, ideally using J2EE
Now the problem that keeps bugging me and I can't find the answer: How to go about writing the flusher component, i.e. a process that constantly runs in background in J2EE.
By scouring the web, basically three possibilities emerged:
a) Write the flusher as message driven bean and control it from master application using JMS. However: I don't like the idea of having a MDB running constantly, I'm not even sure that that's possible
b) Write the flusher as EJB and control it using Timer/Scheduling service. However: the events are not really timed, it just needs to run in infinite loop until told not to do so, just seems wrong usage of the technology.
c) Write the flusher as separate java application, run it as OS service (Linux or Windows) and control using startup scripts through ProcessBuilder invoked from EJB. To monitor it's status, use JMS. However: this just seems to me as overly complicated solution, platform dependent and maybe even unreliable and as EJB should not spawn/manage it's own threads, which ProcessBuilder basically does, it just seem wrong.
Basically, none of these look right to me and I cannot figure out, what would we the right solution in the Java/J2EE world.
Thank you
Thomas

I would write the "Flusher" app as a stand alone Java process. Perhaps use something like Java Service Wrapper to turn it into a service for your OS. I'm not very familiar with the options for interfacing with a RAM disk via Java, but you're either going to end up with an InputStream which you can keep open for the life of the process and continually read from, or you're going to continually poll from inside a while loop. It's perfectly ok to do something like the following:
private volotile boolean stopFlag;
...
while(!stopFlag) {
processNextInput();
}
Then you would have some other mechanism in another thread that could set stopFlag to true when you wanted to terminate the process.
As for monitoring the flusher JMX seems like a good solution. That's exactly what it was intended for. You would create an MBean that would expose any kind of status or statistics you wanted and then other processes could connect to that MBean and query for that data.
The "Client" app would then be a simple servlet application which does reporting on your database and provides a pretty front end for the MBean from your flusher. Alternatively you could just monitor the flusher using a JMX console and not even involve the client with that piece of the system.
I don't think EJBs really make sense for this system. I'm somewhat biased against EJBs, so take my advice with a grain of salt, but to me I don't really see a need for them in this application.

JDK 6: Is there a way to run a new java process that executes the main method of a specified class

I'm trying to develop an application that just before quit has to run a new daemon process to execute the main method of a class.
I require that after the main application quits the daemon process must still be in execution.
It is a Java Stored Procedure running on Oracle DB so I can't use Runtime.exec because I can't locate the java class from the Operating System Shell because it's defined in database structures instead of file system files.
In particular the desired behavior should be that during a remote database session I should be able to
call the first java method that runs the daemon process and quits leaving the daemon process in execution state
and then (having the daemon process up and the session control, because the last call terminated) consequentially
call a method that communicates with the daemon process (that finally quits at the end of the communication)
Is this possible?
Thanks
Update
My exact need is to create and load (reaching the best performances) a big text file into the database supposing that the host doesn't have file transfer services from a Java JDK6 client application connecting to Oracle 11gR1 DB using JDBC-11G oci driver.
I already developed a working solution by calling a procedure that stores into a file the LOB(large database object) given as input, but such a method uses too many intermediate structures that I want to avoid.
So I thought about creating a ServerSocket on the DB with a first call and later connect to it and establish the data transfer with a direct and fast communication.
The problem I encountered comes out because the java procedure that creates the ServerSocket can't quit and leave an executing Thread/Process listening on that Socket and the client, to be sure that the ServerSocket has been created, can't run a separate Thread to handle the rest of the job.
Hope to be clear

I'd be surprised if this was possible. In effect you'd be able to saturate the DB Server machine with an indefinite number of daemon processes.
If such a thing is possible the technique is likely to be Oracle-specific.
Perhaps you could achieve your desired effect using database triggers, or other such event driven Database capabilities.
I'd recommend explaining the exact problem you are trying to solve, why do you need a daemon? My instict is that trying to manage your daemon's life is going to get horribly complex. You may well need to deal with problems such as preventing two instances being launched, unexpected termination of the daemon, taking daemon down when maintenance is needed. This sort of stuff can get really messy.
If, for example, you want to run some Java code every hour then almost certanly there are simpler ways to achieve that effect. Operating systems and databases tend to have nice methods for initiating work at desired times. So having a stored procedure called when you need it is probably a capability already present in your environment. Hence all you need to do is put your desired code in the stored procedure. No need for you to hand craft the shceduling, initiation and management. One quite significant advantage of this approach is that you end up using a tehcnique that other folks in your environment already understand.
Writing the kind of code you're considering is very intersting and great fun, but in commercial environments is often a waste of effort.

Make another jar for your other Main class and within your main application call the jar using the Runtime.getRuntime().exec() method which should run an external program (another JVM) running your other Main class.

The way you start subprocesses in Java is Runtime.exec() (or its more convenient wrapper, ProcessBuilder). If that doesn't work, you're SOL unless you can use native code to implement equivalent functionality (ask another question here to learn how to start subprocesses at the C++ level) but that would be at least as error-prone as using the standard methods.
I'd be startled if an application server like Oracle allowed you access to either the functionality of starting subprocesses or of loading native code; both can cause tremendous mischief so untrusted code is barred from them. Looking over your edit, your best approach is going to be to rethink how you tackle your real problem, e.g., by using NIO to manage the sockets in a more efficient fashion (and try to not create extra files on disk; you'll just have to put in extra elaborate code to clean them up…)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.