I have a small question that might be answered by a more experienced programmer than me.
I have to plot in real time two streams of data. In the end I need to build 5 different plots because in each plot I need to show different unit measures.
The first stream contains 4 values and one timestamp, the second stream has one value and one timestamp.
So, my idea is to run two threads that handle the two different streams which are received with different time intervals(6s the first and 1s the second one). Both threads should be able to receive the data and update the graphs.
My question is: which language do you think would be better between Python and Java, to implement this?
My fear is that using Java will result in a very slow UI.
It would be nice to have an answer supported by some considerations.
Thank you.
Related
My goal is to generate a predictive model using tensor flow in Java but I first want to ensure that my goal is achievable. Firstly, if I have a bunch of parameters and each set of parameters is assigned an output is it possible to train a model to predict an output given similar parameters? I am able to get hundreds of thousands samples (if needed) in order to train it so is this possible?
Secondly, after the model is trained how fast can it actually generate results?
Lastly, assuming everything up until this point checks out what is the best method in Java’s tensor flow to train a model with data that has multiple parameters associated with an outcome? Also in the result a given piece of data satisfies two results both can be returned as options ordered from most likely to least.
Also just to clarify I am not asking someone to make this for me I am just trying to make sure that a solution exists and is quick (if it’s slow I could just go back to brute forcing which I am trying to move away from since is kinda slow and resource intensive). Also, if you have any pointers on getting started tackling this I would greatly appreciate it!
Your question is very, very general, but I'll try to offer some insight:
Firstly, if I have a bunch of parameters and each set of parameters is assigned an output is it possible to train a model to predict an output given similar parameters?
Taking a set of parameters (known as the feature set X) and making predictions of another set of parameters (known as the output set Y) is the primary purpose of machine learning. Exactly how to do this requires many steps, how to do it well takes a lot of experience... However if you are asking if it is possible in principle, that depends on the specific feature set X, and output set Y.
I am able to get hundreds of thousands samples (if needed) in order to train it so is this possible?
The trick to machine learning is the data must be of a sufficient quantity and quality. This takes domain specific knowledge to know.
Are you able to provide any specifics about your data to help us understand?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am working on a project, where I was provided a Java matrix-multiplication program which can run in a distributed system , which is run like so :
usage: java Coordinator maxtrix-dim number-nodes coordinator-port-num
For example:
java blockMatrixMultiplication.Coordinator 25 25 54545
Here's a snapshot of how output looks like :
I want to extend this code with some kind of failsafe ability - and am curious about how I would create checkpoints within a running matrix multiplication calculation. The general idea is to recover to where it was in a computation (but it doesn't need to be so fine grained - just recover to beginning, i.e row 0 column 0 )
My first idea is to use log files (like Apache log4j ), where I would be logging the relevant matrix status. Then, if we forcibly shut down the app in the middle of a calculation, we could recover to a reasonable checkpoint.
Should I use MySQL for such a task (or maybe a more lightweight database)? Or would a basic log file ( and using some useful Apache libraries) be good enough ? any tips appreciated, thanks
source-code :
MatrixMultiple
Coordinator
Connection
DataIO
Worker
If I understand the problem correctly, all you need to do is recover your place in a single matrix calculation in the event of a crash or if the application is quit half way through.
Minimum Viable Solution
The simplest approach would be to recover just the two matrixes you were actively multiplying, but none of your progress, and multiply them from the beginning next time you load the application.
The Process:
At the beginning of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, create a file, let's call it recovery_data.txt with the state of the two arrays being multiplied (parameters a and b). Alternatively, you could use a simple database for this.
At the end of public static int[][] multiplyMatrix(int[][] a, int[][] b) in your MatrixMultiple class, right before you return, clear the contents of the file, or wipe you database.
When the program is initially run, most likely near the beginning of the main(String[] args) you should check to see if the contents of the text file is non-null, in which case you should multiply the contents of the file, and display the output, otherwise proceed as usual.
Notes on implementation:
Using a simple text file or a full fledged relational database is a decision you are going to have to make, mostly based on the real world data that only you could really know, but in my mind, a textile wins out in most situations, and here are my reasons why. You are going to want to read the data sequentially to rebuild your matrix, and so being relational is not that useful. Databases are harder to work with, not too hard, but compared to a text file there is no question, and since you would not be much use of querying, that isn't balanced out by the ways they usually might make a programmers life easier.
Consider how you are going to store your arrays. In a text file, you have several options, my recommendation would be to store each row in a line of text, separated by spaces or commas, or some other character, and then put an extra line of blank space before the second matrix. I think a similar approach is used in crAlexander's Answer here, but I have not tested his code. Alternatively, you could use something more complicated like JSON, but I think that would be too heavy handed to justify. If you are using a database, then the relational structure should make several logical arrangements for your data apparent as well.
Strategic Checkpoints
You expressed interest in saving some calculations by taking advantage of the possibility that some of the calculations will have already been handled on last time the program ran. Lets look first look at the Pros and Cons of adding in checkpoints after every row has been processed, best I can see them.
Pros:
Save computation time next time the program is run, if the system had been closed.
Cons:
Making the extra writes will either use more nodes if distributed (more on that later) or increase general latency from calculations because you now have to throw in a database write operation for every checkpoint
More complicated to implement (but probably not by too much)
If my comments on the implementation of the Minimum Viable Solution about being able to get away with a text file convinced you that you would not have to add in RDBMS, I take back the parts about not leveraging queries, and everything being accessed sequentially, so a database is now perhaps a smarter choice.
I'm not saying that checkpoints are definitely not the better solution, just that I don't know if they are worth it, but here is what I would consider:
Do you expect people to be quitting half way through a calculation frequently relative to the total amount of calculations they will be running? If you think this feature will be used a lot, then the pro of adding checkpoints becomes much more significant relative to the con of it slowing down calculations as a whole.
Does it take a long time to complete a typical calculation that people are providing the program? If so, the added latency I mentioned in the cons is (percentage wise) smaller, and so perhaps more tolerable, but users are already less happy with performance, and so that cancels out some of the effect there. It also makes the argument for checkpointing more significant because it has the potential to save more time.
And so I would only recommend checkpointing like this if you expect a relatively large amount of instances where this is happening, and if it takes a relatively large amount of time to complete a calculation.
If you decide to go with checkpoints, then modify the approach to:
after every row has been processed on the array that you produce the content of that row to your database, or if you use the textile, at the end of the textile, after another empty line to separate it from the last matrix.
on startup if you need to finish a calculation that has already been begun, solve out and distribute only the rows that have yet to be considered, and retrieve the content of the other rows from your database.
A quick point on implementing frequent checkpoints: You could greatly reduce the extra latency from adding in frequent checkpoints by pushing this task out to an additional thread. Doing this would use more processes, and there is always some latency in actually spawning the process or thread, but you do not have to wait for the entire write operation to be completed before proceeding.
A quick warning on the implementation of any such failsafe method
If there is an unchecked edge case that would mean some sort of invalid matrix would crash the program, this failsafe now bricks the program it entirely by trying it again on every start. To combat this, I see some obvious solutions, but perhaps a bit of thought would let you modify my approaches to something you prefer:
Use a lot of try and catch statements, if you get any sort of error that seems to be caused by malformed data, wipe your recovery file, or modify it to add a note that tells your program to treat it as a special case. A good treatment of this special case may be to display the two matrixes at start with an explanation that your program failed to multiply them likely due to malformed content.
Add data in your file/database on how many times the program has quit while solving the current problem, if this is not the first resume, treat it like the special case in the above option.
I hope that this provided enough information for you to implement your failsafe in the way that makes the most sense given what you suspect the realistic use to be, and note that there are perhaps other ways you could approach this problem as well, and these could equally have their own lists of pros and cons to take into consideration.
The NeuralDataSet objects that I've seen in action haven't been anything but XOR which is just two small data arrays... I haven't been able to figure out anything from the documentation on MLDataSet.
It seems like everything must be loaded at once. However, I would like to loop through training data until I reach EOF and then count that as 1 epoch.. However, everything I've seen all the data must be loaded into 1 2D array from the beginning. How can I get around this?
I've read this question, and the answers didn't really help me. And besides that, I haven't found a similar question asked on here.
This is possible, you can either use an existing implementation of a data set that supports streaming operation or you can implement your own on top of whatever source you have. Check out the BasicMLDataSet interface and the SQLNeuralDataSet code as an example. You will have to implement a codec if you have a specific format. For CSV there is an implementation already, I haven't checked if it is memory based though.
Remember when doing this that your data will be streamed fully for each epoch and from my experience that is a much higher bottleneck than the actual computation of the network.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have been experimenting with ways to use the processing power of two computers together as one (not by physically connecting them, but by splitting the task in half and each computer does a half, then the result from the "helper" computer is sent back to be combined with the result from the "main" computer via internet)
I've been using this method to compute fractal images and it works great. The left half and the right half of the image are computed on separate computers, then combined into one. The process of sending one half of the image to the other computer and combining them takes maybe a second, so the efficiency is great and cuts time down by about half.
The problem comes when you want to do this "multi computer processing" with something that needs data exchanged very frequently.
For example, I'd like to use this for something like an n-body simulation. You need the data exchange to happen multiple times per second, so if the exchange takes about a second it actually takes much longer to try and use two computers then it would with one.
So how do online video games do it? The players around you, what they are doing, what they are wearing, everything going on has to be exchanged between everyone playing many times per second.
I'm just looking for general ideas on how to send larger amounts of data and at fast speeds.
The way I have been doing it is with PHP on a free hosting site. The helper computer will compute its half of the data then sends it to the PHP file which saves that data somewhere. Then the main computer reads this and combines it with the data it computed already.
I have a feeling PHP isn't the way to go, but I don't know much about this sort of thing.
Your first step will be to move from using HTTP Requests to using Sockets directly - this will give you much more control over the communication, and give you improved performance by reducing the overhead of the HTTP protocol (this is potentially pretty significant). Plus, with sockets you can more easily have your programs communicate to each other directly, rather than through the PHP-based software.
There are a ton of guides online as to how you would do this sort of system, and I would recommend Googling things like "game networking" and "distributed computing".
Here is one series of articles that I have found useful in the past, that covers the sort of things that you will want to read about: http://gafferongames.com/networking-for-game-programmers/
(He doesn't use Java, but the ideas are universal)
So I've got these huge text files that are filled with a single comma delimited record per line. I need a way to process the files line by line, removing lines that meet certain criteria. Some of the removals are easy, such as one of the fields is less than a certain length. The hardest criteria is that these lines all have timestamps. Many records are identical except for their timestamps and I have to remove all records but one that are identical and within 15 seconds of one another.
So I'm wondering if some others can come up with the best approach for this. I did come up with a small program in Java that accomplishes the task, using JodaTime for the timestamp stuff which makes it really easy. However, the initial way I coded the program was running into OutofMemory Heap Space errors. I refactored the code a bit and it seemed ok for the most part but I do still believe it has some memory issues as once in awhile the program just seems to get hung up. That and it just seems to take way too long. I'm not sure if this is a memory leak issue, a poor coding issue, or something else entirely. And yes I tried increasing the Heap Size significantly but still was having issues.
I will say that the program needs to be in either Perl or Java. I might be able to make a python script work too but I'm not overly familiar with python. As I said, the timestamp stuff is easiest (to me) in Java because of the JodaTime library. I'm not sure how I'd accomplish the timestamp stuff in Perl. But I'm up for learning and using whatever would work best.
I will also add the files being read in vary tremendously in size but some big ones are around 100Mb with something like 1.3 million records.
My code essentially reads in all the records and puts them into a Hashmap with the keys being a specific subset of the data from a record that similar records would share. So a subset of the record not including the timestamps which would be different. This way you'd end up with some number of records with identical data but that occurred at different times. (So completely identical minus the timestamps).
The value of each key then, is a Set of all records that have the same subset of data. Then I simply iterate through the Hashmap, taking each set and iterating through it. I take the first record and compare its times to all the rest to see if they're within 15 seconds. If so the record is removed. Once that set is finished it's written out to a file until all the records have been gone through. Hopefully that makes sense.
This works but clearly the way I'm doing it is too memory intensive. Anyone have any ideas on a better way to do it? Or, a way I can do this in Perl would actually be good because trying to insert the Java program into the current implementation has caused a number of other headaches. Though perhaps that's just because of my memory issues and poor coding.
Finally, I'm not asking someone to write the program for me. Pseudo code is fine. Though if you have ideas for Perl I could use more specifics. The main thing I'm not sure how to do in Perl is the time comparison stuff. I've looked a little into Perl libraries but haven't seen anything like JodaTime (though I haven't looked much). Any thoughts or suggestions are appreciated. Thank you.
Reading all the rows in is not ideal, because you need to store the whole lot in memory.
Instead you could read line by line, writing out the records that you want to keep as you go. You could keep a cache of the rows you've hit previously, bounded to be within 15 seconds of the current program. In very rough pseudo-code, for every line you'd read:
var line = ReadLine()
DiscardAnythingInCacheOlderThan(line.Date().Minus(15 seconds);
if (!cache.ContainsSomethingMatchingCriteria()) {
// it's a line we want to keep
WriteLine(line);
}
UpdateCache(line); // make sure we store this line so we don't write it out again.
As pointed out, this assumes that the lines are in time stamp order. If they aren't, then I'd just use UNIX sort to make it so they are, as that'll quite merrily handle extremely large files.
You might read the file and output just the line numbers to be deleted (to be sorted and used in a separate pass.) Your hash map could then contain just the minimum data needed plus the line number. This could save a lot of memory if the data needed is small compared to the line size.