I have a group of nodes who send measurements to a bootstrap server. In the end I want the bootstrap server to sum all the measurements and write it to a file. One way to do that is to over-write the data to the file each time a measurement message is received(after summing up the current measurements). But this would be very inefficient. I want to store the measurement data and write it to file only once after the simulation is completed.
But the problem is that the simulator code that I am using is not under my control, its a library that I am using. So, I cant tell when exactly the simulation is going to end (and hence I cant tell which measurement message will be the last one).
I naively tried to store the measurement data in a static class but this data is not accessible when the simulation terminates. Is there any other way that I can do this ?
Thanks,
I would find the last message using a timeout.
Write to disk if you have new data but you haven't got anything for a while e.g. a second.
If you cannot store the data you need in the process (which it seems you can't, since the static class failed), you need to persist the data some other way. To an on-disk file is one option, and another common one would be to a database.
Related
i need to save data from an java desktop application. The main part of the Data are the texts of around 50 labels. Which are spread over 5 Java GUI-classes. The Rest are some simple application settings.
Now i am quit unsure about how to safe these data. A friend told me to use Random access Data and to write some kind of "serializable" object. At the moment i am using a .txt and a fileReader/writer. But this seemes impractical for 50-100 Data if your want to search the position in the .txt by every update. This is my same problem with random access data.
i thought about using some kind of embedded DB like "h2" but i dont now if this is to much and too complicated for such a small programm.
An other question is how do i put the text of all labels at the programm start. one way i am thinking about is to have a big list of all labels with determind positions and after reading the data from whatever to go over this list and set the labes. An other way would be to give every Label an id.
But maybe there is a much better way. But i dont now how to access the labels by names read from the data.
For saving serializable objects. Can i safe all the gui-object or do i need to combine se data in one class?
maybe someone could give a nice advise =)
For such a small number of labels, I would just keep all data in memory. On app initialization load the file and on every edit write the entire file from scratch
(If you are concerned about reliability in the face of power loss and random crashes during write you need to be careful here. For example, write the new data to a different file, fsync() then atomically rename the new file to the desired filename.)
I'm not sure I understand your serialization problem -- but it seems like you have some sort of language translation layer that tells the gui elements what to display. If so, then yes - I would store the labels in a central class (say LablesMap) and have the other classes refer to data in that class using some constant keys. E.g.,
myButton.setText(labelsMap.get(CANCEL_BUTTON_LABEL)
where CANCEL_BUTTON_LABEL is some constant or enum value.
I am just trying to write huge data which is fetching from mysql db to CSV by using supercsv. How simply I can manage the performance issue. Does super csv write with some limits?
Since you included almost no detail in your question about how you are approaching the problem, it's hard to make concrete recommendations. So, here's a general one:
Unless you are writing your file to a really slow medium (some old USB stick or something), the slowest step in your process should be reading the data from the database.
There are two general ways how you can structure your program:
The bad way: Reading all the data from the database into your application's memory first and then, in a second step, writing it all in one shot to the csv file.
The right way: "Stream" the data from the db into the csv file, i.e. write the data to the csv file as it comes in to your application (record by record or batch by batch).
The idea is to set up something usually referred to as a "pipeline". Think of it like conveyor belt construction in a factory: You have multiple steps in your process of assembling some widget. What you don't want to do is have station 1 process all widgets and have stations 2 and 3 sit idle meanwhile, and then pass the whole container of widgets to station 2 to begin work, while stations 1 and 3 sit idle and so forth. Instead, station 1 needs to send small batches (1 at a time or 10 at a time or so) of widgets that are done to station 2 immediately so that they can start working on it as soon as possible. The goal is to keep all stations as busy as possible at all times.
In your example, station 1 is mysql retrieving the records, station 2 is your application that forwards (and processes?) them, and station 3 is supercsv. So, simply make sure that supercsv can start working as soon as possible, rather than having to wait for mysql to finish the entire request.
If you do this right, you should be able to generate the csv file as quickly as mysql can throw records at you*, and then, if it's still too slow, you need to rethink your database backend.
*I haven't used supercsv yet, so I don't know how well it performs, but given how trivial its job is and how popular it is, I would find it hard to believe that it would end up performing less well (as measured in processing time needed for one record) than mysql in this task. But this might be something that is worth verifying...
So, I am been playing with Cassandra, and have setup a cluster with three nodes. I am trying to figure out how redundancy works with ConsistencyLevels. Currently, I am writing data with ConsistenyLevel.ALL and am reading data with ConsistencyLevel.ONE. From what I have been reading, this seems to make sense. I have three Cassandra nodes, and I want to write to all three of them. I only care about reading from one of them, so I will take the first response. To test this, I have written a bunch of data (again, with ConsistencyLevel.ALL). I then kill one of my nodes (not the "seed" or "listen_address" machine).
When I then try to read, I expect, maybe after some delay, to get my data back. Initially, I get a TimeoutException... which I expect. This is what one gets when Cassandra is trying to deal with an unexpected node loss, right? After about 20 seconds, I try again, and now am getting an UnavailableException, which is described as "Not all the replicas required could be created and/or read".
Well, I don't care about all the replicas... just one (as in ConsistencyLevel.ONE on my get statement), right?
Am I missing the ConsistencyLevel point here? How can I configure this to still get my information if a node dies?
Thanks
It sounds like you have Replication Factor (RF) set to 1, meaning only one node holds any given row. Thus, when you take a node down, no matter what consistency level you use, you won't be able to read or write 1/3 of your data. Your expectations match what should happen with RF = 3.
I am working on development of an application to process (and merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file on different hosts. But as deserialization will require the all the blocks to be present on single host, its gonna hit the performance drastically. How can I deal this situation where different blocks have to cant be individually processed, unlike text files ?
There's two issues: one is that each file must (in the initial stage) be processed in whole: the mapper that sees the first byte must handle all the rest of that file. The other problem is locality: for best efficiency, you'd like all the blocks for each such file to reside on the same host.
Processing files in whole:
One simple trick is to have the first-stage mapper process a list of filenames, not their contents. If you want 50 map jobs to run, make 50 files each with that fraction of the filenames. This is easy and works with java or streaming hadoop.
Alternatively, use a non-splittable input format such as NonSplitableTextInputFormat.
For more details, see "How do I process files, one per map?" and "How do I get each of my maps to work on one complete input-file?" on the hadoop wiki.
Locality:
This leaves a problem, however, that the blocks you are reading from are disributed all across the HDFS: normally a performance gain, here a real problem. I don't believe there's any way to chain certain blocks to travel together in the HDFS.
Is it possible to place the files in each node's local storage? This is actually the most performant and easiest way to solve this: have each machine start jobs to process all the files in e.g. /data/1/**/*.data (being as clever as you care to be about efficiently using local partitions and number of CPU cores).
If the files originate from a SAN or from say s3 anyway, try just pulling from there directly: it's built to handle the swarm.
A note on using the first trick: If some of the files are much larger than others, put them alone in the earliest-named listing, to avoid issues with speculative execution. You might turn off speculative execution for such jobs anyway if the tasks are dependable and you don't want some batches processed multiple times.
It sounds like your input file is one big serialized object. Is that the case? Could you make each item its own serialized value with a simple key?
For example, if you were wanting to use Hadoop to parallelize the resizing of images you could serialize each image individually and have a simple index key. Your input file would be a text file with the key values pairs being index key and then serialized blob would be the value.
I use this method when doing simulations in Hadoop. My serialized blob is all the data needed for the simulation and the key is simply an integer representing a simulation number. This allows me to use Hadoop (in particular Amazon Elastic Map Reduce) like a grid engine.
I think the basic (unhelpful) answer is that you can't really do this, since this runs directly counter to the MapReduce paradigm. Units of input and output for mappers and reducers are records, which are relatively small. Hadoop operates in terms of these, not file blocks on disk.
Are you sure your process needs everything on one host? Anything that I'd describe as a merge can be implemented pretty cleanly as a MapReduce where there is no such requirement.
If you mean that you want to ensure certain keys (and their values) end up on the same reducer, you can use a Partitioner to define how keys are mapped onto reducer instances. Depending on your situation, this may be what you really are after.
I'll also say it kind of sounds like you are trying to operate on HDFS files, rather than write a Hadoop MapReduce. So maybe your question is really about how to hold open several SequenceFiles on HDFS, read their records and merge, manually. This isn't a Hadoop question then, but, still doesn't need blocks to be on one host.
I challenge you :)
I have a process that someone already implemented. I will try to describe the requirements, and I was hoping I could get some input to the "best way" to do this.
It's for a financial institution.
I have a routing framework that will allow me to recieve files and send requests to other systems. I have a database I can use as I wish but it is only me and my software that has access to this database.
The facts
Via the routing framework I recieve a file.
Each line in this file follows a fixed length format with the identification of a person and an amount (+ lots of other stuff).
This file is 99% of the time im below 100MB ( around 800bytes per line, ie 2,2mb = 2600lines)
Once a year we have 1-3 gb of data instead.
Running on an "appserver"
I can fork subprocesses as I like. (within reason)
I can not ensure consistency when running for more than two days. subprocesses may die, connection to db/framework might be lost, files might move
I can NOT send reliable messages via the framework. The call is synchronus, so I must wait for the answer.
It's possible/likely that sending these getPerson request will crash my "process" when sending LOTS.
We're using java.
Requirements
I must return a file with all the data + I must add some more info for somelines. (about 25-50% of the lines : 25.000 at least)
This info I can only get by doing a getPerson request via the framework to another system. One per person. Takes between 200 and 400msec.
It must be able to complete within two days
Nice to have
Checkpointing. If im going to run for a long time I sure would like to be able to restart the process without starting from the top.
...
How would you design this?
I will later add the current "hack" and my brief idea
========== Current solution ================
It's running on BEA/Oracle Weblogic Integration, not by choice but by definition
When the file is received each line is read into a database with
id, line, status,batchfilename and status 'Needs processing'
When all lines is in the database the rows are seperated by mod 4 and a process is started per each quarter of the rows and each line that needs it is enriched by the getPerson call and status is set to 'Processed'. (38.0000 in the current batch).
When all 4 quaters of the rows has been Processed a writer process startes by select 100 rows from that database, writing them to file and updating their status to 'Written'.
When all is done the new file is handed back to the routing framework, and a "im done" email is sent to the operations crew.
The 4 processing processes can/will fail so its possible to restart them with a http get to a servlet on WLI.
Simplify as much as possible.
The batches (trying to process them as units, and their various sizes) appear to be discardable in terms of the simplest process. It sounds like the rows are atomic, not the batches.
Feed all the lines as separate atomic transactions through an asynchronous FIFO message queue, with a good mechanism for detecting (and appropriately logging and routing failures). Then you can deal with the problems strictly on an exception basis. (A queue table in your database can probably work.)
Maintain batch identity only with a column in the message record, and summarize batches by that means however you need, whenever you need.
When you receive the file, parse it and put the information in the database.
Make one table with a record per line that will need a getPerson request.
Have one or more threads get records from this table, perform the request and put the completed record back in the table.
Once all records are processed, generate the complete file and return it.
if the processing of the file takes 2 days, then I would start by implementing some sort of resume feature. Split the large file into smaller ones and process them one by one. If for some reason the whole processing should be interrupted, then you will not have to start all over again.
By splitting the larger file into smaller files then you could also use more servers to process the files.
You could also use some mass loader(Oracles SQL Loader for example) to take the large amount of data form the file into the table, again adding a column to mark if the line has been processed, so you can pick up where you left off if the process should crash.
The return value could be many small files which at the end would be combined into large single file. If the database approach is chosen you could also save the results in a table which could then be extracted to a csv file.