I want to know how to process and manage Tabular data stream in java programming.
consider there is a table of records has the scheme ( name, age, zip-code, disease) and the records are to be read and processed tuple by tuple in time as a stream. i want to manage these stream tuples to save the processed tuples with the the scheme ( age, zip- code, disease ) ( name attribute is supposed to be deleted )
for input example .. read Tuple 1 ( han, 25, 12548, flue) at Time t1
publish Tuple 1* ( 25, 12548, flue)
read Tuple 2 ( alex, 27, 12544, cancer) 1 at t2 .
output Tuple 2* (27, 12544, cancer).
.. and so on, Can anyone Help me?
Here are some suggestions for a framework you can base your final application on.
First, make classes to represent your input and output records. We'll call them InRecord and OutRecord for the sake of discussion, but you can give them whatever names make sense for you. Give them private fields to hold the necessary data and public getter/setter methods to access the data.
Second, define an interface for an input supplier; let's call it InputSupplier for this discussion. It will need to provide methods for setup (open()) and tear-down (close()) methods to be called at the start and end of processing, and a getNext() method that returns the next available InRecord. You'll need to decide how it indicate end-of-input: either define that getNext() will return null if
there are no more input records, or provide a hasNext() method to call which will return true or false to indicate if another input record is available.
Third, define an interface for an output consumer (OutputConsumer). You'll want to have open() and close() methods, as well as an accept(OutRecord) method.
With this infrastructure in place, you can write your processing method:
public void process(InputSupplier in, OutputConsumer out){
in.open();
out.open();
InRecord inrec;
while ((inrec = in.getNext()) != null){
OutRecord outrec = new OutRecord(in.getAge(), in.getZipCode(), in.getDisease());
out.accept(outrec);
}
out.close();
in.close();
}
Finally, write some "dummy" I/O classes, one that implements InputSupplier and another that implements OutputConsumer. For test purposes, your input supplier can just return a few hand-created records and your output consumer could just print on the console the output records you send it.
Then all you need is a main method to tie it all together:
public static void main(String[] args){
InputSupplier in = new TestInput();// our "dummy" input supplier class
OuputConsumer out = new TestOutput(); // our "dummy" output consumer
process(in, out);
}
For the "real" application you'd write a "real" input supplier class, still implementing the InputSupplier interface, that can read from from a database or an Excel file or whatever input source, and an new output consumer class, still implementing the OutputConsumer interface, that can take output records and store them into whatever appropriate format. Your processing logic won't have to change, because you coded it in terms of InputSupplier and OutputConsumer interfaces. Now just tweak main a bit and you've got your final app:
public static void main(String[] args){
InputSupplier in = new RealInput();// our "real" input supplier class
OuputConsumer out = new RealOutput(); // our "real" output consumer
process(in, out);
}
Related
I have a file of records, each row begins with a timestamp and then a few fields.. it implements Iterable
#SuppressWarnings("unchecked")
#Override
public <E extends MarkedPoint>
Stream<E>
stream()
{
return (Stream<E>) StreamSupport.stream(spliterator(), false);
}
I would like to implement with Lambda expression/streams API what is essentially not just a filter, but a mapping/accumulator that would merge neighboring records ( stream elements coming from the an Iterable interface ) having the same timestamp. I would need an interface that was something like this
MarkedPoint prevPoint = null;
void nextPoint(MarkedPoint in, Stream<MarkedPoint> inputStream, Stream<MarkedPoint> outputStream )
{
while ( prevPoint.time == in.time )
{
updatePrevPoint(in);
in = stream.next();
}
outputStream.emit(in);
prevPoint = in;
}
}
that is rough-pseudocode of what I imagine is close to some API as how it is supposed to be used.. can someone please point me towards the most straightforward way of implementing this stream transformation ? The resulting stream will be necessarily of the same or lesser number of elements as the input, as it is essentially a filter and and option transformation of records occuring at the same timestamp are encountered.
Thanks in advance
Streams don’t work like that; there can be only 1 terminating method (consumer). What you seem to be asking for is an on-the-fly reduction with a possible consumption of the next element(s) within your class. No dice with the standard stream API.
You could first create a list of un-merged lines, then create an iterator that peeks at the next elenent(s) and merges them before returning the next merged element.
I'm having troubles properly implementing the following scenario using RxJava (v1.2.1):
I need to handle a request for some data object. I have a meta-data copy of this object which I can return immediately, while making an API call to a remote server to retrieve the whole object data. When I receive the data from the API call I need to process the data before emitting it.
My solution currently looks like this:
return Observable.just(localDataCall())
.concatWith(externalAPICall().map(new DataProcessFunction()));
The first Observable, localDataCall(), should emit the local data, which is then concatenated with the remote API call, externalAPICall(), mapped to the DataProcessFunction.
This solution works but it has a behavior that is not clear to me. When the local data call returns its value, this value goes through the DataProcessFunction even though it's not connected to the first call.
Any idea why this is happening? Is there a better implementation for my use case?
I believe that the issue lies in some part of your code that has not been provided. The data returned from localDataCall() is independent of the new DataProcessFunction() object, unless somewhere within localDataCall you use another DataProcessFunction.
To prove this to you I will create a small example using io.reactivex:rxjava:1.2.1:
public static void main(String[] args){
Observable.just(foo())
.concatWith(bar().map(new IntMapper()))
.subscribe(System.out::println);
}
static int foo() {
System.out.println("foo");
return 0;
}
static Observable<Integer> bar() {
System.out.println("bar");
return Observable.just(1, 2);
}
static class IntMapper implements Func1<Integer, Integer>
{
#Override
public Integer call(Integer integer)
{
System.out.println("IntMapper " + integer);
return integer + 5;
}
}
This prints to the console:
foo
bar
0
IntMapper 1
6
IntMapper 2
7
As can be seen, the value 0 created in foo never gets processed by IntMapper; IntMapper#call is only called twice for the values created in bar. The same can be said for the value created by localDataCall. It will not be mapped by the DataProcessFunction object passed to your map call. Just like bar and IntMapper, only values returned from externalAPICall will be processed by DataProcessFunction.
.concatWith() concatenates all items emitted by one observable with all items emitted by the other observable, so no wonder that .map() is being called twice.
But I do not understand why do you need localDataCall() at all in this scenario. Perhaps you might want to use .switchIfEmpty() or .switchOnNext() instead.
I will begin with an example. Suppose the input data is something like
User1,product1,time1
User1,product2,time2
User1,product3,time3
User2,product2,time2
User2,product4,time6
Now the output which is expected is I have to insert the data to a database(Aerospike(Key Value Store), in my case) where the data should be formatted as
User1, [ [product1,time1],[product2,time2],[product3,time3] ]
User2, [ [product2,time2],[product4,time6] ]
So In the Mapper I output the below
UserID, [productid,timestamp]
Please do not assume that [x,y] means i am outputting list i may send data from mappper in any way may be write the data in a custom object
So at the receiver side I have data in the format
User1, [ [product1,time1],[product2,time2],[product3,time3] ]
User2, [ [product2,time2],[product4,time6] ]
Now I can do two things
a) I can write the logic to push this data in database in the reducer only
(i dont want to do this)
b) What i want to do is, when we do Context.write() then i want the data to be written to database.
Please help how could this be done and if possible attach a code snippet or pseudo-code
PS : What does Context.write() do ? where does it write to? what are the steps and phases it goes by ?
As far as my understanding goes, invoking context.write involves a certain number of steps
In the driver we have to specify the output format. Now lets see what happens if we want to write to a file
For writing to text file we specify something like
job.setOutputFormatClass(TextOutputFormat.class);
Now if we see the implementation of the TextOutputFormat class that extends FileOutputFormat(abstract class) which implements the OutputFormat interface and the OutputFormat interface provides two methods
1) getRecordWriter
2) checkOutputSpecs
Now what will happen, OutputFormatClass just tells what kind of record you want to write and how is given by the record writer, for a record writer it gets just Object Key, Object Value where value could be single or a list, and in the implementation of the record writer we specify the actual logic like how should be this record written.
Now Comming back to the original question that how should the record be written to a database in my case Aerospike
I created a custom OutputFormat say
public class AerospikeOutputFormat extends OutputFormat {
//Return a new instance of record writer
#Override
public RecordWriter getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
return new AerospikeRecordWriter(context.getConfiguration(), new Progressable() {
#Override
public void progress() {
}
});
}
}
Now we have to define a custom record writer which would get a key and a value and would write the data to the database
public class RSRVRecordWriter<KK,VV> extends RecordWriter<KK, VV> {
#Override
public void write(KK key, VV value) throws IOException {
//Now here we can have an instance of aerospikeclient from a singleton class and then we could do client.put()
}
Above code is just a snippet, proper design strategy must be taken.
PS: Aerospike has given a record writer which could be extended to match your needs at this link
class CSVReader {
private List<String> output;
private InputStream input;
public CSVReader(InputStream input) {
this.input = input;
}
public void read() throws Exception{
//do something with the inputstream
// create output list.
}
public List<String> getOutput() {
return Collections.unmodifiableList(output);
}
I am trying to create a simple class which will be part of a library. I would like to create code that satisfies the following conditions:
handles all potential errors or wraps them into library errors and
throws them.
creates meaningful and complete object states (no incomplete object structures).
easy to utilize by developers using the library
Now, when I evaluated the code above, against the goals, I realized that I failed badly. A developer using this code would have to write something like this -
CSVReader reader = new CVSReader(new FileInputStream("test.csv");
reader.read();
read.getOutput();
I see the following issues straight away -
- developer has to call read first before getOutput. There is no way for him to know this intuitively and this is probably bad design.
So, I decided to fix the code and write something like this
public List<String> getOutput() throws IOException{
if(output==null)
read();
return Collections.unmodifiableList(output);
}
OR this
public List<String> getOutput() {
if(output==null)
throw new IncompleteStateException("invoke read before getoutput()");
return Collections.unmodifiableList(output);
}
OR this
public CSVReader(InputStream input) {
read(); //throw runtime exception
}
OR this
public List<String> read() throws IOException {
//read and create output list.
// return list
}
What is a good way to achieve my goals? Should the object state be always well defined? - there is never a state where "output" is not defined, so I should create the output as part of constructor? Or should the class ensure that a created instance is always valid, by calling "read" whenever it finds that "output" is not defined and just throw a runtime exception? What is a good approach/ best practice here?
I would make read() private and have getOutput() call it as an implementation detail. If the point of exposing read() is to lazy-load the file, you can do that with exposing getOutput only
public List<String> getOutput() {
if (output == null) {
try {
output = read();
} catch (IOException) {
//here you either wrap into your own exception and then declare it in the signature of getOutput, or just not catch it and make getOutput `throws IOException`
}
}
return Collections.unmodifiableList(output);
}
The advantage of this is that the interface of your class is very trivial: you give me an input (via constructor) I give you an output (via getOutput), no magic order of calls while preserving lazy-loading which is nice if the file is big.
Another advantage of removing read from the public API is that you can go from lazy-loading to eager-loading and viceversa without affecting your clients. If you expose read you have to account for it being called in all possible states of your object (before it's loaded, while it's already running, after it already loaded). In short, always expose the least possible
So to address your specific questions:
Yes, the object state should always be well-defined. Your point of not knowing that an external call on read by the client class is needed is indeed a design smell
Yes, you could call read in the constructor and eagerly load everything upfront. Deciding to lazy-load or not is an implementation detail dependent on your context, it should not matter to a client of your class
Throwing an exception if read has not been called puts again the burden to calling things in the right, implicit order on the client, which is unnecessary due to your comment that output is never really undefined so the implementation itself can make the risk-free decision of when to call read
I would suggest you make your class as small as possible, dropping the getOutput() method all together.
The idea is to have a class that reads a CSV file and returns a list, representing the result. To achieve this, you can expose a single read() method, that will return a List<String>.
Something like:
public class CSVReader {
private final InputStream input;
public CSVReader(String filename) {
this.input = new FileInputStream(filename);
}
public List<String> read() {
// perform the actual reading here
}
}
You have a well defined class, a small interface to maintain and the instances of CSVReader are immutable.
Have getOutput check if it is null (or out of date) and load it in automatically if it is. This allows for a user of your class to not have to care about internal state of the class's file management.
However, you may also want to expose a read function so that the user can chose to load in the file when it is convenient. If you make the class for a concurrent environment, I would recommend doing so.
The first approach takes away some flexibility from the API: before the change the user could call read() in a context where an exception is expected, and then call getOutput() exception-free as many times as he pleases. Your change forces the user to catch a checked exception in contexts where it wasn't necessary before.
The second approach is how it should have been done in the first place: since calling read() is a prerequisite of calling getOutput(), it is a responsibility of your class to "catch" your users when they "forget" to make a call to read().
The third approach hides IOException, which may be a legitimate exception to catch. There is no way to let the user know if the exception is going to be thrown or not, which is a bad practice when designing runtime exceptions.
The root cause of your problem is that the class has two orthogonal responsibilities:
Reading a CSV, and
Storing the result of a read for later use.
If you separate these two responsibilities from each other, you would end up with a cleaner design, in which the users would have no confusion over what they must call, and in what order:
interface CSVData {
List<String> getOutput();
}
class CSVReader {
public static CSVData read(InputStream input) throws IOException {
...
}
}
You could combine the two into a single class with a factory method:
class CSVData {
private CSVData() { // No user instantiation
}
// Getting data is exception-free
public List<String> getOutput() {
...
}
// Creating instances requires a factory call
public static CSVData read(InputStream input) throws IOException {
...
}
}
I have a class that needs to provide a fast classification service. For example, I want to write code like "classify("Ac Kd Kh 3c 3s")" that quickly returns TWO_PAIR. (This isn't the application but you get the jist)
Because I need the classification to be quick I want to precompute, and then store, a look-up table that lists the classification output for all possible inputs. In the interest of time I want to parallelize this precomputation. HOWEVER, attempting to use "classifySlowly" from a 2nd thread creates a deadlock.
public class Classifcation Service {
enum CLASS {TYPE_A, TYPE_B, ...};
static CLASS[] preComputedClassLookUpTable;
static {
preComputedClassLookUpTable = constructLookUpTableInParallel();
}
//Note: using this method from with constructLookUpTableInParallel causes deadlock
private static CLASS classifySlowly(Object classifyMe) {
//do time intensive work to classify the input
// -- uses no other methods from this class
return classification;
}
public static CLASS classify(Object classifyMe) {
//use the lookup table to do a quick classification
return classification;
}
}
So my question is: Is there a way to precompute this lookup table IN PARALLEL from within the static initalizer block?
The only (poor) alternative I see is to switch from:
preComputedClassLookUpTable = constructLookUpTableInParallel();
To:
preComputeClassLookUpTable = loadLookUpTableFromFile();
if(preComputedClassLookUpTable == null) {
System.out.println("WARNING: Construction incomplete, Must call computeAndSaveLookUpTableFile();}
}
I thought this would be too much but here is the implementation of constructLookUpTableInParallel
private static CLASS[] constructLookUpTableInParallel() {
//build a collection of Runnables by iterating over all possible input Objects
//wrap each possible input in an anonymous Runnable that calls classifySlowly.
//submit the collection of Runnables to a new ExecutorService
//process runnables...
//shutdown executor service
}
////////END OF POORLY WORDED ORIGINAL QUESTION ///////////
The solution that works somewhat cleanly is splitting the classifySlowly(Object classifyMe) and classify(Object classifyMe) methods into two different classes.
This will allow the (first) class that contains "public static CLASS classifySlowly(Object classifyMe)" to be fully loaded by the time the (second) class that contains "public static CLASS classifyQuickly(Object classifyMe)" needs to use the classifySlowly method. Now, that the 2nd static inialization block don't require any of its own static methods it can be fully parallelized.
"So my question is: Is there a way to precompute this lookup table IN PARALLEL from within the static initalizer block?"
Yes, it's practically certain there is a way. Just new the array and launch a Runnable for each array element. Give each Runnable reference to the array, and index it is computing, then have it do the computing without locking, then lock when assigning result to the array element.
Note/disclaimer: this answer is based on the rather incomplete information given in the question...