There's some domain knowledge/business logic baked into the problem I'm trying to solve but I'll try to boil it down to the basics as much as possible.
Say I have an interface defined as follows:
public interface Stage<I, O> {
StageResult<O> process(StageResult<I> input) throws StageException;
}
This represents a stage in a multi-stage data processing pipeline, my idea is to break the data processing steps into sequential (non-branching) independent steps (such as read from file, parse network headers, parse message payloads, convert format, write to file) represented by individual Stage implementations. Ideally I'd implement a FileInputStage, a NetworkHeaderParseStage, a ParseMessageStage, a FormatStage, and a FileOutputStage, then have some sort of
Stage<A, C> compose(Stage<A, B> stage1, Stage<B, C> stage2);
method such that I can eventually compose a bunch of stages into a final stage that looks like FileInput -> FileOutput.
Is this something (specifically the compose method, or a similar mechanism for aggregating many stages into one stage) even supported by the Java type system? I'm hacking away at it now and I'm ending up in a very ugly place involving reflection and lots of unchecked generic types.
Am I heading off in the wrong direction or is this even a reasonable thing to try to do in Java? Thanks so much in advance!
You didn't post enough implementation details to show where the type safety issues are but here is my throw on how you could address the problem:
First dont make the whole thing too generic, make your satges specific reguarding their inputs and outputs
Then create a composit stage which implements Stage and combines two stages into one final result.
Here is a very simpele implementatiom
public class StageComposit<A, B, C> implements Stage<A, C> {
final Stage<A, B> stage1;
final Stage<B, C> stage2;
public StageComposit(Stage<A, B> stage1, Stage<B, C> stage2) {
this.stage1 = stage1;
this.stage2 = stage2;
}
#Override
public StageResult<C> process(StageResult<A> input) {
return stage2.process(stage1.process(input));
}
}
Stage result
public class StageResult<O> {
final O result;
public StageResult(O result) {
this.result = result;
}
public O get() {
return result;
}
}
Example specific Stages:
public class EpochInputStage implements Stage<Long, Date> {
#Override
public StageResult<Date> process(StageResult<Long> input) {
return new StageResult<Date>(new Date(input.get()));
}
}
public class DateFormatStage implements Stage<Date, String> {
#Override
public StageResult<String> process(StageResult<Date> input) {
return new StageResult<String>(
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
.format(input.get()));
}
}
public class InputSplitStage implements Stage<String, List<String>> {
#Override
public StageResult<List<String>> process(StageResult<String> input) {
return new StageResult<List<String>>(
Arrays.asList(input.get().split("[-:\\s]")));
}
}
And finally a small test demonstrating how to comibine all
public class StageTest {
#Test
public void process() {
EpochInputStage efis = new EpochInputStage();
DateFormatStage dfs = new DateFormatStage();
InputSplitStage iss = new InputSplitStage();
Stage<Long, String> sc1 =
new StageComposit<Long, Date, String>(efis, dfs);
Stage<Long, List<String>> sc2 =
new StageComposit<Long, String, List<String>>(sc1, iss);
StageResult<List<String>> result =
sc2.process(new StageResult<Long>(System.currentTimeMillis()));
System.out.print(result.get());
}
}
Output for current time would be a list of strings
[2015, 06, 24, 16, 27, 55]
As you see no type safety issues or any type castings. When you need to handle other types of inputs and outputs or convert them to suite the next stage just write a new Stage and hook it up in your stage processing chain.
You may want to consider using a composite pattern or a decorator pattern. For the decorator each stage will wrap or decorate the previous stage. To do this you have each stage implement the interface as you are doing allow a stage to contain another stage.
The process() method does not need to accept a StageResult parameter anymore since it can call the contained Stage's process() method itself, get the StageResult and perform its own processing, returning another StageResult.
One advantage is that you can restructure your pipeline at run time.
Each Stage that may contain another can extend the ComposableStage and each stage that is an end point of the process can extend the LeafStage. Note that I just used those terms to name the classes by function but you can create more imaginative names.
Related
I want to use Stream API to keep track of a variable while changing it with functions.
My code:
public String encoder(String texteClair) {
for (Crypteur crypteur : algo) {
texteClair = crypteur.encoder(texteClair);
}
return texteClair;
}
I have a list of classes that have methods and I want to put a variable inside all of them, like done in the code above.
It works perfectly, but I was wondering how it could be done with streams?
Could we use reduce()?
Use an AtomicReference, which is effectively final, but its wrapped value may change:
public String encoder(String texteClair) {
AtomicReference<String> ref = new AtomicReference<>(texteClair);
algo.stream().forEach(c -> ref.updateAndGet(c::encoder)); // credit Ole V.V
return ref.get();
}
Could we use reduce()?
I guess we could. But keep in mind that it's not the best case to use streams.
Because you've mentioned "classes" in plural, I assume that Crypteur is either an abstract class or an interface. As a general rule you should favor interfaces over abstract classes, so I'll assume the that Crypteur is an interface (if it's not, that's not a big issue) and it has at least one implementation similar to this :
public interface Encoder {
String encoder(String str);
}
public class Crypteur implements Encoder {
private UnaryOperator<String> operator;
public Crypteur(UnaryOperator<String> operator) {
this.operator = operator;
}
#Override
public String encoder(String str) {
return operator.apply(str);
}
}
Then you can utilize your encoders with stream like this:
public static void main(String[] args) {
List<Crypteur> algo =
List.of(new Crypteur(str -> str.replaceAll("\\p{Punct}|\\p{Space}", "")),
new Crypteur(str -> str.toUpperCase(Locale.ROOT)),
new Crypteur(str -> str.replace('A', 'W')));
String result = encode(algo, "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system");
System.out.println(result);
}
public static String encode(Collection<Crypteur> algo, String str) {
return algo.stream()
.reduce(str,
(String result, Crypteur encoder) -> encoder.encoder(result),
(result1, result2) -> { throw new UnsupportedOperationException(); });
}
Note that combiner, which is used in parallel to combine partial results, deliberately throws an exception to indicate that this task ins't parallelizable. All transformations must be applied sequentially, we can't, for instance, apply some encoders on the given string and then apply the rest of them separately on the given string and merge the two results - it's not possible.
Output
EVERYPIECEOFKNOWLEDGEMUSTHWVEWSINGLEUNWMBIGUOUSWUTHORITWTIVEREPRESENTWTIONWITHINWSYSTEM
Does Java Iterator interface enforce us to return a new Object when we call next() method on this interface? I went through the documentation and there was no Obligation for returning a new Object per each call, but it causes many ambiguities. It seems, that Hadoop mapreduce framework breaks some undocumented rule, that causes many problem in my simple program (including using Java8 Streams). It returns the same Object with different content when I call next() method on theIterator (although it is against my imagination, it seems that it does not break the rule of Iterator, at least it seems that it does not break the documented rule of Iterator interface). I want to know why it happens? is it a mapreduce fault? or is it Java fault for not documenting Iterator interface to return new instance on every call to next() method:
For the sake of simplicity and showing what is happening in hadoop mapreduce I write my own Iterator which is similar to what mapreduce is doing so you can understand what I'm getting at (so it is not a flawless program and might have many problems, but please focus on the concept that I'm trying to show).
Imagine I have the following Hospital Entity:
#Getter
#Setter
#AllArgsConstructor
#ToString
public class Hospital {
private AREA area;
private int patients;
public Hospital(AREA area, int patients) {
this.area = area;
this.patients = patients;
}
public Hospital() {
}
}
For which I have Written following MyCustomHospitalIterable:
public class MyCustomHospitalIterable implements Iterable<Hospital> {
private List<Hospital> internalList;
private CustomHospitalIteration customIteration = new CustomHospitalIteration();
public MyCustomHospitalIterable(List<Hospital> internalList) {
this.internalList = internalList;
}
#Override
public Iterator<Hospital> iterator() {
return customIteration;
}
public class CustomHospitalIteration implements Iterator<Hospital> {
private int currentIndex = 0;
private Hospital currentHospital = new Hospital();
#Override
public boolean hasNext() {
if (MyCustomHospitalIterable.this.internalList.size() - 1 > currentIndex) {
currentIndex++;
return true;
}
return false;
}
#Override
public Hospital next() {
Hospital hospital =
MyCustomHospitalIterable.this.internalList.get(currentIndex);
currentHospital.setArea(hospital.getArea());
currentHospital.setPatients(hospital.getPatients());
return currentHospital;
}
}
}
Here, instead of returning new Object on next() method call, I return the same Object with different content. You might ask what is the advantage of doing this? It has its own advantage in mapreduce because in big data they don't want to create new Object for performance consideration. Does this break any documented rule of Iterator interface?
Now let's see some consequences of having implemented Iterable that way:
consider following simple program:
public static void main(String[] args) {
List<Hospital> hospitalArray = Arrays.asList(
new Hospital(AREA.AREA1, 10),
new Hospital(AREA.AREA2, 20),
new Hospital(AREA.AREA3, 30),
new Hospital(AREA.AREA1, 40));
MyCustomHospitalIterable hospitalIterable = new MyCustomHospitalIterable(hospitalArray);
List<Hospital> hospitalList = new LinkedList<>();
Iterator<Hospital> hospitalIter = hospitalIterable.iterator();
while (hospitalIter.hasNext()) {
Hospital hospital = hospitalIter.next();
System.out.println(hospital);
hospitalList.add(hospital);
}
System.out.println("---------------------");
System.out.println(hospitalList);
}
It is so unlogical and counterintuitive that the output of the program is as follow:
Hospital{area=AREA2, patients=20}
Hospital{area=AREA3, patients=30}
Hospital{area=AREA1, patients=40}
---------------------
[Hospital{area=AREA1, patients=40}, Hospital{area=AREA1, patients=40}, Hospital{area=AREA1, patients=40}]
And to make it worse imagine what happens when we are woking with Streams in Java. What would be the output of following program in Java:
public static void main(String[] args) {
List<Hospital> hospitalArray = Arrays.asList(
new Hospital(AREA.AREA1, 10),
new Hospital(AREA.AREA2, 20),
new Hospital(AREA.AREA3, 30),
new Hospital(AREA.AREA1, 40));
MyCustomHospitalIterable hospitalIterable = new MyCustomHospitalIterable(hospitalArray);
Map<AREA, Integer> sortedHospital =
StreamSupport.stream(hospitalIterable.spliterator(), false)
.collect(Collectors.groupingBy(
Hospital::getArea, Collectors.summingInt(Hospital::getPatients)));
System.out.println(sortedHospital);
}
It depends we use parallel Stream or sequential one:
In seqentioal one output is as follow:
{AREA2=20, AREA1=40, AREA3=30}
and in parallel one it is:
{AREA1=120}
As a user I want to use interface as they are and don't have any concern about the implementations of that interface.
The problem is that here I know how MyCustomHospitalIterable is implemeted but in hadoop mapreduce I have to implement method like bellow and I have no idea where Iterable<IntWritable> came from and what is its implementation. I just want to use it as a pure Iterable interface but as I showed above it does not work as expected:
public void reduce(Text key, Iterable<IntWritable> values, Context context
) throws IOException, InterruptedException {
List<IntWritable> list = new LinkedList<>();
Iterator<IntWritable> iter = values.iterator();
while (iter.hasNext()) {
IntWritable count = iter.next();
System.out.println(count);
list.add(count);
}
System.out.println("---------------------");
System.out.println(list);
}
Here is my question:
Why my simple program has broken?
Is it mapreduce fault to not implementing undocomented conventional rule of Iterable and Iterator(or there is documentation for this behavior which I haven't noticed)?
Or is it Java for not documenting Iterable and Iterator interface to return new Object on each call?
or it's my fault as a programmer?
It is rather very unusual to return the same mutable object with different content for an Iterable. I did not find something in the java language reference; though not searched much. It is simple too error prone to be correct language usage.
You mention of other tools, like Streams, are apt.
Also the next java's record type is just intended for such tuple like usage, of course as multiple immutable objects. "Your" Iterable suffers from not being able to use in collections, unless on does a .next().clone() or such.
This weakness of Iterable is in the same category as having a mutable object as Map key. It is deadly wrong.
I have controller method that get data from request and based on subject variable from request decide to call a function. (for project need I cannot use seperate controller method for each subject variable)
For now I used switch but I think it breaks Open Closed Principle (because every time new type of subject added I have to add new case to switch) and not good design, How can I refactor this code?
Subject subject = ... //(type of enum)
JSONObject data = request.getData("data");
switch(subject) {
case SEND_VERIFY:
send_foo1(data.getString("foo1_1"), data.getString("foo1_2"));
break;
case do_foo2:
foo2(data.getInt("foo2_b"), data.getInt("foo2_cc"));
break;
case do_foo3:
do_foo3_for(data.getString("foo3"));
break;
// some more cases
}
While I am not sure about which OO principle this snippet violates, there is indeed a more roust way to achieve the logic: tie the processing for each enum value to the enum class.
You will need to generalize the processing into an interface:
public interface SubjectProcessor
{
void process(JSONObject data);
}
and create concrete implementations for each enum value:
public class SendVerifySubjectProcessor implements SubjectProcessor
{
#Override
public void process(JSONObject data) {
String foo1 = data.getString("foo1_1");
String foo2 = data.getString("foo1_2");
...
}
}
once you have that class hierarchy tree, you can associate each enum value to a concrete processor
public enum Subject
{
SEND_VERIFY(new SendVerifySubjectProcessor()),
do_foo2(new Foo2SubjectProcessor()),
...
private SubjectProcessor processor
Subject(SubjectProcessor processor) {
this.processor = processor;
}
public void process(JSONObject data) {
this.processor.process(data);
}
}
This eliminates the need for the switch statement in the controller:
Subject subject = ... //(type of enum)
JSONObject data = request.getData("data");
subject.process(data);
EDIT:
Following the good comment, You can utilize the java.util.function.Consumer functional interface instead of the custom SubjectProcessor one. You can decide whether to write concrete classes or use the lambda expr construct.
public class SendVerifySubjectProcessor implements Consumer<JSONObject>
{
#Override
public void accept(JSONObject data) {
String foo1 = data.getString("foo1_1");
String foo2 = data.getString("foo1_2");
...
}
}
OR
public enum Subject
{
SEND_VERIFY(data -> {
String foo1 = data.getString("foo1_1");
String foo2 = data.getString("foo1_2");
...
}),
...
private Consumer<Subject> processor
Subject(Consumer<Subject> processor) {
this.processor = processor;
}
public void process(JSONObject data) {
this.processor.accept(data);
}
}
// SubjectsMapping.java
Map<Subject, Consumer<JSONObject>> tasks = new HashMap<>();
tasks.put(SEND_VERIFY,
data -> send_foo1(data.getString("foo1_1"), data.getString("foo1_2")));
tasks.put(do_foo2,
data -> foo2(data.getInt("foo2_b"), data.getInt("foo2_cc")));
tasks.put(do_foo3, data -> do_foo3_for(data.getString("foo3")));
// In your controller class where currently `switch` code written
if (tasks.containsKey(subject)) {
tasks.get(subject).accept(data);
} else {
throw new IllegalArgumentException("No suitable task");
}
You can maintain Map<Subject, Consumer<JSONObject>> tasks configuration in separate class rather than mixing with if (tasks.containsKey(subject)) code. When you need another feature you can configure one entry in this map.
Answers of others seems to be great, as an addition I would suggest using EnumMap for storing enums as keys as it might be more efficient than the standard Map. I think it's also worth mentioning that the Strategy Pattern is used here to achieve calling specific actions for each key from Map without the need of building long switch statements.
I'm kind of confused how to use MergeHub.
I'm designing a flow graph that uses Flow.mapAsync(), where the given function creates another flow graph, and then runs it with Sink.ignore(), and returns that CompletionStage as the value for Flow.mapAsync() to wait for. The nested flow will return elements via the Sink returned from materializing the MergeHub.
The issue is that I need to provide the Function which starts the nested flow to Flow.mapAsync() when I'm creating the top-level flow graph, but that requires it to have access to the materialized value returned from materializing the result of MergeHub.of(). How do I get that materialized value before starting the flow graph?
The only way I can see right now is to implement the Function to block until the Sink has been provided (after starting the top-level flow graph), but that seems pretty hacky.
So, something like
class MapAsyncFunctor implements Function<T, CompletionStage<Done>> {...}
MapAsyncFunctor mapAsyncFunctor = new MapAsyncFunctor();
RunnableGraph<Sink<T>> graph = createGraph(mapAsyncFunctor);
Sink<T> sink = materializer.materialize(graph);
mapAsyncFunctor.setSink(sink); // Graph execution blocked in background in call to mapAsyncFunctor.apply() until this is done
Edit: I've created the following class
public final class Channel<T>
{
private final Sink<T, NotUsed> m_channelIn;
private final Source<T, NotUsed> m_channelOut;
private final UniqueKillSwitch m_killSwitch;
public Channel(Class<T> in_class, Materializer in_materializer)
{
final Source<T, Sink<T, NotUsed>> source = MergeHub.of(in_class);
final Sink<T, Source<T, NotUsed>> sink = BroadcastHub.of(in_class);
final Pair<Pair<Sink<T, NotUsed>, UniqueKillSwitch>, Source<T, NotUsed>> matVals = in_materializer.materialize(source.viaMat(KillSwitches.single(), Keep.both()).toMat(sink, Keep.both()));
m_channelIn = matVals.first().first();
m_channelOut = matVals.second();
m_killSwitch = matVals.first().second();
}
public Sink<T, NotUsed> in()
{
return m_channelIn;
}
public Source<T, NotUsed> out()
{
return m_channelOut;
}
public void close()
{
m_killSwitch.shutdown();
}
}
so that I can get a Source/Sink pair to use in building the graph. Is this a good idea? Will I 'leak' these channels if I don't explicitly close()
them?
I'll only ever need to use .out() once for my use-case.
With MergeHub you always need to materialize the hub sink before doing anything else.
Sink<T, NotUsed> toConsumer = MergeHub.of(String.class, 16).to(consumer).run(materializer);
You can then distribute it to all bits of code that need to materialize it to send data to it. Following your snippet above, a possible approach might be passing the Sink your functor at construction time:
class MapAsyncFunctor implements Function<T, CompletionStage<Done>> {
private Sink<T, NotUsed> sink;
public MapAsyncFunctor(Sink<T, NotUsed> sink) {
this.sink = sink;
}
#Override
public CompletionStage<Done> apply(T t) { /* run substream into sink */ }
}
MapAsyncFunctor mapAsyncFunctor = new MapAsyncFunctor(toConsumer);
// run your flow with mapAsync on the above functor
More info on MergeHub can be found in the docs.
I am going to develop a web crawler using java to capture hotel room prices from hotel websites.
In this case I want to capture room price with the room type and the meal type, so my algorithm should be intelligent to handle that.
For example:
Room type: Deluxe
Meal type: HalfBoad
price : $20.00
The main problem is room prices can be in different ways in different hotel sites. So my algorithm should be independent from hotel sites.
I am plan to use above room types and meal types as a fuzzy sets and compare the words in webpage with above fuzzy sets using a suitable membership function.
Anyone experienced with this? or have an idea for my problem?
There are two ways to approach this problem:
You can customize your crawler to understand the formats used by different Websites; or
You can come up with a general ("fuzzy") solution.
(1) will, by far, be the easiest. Ideally you want to create some tools that make this easier so you can create a filter for any new site in minimal time. IMHO your time will be best spent with this approach.
(2) has lots of problems. Firstly it will be unreliable. You will come across formats you don't understand or (worse) get wrong. Second, it will require a substantial amount of development to get something working. This is the sort of thing you use when you're dealing with thousands or millions of sites.
With hundreds of sites you will get better and more predictable results with (1).
As with all problems, design can let you deliver value adapt to situations you haven't considered much more quickly than the general solution.
Start by writing something that parses the data from one provider - the one with the simplest format to handle. Find a way to adapt that handler into your crawler. Be sure to encapsulate construction - you should always do this anyway...
public class RoomTypeExtractor
{
private RoomTypeExtractor() { }
public static RoomTypeExtractor GetInstance()
{
return new RoomTypeExtractor();
}
public string GetRoomType(string content)
{
// BEHAVIOR #1
}
}
The GetInstance() ,ethod lets you promote to a Strategy pattern for practically free.
Then add your second provider type. Say, for instance, that you have a slightly more complex data format which is a little more prevalent than the first format. Start by refactoring what was your concrete room type extractor class into an abstraction with a single variation behind it and have the GetInstance() method return an instance of the concrete type:
public abstract class RoomTypeExtractor
{
public static RoomTypeExtractor GetInstance()
{
return SimpleRoomTypeExtractor.GetInstance();
}
public abstract string GetRoomType(string content);
}
public final class SimpleRoomTypeExtractor extends RoomTypeExtractor
{
private SimpleRoomTypeExtractor() { }
public static SimpleRoomTypeExtractor GetInstance()
{
return new SimpleRoomTypeExtractor();
}
public string GetRoomType(string content)
{
// BEHAVIOR #1
}
}
Create another variation that implements the Null Object pattern...
public class NullRoomTypeExtractor extends RoomTypeExtractor
{
private NullRoomTypeExtractor() { }
public static NullRoomTypeExtractor GetInstance()
{
return new NullRoomTypeExtractor();
}
public string GetRoomType(string content)
{
// whatever "no content" behavior you want... I chose returning null
return null;
}
}
Add a base class that will make it easier to work with the Chain of Responsibility pattern that is in this problem:
public abstract class ChainLinkRoomTypeExtractor extends RoomTypeExtractor
{
private final RoomTypeExtractor next_;
protected ChainLinkRoomTypeExtractor(RoomTypeExtractor next)
{
next_ = next;
}
public final string GetRoomType(string content)
{
if (CanHandleContent(content))
{
return GetRoomTypeFromUnderstoodFormat(content);
}
else
{
return next_.GetRoomType(content);
}
}
protected abstract bool CanHandleContent(string content);
protected abstract string GetRoomTypeFromUnderstoodFormat(string content);
}
Now, refactor the original implementation to have a base class that joins it into a Chain of Responsibility...
public final class SimpleRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
private SimpleRoomTypeExtractor(RoomTypeExtractor next)
{
super(next);
}
public static SimpleRoomTypeExtractor GetInstance(RoomTypeExtractor next)
{
return new SimpleRoomTypeExtractor(next);
}
protected string CanHandleContent(string content)
{
// return whether or not content contains the right format
}
protected string GetRoomTypeFromUnderstoodFormat(string content)
{
// BEHAVIOR #1
}
}
Be sure to update RoomTypeExtractor.GetInstance():
public static RoomTypeExtractor GetInstance()
{
RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();
extractor = SimpleRoomTypeExtractor.GetInstance(extractor);
return extractor;
}
Once that's done, create a new link for the Chain of Responsibility...
public final class MoreComplexRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
private MoreComplexRoomTypeExtractor(RoomTypeExtractor next)
{
super(next);
}
public static MoreComplexRoomTypeExtractor GetInstance(RoomTypeExtractor next)
{
return new MoreComplexRoomTypeExtractor(next);
}
protected string CanHandleContent(string content)
{
// Check for presence of format #2
}
protected string GetRoomTypeFromUnderstoodFormat(string content)
{
// BEHAVIOR #2
}
}
Finally, add the new link to the chain, if this is a more common format, you might want to give it higher priority by putting it higher in the chain (the real forces that govern the order of the chain will become apparent when you do this):
public static RoomTypeExtractor GetInstance()
{
RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();
extractor = SimpleRoomTypeExtractor.GetInstance(extractor);
extractor = MoreComplexRoomTypeExtractor.GetInstance(extractor);
return extractor;
}
As time passes, you may want to add ways to dynamically add new links to the Chain of Responsibility, as pointed out by Cletus, but the fundamental principle here is Emergent Design. Start with high quality. Keep quality high. Drive with tests. Do those three things and you will be able to use the fuzzy logic engine between your ears to overcome almost any problem...
EDIT
Translated to Java. Hope I did that right; I'm a little rusty.