Word Count Number is always changing, when using Flink - java

I am trying to create word count example with flink. Here is the link for words data (this is the example from flink's github account)
When I count the words with simple java program:
public static void main(String[] args) throws Exception {
int count = 0;
for (String eachSentence : WordCountData.WORDS){
String[] splittedSentence = eachSentence.toLowerCase().split("\\W+");
for (String eachWord: splittedSentence){
count++;
}
}
System.out.println(count);
// result is 287
}
Now when I do this with flink, first I will split the sentence to words.
DataStream<Tuple2<String, Integer>> readWordByWordStream = splitSentenceWordByWord(wordCountDataSource);
//...
public DataStream<Tuple2<String, Integer>> splitSentenceWordByWord(DataStream<String> wordDataSourceStream)
{
DataStream<Tuple2<String, Integer>> wordByWordStream = wordDataSourceStream.flatMap(new TempTransformation());
return wordByWordStream;
}
Here is the my TempTransformationclass:
public class TempTransformation extends RichFlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String input, Collector<Tuple2<String, Integer>> collector) throws Exception
{
String[] splittedSentence = input.toLowerCase().split("\\W+");
for (String eachWord : splittedSentence)
{
collector.collect(new Tuple2<String, Integer>(eachWord, 1));
}
}
}
Now I am going to count the words by converting it to KeyedStream (keyed by word)
public SingleOutputStreamOperator<String> keyedStreamExample(DataStream<Tuple2<String, Integer>> wordByWordStream)
{
return wordByWordStream.keyBy(0).timeWindow(Time.milliseconds(1)).apply(new TempWindowFunction());
}
TempWindowFunction():
public class TempWindowFunction extends RichWindowFunction<Tuple2<String, Integer>, String, Tuple, TimeWindow> {
private Logger logger = LoggerFactory.getLogger(TempWindowFunction.class);
private int count = 0;
#Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<String> out) throws Exception
{
logger.info("Key is:' {} ' and collected element for that key and count: {}", (Object) tuple.getField(0), count);
StringBuilder builder = new StringBuilder();
for (Tuple2 each : input)
{
String key = (String) each.getField(0);
Integer value = (Integer) each.getField(1);
String tupleStr = "[ " + key + " , " + value + "]";
builder.append(tupleStr);
count ++;
}
logger.info("All tuples {}", builder.toString());
logger.info("Exit method");
logger.info("----");
}
}
After running this job with Flink's local environments, outputs always changing, here is the a few samples:
18:09:40,086 INFO com.sampleFlinkProject.transformations.TempWindowFunction - Key is:' rub ' and collected element for that key and count: 86
18:09:40,086 INFO TempWindowFunction - All tuples [ rub , 1]
18:09:40,086 INFO TempWindowFunction - Exit method
18:09:40,086 INFO TempWindowFunction - ----
18:09:40,086 INFO TempWindowFunction - Key is:' for ' and collected element for that key and count: 87
18:09:40,086 INFO TempWindowFunction - All tuples [ for , 1]
18:09:40,086 INFO TempWindowFunction - Exit method
18:09:40,086 INFO TempWindowFunction - ----
// another running outputs:
18:36:21,660 INFO TempWindowFunction - Key is:' for ' and collected element for that key and count: 103
18:36:21,660 INFO TempWindowFunction - All tuples [ for , 1]
18:36:21,660 INFO TempWindowFunction - Exit method
18:36:21,660 INFO TempWindowFunction - ----
18:36:21,662 INFO TempWindowFunction - Key is:' coil ' and collected element for that key and count: 104
18:36:21,662 INFO TempWindowFunction - All tuples [ coil , 1]
18:36:21,662 INFO TempWindowFunction - Exit method
18:36:21,662 INFO TempWindowFunction - ----
Lastly, here is the execution setup
//...
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(1);
//...
Why Flink is giving different outputs for each execution?

One source of non-determinism in your application is the processing time windows (which are 1 ms long). Whenever you use processing time for windowing, then the windows end up containing whatever events happen to show up and get processed during the time interval. (Event time windows do behave deterministically, since they are based on timestamps in the events.) Having the windows be so short is going to exaggerate this effect.

Related

Apache Flink batch mode FileSink to S3 can't finish in jetbrains

What we are trying to do: we are evaluating Flink to perform batch processing using DataStream API in BATCH mode.
Minimal application to reproduce the issue:
FileSystem.initialize(GlobalConfiguration.loadConfiguration(System.getenv("FLINK_CONF_DIR")))
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setRuntimeMode(RuntimeExecutionMode.BATCH)
val inputStream = env.fromSource(
FileSource.forRecordStreamFormat(new TextLineFormat(), new Path("s3://testtest/2022/04/12/")).build(), WatermarkStrategy.noWatermarks()
.withTimestampAssigner(new SerializableTimestampAssigner[String]() {
override def extractTimestamp(element: String, recordTimestamp: Long): Long = -1
}), "MySourceName"
)
.map(str => {
val jsonNode = JsonUtil.getJSON(str)
val log = JsonUtil.getJSONString(jsonNode, "log")
if (StringUtils.isNotBlank(log)) {
log
} else {
""
}
})
.filter(StringUtils.isNotBlank(_))
val sink: FileSink[BaseLocation] = FileSink
// .forBulkFormat(new Path("/Users/temp/flinksave"), AvroWriters.forSpecificRecord(classOf[BaseLocation]))
.forBulkFormat(new Path("s3://testtest/avro"), AvroWriters.forSpecificRecord(classOf[BaseLocation]))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.withOutputFileConfig(config)
.build()
inputStream.map(data => {
val baseLocation = new BaseLocation()
baseLocation.setRegion(data)
baseLocation
}).sinkTo(sink)
inputStream.print("input:")
env.execute()
Flink version: 1.14.2
the program executes normally when the path is local.
The program does not give a error when path change to s3://. However I do not see any files being written in S3 either.
This problem does not exist in the stand-alone mode, but only in the local development environment jetbrains IDEA. Is it because I lack configuration? I have already configured flink-config.yaml like:
s3.access-key: test
s3.secret-key: test
s3.endpoint: http://127.0.0.1:39000
log
18:42:25,524 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Finished reading split(s) [0000000002]
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Finished reading split(s) [0000000001]
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Closing splitFetcher 0 because it is idle.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager [] - Closing splitFetcher 0 because it is idle.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Shutting down split fetcher 0
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Shutting down split fetcher 0
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Split fetcher 0 exited.
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher [] - Split fetcher 0 exited.
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - Subtask 11 (on host '') is requesting a file source split
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No more splits available for subtask 11
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - Subtask 8 (on host '') is requesting a file source split
18:42:25,525 INFO org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No more splits available for subtask 8
18:42:25,525 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader received NoMoreSplits event.
18:42:25,526 INFO org.apache.flink.connector.base.source.reader.SourceReaderBase [] - Reader received NoMoreSplits event.

Iterate over different columns using withcolumn in Java Spark

I have to modify a Dataset<Row> according to some rules that are in a List<Row>.
I want to iterate over the Datset<Row> columns using Dataset.withColumn(...) as seen in the next example:
(import necesary libraries...)
SparkSession spark = SparkSession
.builder()
.appName("appname")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> dfToModify = spark.read().table("TableToModify");
List<Row> ListListWithInfo = new ArrayList<>(Arrays.asList());
ListWithInfo.add(0,RowFactory.create("field1", "input1", "output1", "conditionAux1"));
ListWithInfo.add(1,RowFactory.create("field1", "input1", "output1", "conditionAux2"));
ListWithInfo.add(2,RowFactory.create("field1", "input2", "output3", "conditionAux3"));
ListWithInfo.add(3,RowFactory.create("field2", "input3", "output4", "conditionAux4"));
.
.
.
for (Row row : ListWithInfo) {
String field = row.getString(0);
String input = row.getString(1);
String output = row.getString(2);
String conditionAux = row.getString(3);
dfToModify = dfToModify.withColumn(field,
when(dfToModify.col(field).equalTo(input)
.and(dfToModify.col("conditionAuxField").equalTo(conditionAux))
,output)
.otherwise(dfToModify.col(field)));
}
The code does works as it should, but when there are more than 50 "rules" in the List, the program doesn't finish and this output is shown in the screen:
0/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1653
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1650
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1635
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1641
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1645
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1646
20/01/27 17:48:18 INFO storage.BlockManagerInfo: Removed broadcast_113_piece0 on **************** in memory (size: 14.5 KB, free: 3.0 GB)
20/01/27 17:48:18 INFO storage.BlockManagerInfo: Removed broadcast_113_piece0 on ***************** in memory (size: 14.5 KB, free: 3.0 GB)
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1639
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1649
20/01/27 17:48:18 INFO spark.ContextCleaner: Cleaned accumulator 1651
20/01/27 17:49:18 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 6
20/01/27 17:49:18 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 6
20/01/27 17:49:18 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 6
20/01/27 17:49:18 INFO spark.ExecutorAllocationManager: Removing executor 6 because it has been idle for 60 seconds (new desired total will be 0)
20/01/27 17:49:19 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:19 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 6.
20/01/27 17:49:19 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
20/01/27 17:49:19 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:19 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
20/01/27 17:49:19 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, *********************, 43387, None)
20/01/27 17:49:19 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
20/01/27 17:49:19 INFO cluster.YarnScheduler: Executor 6 on **************** killed by driver.
20/01/27 17:49:19 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 0)
20/01/27 17:49:20 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:21 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
20/01/27 17:49:22 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
.
.
.
.
Is there any way to make it more efficient using Java Spark? (without using for loop or something similar)
Finally I used withColumns method of Dataset<Row> objet. This method need two arguments:
.withColumns(Seq<String> ColumnsNames, Seq<Column> ColumnsValues);
And in the Seq<String> can not be duplicated.
The code is as follow:
SparkSession spark = SparkSession
.builder()
.appName("appname")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> dfToModify = spark.read().table("TableToModify");
List<Row> ListListWithInfo = new ArrayList<>(Arrays.asList());
ListWithInfo.add(0,RowFactory.create("field1", "input1", "output1", "conditionAux1"));
ListWithInfo.add(1,RowFactory.create("field1", "input1", "output1", "conditionAux2"));
ListWithInfo.add(2,RowFactory.create("field1", "input2", "output3", "conditionAux3"));
ListWithInfo.add(3,RowFactory.create("field2", "input3", "output4", "conditionAux4"));
.
.
.
// initialize values for fields and conditions
String field_ant = ListWithInfo.get(0).getString(0).toLowerCase();
String first_input = ListWithInfo.get(0).getString(1);
String first_output = ListWithInfo.get(0).getString(2);
String first_conditionAux = ListWithInfo.get(0).getString(3);
Column whenColumn = when(dfToModify.col(field_ant).equalTo(first_input)
.and(dfToModify.col("conditionAuxField").equalTo(lit(first_conditionAux)))
,first_output);
// lists with the names of the fields and the conditions
List<Column> whenColumnList = new ArrayList(Arrays.asList());
List<String> fieldsNameList = new ArrayList(Arrays.asList());
for (Row row : ListWithInfo.subList(1,ListWithInfo.size())) {
String field = row.getString(0);
String input = row.getString(1);
String output = row.getString(2);
String conditionAux = row.getString(3);
if (field.equals(field_ant)) {
// if field is equals to fiel_ant the new condition is added to the previous one
whenColumn = whenColumn.when(dfToModify.col(field).equalTo(input)
.and(dfToModify.col("conditionAuxField").equalTo(lit(conditionAux)))
,output);
} else {
// if field is diferent to the previous:
// close the conditions for this field
whenColumn = whenColumn.otherwise(dfToModify.col(field_ant));
// add to the lists the field(String) and the conditions (columns)
whenColumnList.add(whenColumn);
fieldsNameList.add(field_ant);
// and initialize the conditions for the new field
whenColumn = when(dfToModify.col(field).equalTo(input)
.and(dfToModify.col("branchField").equalTo(lit(branch)))
,output);
}
field_ant = field;
}
// add last values
whenColumnList.add(whenColumn);
fieldsNameList.add(field_ant);
// transform list to Seq
Seq<Column> whenColumnSeq = JavaConversions.asScalaBuffer(whenColumnList).seq();
Seq<String> fieldsNameSeq = JavaConversions.asScalaBuffer(fieldsNameList).seq();
Dataset<Row> dfModified = dfToModify.withColumns(fieldsNameSeq, whenColumnSeq);

Combine Mono with Flux

I want to create a service that combines results from two reactive sources.
One is producing Mono and another one is producing Flux. For merging I need the same value of mono for every flux emitted.
For now I have something like this
Flux.zip(
service1.getConfig(), //produces flux
service2.getContext() //produces mono
.cache().repeat()
)
This gives me what I need,
service2 is called only once
context is provided for every configuration
resulting flux has as many elements as configurations
But I have noticed that repeat() is emitting a massive amount of elements after context is cached. Is this a problem?
Is there something I can do to limit number of repeats to the number of received configurations, yet still do both request simultaneously?
Or this is not an issue and I Can safely ignore those extra emitted elements?
I tried to use combineLatest but depending on timing I some elements fo configuration can get lost and not processed.
EDIT
Looking at the suggestions from #Ricard Kollcaku I have created sample test that shows why this is not what I'm looking for.
import java.util.concurrent.atomic.AtomicLong;
import java.util.stream.Stream;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
import reactor.core.scheduler.Schedulers;
import reactor.test.StepVerifier;
public class SampleTest
{
Logger LOG = LoggerFactory.getLogger(SampleTest.class);
AtomicLong counter = new AtomicLong(0);
Flux<String> getFlux()
{
return Flux.fromStream(() -> {
LOG.info("flux started");
sleep(1000);
return Stream.of("a", "b", "c");
}).subscribeOn(Schedulers.parallel());
}
Mono<String> getMono()
{
return Mono.defer(() -> {
counter.incrementAndGet();
LOG.info("mono started");
sleep(1000);
return Mono.just("mono");
}).subscribeOn(Schedulers.parallel());
}
private void sleep(final long milis)
{
try
{
Thread.sleep(milis);
}
catch (final InterruptedException e)
{
e.printStackTrace();
}
}
#Test
void test0()
{
final Flux<String> result = Flux.zip(
getFlux(),
getMono().cache().repeat()
.doOnNext(n -> LOG.warn("signal on mono", n)),
(s1, s2) -> s1 + " " + s2
);
assertResults(result);
}
#Test
void test1()
{
final Flux<String> result =
getFlux().flatMap(s -> Mono.zip(Mono.just(s), getMono(),
(s1, s2) -> s1 + " " + s2));
assertResults(result);
}
#Test
void test2()
{
final Flux<String> result = getFlux().flatMap(s -> getMono().map((s1 -> s + " " + s1)));
assertResults(result);
}
void assertResults(final Flux<String> result)
{
final Flux<String> flux = result;
StepVerifier.create(flux)
.expectNext("a mono")
.expectNext("b mono")
.expectNext("c mono")
.verifyComplete();
Assertions.assertEquals(1L, counter.get());
}
Looking at the test results for test1 and test2
2020-01-20 12:55:22.542 INFO [] [] [ parallel-3] SampleTest : flux started
2020-01-20 12:55:24.547 INFO [] [] [ parallel-4] SampleTest : mono started
2020-01-20 12:55:24.547 INFO [] [] [ parallel-5] SampleTest : mono started
2020-01-20 12:55:24.548 INFO [] [] [ parallel-6] SampleTest : mono started
expected: <1> but was: <3>
I need to reject your proposal. In both cases getMono is
- invoked as many times as items in flux
- invoked after first element of flux arrives
And those are interactions that I want to avoid. My services are making http requests under the hood and they may be time consuming.
My current solution does not have this problem, but if I add logger to my zip I will get this
2020-01-20 12:55:20.505 INFO [] [] [ parallel-1] SampleTest : flux started
2020-01-20 12:55:20.508 INFO [] [] [ parallel-2] SampleTest : mono started
2020-01-20 12:55:21.523 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.528 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.529 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.529 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.529 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.529 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.530 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.530 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.530 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.530 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.531 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.531 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.531 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.531 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.531 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.532 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.532 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.532 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.532 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.533 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.533 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.533 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.533 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.533 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.533 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.533 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.534 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.534 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.534 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.534 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.534 WARN [] [] [ parallel-2] SampleTest : signal on mono
2020-01-20 12:55:21.535 WARN [] [] [ parallel-2] SampleTest : signal on mono
As you can see there is a lot of elements emitted by combining cache().repeat() together and I want to know if this is an issue and if yes then how to avoid it (but keep single invocation of mono and parallel invocation).
I think what you are trying to achieve could be done with Flux.join
Here is some example code:
Flux<Integer> flux = Flux.concat(Mono.just(1).delayElement(Duration.ofMillis(100)),
Mono.just(2).delayElement(Duration.ofMillis(500))).log();
Mono<String> mono = Mono.just("a").delayElement(Duration.ofMillis(50)).log();
List<String> list = flux.join(mono, (v1) -> Flux.never(), (v2) -> Flux.never(), (x, y) -> {
return x + y;
}).collectList().block();
System.out.println(list);
Libraries like Project Reactor and RxJava try to provide as much combinations of their capabilities as possible, but do not provide access to the instruments of combining capabilities. And as a result, there are always corner cases which are not covered.
My own DF4J, as far as I know, is the only asynchronous library which provides the means to combine capabilities. For example, this is how user can zip Flux and Mono: (of course, this class is not part of DF4J itself):
import org.df4j.core.dataflow.Actor;
import org.df4j.core.port.InpFlow;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
abstract class ZipActor<T1, T2> extends Actor {
InpFlow<T1> inpFlow = new InpFlow<>(this);
InpFlow<T2> inpScalar = new InpFlow<>(this);
ZipActor(Flux<T1> flux, Mono<T2> mono) {
flux.subscribe(inpFlow);
mono.subscribe(inpScalar);
}
#Override
protected void runAction() throws Throwable {
if (inpFlow.isCompleted()) {
stop();
return;
}
T1 element1 = inpFlow.removeAndRequest();
T2 element2 = inpScalar.current();
runAction(element1, element2);
}
protected abstract void runAction(T1 element1, T2 element2);
}
and this is how it can be used:
#Test
public void ZipActorTest() {
Flux<Integer> flux = Flux.just(1,2,3);
Mono<Integer> mono = Mono.just(5);
ZipActor<Integer, Integer> actor = new ZipActor<Integer, Integer>(flux, mono){
#Override
protected void runAction(Integer element1, Integer element2) {
System.out.println("got:"+element1+" and:"+element2);
}
};
actor.start();
actor.join();
}
The console output is as follows:
got:1 and:5
got:2 and:5
got:3 and:5
You can do it with just a simple change
getFlux()
.flatMap(s -> Mono.zip(Mono.just(s),getMono(), (s1, s2) -> s1+" "+s2))
.subscribe(System.out::println);
Flux<String> getFlux(){
return Flux.just("a","b","c");
}
Mono<String> getMono(){
return Mono.just("mono");
}
if you want to use zip but you can achieve same results using flatmap
getFlux()
.flatMap(s -> getMono()
.map((s1 -> s + " " + s1)))
.subscribe(System.out::println);
}
Flux<String> getFlux() {
return Flux.just("a", "b", "c");
}
Mono<String> getMono() {
return Mono.just("mono");
}
in both result is:
a mono
b mono
c mono
EDIT
Ok now i understand it better. Can you try this solution.
getMono().
flatMapMany(s -> getFlux().map(s1 -> s1 + " " + s))
.subscribe(System.out::println);
Flux<String> getFlux() {
return Flux.defer(() -> {
System.out.println("init flux");
return Flux.just("a", "b", "c");
});
}
Mono<String> getMono() {
return Mono.defer(() -> {
System.out.println("init Mono");
return Mono.just("sss");
});
}

MongoDB error on update method during multithreading

i'm working with Apache Storm and i want to write on a MongoDB database but, sometimes, it throws an exception of
Caused by: com.mongodb.MongoException$DuplicateKey: { "serverUsed" : "127.0.0.1:27017" , "ok" : 1 , "n" : 0 , "updatedExisting" : false , "err" : "E11000 duplicate key error collection: TesiMarco.UserPostNew_Hampshire index: _id_ dup key: { : \"mainelyinspired\" }" , "code" : 11000}
while using parallelism option. In particular my bolt was executing:
public void execute(Tuple input, BasicOutputCollector collector) {
String user=input.getString(0);
DBObject query=new BasicDBObject("_id",user);
DBObject toUpdate=new BasicDBObject("$inc",new BasicDBObject("numeroPost",1));
collection.update(query,toUpdate,true,false);
}
but it finds duplicated error on key. How can i execute this part multithreading?

How to calculate difference between current and previous row in Spark JavaRDD

I parsed .log file to JavaRDD, after sorted this JavaRDD and now I have, for example oldJavaRDD:
2016-03-28 | 11:00 | X | object1 | region1
2016-03-28 | 11:01 | Y | object1 | region1
2016-03-28 | 11:05 | X | object1 | region1
2016-03-28 | 11:09 | X | object1 | region1
2016-03-28 | 11:00 | X | object2 | region1
2016-03-28 | 11:01 | Z | object2 | region1
How I can get newJavaRDD for saving it to DB?
New JavaRDD structure have to be:
2016-03-28 | 9 | object1 | region1
2016-03-28 | 1 | object2 | region1
so, I have to calculate time between current and previous row (also use flag X, Y, Z in some cases to define, add time to result or not) and add new element to JavaRDD after changing date, objectName or objectRegion.
I can do it using this type of code (map), but I think it's not good and not the fastest way
JavaRDD<NewObject> newJavaRDD = oldJavaRDD.map { r ->
String datePrev[] = ...
if (datePrev != dateCurr ...) {
return newJavaRdd;
} else {
return null;
}
}
First, your code example references newJavaRDD from within a transformation that creates newJavaRDD - that's impossible on a few different levels:
You can't reference a variable on the right-hand-side of that variable's declaration...
You can't use an RDD within a transformation on an RDD (same one or another one - that doesn't matter) - anything inside a transformation must be serialized by Spark, and Spark can't serialize its own RDDs (that would make no sense)
So, how should you do that?
Assuming:
Your intention here is to get a single record for each combination of date + object + region
There shouldn't be too many records for each such combination, so it's safe to groupBy these fields as key
You can groupBy the key fields, and then mapValues to get the "minute distnace" between first and last record (the function passed to mapValues can be changed to contain your exact logic if I didn't get it right). I'll use Joda Time library for the time calculations:
public static void main(String[] args) {
// some setup code for this test:
JavaSparkContext sc = new JavaSparkContext("local", "test");
// input:
final JavaRDD<String[]> input = sc.parallelize(Lists.newArrayList(
// date time ? object region
new String[]{"2016-03-28", "11:00", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:01", "Y", "object1", "region1"},
new String[]{"2016-03-28", "11:05", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:09", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:00", "X", "object2", "region1"},
new String[]{"2016-03-28", "11:01", "Z", "object2", "region1"}
));
// grouping by key:
final JavaPairRDD<String, Iterable<String[]>> byObjectAndDate = input.groupBy(new Function<String[], String>() {
#Override
public String call(String[] record) throws Exception {
return record[0] + record[3] + record[4]; // date, object, region
}
});
// mapping each "value" (all record matching key) to result
final JavaRDD<String[]> result = byObjectAndDate.mapValues(new Function<Iterable<String[]>, String[]>() {
#Override
public String[] call(Iterable<String[]> records) throws Exception {
final Iterator<String[]> iterator = records.iterator();
String[] previousRecord = iterator.next();
int diffMinutes = 0;
for (String[] record : records) {
if (record[2].equals("X")) { // if I got your intention right...
final LocalDateTime prev = getLocalDateTime(previousRecord);
final LocalDateTime curr = getLocalDateTime(record);
diffMinutes += Period.fieldDifference(prev, curr).toStandardMinutes().getMinutes();
}
previousRecord = record;
}
return new String[]{
previousRecord[0],
Integer.toString(diffMinutes),
previousRecord[3],
previousRecord[4]
};
}
}).values();
// do whatever with "result"...
}
// extracts a Joda LocalDateTime from a "record"
static LocalDateTime getLocalDateTime(String[] record) {
return LocalDateTime.parse(record[0] + " " + record[1], formatter);
}
static final DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm");
P.S. In Scala this would take about 8 lines... :/

Categories