Project Reactor: Schedulers#parallel & Schedulers#elastic purpose - java

I am learning Project Reactor where I am exploring Schedulers factory.
I tried the following code:
ExecutorService executorService = Executors.newFixedThreadPool(10);
Flux.range(1,4)
.map(i -> {
logger.info(i +" [MAP] " + Thread.currentThread().getName());
return 10 / i;
})
.publishOn(Schedulers.fromExecutorService(executorService)) // .publishOn(Schedulers.parallel())
.subscribe(
n -> {
logger.info("START "+((Long)(System.currentTimeMillis() % 10000000L)).toString());
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
logger.info(n.toString());
logger.info("END "+((Long)(System.currentTimeMillis() % 10000000L)).toString());
}
);
executorService.shutdown();
This code was tried with Schedulers.parallel() and Schedulers.elastic() as well. Also, tried with subscribeOn() operator to see similar results.
The logs are:
02:07:30.142 [main] INFO - 1 [MAP] main
02:07:30.143 [main] INFO - 2 [MAP] main
02:07:30.143 [main] INFO - 3 [MAP] main
02:07:30.143 [main] INFO - 4 [MAP] main
02:07:30.143 [pool-1-thread-2] INFO - START 1050143
02:07:30.247 [pool-1-thread-2] INFO - 10
02:07:30.247 [pool-1-thread-2] INFO - END 1050247
02:07:30.247 [pool-1-thread-2] INFO - START 1050247
02:07:30.350 [pool-1-thread-2] INFO - 5
02:07:30.350 [pool-1-thread-2] INFO - END 1050350
02:07:30.350 [pool-1-thread-2] INFO - START 1050350
02:07:30.455 [pool-1-thread-2] INFO - 3
02:07:30.455 [pool-1-thread-2] INFO - END 1050455
02:07:30.455 [pool-1-thread-2] INFO - START 1050455
02:07:30.557 [pool-1-thread-2] INFO - 2
02:07:30.558 [pool-1-thread-2] INFO - END 1050558
Since the Flux's elements are ordered and operated upon in sequence (apparent from the logs above), having multiple threads for an operator (or operator chain) for one element does not make sense. I am sure I am either misinterpreting the Schedulers or lack somewhere in my basic understanding. Can someone point me to the right direction?
I understand the purpose of Schedulers to make the processing asynchronous and unhold the main thread. But why would anyone want to give multiple threads to the operator(s) when operated at one element at a time.
Does it makes sense only when we deal with flatMap operator?

Related

Java reactor `suscribe` is sometime blocking, sometime not

I have been playing around for some time with reactor, but I still need to get something.
This piece of code
Flux.range(1, 1000)
.delayElements(Duration.ofNanos(1))
.map(integer -> integer + 1)
.subscribe(System.out::println);
System.out.println("after");
Returns:
after
2
3
4
which is expected as the documentation of subscribe states: this will immediately return control to the calling thread.
Why, then, this piece of code:
Flux.range(1, 1000)
.map(integer -> integer + 1)
.subscribe(System.out::println);
returns
1
2
...
1000
1001
after
I can never figure out when subscribe will block or not, and that's very annoying when writing batches.
If anyone has the answer, that would be amazing
There is no blocking code in your snippet.
In first example you use .delayElements() and it switches the executing to another thread and releases your main thread. So you can see your System.out.println("after"); executing in Main thread immediately, whilst the reactive chain is being executed on parallel-n threads.
Your first example:
18:49:29.195 [main] INFO com.example.demo.FluxTest - AFTER
18:49:29.199 [parallel-1] INFO com.example.demo.FluxTest - v: 2
18:49:29.201 [parallel-2] INFO com.example.demo.FluxTest - v: 3
18:49:29.202 [parallel-3] INFO com.example.demo.FluxTest - v: 4
18:49:29.203 [parallel-4] INFO com.example.demo.FluxTest - v: 5
18:49:29.205 [parallel-5] INFO com.example.demo.FluxTest - v: 6
But your second example does not switch the executing thread, so your reactive chain executes on Main thread. And after it completes it continues to execute your System.out.println("after");
18:51:28.490 [main] INFO com.example.demo.FluxTest - v: 995
18:51:28.490 [main] INFO com.example.demo.FluxTest - v: 996
18:51:28.490 [main] INFO com.example.demo.FluxTest - v: 997
18:51:28.490 [main] INFO com.example.demo.FluxTest - v: 998
18:51:28.490 [main] INFO com.example.demo.FluxTest - v: 999
18:51:28.490 [main] INFO com.example.demo.FluxTest - v: 1000
18:51:28.490 [main] INFO com.example.demo.FluxTest - v: 1001
18:51:28.491 [main] INFO com.example.demo.FluxTest - AFTER
EDIT:
If you want to switch the thread in your second snippet, basically you have two options:
Add subscribeOn(<Scheduler>) in any place of your reactive chain. Then the whole subscription process will happen on a thread from scheduler you provided.
Add publishOn(<Scheduler>), for example, after Flux.range(), then the emitting itself will happen on your calling thread, but the downstream will be executed on a thread from the scheduler you provided

apoc.periodic.iterate fails with exception: java.util.concurrent.RejectedExecutionException

I am trying run the annotation function of graphaware within Neo4J (see documentation here). I have a set of 5000 nodes (KnowledgeArticles) with textual data in the content property. To annotate those I run the following query in Neo4J desktop:
CALL apoc.periodic.iterate(
"MATCH (n:KnowledgeArticle) RETURN n",
"CALL ga.nlp.annotate({text: n.content, id: id(n)})
YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)", {batchSize:1, iterateList:true})
After annotating approximately 200 to 300 KnowledgeArticles the database shuts down and provides the error:
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `apoc.periodic.iterate`: Caused by:
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask#373b81ee rejected from
java.util.concurrent.ThreadPoolExecutor#285a2901[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 288]
I have experimented using different values for batchSize or setting iterateList to false, but none of this helped.
Also, I have tried performing the above iterate call limiting it only to 150 nodes. This works fine for the first time I call it, but when I perform it for a second time it again provides the same error, stating that the completed_task is about 200 to 300. The processor in the back thus seems to 'remember' the amount of tasks it has run in total as of the first time the database has started.
Could you help me resolve this issue. I want to run the above query not necessarily from Neo4j desktop, but eventually with py2neo from Python using graph.run([iterate-query]). If there is thus any way of solving this from Python, that would be even better.
Thank you!
PS. The debug log provides the following output (as of the last few iterations of the annotation up until the shut down):
2019-05-21 12:46:10.359+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251906
2019-05-21 12:46:13.784+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251906. It took: 3425
2019-05-21 12:46:13.786+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.788+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.800+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:13.868+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 67. Text length: 954
2019-05-21 12:46:13.869+0000 INFO [c.g.n.NLPManager] Time to annotate 68
2019-05-21 12:46:13.869+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.869+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251907
2019-05-21 12:46:15.848+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251907. It took: 1978
2019-05-21 12:46:15.848+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.862+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.915+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:16.294+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 378. Text length: 2641
2019-05-21 12:46:16.295+0000 INFO [c.g.n.NLPManager] Time to annotate 379
2019-05-21 12:46:16.296+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:16.296+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251908
2019-05-21 12:46:16.421+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database graph.db is unavailable.
2019-05-21 12:46:17.018+0000 INFO [c.g.s.f.b.GraphAwareServerBootstrapper] stopped
2019-05-21 12:46:17.020+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.149+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutting down 'graph.db' database.
2019-05-21 12:46:17.150+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.164+0000 INFO [o.n.b.i.BackupServer] BackupServer communication server shutting down and unbinding from /127.0.0.1:6362
2019-05-21 12:46:17.226+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint started...
2019-05-21 12:46:17.247+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 7720 to [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.a], from [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.b].
2019-05-21 12:46:17.644+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint completed in 418ms
2019-05-21 12:46:17.647+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 3
2019-05-21 12:46:17.698+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics START ---
2019-05-21 12:46:17.700+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-05-21 12:46:17.706+0000 INFO [c.g.r.BaseGraphAwareRuntime] Shutting down GraphAware Runtime...
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module UIDM
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module NLP
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Terminating task scheduler...
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Task scheduler terminated successfully.
2019-05-21 12:46:17.714+0000 INFO [c.g.r.BaseGraphAwareRuntime] GraphAware Runtime shut down.

Basic questions about Reactor signals

I have some questions regarding the output of the following code:
Flux.just("a", "b", "c", "d")
.log(null, Level.INFO, true) // line: 18
.flatMap(value ->
Mono.just(value.toUpperCase()).publishOn(Schedulers.elastic()), 2)
.log(null, Level.INFO, true) // line: 21
.take(3)
.log(null, Level.INFO, true) // line: 23
.subscribe(x ->
System.out.println("Thread: " + Thread.currentThread().getName() +
" , " + x));
Thread.sleep(1000 * 1000);
Output:
1. 11:29:11 [main] INFO - | onSubscribe([Synchronous Fuseable] FluxArray.ArraySubscription) Flux.log(App.java:18)
2. 11:29:11 [main] INFO - onSubscribe(FluxFlatMap.FlatMapMain) Flux.log(App.java:21)
3. 11:29:11 [main] INFO - onSubscribe(FluxTake.TakeSubscriber) Flux.log(App.java:23)
4. 11:29:11 [main] INFO - request(unbounded) Flux.log(App.java:23)
5. 11:29:11 [main] INFO - request(unbounded) Flux.log(App.java:21)
6. 11:29:11 [main] INFO - | request(2) Flux.log(App.java:18)
7. 11:29:11 [main] INFO - | onNext(a) Flux.log(App.java:18)
8. 11:29:11 [main] INFO - | onNext(b) Flux.log(App.java:18)
9. 11:29:11 [elastic-2] INFO - onNext(A) Flux.log(App.java:21)
10. 11:29:11 [elastic-2] INFO - onNext(A) Flux.log(App.java:23)
11. Thread: elastic-2 , A
12. 11:29:11 [elastic-2] INFO - | request(1) Flux.log(App.java:18)
13. 11:29:11 [main] INFO - | onNext(c) Flux.log(App.java:18)
14. 11:29:11 [elastic-3] INFO - onNext(B) Flux.log(App.java:21)
15. 11:29:11 [elastic-3] INFO - onNext(B) Flux.log(App.java:23)
16. Thread: elastic-3 , B
17. 11:29:11 [elastic-3] INFO - | request(1) Flux.log(App.java:18)
18. 11:29:11 [elastic-3] INFO - | onNext(d) Flux.log(App.java:18)
19. 11:29:11 [elastic-3] INFO - | onComplete() Flux.log(App.java:18)
20. 11:29:11 [elastic-3] INFO - onNext(C) Flux.log(App.java:21)
21. 11:29:11 [elastic-3] INFO - onNext(C) Flux.log(App.java:23)
22. Thread: elastic-3 , C
23. 11:29:11 [elastic-3] INFO - cancel() Flux.log(App.java:21)
24. 11:29:11 [elastic-3] INFO - onComplete() Flux.log(App.java:23)
25. 11:29:11 [elastic-3] INFO - | cancel() Flux.log(App.java:18)
Questions: Each question is about a specific line inside the output (not a line in the code). I also added my answers to some of them but I'm not sure I'm correct.
When subscribing, the subscribe operation ask for unbounded amount of elements. Then why the event: request(unbounded) is going down in the pipeline instead of going up? My answer: The request for unbounded amount is going up to take and then take sending it down again.
flatMap send cancel signal. Why doesn't take sends it instead?
Last question: There is more then one terminal signal in the output. Isn't it a vaulation of reactive-streams spec?
In that case, will be produced ONLY one terminal signal.
Flux.just("a", "b", "c", "d")
.log(null, Level.INFO, true) // line: 18
.flatMap(value ->
Mono.just(value.toUpperCase()).publishOn(Schedulers.elastic()), 2)
.log(null, Level.INFO, true) // line: 21
.take(3)
.log(null, Level.INFO, true) // line: 23
.subscribe(x ->
System.out.println("Thread: " + Thread.currentThread().getName() +
" , " + x), t -> {}, () -> System.out.println("Completed ""Only Once"));
The tricky part here is that each Reactor 3 operator has its own life, and they all are playing by the same rule - emit onComplete to notify downstream operator that there is no data anymore.
Since you have .log() operator and three different points thus you will observe three independent onComplete signals from .just, from .flatMap, and from .take(3).
Firstly, you will see onComplete from .just because the default behavior of .flatMap is 'ok, let's try to request first concurrency elements, and then let's see how it goes', since .just may produce (in your case) only 4 elements, on 2 (which is concurrency level in your example) requested demand it will emit 2 onNext and after two request(1) you will see onComplete. In turn, emitted onComplete lets the .flatMap knows that when 4 flatted streams emit their .onComplete signals, it will be allowed to emit its own onComplete to downstream.
In turn, downstream is .take(3) operator which is also after first three elements will emit its own onComplete signal without waiting for upstream onComplete. Since there is .log operator after .take this signal also will be recorded.
Finally, in your flow, you have 3 independent log operators, which will record 3 independent onComplete from 3 independent operators, but despite that fact, the final terminal .subscribe will receive only one onComplete from the first operator up to the flow.
Small update regarding .take behavior
The central idea of .take is taking elements until the remaining count has been satisfied. Since the upstream may produce more than was requested we need to have a mechanism to prevent sending more data. One of the mechanisms that Reactive-Streams spec offers to us is collaborations over Subscription. Subscription has two primary methods - request - to show the demand and cancel - to show that data is not needed anymore even if requested demand was not satisfied.
In case of .take operator, initial demand is Long.MAX_VALUE, which considers as unbounded demand. Therefore, the only way to stop consuming potentially infinitive stream of data is to cancel subscription, or in other word unsubscribe
Hope it helps you :)

No results and no log information in Apache JMeter

I have set up a class extending AbstractJavaSamplerClientcontaining a setupTest, runTest and getDefaultParameters method, all written accordingly to the jMeter's templates and online examples.
Have added a Logger, tried setting it up normally (LoggingManager.getLoggerForClass();) and using super as well (super.getLogger();) and added multiple Logger.info and Logger.error
I'm using a Java request and giving a Jar containing my classes have completed the user.properties with the classpath. The right class is selected when running the tests.
When I run my tests, no custom logs appear. No error either. And I'm getting this:
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Running the test!
2016/02/03 13:34:42 INFO - jmeter.samplers.SampleEvent: List of sample_variables: []
2016/02/03 13:34:42 INFO - jmeter.gui.util.JMeterMenuBar: setRunning(true,*local*)
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Starting ThreadGroup: 1 : Group Test
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Starting 1 threads for group Group Test.
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Thread will start next loop on error
2016/02/03 13:34:42 INFO - jmeter.threads.ThreadGroup: Starting thread group number 1 threads 1 ramp-up 1 perThread 1000.0 delayedStart=false
2016/02/03 13:34:42 INFO - jmeter.threads.ThreadGroup: Started thread group number 1
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: All thread groups have been started
2016/02/03 13:34:42 INFO - jmeter.threads.JMeterThread: Thread started: Group Test 1-1
2016/02/03 13:34:42 INFO - jmeter.threads.JMeterThread: Thread is done: Group Test 1-1
2016/02/03 13:34:42 INFO - jmeter.threads.JMeterThread: Thread finished: Group Test 1-1
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Notifying test listeners of end of test
2016/02/03 13:34:42 INFO - jmeter.gui.util.JMeterMenuBar: setRunning(false,*local*)
I've checked the user.properties and the logs are written to jmeter.log
I feel like my class is not even tested since the logs are not displaying anything beside "Starting/Done/Finshed Thread".
When I add a result tree or any other result thread, it stays empty.
What could be set up wrong here?
EDIT:
Didn't use the right JAVa request (Configurations>Default Java Request != Single Java Request). From that point I saw I had an error with my Jonas thus not connecting and explaining the non existent results.
As per example given in WebSocket Testing With Apache JMeter article:
Necessary imports:
import org.apache.jorphan.logging.LoggingManager;
import org.apache.log.Logger;
Initialization:
private static final Logger log = LoggingManager.getLoggerForClass();
Usage:
log.info("your log message");
log.error("your error message", exception);

algebraic error when running "aggregate" function on dataset

I'm learning hadoop/pig/hive through running through tutorials on hortonworks.com
I have indeed tried to find a link to the tutorial, but unfortunately it only ships with the ISA image that they provide to you. It's not actually hosted on their website.
batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
I've copied their code exactly as it was stated in the tutorial and I'm getting this output:
2013-06-14 14:34:37,969 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:34:37,977 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
2013-06-14 14:34:38,412 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:34:38,598 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:34:38,998 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:34:40,819 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2013-06-14 14:34:40,827 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY
2013-06-14 14:34:41,115 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-06-14 14:34:41,160 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2013-06-14 14:34:41,201 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POJoinPackage
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2
2013-06-14 14:34:41,488 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-06-14 14:34:41,551 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-06-14 14:34:41,555 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2013-06-14 14:34:44,244 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5371236206169131677.jar
2013-06-14 14:34:49,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5371236206169131677.jar created
2013-06-14 14:34:49,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job
2013-06-14 14:34:49,529 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2013-06-14 14:34:49,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-06-14 14:34:50,144 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-06-14 14:34:50,145 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-06-14 14:34:50,256 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-06-14 14:34:50,316 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2013-06-14 14:34:50,444 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3]
2013-06-14 14:34:50,665 [JobControl] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is available
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded
2013-06-14 14:34:50,680 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201306140401_0021
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases batting,grp_data,max_runs,runs
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: batting[1,10],runs[2,7],max_runs[4,11],grp_data[3,11] C: max_runs[4,11],grp_data[3,11] R: max_runs[4,11]
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://sandbox:50030/jobdetails.jsp?jobid=job_201306140401_0021
2013-06-14 14:36:01,993 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2013-06-14 14:36:04,767 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201306140401_0021 has failed! Stop running all dependent jobs
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-06-14 14:36:05,029 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2106: Error executing an algebraic function
2013-06-14 14:36:05,030 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.0.1.3.0.0-107 0.11.1.1.3.0.0-107 mapred 2013-06-14 14:34:41 2013-06-14 14:36:05 HASH_JOIN,GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201306140401_0021 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201306140401_0021_m_000000
Input(s):
Failed to read data from "hdfs://sandbox:8020/user/hue/batting.csv"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201306140401_0021 -> null,
null
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2013-06-14 14:36:05,043 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias join_data
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
When switching this part: MAX(runs.runs) to avg(runs.runs) then I am getting a completely different issue:
2013-06-14 14:38:25,694 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:38:25,695 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
2013-06-14 14:38:26,198 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:38:26,438 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:38:26,824 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:38:28,238 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve avg using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
Anybody know what the issue might be?
I am sure lot of people would have figured this out. I combined Eugene's solution with the original code from Hortonworks such that we get the exact output as specific in the tutorial.
Following code works and produces exact output as specified in the tutorial:
batting = LOAD 'Batting.csv' using PigStorage(',');
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
Note: line "runs = FILTER runs_raw BY runs > 0;" is additional than what has been provided by Hortonworks, thanks to Eugene for sharing working code which I used to modify original Hortonworks code to make it work.
UDFs are case sensitive, so at least to answer the second part of your question - you'll need to use AVG(runs.runs) instead of avg(runs.runs)
It's likely that once you correct your syntax you'll get the original error you reported...
i am having the same exact same issue with exact same log output, but this solution doesn't work because i believe changing MAX with AVG here dumps the whole purpose of this hortonworks.com tutorial - it was to get the MAX runs by playerID for each year.
UPDATE
Finally i got it resolved - you have to either remove the first line in Batting.csv (column names) or edit your Pig Latin code like this:
batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
dump max_runs;
After that you should be able to complete tutorial correctly and get the proper result.
It also looks like this is due to the "bug" in the older versions of Pig rhich was used in the tutorial
Please specify appropriate data type for playerID, year & runs like below:
runs = FOREACH batting GENERATE $0 as playerID:int, $1 as year:chararray, $8 as runs:int;
Not, it should work.

Categories