I am learning Project Reactor where I am exploring Schedulers factory.
I tried the following code:
ExecutorService executorService = Executors.newFixedThreadPool(10);
Flux.range(1,4)
.map(i -> {
logger.info(i +" [MAP] " + Thread.currentThread().getName());
return 10 / i;
})
.publishOn(Schedulers.fromExecutorService(executorService)) // .publishOn(Schedulers.parallel())
.subscribe(
n -> {
logger.info("START "+((Long)(System.currentTimeMillis() % 10000000L)).toString());
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
logger.info(n.toString());
logger.info("END "+((Long)(System.currentTimeMillis() % 10000000L)).toString());
}
);
executorService.shutdown();
This code was tried with Schedulers.parallel() and Schedulers.elastic() as well. Also, tried with subscribeOn() operator to see similar results.
The logs are:
02:07:30.142 [main] INFO - 1 [MAP] main
02:07:30.143 [main] INFO - 2 [MAP] main
02:07:30.143 [main] INFO - 3 [MAP] main
02:07:30.143 [main] INFO - 4 [MAP] main
02:07:30.143 [pool-1-thread-2] INFO - START 1050143
02:07:30.247 [pool-1-thread-2] INFO - 10
02:07:30.247 [pool-1-thread-2] INFO - END 1050247
02:07:30.247 [pool-1-thread-2] INFO - START 1050247
02:07:30.350 [pool-1-thread-2] INFO - 5
02:07:30.350 [pool-1-thread-2] INFO - END 1050350
02:07:30.350 [pool-1-thread-2] INFO - START 1050350
02:07:30.455 [pool-1-thread-2] INFO - 3
02:07:30.455 [pool-1-thread-2] INFO - END 1050455
02:07:30.455 [pool-1-thread-2] INFO - START 1050455
02:07:30.557 [pool-1-thread-2] INFO - 2
02:07:30.558 [pool-1-thread-2] INFO - END 1050558
Since the Flux's elements are ordered and operated upon in sequence (apparent from the logs above), having multiple threads for an operator (or operator chain) for one element does not make sense. I am sure I am either misinterpreting the Schedulers or lack somewhere in my basic understanding. Can someone point me to the right direction?
I understand the purpose of Schedulers to make the processing asynchronous and unhold the main thread. But why would anyone want to give multiple threads to the operator(s) when operated at one element at a time.
Does it makes sense only when we deal with flatMap operator?
This code is a quoted from Java 8 in Action, which is also in the book 11.4.3.
public Stream<CompletableFuture<String>> findPricesStream(String product) {
return shops.stream()
.map(shop -> CompletableFuture.supplyAsync(() -> shop.getPrice(product), executor))
.map(future -> future.thenApply(Quote::parse))
.map(future -> future.thenCompose(quote -> CompletableFuture.supplyAsync(() -> Discount.applyDiscount(quote), executor)));
}
Along the code, the writer enclose a figure as follows expressing that the applyDiscount() works in the same thread with getPrice(), which I strongly have a doubt: there are two different Async suffix here which means the second call should be in another thread.
I tested it locally with the following code:
private static void testBasic() {
out.println("*****************************************");
out.println("********** TESTING thenCompose **********");
CompletableFuture[] futures = IntStream.rangeClosed(0, LEN).boxed()
.map(i -> CompletableFuture.supplyAsync(() -> runStage1(i), EXECUTOR_SERVICE))
.map(future -> future.thenCompose(i -> CompletableFuture.supplyAsync(() -> runStage2(i), EXECUTOR_SERVICE)))
.toArray(size -> new CompletableFuture[size]);
CompletableFuture.allOf(futures).join();
}
The output further demonstrate my thought, is it correct?
*****************************************
********** TESTING thenCompose **********
Start: stage - 1 - value: 0 - thread name: pool-1-thread-1
Start: stage - 1 - value: 1 - thread name: pool-1-thread-2
Start: stage - 1 - value: 2 - thread name: pool-1-thread-3
Start: stage - 1 - value: 3 - thread name: pool-1-thread-4
Finish: stage - 1 - value: 3 - thread name: pool-1-thread-4 - time cost: 1520
Start: stage - 2 - value: 3 - thread name: pool-1-thread-5
Finish: stage - 1 - value: 0 - thread name: pool-1-thread-1 - time cost: 1736
Start: stage - 2 - value: 0 - thread name: pool-1-thread-6
Finish: stage - 1 - value: 2 - thread name: pool-1-thread-3 - time cost: 1761
Start: stage - 2 - value: 2 - thread name: pool-1-thread-7
Finish: stage - 2 - value: 2 - thread name: pool-1-thread-7 - time cost: 446
Finish: stage - 1 - value: 1 - thread name: pool-1-thread-2 - time cost: 2249
Start: stage - 2 - value: 1 - thread name: pool-1-thread-8
Finish: stage - 2 - value: 3 - thread name: pool-1-thread-5 - time cost: 828
Finish: stage - 2 - value: 0 - thread name: pool-1-thread-6 - time cost: 704
Finish: stage - 2 - value: 1 - thread name: pool-1-thread-8 - time cost: 401
The Java 8 in Action is wrong about this?
Thank you, #Holger. You make it crystal clear to me now about the executing thread for async and non-async methods. Especially after checking its specification further demonstrating your point.
Actions supplied for dependent completions of non-async methods may be performed by the thread that completes the current CompletableFuture, or by any other caller of a completion method.
As a first note, that code is distracting from what’s happening due to the unnecessary splitting into multiple Stream operations.
Further, there is no sense in doing
future.thenCompose(quote ->
CompletableFuture.supplyAsync(() -> Discount.applyDiscount(quote), executor))
instead of
future.thenApplyAsync(quote -> Discount.applyDiscount(quote), executor)
So, a simpler example doing the same would be
public Stream<CompletableFuture<String>> findPricesStream(String product) {
return shops.stream().map(
shop -> CompletableFuture
.supplyAsync(() -> shop.getPrice(product), executor)
.thenApply(Quote::parse)
.thenApplyAsync(quote -> Discount.applyDiscount(quote), executor));
}
However, you are right, there is no guaranty that getPrice and applyDiscount run in the same thread—unless the executor is a single threaded executor.
You may interpret “executor thread” as “one of the executor’s threads”, but even then, there in a dangerously wrong point in the diagram, namely, “new Quote(price)”, which apparently actually means “Quote::parse”. That step does not belong to the right side, as the actual thread evaluating the function passed to thenApply is unspecified. It may be one of the executor’s threads upon completion of the previous stage, but it may also be “your thread” right when calling thenApply, e.g. if the asynchronous operation managed to complete in‑between.
The CompletableFuture offers no way to enforce the use of the first stage’s completing thread for the dependent actions.
Unless you use a simple sequential code instead, of course:
public Stream<CompletableFuture<String>> findPricesStream(String product) {
return shops.stream().map(shop -> CompletableFuture
.supplyAsync(() -> Discount.applyDiscount(Quote.parse(shop.getPrice(product))), executor));
}
Then, the picture of a linear thread on the right hand side will be correct.
I have some questions regarding the output of the following code:
Flux.just("a", "b", "c", "d")
.log(null, Level.INFO, true) // line: 18
.flatMap(value ->
Mono.just(value.toUpperCase()).publishOn(Schedulers.elastic()), 2)
.log(null, Level.INFO, true) // line: 21
.take(3)
.log(null, Level.INFO, true) // line: 23
.subscribe(x ->
System.out.println("Thread: " + Thread.currentThread().getName() +
" , " + x));
Thread.sleep(1000 * 1000);
Output:
1. 11:29:11 [main] INFO - | onSubscribe([Synchronous Fuseable] FluxArray.ArraySubscription) Flux.log(App.java:18)
2. 11:29:11 [main] INFO - onSubscribe(FluxFlatMap.FlatMapMain) Flux.log(App.java:21)
3. 11:29:11 [main] INFO - onSubscribe(FluxTake.TakeSubscriber) Flux.log(App.java:23)
4. 11:29:11 [main] INFO - request(unbounded) Flux.log(App.java:23)
5. 11:29:11 [main] INFO - request(unbounded) Flux.log(App.java:21)
6. 11:29:11 [main] INFO - | request(2) Flux.log(App.java:18)
7. 11:29:11 [main] INFO - | onNext(a) Flux.log(App.java:18)
8. 11:29:11 [main] INFO - | onNext(b) Flux.log(App.java:18)
9. 11:29:11 [elastic-2] INFO - onNext(A) Flux.log(App.java:21)
10. 11:29:11 [elastic-2] INFO - onNext(A) Flux.log(App.java:23)
11. Thread: elastic-2 , A
12. 11:29:11 [elastic-2] INFO - | request(1) Flux.log(App.java:18)
13. 11:29:11 [main] INFO - | onNext(c) Flux.log(App.java:18)
14. 11:29:11 [elastic-3] INFO - onNext(B) Flux.log(App.java:21)
15. 11:29:11 [elastic-3] INFO - onNext(B) Flux.log(App.java:23)
16. Thread: elastic-3 , B
17. 11:29:11 [elastic-3] INFO - | request(1) Flux.log(App.java:18)
18. 11:29:11 [elastic-3] INFO - | onNext(d) Flux.log(App.java:18)
19. 11:29:11 [elastic-3] INFO - | onComplete() Flux.log(App.java:18)
20. 11:29:11 [elastic-3] INFO - onNext(C) Flux.log(App.java:21)
21. 11:29:11 [elastic-3] INFO - onNext(C) Flux.log(App.java:23)
22. Thread: elastic-3 , C
23. 11:29:11 [elastic-3] INFO - cancel() Flux.log(App.java:21)
24. 11:29:11 [elastic-3] INFO - onComplete() Flux.log(App.java:23)
25. 11:29:11 [elastic-3] INFO - | cancel() Flux.log(App.java:18)
Questions: Each question is about a specific line inside the output (not a line in the code). I also added my answers to some of them but I'm not sure I'm correct.
When subscribing, the subscribe operation ask for unbounded amount of elements. Then why the event: request(unbounded) is going down in the pipeline instead of going up? My answer: The request for unbounded amount is going up to take and then take sending it down again.
flatMap send cancel signal. Why doesn't take sends it instead?
Last question: There is more then one terminal signal in the output. Isn't it a vaulation of reactive-streams spec?
In that case, will be produced ONLY one terminal signal.
Flux.just("a", "b", "c", "d")
.log(null, Level.INFO, true) // line: 18
.flatMap(value ->
Mono.just(value.toUpperCase()).publishOn(Schedulers.elastic()), 2)
.log(null, Level.INFO, true) // line: 21
.take(3)
.log(null, Level.INFO, true) // line: 23
.subscribe(x ->
System.out.println("Thread: " + Thread.currentThread().getName() +
" , " + x), t -> {}, () -> System.out.println("Completed ""Only Once"));
The tricky part here is that each Reactor 3 operator has its own life, and they all are playing by the same rule - emit onComplete to notify downstream operator that there is no data anymore.
Since you have .log() operator and three different points thus you will observe three independent onComplete signals from .just, from .flatMap, and from .take(3).
Firstly, you will see onComplete from .just because the default behavior of .flatMap is 'ok, let's try to request first concurrency elements, and then let's see how it goes', since .just may produce (in your case) only 4 elements, on 2 (which is concurrency level in your example) requested demand it will emit 2 onNext and after two request(1) you will see onComplete. In turn, emitted onComplete lets the .flatMap knows that when 4 flatted streams emit their .onComplete signals, it will be allowed to emit its own onComplete to downstream.
In turn, downstream is .take(3) operator which is also after first three elements will emit its own onComplete signal without waiting for upstream onComplete. Since there is .log operator after .take this signal also will be recorded.
Finally, in your flow, you have 3 independent log operators, which will record 3 independent onComplete from 3 independent operators, but despite that fact, the final terminal .subscribe will receive only one onComplete from the first operator up to the flow.
Small update regarding .take behavior
The central idea of .take is taking elements until the remaining count has been satisfied. Since the upstream may produce more than was requested we need to have a mechanism to prevent sending more data. One of the mechanisms that Reactive-Streams spec offers to us is collaborations over Subscription. Subscription has two primary methods - request - to show the demand and cancel - to show that data is not needed anymore even if requested demand was not satisfied.
In case of .take operator, initial demand is Long.MAX_VALUE, which considers as unbounded demand. Therefore, the only way to stop consuming potentially infinitive stream of data is to cancel subscription, or in other word unsubscribe
Hope it helps you :)
I have set up a class extending AbstractJavaSamplerClientcontaining a setupTest, runTest and getDefaultParameters method, all written accordingly to the jMeter's templates and online examples.
Have added a Logger, tried setting it up normally (LoggingManager.getLoggerForClass();) and using super as well (super.getLogger();) and added multiple Logger.info and Logger.error
I'm using a Java request and giving a Jar containing my classes have completed the user.properties with the classpath. The right class is selected when running the tests.
When I run my tests, no custom logs appear. No error either. And I'm getting this:
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Running the test!
2016/02/03 13:34:42 INFO - jmeter.samplers.SampleEvent: List of sample_variables: []
2016/02/03 13:34:42 INFO - jmeter.gui.util.JMeterMenuBar: setRunning(true,*local*)
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Starting ThreadGroup: 1 : Group Test
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Starting 1 threads for group Group Test.
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Thread will start next loop on error
2016/02/03 13:34:42 INFO - jmeter.threads.ThreadGroup: Starting thread group number 1 threads 1 ramp-up 1 perThread 1000.0 delayedStart=false
2016/02/03 13:34:42 INFO - jmeter.threads.ThreadGroup: Started thread group number 1
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: All thread groups have been started
2016/02/03 13:34:42 INFO - jmeter.threads.JMeterThread: Thread started: Group Test 1-1
2016/02/03 13:34:42 INFO - jmeter.threads.JMeterThread: Thread is done: Group Test 1-1
2016/02/03 13:34:42 INFO - jmeter.threads.JMeterThread: Thread finished: Group Test 1-1
2016/02/03 13:34:42 INFO - jmeter.engine.StandardJMeterEngine: Notifying test listeners of end of test
2016/02/03 13:34:42 INFO - jmeter.gui.util.JMeterMenuBar: setRunning(false,*local*)
I've checked the user.properties and the logs are written to jmeter.log
I feel like my class is not even tested since the logs are not displaying anything beside "Starting/Done/Finshed Thread".
When I add a result tree or any other result thread, it stays empty.
What could be set up wrong here?
EDIT:
Didn't use the right JAVa request (Configurations>Default Java Request != Single Java Request). From that point I saw I had an error with my Jonas thus not connecting and explaining the non existent results.
As per example given in WebSocket Testing With Apache JMeter article:
Necessary imports:
import org.apache.jorphan.logging.LoggingManager;
import org.apache.log.Logger;
Initialization:
private static final Logger log = LoggingManager.getLoggerForClass();
Usage:
log.info("your log message");
log.error("your error message", exception);
I'm learning hadoop/pig/hive through running through tutorials on hortonworks.com
I have indeed tried to find a link to the tutorial, but unfortunately it only ships with the ISA image that they provide to you. It's not actually hosted on their website.
batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
I've copied their code exactly as it was stated in the tutorial and I'm getting this output:
2013-06-14 14:34:37,969 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:34:37,977 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
2013-06-14 14:34:38,412 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:34:38,598 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:34:38,998 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:34:40,819 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2013-06-14 14:34:40,827 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY
2013-06-14 14:34:41,115 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-06-14 14:34:41,160 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2013-06-14 14:34:41,201 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POJoinPackage
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2
2013-06-14 14:34:41,488 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-06-14 14:34:41,551 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-06-14 14:34:41,555 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2013-06-14 14:34:44,244 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5371236206169131677.jar
2013-06-14 14:34:49,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5371236206169131677.jar created
2013-06-14 14:34:49,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job
2013-06-14 14:34:49,529 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2013-06-14 14:34:49,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-06-14 14:34:50,144 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-06-14 14:34:50,145 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-06-14 14:34:50,256 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-06-14 14:34:50,316 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2013-06-14 14:34:50,444 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3]
2013-06-14 14:34:50,665 [JobControl] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is available
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded
2013-06-14 14:34:50,680 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201306140401_0021
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases batting,grp_data,max_runs,runs
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: batting[1,10],runs[2,7],max_runs[4,11],grp_data[3,11] C: max_runs[4,11],grp_data[3,11] R: max_runs[4,11]
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://sandbox:50030/jobdetails.jsp?jobid=job_201306140401_0021
2013-06-14 14:36:01,993 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2013-06-14 14:36:04,767 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201306140401_0021 has failed! Stop running all dependent jobs
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-06-14 14:36:05,029 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2106: Error executing an algebraic function
2013-06-14 14:36:05,030 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.0.1.3.0.0-107 0.11.1.1.3.0.0-107 mapred 2013-06-14 14:34:41 2013-06-14 14:36:05 HASH_JOIN,GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201306140401_0021 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201306140401_0021_m_000000
Input(s):
Failed to read data from "hdfs://sandbox:8020/user/hue/batting.csv"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201306140401_0021 -> null,
null
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2013-06-14 14:36:05,043 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias join_data
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
When switching this part: MAX(runs.runs) to avg(runs.runs) then I am getting a completely different issue:
2013-06-14 14:38:25,694 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:38:25,695 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
2013-06-14 14:38:26,198 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:38:26,438 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:38:26,824 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:38:28,238 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve avg using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
Anybody know what the issue might be?
I am sure lot of people would have figured this out. I combined Eugene's solution with the original code from Hortonworks such that we get the exact output as specific in the tutorial.
Following code works and produces exact output as specified in the tutorial:
batting = LOAD 'Batting.csv' using PigStorage(',');
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
Note: line "runs = FILTER runs_raw BY runs > 0;" is additional than what has been provided by Hortonworks, thanks to Eugene for sharing working code which I used to modify original Hortonworks code to make it work.
UDFs are case sensitive, so at least to answer the second part of your question - you'll need to use AVG(runs.runs) instead of avg(runs.runs)
It's likely that once you correct your syntax you'll get the original error you reported...
i am having the same exact same issue with exact same log output, but this solution doesn't work because i believe changing MAX with AVG here dumps the whole purpose of this hortonworks.com tutorial - it was to get the MAX runs by playerID for each year.
UPDATE
Finally i got it resolved - you have to either remove the first line in Batting.csv (column names) or edit your Pig Latin code like this:
batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
dump max_runs;
After that you should be able to complete tutorial correctly and get the proper result.
It also looks like this is due to the "bug" in the older versions of Pig rhich was used in the tutorial
Please specify appropriate data type for playerID, year & runs like below:
runs = FOREACH batting GENERATE $0 as playerID:int, $1 as year:chararray, $8 as runs:int;
Not, it should work.