Print inside MapReduce class - java

I have this example of MapReduce [1], and I want to print info in the stdout and in a log file[3]. It seems that the logs isn’t print something. How can I make my map class print output?
I also have configured yarn-site.xml to retain log[2]. Although the logs are retained in the /app-logs dir, the userlogs dir that contains the output of the job execution is deleted at the end of the job execution. How can I make MapReduce to not delete files in the userlogs dir?
I am using Yarn.
Thanks,
[1] Wordcount exampla with just the map part.
public class MyWordCount {
public static class MyMap extends Mapper {
Log log = LogFactory.getLog(MyWordCount.class);
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
StringTokenizer itr = new StringTokenizer(value.toString());
System.out.println("HERRE");
log.info("HERRRRRE");
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
System.out.println("Key: " + context.getCurrentKey() + " Value: " + context.getCurrentValue());
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
public void cleanup(Mapper.Context context) {}
}
[2] yarn-site.xml
<!-- job history -->
<property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property>
<property> <name>yarn.nodemanager.log.retain-seconds</name> <value>900000</value> </property>
<property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/app-logs</value> </property>
​[3] log output
Log Type: stderr
Log Upload Time: 24-Sep-2015 12:45:19
Log Length: 317
Java HotSpot(TM) Client VM warning: You have loaded library /home/xubuntu/Programs/hadoop-2.6.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
Log Type: stdout
Log Upload Time: 24-Sep-2015 12:45:19
Log Length: 0
Log Type: syslog
Log Upload Time: 24-Sep-2015 12:45:19
Log Length: 2604
2015-09-24 12:45:04,569 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-09-24 12:45:05,139 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-09-24 12:45:05,412 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2015-09-24 12:45:05,413 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2015-09-24 12:45:05,462 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2015-09-24 12:45:05,463 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1443113036547_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier#1b5a082)
2015-09-24 12:45:05,847 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2015-09-24 12:45:06,915 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /tmp/hadoop-temp/nm-local-dir/usercache/xubuntu/appcache/application_1443113036547_0001
2015-09-24 12:45:07,604 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-09-24 12:45:09,402 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2015-09-24 12:45:10,187 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://hadoop-coc-1:9000/input1/b.txt:0+21
2015-09-24 12:45:10,812 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1443113036547_0001_m_000000_0 is done. And is in the process of committing
2015-09-24 12:45:10,969 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1443113036547_0001_m_000000_0 is allowed to commit now
2015-09-24 12:45:10,993 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1443113036547_0001_m_000000_0' to hdfs://192.168.10.110:9000/output1-1442847968/_temporary/1/task_1443113036547_0001_m_000000
2015-09-24 12:45:11,135 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1443113036547_0001_m_000000_0' done.
2015-09-24 12:45:11,135 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2015-09-24 12:45:11,136 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2015-09-24 12:45:11,136 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

I have found the error. It is a bug in my code.

Related

Apache Camel Route is getting started automatically

Route loadfile is getting started automatically when I start main class.
On exception, when process should finish. It starts loadfile again and again.
It should get start from timer and then should call loadfile route, but loadfile is starting independent as well as from timer.
CamelContext context = new DefaultCamelContext(sr);
try {
context.addRoutes(new RouteBuilder() {
#Override
public void configure() throws Exception {
onException(Exception.class)
.log(LoggingLevel.INFO, "Extype:${exception.message}")
.stop();
from("timer://alertstrigtimer?period=60s&repeatCount=1")
.startupOrder(1)
.log(LoggingLevel.INFO, "*******************************Job-Alert-System: Started: alertstrigtimer******************************")
.to("direct:loadFile").stop();
from("direct:loadFile").routeId("loadfile")
.log(LoggingLevel.INFO, "*******************************Job-Alert-System: Started: direct:loadFile******************************")
.from(getTriggerFileURI(getWorkFilePath(), getWorkFileName())).choice()
.
.
});
context.start();
Thread.sleep(40000);
Following is log:
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) is starting
[main] INFO org.apache.camel.management.ManagedManagementStrategy - JMX is enabled
[main] INFO org.apache.camel.impl.converter.DefaultTypeConverter - Type converters loaded (core: 194, classpath: 14)
[main] INFO org.apache.camel.impl.DefaultCamelContext - StreamCaching is not in use. If using streams then its recommended to enable stream caching. See more details at http://camel.apache.org/stream-caching.html
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route1 started and consuming from: timer://alertstrigtimer?period=60s&repeatCount=1
[main] INFO org.apache.camel.impl.DefaultCamelContext - Skipping starting of route loadfile as its configured with autoStartup=false
[main] INFO org.apache.camel.impl.DefaultCamelContext - Route: loadDataAndAlerts started and consuming from: direct://loadDataAndAlerts
[main] INFO org.apache.camel.impl.DefaultCamelContext - Total 4 routes, of which 2 are started
[main] INFO org.apache.camel.impl.DefaultCamelContext - Apache Camel 2.21.1 (CamelContext: camel-1) started in 0.761 seconds
[Camel (camel-1) thread #1 - timer://alertstrigtimer] INFO route1 - *******************************Job-Alert-System: Started: alertstrigtimer******************************
[Camel (camel-1) thread #2 - timer://alertstrigtimer] INFO loadfile - *******************************Job-Alert-System: Started: direct:loadFile******************************
[Camel (camel-1) thread #1 - file://null] INFO loadfile - *******************************Job-Alert-System: Started: direct:loadFile******************************
The problem could be cause by this line .from(getTriggerFileURI(getWorkFilePath(), getWorkFileName())) in loadfile route. Route with multiple from endpoint is known as Multiple Input and this pattern is removed in Camel 3.x.
From RedHat,
from("URI1").from("URI2").from("URI3").to("DestinationUri");
..., exchanges from each of the input endpoints,
URI1, URI2, and URI3, are processed independently of each other and in
separate threads. In fact, you can think of the preceding route as
being equivalent to the following three separate routes:
from("URI1").to("DestinationUri");
from("URI2").to("DestinationUri");
from("URI3").to("DestinationUri");
Rather than using multiple from endpoint (extra independent input), try content enricher pattern (pollEnrich for file component).

apoc.periodic.iterate fails with exception: java.util.concurrent.RejectedExecutionException

I am trying run the annotation function of graphaware within Neo4J (see documentation here). I have a set of 5000 nodes (KnowledgeArticles) with textual data in the content property. To annotate those I run the following query in Neo4J desktop:
CALL apoc.periodic.iterate(
"MATCH (n:KnowledgeArticle) RETURN n",
"CALL ga.nlp.annotate({text: n.content, id: id(n)})
YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)", {batchSize:1, iterateList:true})
After annotating approximately 200 to 300 KnowledgeArticles the database shuts down and provides the error:
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `apoc.periodic.iterate`: Caused by:
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask#373b81ee rejected from
java.util.concurrent.ThreadPoolExecutor#285a2901[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 288]
I have experimented using different values for batchSize or setting iterateList to false, but none of this helped.
Also, I have tried performing the above iterate call limiting it only to 150 nodes. This works fine for the first time I call it, but when I perform it for a second time it again provides the same error, stating that the completed_task is about 200 to 300. The processor in the back thus seems to 'remember' the amount of tasks it has run in total as of the first time the database has started.
Could you help me resolve this issue. I want to run the above query not necessarily from Neo4j desktop, but eventually with py2neo from Python using graph.run([iterate-query]). If there is thus any way of solving this from Python, that would be even better.
Thank you!
PS. The debug log provides the following output (as of the last few iterations of the annotation up until the shut down):
2019-05-21 12:46:10.359+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251906
2019-05-21 12:46:13.784+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251906. It took: 3425
2019-05-21 12:46:13.786+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.788+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.800+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:13.868+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 67. Text length: 954
2019-05-21 12:46:13.869+0000 INFO [c.g.n.NLPManager] Time to annotate 68
2019-05-21 12:46:13.869+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.869+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251907
2019-05-21 12:46:15.848+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251907. It took: 1978
2019-05-21 12:46:15.848+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.862+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.915+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:16.294+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 378. Text length: 2641
2019-05-21 12:46:16.295+0000 INFO [c.g.n.NLPManager] Time to annotate 379
2019-05-21 12:46:16.296+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:16.296+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251908
2019-05-21 12:46:16.421+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database graph.db is unavailable.
2019-05-21 12:46:17.018+0000 INFO [c.g.s.f.b.GraphAwareServerBootstrapper] stopped
2019-05-21 12:46:17.020+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.149+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutting down 'graph.db' database.
2019-05-21 12:46:17.150+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.164+0000 INFO [o.n.b.i.BackupServer] BackupServer communication server shutting down and unbinding from /127.0.0.1:6362
2019-05-21 12:46:17.226+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint started...
2019-05-21 12:46:17.247+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 7720 to [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.a], from [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.b].
2019-05-21 12:46:17.644+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint completed in 418ms
2019-05-21 12:46:17.647+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 3
2019-05-21 12:46:17.698+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics START ---
2019-05-21 12:46:17.700+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-05-21 12:46:17.706+0000 INFO [c.g.r.BaseGraphAwareRuntime] Shutting down GraphAware Runtime...
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module UIDM
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module NLP
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Terminating task scheduler...
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Task scheduler terminated successfully.
2019-05-21 12:46:17.714+0000 INFO [c.g.r.BaseGraphAwareRuntime] GraphAware Runtime shut down.

How to fix logger randomly logging on the same line, even though it is set to start each log with a new line

I am using log4j to log events in java code. I have it set to start each log line as new line, with timestamp, thread, log level and the class where the log runs. So the configuration looks like this:
LoggerContext loggerContext = (LoggerContext) LoggerFactory.getILoggerFactory();
logger = loggerContext.getLogger("com.asdf");
logger.setAdditive(true);
PatternLayoutEncoder encoder = new PatternLayoutEncoder();
encoder.setContext(loggerContext);
encoder.setPattern("%-5level %d [%thread:%M:%caller{1}]: %message%n");
encoder.start();
cucumberAppender = new CucumberAppender();
cucumberAppender.setName("cucumber-appender");
cucumberAppender.setContext(loggerContext);
cucumberAppender.setScenario(scenario);
cucumberAppender.setEncoder(encoder);
cucumberAppender.start();
logger.addAppender(cucumberAppender);
loggerContext.start();
logger().info("*********************************************");
logger().info("* Starting Scenario - {}", scenario.getName());
logger().info("*********************************************\n");
}
#After
public void showScenarioResult(Scenario scenario) throws InterruptedException {
logger().info("**************************************************************");
logger().info("* {} Scenario - {} ", scenario.getStatus(), scenario.getName());
logger().info("**************************************************************\n");
cucumberAppender.writeToScenario();
cucumberAppender.stop();
logger.detachAppender(cucumberAppender);
logger.detachAndStopAllAppenders();
}
which most of the times outputs the log correctly, as so:
15:59:25.448 [main] INFO com.asdf.runner.steps.StepHooks -
********************************************* 15:59:25.449 [main] INFO com.asdf.runner.steps.StepHooks - * Starting Scenario - Check Cache 15:59:25.450 [main] INFO com.asdf.runner.steps.StepHooks -
********************************************* 15:59:25.558 [main] DEBUG org.cache2k.core.util.Log - New instance, using SLF4J logging 15:59:25.575 [main] INFO org.cache2k.core.Cache2kCoreProviderImpl - cache2k starting. version=1.0.1.Final, build=undefined, defaultImplementation=HeapCache 15:59:25.629 [main] DEBUG org.cache2k.CacheManager:default - open name=default, id=wvl973, classloaderId=6us14y
However, sometimes the next line of the logger is written on the above one, without using the new line, like below:
15:59:27.353 [main] INFO com.asdf.cache.CacheService - Creating a cache for [Kafka] service with specific types.15:59:27.354 [main] INFO com.asdf.runner.steps.StepHooks - **************************************************************
15:59:27.354 [main] INFO com.asdf.runner.steps.StepHooks - * PASSED Scenario - Check Cache
15:59:27.354 [main] INFO com.asdf.runner.steps.StepHooks - **************************************************************
As you can see, the first StepHooks line goes on the same line as CacheService, which is unaesthetic.
What can i change in order for the log to always log in new line, without exceptions like this?

Hadoop Pipes Wordcount example: NullPointerException in LocalJobRunner

I am trying to run the sample example in this tutorial about Hadoop Pipes:
I'm succeeding in compiling and everything. However, after it runs it shows me NullPointerException error. I tried many ways and read many similar questions, but wasn't able to find an actual solution for this problem.
Note: I am running on a single machine in a pseudo-distributed environment.
hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriters=true -input /input -output /output -program /bin/wordcount
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.
15/02/18 01:09:02 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/02/18 01:09:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/02/18 01:09:02 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/02/18 01:09:03 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
15/02/18 01:09:04 INFO mapred.FileInputFormat: Total input paths to process : 1
15/02/18 01:09:04 INFO mapreduce.JobSubmitter: number of splits:1
15/02/18 01:09:04 INFO Configuration.deprecation: hadoop.pipes.java.recordreader is deprecated. Instead, use mapreduce.pipes.isjavarecordreader
15/02/18 01:09:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local143452495_0001
15/02/18 01:09:06 INFO mapred.LocalDistributedCacheManager: Localized hdfs://localhost:9000/bin/wordcount as file:/tmp/hadoop-abdulrahman/mapred/local/1424214545411/wordcount
15/02/18 01:09:06 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/02/18 01:09:06 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/02/18 01:09:06 INFO mapreduce.Job: Running job: job_local143452495_0001
15/02/18 01:09:06 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
15/02/18 01:09:06 INFO mapred.LocalJobRunner: Waiting for map tasks
15/02/18 01:09:06 INFO mapred.LocalJobRunner: Starting task: attempt_local143452495_0001_m_000000_0
15/02/18 01:09:06 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/02/18 01:09:06 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/input/data.txt:0+68
15/02/18 01:09:07 INFO mapred.MapTask: numReduceTasks: 1
15/02/18 01:09:07 INFO mapreduce.Job: Job job_local143452495_0001 running in uber mode : false
15/02/18 01:09:07 INFO mapreduce.Job: map 0% reduce 0%
15/02/18 01:09:07 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/02/18 01:09:07 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/02/18 01:09:07 INFO mapred.MapTask: soft limit at 83886080
15/02/18 01:09:07 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/02/18 01:09:07 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/02/18 01:09:07 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/02/18 01:09:08 INFO mapred.LocalJobRunner: map task executor complete.
15/02/18 01:09:08 WARN mapred.LocalJobRunner: job_local143452495_0001
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:104)
at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/18 01:09:08 INFO mapreduce.Job: Job job_local143452495_0001 failed with state FAILED due to: NA
15/02/18 01:09:08 INFO mapreduce.Job: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:264)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:503)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:518)
Edit: I downloaded the sourcecode of hadoop and tracked where the exception is happening, it seems that the exception occurs in the initialization stage, and thus the code inside the mapper/reducer isn't really the problem.
The function in Hadoop that produces the exception is this one:
/** Run a set of tasks and waits for them to complete. */
435 private void runTasks(List<RunnableWithThrowable> runnables,
436 ExecutorService service, String taskType) throws Exception {
437 // Start populating the executor with work units.
438 // They may begin running immediately (in other threads).
439 for (Runnable r : runnables) {
440 service.submit(r);
441 }
442
443 try {
444 service.shutdown(); // Instructs queue to drain.
445
446 // Wait for tasks to finish; do not use a time-based timeout.
447 // (See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6179024)
448 LOG.info("Waiting for " + taskType + " tasks");
449 service.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
450 } catch (InterruptedException ie) {
451 // Cancel all threads.
452 service.shutdownNow();
453 throw ie;
454 }
455
456 LOG.info(taskType + " task executor complete.");
457
458 // After waiting for the tasks to complete, if any of these
459 // have thrown an exception, rethrow it now in the main thread context.
460 for (RunnableWithThrowable r : runnables) {
461 if (r.storedException != null) {
462 throw new Exception(r.storedException);
463 }
464 }
465 }
The problem though is that it is storing the exception and then throwing it, which is preventing me from knowing the actual source of the exception.
Any help?
Also, if you need me to post more details please let me know.
Thank you,
So after a lot of research, I found out that the problem was actually caused by this line in pipes/Application.java (line 104):
byte[] password= jobToken.getPassword();
I changed the code and recompiled hadoop:
byte[] password= "no password".getBytes();
if (jobToken != null)
{
password= jobToken.getPassword();
}
I got this from here
This solved the problem, and my program currently runs, but I am facing another problem where the program actually hangs at map 0% reduce 0%
I will open another topic for that question.
Thank you,

algebraic error when running "aggregate" function on dataset

I'm learning hadoop/pig/hive through running through tutorials on hortonworks.com
I have indeed tried to find a link to the tutorial, but unfortunately it only ships with the ISA image that they provide to you. It's not actually hosted on their website.
batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
I've copied their code exactly as it was stated in the tutorial and I'm getting this output:
2013-06-14 14:34:37,969 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:34:37,977 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
2013-06-14 14:34:38,412 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:34:38,598 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:34:38,998 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:34:40,819 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2013-06-14 14:34:40,827 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY
2013-06-14 14:34:41,115 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-06-14 14:34:41,160 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2013-06-14 14:34:41,201 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POJoinPackage
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2
2013-06-14 14:34:41,488 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-06-14 14:34:41,551 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-06-14 14:34:41,555 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2013-06-14 14:34:44,244 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5371236206169131677.jar
2013-06-14 14:34:49,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5371236206169131677.jar created
2013-06-14 14:34:49,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job
2013-06-14 14:34:49,529 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2013-06-14 14:34:49,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-06-14 14:34:50,144 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-06-14 14:34:50,145 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-06-14 14:34:50,256 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-06-14 14:34:50,316 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2013-06-14 14:34:50,444 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3]
2013-06-14 14:34:50,665 [JobControl] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is available
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded
2013-06-14 14:34:50,680 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201306140401_0021
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases batting,grp_data,max_runs,runs
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: batting[1,10],runs[2,7],max_runs[4,11],grp_data[3,11] C: max_runs[4,11],grp_data[3,11] R: max_runs[4,11]
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://sandbox:50030/jobdetails.jsp?jobid=job_201306140401_0021
2013-06-14 14:36:01,993 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2013-06-14 14:36:04,767 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201306140401_0021 has failed! Stop running all dependent jobs
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-06-14 14:36:05,029 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2106: Error executing an algebraic function
2013-06-14 14:36:05,030 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.0.1.3.0.0-107 0.11.1.1.3.0.0-107 mapred 2013-06-14 14:34:41 2013-06-14 14:36:05 HASH_JOIN,GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201306140401_0021 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201306140401_0021_m_000000
Input(s):
Failed to read data from "hdfs://sandbox:8020/user/hue/batting.csv"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201306140401_0021 -> null,
null
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2013-06-14 14:36:05,043 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias join_data
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
When switching this part: MAX(runs.runs) to avg(runs.runs) then I am getting a completely different issue:
2013-06-14 14:38:25,694 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:38:25,695 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
2013-06-14 14:38:26,198 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:38:26,438 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:38:26,824 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:38:28,238 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve avg using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
Anybody know what the issue might be?
I am sure lot of people would have figured this out. I combined Eugene's solution with the original code from Hortonworks such that we get the exact output as specific in the tutorial.
Following code works and produces exact output as specified in the tutorial:
batting = LOAD 'Batting.csv' using PigStorage(',');
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
Note: line "runs = FILTER runs_raw BY runs > 0;" is additional than what has been provided by Hortonworks, thanks to Eugene for sharing working code which I used to modify original Hortonworks code to make it work.
UDFs are case sensitive, so at least to answer the second part of your question - you'll need to use AVG(runs.runs) instead of avg(runs.runs)
It's likely that once you correct your syntax you'll get the original error you reported...
i am having the same exact same issue with exact same log output, but this solution doesn't work because i believe changing MAX with AVG here dumps the whole purpose of this hortonworks.com tutorial - it was to get the MAX runs by playerID for each year.
UPDATE
Finally i got it resolved - you have to either remove the first line in Batting.csv (column names) or edit your Pig Latin code like this:
batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
dump max_runs;
After that you should be able to complete tutorial correctly and get the proper result.
It also looks like this is due to the "bug" in the older versions of Pig rhich was used in the tutorial
Please specify appropriate data type for playerID, year & runs like below:
runs = FOREACH batting GENERATE $0 as playerID:int, $1 as year:chararray, $8 as runs:int;
Not, it should work.

Categories