I'm learning hadoop/pig/hive through running through tutorials on hortonworks.com
I have indeed tried to find a link to the tutorial, but unfortunately it only ships with the ISA image that they provide to you. It's not actually hosted on their website.
batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
I've copied their code exactly as it was stated in the tutorial and I'm getting this output:
2013-06-14 14:34:37,969 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:34:37,977 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
2013-06-14 14:34:38,412 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:34:38,598 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:34:38,998 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:34:40,819 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2013-06-14 14:34:40,827 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY
2013-06-14 14:34:41,115 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-06-14 14:34:41,160 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2013-06-14 14:34:41,201 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POJoinPackage
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3
2013-06-14 14:34:41,213 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators.
2013-06-14 14:34:41,214 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2
2013-06-14 14:34:41,488 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-06-14 14:34:41,551 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-06-14 14:34:41,555 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990
2013-06-14 14:34:41,559 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2013-06-14 14:34:44,244 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5371236206169131677.jar
2013-06-14 14:34:49,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5371236206169131677.jar created
2013-06-14 14:34:49,517 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job
2013-06-14 14:34:49,529 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2013-06-14 14:34:49,530 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2013-06-14 14:34:49,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-06-14 14:34:50,144 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-06-14 14:34:50,145 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-06-14 14:34:50,256 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-06-14 14:34:50,316 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2013-06-14 14:34:50,444 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3]
2013-06-14 14:34:50,665 [JobControl] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is available
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library
2013-06-14 14:34:50,666 [JobControl] INFO org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded
2013-06-14 14:34:50,680 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201306140401_0021
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases batting,grp_data,max_runs,runs
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: batting[1,10],runs[2,7],max_runs[4,11],grp_data[3,11] C: max_runs[4,11],grp_data[3,11] R: max_runs[4,11]
2013-06-14 14:34:52,796 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://sandbox:50030/jobdetails.jsp?jobid=job_201306140401_0021
2013-06-14 14:36:01,993 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2013-06-14 14:36:04,767 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201306140401_0021 has failed! Stop running all dependent jobs
2013-06-14 14:36:04,768 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-06-14 14:36:05,029 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2106: Error executing an algebraic function
2013-06-14 14:36:05,030 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.2.0.1.3.0.0-107 0.11.1.1.3.0.0-107 mapred 2013-06-14 14:34:41 2013-06-14 14:36:05 HASH_JOIN,GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201306140401_0021 batting,grp_data,max_runs,runs MULTI_QUERY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201306140401_0021_m_000000
Input(s):
Failed to read data from "hdfs://sandbox:8020/user/hue/batting.csv"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201306140401_0021 -> null,
null
2013-06-14 14:36:05,042 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2013-06-14 14:36:05,043 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias join_data
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
When switching this part: MAX(runs.runs) to avg(runs.runs) then I am getting a completely different issue:
2013-06-14 14:38:25,694 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:38:25,695 [main] INFO org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
2013-06-14 14:38:26,198 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:38:26,438 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:38:26,824 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:38:28,238 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve avg using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
Anybody know what the issue might be?
I am sure lot of people would have figured this out. I combined Eugene's solution with the original code from Hortonworks such that we get the exact output as specific in the tutorial.
Following code works and produces exact output as specified in the tutorial:
batting = LOAD 'Batting.csv' using PigStorage(',');
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
Note: line "runs = FILTER runs_raw BY runs > 0;" is additional than what has been provided by Hortonworks, thanks to Eugene for sharing working code which I used to modify original Hortonworks code to make it work.
UDFs are case sensitive, so at least to answer the second part of your question - you'll need to use AVG(runs.runs) instead of avg(runs.runs)
It's likely that once you correct your syntax you'll get the original error you reported...
i am having the same exact same issue with exact same log output, but this solution doesn't work because i believe changing MAX with AVG here dumps the whole purpose of this hortonworks.com tutorial - it was to get the MAX runs by playerID for each year.
UPDATE
Finally i got it resolved - you have to either remove the first line in Batting.csv (column names) or edit your Pig Latin code like this:
batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
dump max_runs;
After that you should be able to complete tutorial correctly and get the proper result.
It also looks like this is due to the "bug" in the older versions of Pig rhich was used in the tutorial
Please specify appropriate data type for playerID, year & runs like below:
runs = FOREACH batting GENERATE $0 as playerID:int, $1 as year:chararray, $8 as runs:int;
Not, it should work.
Related
I can't seem to get GATK to recognise the number of available threads. I am running GATK (4.2.4.1) in a conda environment which is part of a nextflow (v20.10.0) pipeline I'm writing. For whatever reason, I cannot get GATK to see there is more than one thread. I've tried different node types, increasing and decreasing the number of cpus available, providing java arguments such as -XX:ActiveProcessorCount=16, using taskset, but it always just detects 1.
Here is the command from the .command.sh:
gatk HaplotypeCaller \
--tmp-dir tmp/ \
-ERC GVCF \
-R VectorBase-54_AgambiaePEST_Genome.fasta \
-I AE12A_S24_BP.bam \
-O AE12A_S24_BP.vcf
And here is the top of the .command.log file:
12:10:00.695 INFO HaplotypeCaller - ------------------------------------------------------------
12:10:00.695 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.4.1
12:10:00.695 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
12:10:00.696 INFO HaplotypeCaller - Executing on Linux v4.18.0-193.6.3.el8_2.x86_64 amd64
12:10:00.696 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.13+7-b1751.21
12:10:00.696 INFO HaplotypeCaller - Start Date/Time: 9 February 2022 at 12:10:00 GMT
12:10:00.696 INFO HaplotypeCaller - ------------------------------------------------------------
12:10:00.696 INFO HaplotypeCaller - ------------------------------------------------------------
12:10:00.697 INFO HaplotypeCaller - HTSJDK Version: 2.24.1
12:10:00.697 INFO HaplotypeCaller - Picard Version: 2.25.4
12:10:00.697 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
12:10:00.697 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:10:00.697 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:10:00.697 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:10:00.697 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:10:00.697 INFO HaplotypeCaller - Deflater: IntelDeflater
12:10:00.697 INFO HaplotypeCaller - Inflater: IntelInflater
12:10:00.697 INFO HaplotypeCaller - GCS max retries/reopens: 20
12:10:00.698 INFO HaplotypeCaller - Requester pays: disabled
12:10:00.698 INFO HaplotypeCaller - Initializing engine
12:10:01.126 INFO HaplotypeCaller - Done initializing engine
12:10:01.129 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
12:10:01.143 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
12:10:01.143 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
12:10:01.162 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/anaconda3/envs/NF_GATK/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
12:10:01.169 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/anaconda3/envs/NF_GATK/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
12:10:01.209 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
12:10:01.210 INFO IntelPairHmm - Available threads: 1
12:10:01.210 INFO IntelPairHmm - Requested threads: 4
12:10:01.210 WARN IntelPairHmm - Using 1 available threads, but 4 were requested
12:10:01.210 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
12:10:01.271 INFO ProgressMeter - Starting traversal
I found a thread on the broad institute website suggesting it might be the OMP library, but this is seemingly loaded, and I'm using the version they suggested updating to...
Needless to say, this is a little slow. I can always parallelise by using the -L option, but this doesn't solve that every step in the pipeline will be very slow.
Thanks in advance.
In case anyone else has the same problem, it turned out I had to configure the submission as an MPI job.
So on the HPC I use, here is the nextflow process:
process DNA_HCG {
errorStrategy { sleep(Math.pow(2, task.attempt) * 600 as long); return 'retry' }
maxRetries 3
maxForks params.HCG_Forks
tag { SampleID+"-"+chrom }
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.HCG_threads}:mem=${params.HCG_memory}gb:mpiprocs=1:ompthreads=${params.HCG_threads} -lwalltime=${params.HCG_walltime}:00:00"
publishDir(
path: "${params.HCDir}",
mode: 'copy',
)
input:
each chrom from chromosomes_ch
set SampleID, path(bam), path(bai) from processed_bams
path ref_genome
path ref_dict
path ref_index
output:
tuple chrom, path("${SampleID}_${chrom}.vcf") into HCG_ch
path("${SampleID}_${chrom}.vcf.idx") into idx_ch
beforeScript 'module load anaconda3/personal; source activate NF_GATK'
script:
"""
mkdir tmp
n_slots=`expr ${params.GVCF_threads} / 2 - 3`
if [ \$n_slots -le 0 ]; then n_slots=1; fi
taskset -c 0-\${n_slots} gatk --java-options \"-Xmx${params.HCG_memory}G -XX:+UseParallelGC -XX:ParallelGCThreads=\${n_slots}\" HaplotypeCaller \\
--tmp-dir tmp/ \\
--pair-hmm-implementation AVX_LOGLESS_CACHING_OMP \\
--native-pair-hmm-threads \${n_slots} \\
-ERC GVCF \\
-L ${chrom} \\
-R ${ref_genome} \\
-I ${bam} \\
-O ${SampleID}_${chrom}.vcf ${params.GVCF_args}
"""
}
I think I solved this problem (at least for me, it worked well on SLURM). This comes from how GATK is configured for parallelizing jobs: it's based on OpenMP, so you should add to the beginning of your script something like this:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
source
I am trying to execute MultiGpuLenetMnistExample.java
and i have received following error
"
...
12:41:24.129 [main] INFO Test - Load data....
12:41:24.716 [main] INFO Test - Build model....
12:41:25.500 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
ND4J CUDA build version: 10.1.243
CUDA device 0: [Quadro K4000]; cc: [3.0]; Total memory: [3221225472];
12:41:26.692 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for OpenMP: 32
12:41:26.746 [main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 0
12:41:26.755 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 8.1]
12:41:26.755 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [24]; Memory: [3,5GB];
12:41:26.755 [main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
12:41:26.755 [main] INFO org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner - Device Name: [Quadro K4000]; CC: [3.0]; Total/free memory: [3221225472]
12:41:26.844 [main] INFO org.deeplearning4j.nn.multilayer.MultiLayerNetwork - Starting MultiLayerNetwork with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
12:41:27.957 [main] DEBUG org.nd4j.jita.allocator.impl.MemoryTracker - Free memory on device_0: 2709856256
Exception in thread "main" java.lang.RuntimeException: cudaGetSymbolAddress(...) failed; Error code: [13]
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.createShapeInfo(CudaExecutioner.java:2557)
at org.nd4j.linalg.api.shape.Shape.createShapeInformation(Shape.java:3282)
at org.nd4j.linalg.api.ndarray.BaseShapeInfoProvider.createShapeInformation(BaseShapeInfoProvider.java:76)
at org.nd4j.jita.constant.ProtectedCudaShapeInfoProvider.createShapeInformation(ProtectedCudaShapeInfoProvider.java:96)
at org.nd4j.jita.constant.ProtectedCudaShapeInfoProvider.createShapeInformation(ProtectedCudaShapeInfoProvider.java:77)
at org.nd4j.linalg.jcublas.CachedShapeInfoProvider.createShapeInformation(CachedShapeInfoProvider.java:44)
at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:211)
at org.nd4j.linalg.jcublas.JCublasNDArray.<init>(JCublasNDArray.java:383)
at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.create(JCublasNDArrayFactory.java:1543)
at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.create(JCublasNDArrayFactory.java:1538)
at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4298)
at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3986)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:688)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.init(MultiLayerNetwork.java:604)
at Test.main(Test.java:80)
Process finished with exit code 1 "
is there any workaround about this problem?
2 options here: either build dl4j from sources for your target compute capability (3.0) or wait for next release, since we’re going to bring it back for 1 additional release.
At this point cc 3.0 is just considered deprecated by most frameworks afaik 😞
I am learning Project Reactor where I am exploring Schedulers factory.
I tried the following code:
ExecutorService executorService = Executors.newFixedThreadPool(10);
Flux.range(1,4)
.map(i -> {
logger.info(i +" [MAP] " + Thread.currentThread().getName());
return 10 / i;
})
.publishOn(Schedulers.fromExecutorService(executorService)) // .publishOn(Schedulers.parallel())
.subscribe(
n -> {
logger.info("START "+((Long)(System.currentTimeMillis() % 10000000L)).toString());
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
logger.info(n.toString());
logger.info("END "+((Long)(System.currentTimeMillis() % 10000000L)).toString());
}
);
executorService.shutdown();
This code was tried with Schedulers.parallel() and Schedulers.elastic() as well. Also, tried with subscribeOn() operator to see similar results.
The logs are:
02:07:30.142 [main] INFO - 1 [MAP] main
02:07:30.143 [main] INFO - 2 [MAP] main
02:07:30.143 [main] INFO - 3 [MAP] main
02:07:30.143 [main] INFO - 4 [MAP] main
02:07:30.143 [pool-1-thread-2] INFO - START 1050143
02:07:30.247 [pool-1-thread-2] INFO - 10
02:07:30.247 [pool-1-thread-2] INFO - END 1050247
02:07:30.247 [pool-1-thread-2] INFO - START 1050247
02:07:30.350 [pool-1-thread-2] INFO - 5
02:07:30.350 [pool-1-thread-2] INFO - END 1050350
02:07:30.350 [pool-1-thread-2] INFO - START 1050350
02:07:30.455 [pool-1-thread-2] INFO - 3
02:07:30.455 [pool-1-thread-2] INFO - END 1050455
02:07:30.455 [pool-1-thread-2] INFO - START 1050455
02:07:30.557 [pool-1-thread-2] INFO - 2
02:07:30.558 [pool-1-thread-2] INFO - END 1050558
Since the Flux's elements are ordered and operated upon in sequence (apparent from the logs above), having multiple threads for an operator (or operator chain) for one element does not make sense. I am sure I am either misinterpreting the Schedulers or lack somewhere in my basic understanding. Can someone point me to the right direction?
I understand the purpose of Schedulers to make the processing asynchronous and unhold the main thread. But why would anyone want to give multiple threads to the operator(s) when operated at one element at a time.
Does it makes sense only when we deal with flatMap operator?
I have this example of MapReduce [1], and I want to print info in the stdout and in a log file[3]. It seems that the logs isn’t print something. How can I make my map class print output?
I also have configured yarn-site.xml to retain log[2]. Although the logs are retained in the /app-logs dir, the userlogs dir that contains the output of the job execution is deleted at the end of the job execution. How can I make MapReduce to not delete files in the userlogs dir?
I am using Yarn.
Thanks,
[1] Wordcount exampla with just the map part.
public class MyWordCount {
public static class MyMap extends Mapper {
Log log = LogFactory.getLog(MyWordCount.class);
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
StringTokenizer itr = new StringTokenizer(value.toString());
System.out.println("HERRE");
log.info("HERRRRRE");
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
System.out.println("Key: " + context.getCurrentKey() + " Value: " + context.getCurrentValue());
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
public void cleanup(Mapper.Context context) {}
}
[2] yarn-site.xml
<!-- job history -->
<property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property>
<property> <name>yarn.nodemanager.log.retain-seconds</name> <value>900000</value> </property>
<property> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/app-logs</value> </property>
[3] log output
Log Type: stderr
Log Upload Time: 24-Sep-2015 12:45:19
Log Length: 317
Java HotSpot(TM) Client VM warning: You have loaded library /home/xubuntu/Programs/hadoop-2.6.0/lib/native/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
Log Type: stdout
Log Upload Time: 24-Sep-2015 12:45:19
Log Length: 0
Log Type: syslog
Log Upload Time: 24-Sep-2015 12:45:19
Log Length: 2604
2015-09-24 12:45:04,569 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-09-24 12:45:05,139 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-09-24 12:45:05,412 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2015-09-24 12:45:05,413 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2015-09-24 12:45:05,462 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2015-09-24 12:45:05,463 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1443113036547_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier#1b5a082)
2015-09-24 12:45:05,847 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2015-09-24 12:45:06,915 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /tmp/hadoop-temp/nm-local-dir/usercache/xubuntu/appcache/application_1443113036547_0001
2015-09-24 12:45:07,604 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-09-24 12:45:09,402 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2015-09-24 12:45:10,187 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://hadoop-coc-1:9000/input1/b.txt:0+21
2015-09-24 12:45:10,812 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1443113036547_0001_m_000000_0 is done. And is in the process of committing
2015-09-24 12:45:10,969 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1443113036547_0001_m_000000_0 is allowed to commit now
2015-09-24 12:45:10,993 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1443113036547_0001_m_000000_0' to hdfs://192.168.10.110:9000/output1-1442847968/_temporary/1/task_1443113036547_0001_m_000000
2015-09-24 12:45:11,135 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1443113036547_0001_m_000000_0' done.
2015-09-24 12:45:11,135 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2015-09-24 12:45:11,136 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2015-09-24 12:45:11,136 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.
I have found the error. It is a bug in my code.
I'm getting an exception from a Pig script and haven't been able to nail down the cause. I'm fairly new to Pig & have searched for
various topics based on the exception I'm getting but haven't been
able to find anything meaningful. From the grunt shell & log I've
looked for different variations of these - unable to read pigs
manifest file java.lang.NegativeArraySizeException: -1 ERROR 1066:
Unable to open iterator for alias F. Backend error : -1
I'm using Hadoop version 2.0.0-cdh4.6.0 & Pig version 0.11.0, running
from the Grunt shell.
My Pig script reads a file, does some manipulation on the data
(including calling a Java UDF), joins to an HBase table, then DUMPs
the output. Pretty simple. I can DUMP the intermediate result (alias
B) and the data looks fine.
I've tested the Java function from Pig using the same input file and
have seen it return values as I'd expect, and I've tested the function
locally outside the Pig script. The Java function is provided a number
of days from 01-01-1900 & uses joda-time v2.7 to return a Datetime.
Initially, the UDF was accepting a tuple as input. I've tried changing
the UDF input format to Byte and most recently String and casting to
Datetime in Pig upon returning, but am still getting the same error.
When I change my Pig script merely to not call the UDF it works fine.
The NegativeArray error sounds like the data is out of whack for the
Dump, possibly from some kind of format issue, but I don't see how.
Pig script
A = LOAD 'tst2_SplitGroupMax.txt' using PigStorage(',')
as (id:bytearray, year:int, doy:int, month:int, dayOfMonth:int,
awh_minTemp:double, awh_maxTemp:double,
nws_minTemp:double, nws_maxTemp:double,
wxs_minTemp:double, wxs_maxTemp:double,
tcc_minTemp:double, tcc_maxTemp:double
) ;
register /import/pool2/home/NA1000APP-TPSDM/ejbles/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar;
B = FOREACH A GENERATE id as msmtid, SUBSTRING(id,0,8) as gridid, SUBSTRING(id,9,20) as msmt_days,
year, doy, month, dayOfMonth,
CONCAT(CONCAT(CONCAT((chararray)year,'-'),CONCAT((chararray)month,'-')),(chararray)dayOfMonth) as msmt_dt,
ToDate(monutil.geoloc.GridIDtoDatetime(id)) as func_msmt_dt,
awh_minTemp, awh_maxTemp,
nws_minTemp, nws_maxTemp,
wxs_minTemp, wxs_maxTemp,
tcc_minTemp, tcc_maxTemp
;
E = LOAD 'hbase://wxgrid_detail' using org.apache.pig.backend.hadoop.hbase.HBaseStorage
('loc:country, loc:fips, loc:l1 ,loc:l2, loc:latitude, loc:longitude',
'-loadKey=true -caster=HBaseBinaryConverter')
as (wxgrid:bytearray, country:chararray, fips:chararray, l1:chararray, l2:chararray,
latitude:double, longitude:double);
F = join B by gridid, E by wxgrid;
DUMP F; --- This is where I get the exception
Here's an excerpt from what's returned in the Grunt shell -
2015-06-15 12:23:24,204 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2015-06-15 12:23:24,205 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_201502081759_916870 has failed! Stop running all dependent jobs 2015-06-15 12:23:24,205 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete 2015-06-15 12:23:24,221 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: -1 2015-06-15
12:23:24,221 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil -
1 map reduce job(s) failed! 2015-06-15 12:23:24,223 [main] WARN
org.apache.pig.tools.pigstats.ScriptState - unable to read pigs
manifest file 2015-06-15 12:23:24,224 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.0.0-cdh4.6.0 na1000app-tpsdm 2015-06-15 12:22:39 2015-06-15 12:23:24 HASH_JOIN
Failed!
Failed Jobs: JobId Alias Feature Message Outputs
job_201502081759_916870 A,B,E,F HASH_JOIN Message: Job failed!
hdfs://nameservice1/tmp/temp-238648079/tmp-1338617620,
Input(s): Failed to read data from "hbase://wxgrid_detail" Failed to
read data from
"hdfs://nameservice1/user/na1000app-tpsdm/tst2_SplitGroupMax.txt"
Output(s): Failed to produce result in
"hdfs://nameservice1/tmp/temp-238648079/tmp-1338617620"
Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0
Job DAG: job_201502081759_916870
2015-06-15 12:23:24,224 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed! 2015-06-15 12:23:24,234 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator
for alias F. Backend error : -1 Details at logfile:
/import/pool2/home/NA1000APP-TPSDM/ejbles/pig_1434388844905.log
And here's the log -
Backend error message
--------------------- java.lang.NegativeArraySizeException: -1 at org.apache.hadoop.hbase.util.Bytes.readByteArray(Bytes.java:148) at
org.apache.hadoop.hbase.mapreduce.TableSplit.readFields(TableSplit.java:133)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit.readFields(PigSplit.java:233)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:640)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
org.apache.hadoop.mapred.Child$4.run(Ch
Pig Stack Trace
--------------- ERROR 1066: Unable to open iterator for alias F. Backend error : -1
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable
to open iterator for alias F. Backend error : -1 at
org.apache.pig.PigServer.openIterator(PigServer.java:828) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at
org.apache.pig.Main.run(Main.java:538) at
org.apache.pig.Main.main(Main.java:157) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by:
java.lang.NegativeArraySizeException: -1 at
org.apache.hadoop.hbase.util.Bytes.readByteArray(Bytes.java:148) at
org.apache.hadoop.hbase.mapreduce.TableSplit.readFields(TableSplit.java:133)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit.readFields(PigSplit.java:233)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:640)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)