KiteSDK MapReduce: EOF exception during parquet file load - java

I have hadoop map-reduce job which uses KitSDK DatasetKeyInputFormat. It is configured to read parquet file.
Eveery time I run the job I get following exception:
Error: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at parquet.hadoop.ParquetInputSplit.readArray(ParquetInputSplit.java:304)
at parquet.hadoop.ParquetInputSplit.readFields(ParquetInputSplit.java:263)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:372)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:754)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
The same file can be successfully read by map-reduce jobs created by hive. I.e. I can query it successfully.
To isolate possible issue I have created map-reduce job based on the KiteSDK example for mapreduce. But I still get the same exception.
Note: AVRO and CSV formats work well.

Related

Filepath on Mac for local Parquet files in Spark program in Java

I have a small Spark program in Java, that reads parquet files from a local directory on a Mac.
I have been trying to do this multiple ways, but nothing seems to be working.
Dataset<Row> dsuomcategoryconvfactor = spark.read().parquet(path + "file:///usr/local⁩/ParquetData/data1.parquet");
I think I am giving the path wrong for Spark to identify it, and it's throwing the below error.
20/01/06 10:58:29 INFO SharedState: Warehouse path is 'file:/usr/local/Cellar/apache-spark/2.4.4/libexec/work/driver-20200106105812-0006/spark-warehouse'.
20/01/06 10:58:29 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: file:/usr/local⁩/ParquetData/data1.parquet;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:644)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:628)
This works fine when run from an IDE, but when I submit the job from shell using spark-submit, this error is being thrown.
Any help would be appreciated.
Thanks!

Mapr distcp: FileNotFoundException

I am trying to copy files between two mapr clusters
hadoop distcp maprfs:///path/to/source/hive/table maprfs:///path/to/destination/hive/table
The source table has some data, but the destination table is empty. I have just used the create table command in the destination.
but for the above distcp command, i get
java.io.FileNotFoundException: Requested file maprfs:/var/mapr/cluster/yarn/rm/staging/some_user/.staging/job_1528383939830_608130 does not exist.
at com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1262)
at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:927)
at com.mapr.fs.MFS.getFileStatus(MFS.java:151)
at org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:467)
at org.apache.hadoop.mapred.YARNRunner.createApplicationSubmissionContext(YARNRunner.java:370)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:306)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:243)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.tools.DistCp.createAndSubmitJob(DistCp.java:183)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:153)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:126)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
what is this staging file and how do i get it? Is my syntax correct?

I am getting org.apache.hadoop.ipc.RemoteException: java.io.IOException in Mapreduce job?

I have installed hadoop using following points.
Installed hadoop using tar file
created hdfs user and group and assigned them to hadoop folder
then created hdfs directories for namenode and datanode in /opt folder
Configuration files are also set.
But when i tried to run hadoop jar hadoop-examples-1.0.0.jar pi 4 100 I am getting this error.
2014-11-05 12:12:02,978 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
2014-11-05 12:12:02,978 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/hadoop-hdfs/mapred/system/jobtracker.info" - Aborting...
2014-11-05 12:12:02,979 WARN org.apache.hadoop.mapred.JobTracker: Writing to file hdfs://hostname:9000/tmp/hadoop-hdfs/mapred/system/jobtracker.info failed!
2014-11-05 12:12:02,979 WARN org.apache.hadoop.mapred.JobTracker: FileSystem is not ready yet!
2014-11-05 12:12:02,982 WARN org.apache.hadoop.mapred.JobTracker: Failed to initialize recovery manager.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-hdfs/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
at org.apache.hadoop.ipc.Client.call(Client.java:1066)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy5.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at com.sun.proxy.$Proxy5.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826)
One thing here I want to mention is that I have set hdfs paths to /mnt direcotry but hdfs still pointing to /tmp/hadoop-hdfs
Please give some suggestions.
Check all the paths of the action node you are trying to run,this usually occurs due to wrong input/output paths provided.
Also if you are rerunning a workflow job,make sure all the events and properties provided in coordinator.xml(or job.xml) must be present in job.properties,because rerunning a workflow job doesn't refer to job.xml as against in the case of normal coordinator job run(scheduled running).

Error running hadoop job from Eclipse

I am facing some issues while trying to launch a mapreduce job on our Hadoop cluster from eclipse. I have added a folder named "conf" as a class folder and under that folder, I have imported the "core-site.xml", "hdfs-site.xml", "mapred-site.xml" and "hbase-site.xml". My hadoop cluster runs Hadoop 0.20.205.0, HBase-0.94.1. We are able to successfully submit the jobs to the cluser using "hadoop jar" command. Since this is very cumbersome, I want to setup eclipse such that I could submit the Hadoop jobs to the cluster by just running the program.
After I added the required dependencies to the project, I am getting the following exception when I run the example "PiEstimator.java" (of Hadoop-0.20.205.0).
Number of Maps = 4
Samples per Map = 4
Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.NoSuchMethodException: org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String, org.apache.hadoop.fs.permission.FsPermission, java.lang.String, boolean, boolean, short, long)
at java.lang.Class.getMethod(Class.java:1632)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
at org.apache.hadoop.ipc.Client.call(Client.java:1066)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy1.create(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at com.sun.proxy.$Proxy1.create(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3245)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:892)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:393)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
at com.amazon.seo.mapreduce.examples.PiEstimator.estimate(PiEstimator.java:265)
at com.amazon.seo.mapreduce.examples.PiEstimator.run(PiEstimator.java:325)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.amazon.seo.mapreduce.examples.PiEstimator.main(PiEstimator.java:333)
Can you please help me understand what part of my setup is wrong and how to fix it?
Were you able to resolve this issue?
I've resolved similar error with ClassDefinition as follows:
Create Jar as a Runnable Java File (File >> Export >> Runnable Java File)
export HADOOP_CLASSPATH =
This will make hadoop to pickup the correct class from your jar file.
Unfortunately, I believe that you will have to upgrade Hadoop's version to at least 2.5.0

Hadoop Pig Cassandra get_range_slices error

I'm using Cassandra 1.0.9 and the latest Pig and Hadoop to execute MapReduce tasks.
Just a simple task scripted in Pig to extract 2 columns from the Cassandra database.
Seems to work and then it encounters this problem.
java.lang.RuntimeException: org.apache.thrift.TApplicationException: Internal error processing get_range_slices at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:334) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:348) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:222) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:178) at org.apache.cassandra.hadoop.pig.CassandraStorage.getNext(Unknown Source) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: org.apache.thrift.TApplicationException: Internal error processing get_range_slices at org.apache.thrift.TApplicationException.read(TApplicationException.java:108) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:754) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:734) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:289) ... 17 more
Is there a way around it? I could show the Pig script on request.

Categories