I am trying to implement recommender system using Spark collaborative filtering.
First I prepare model and save to disk:
MatrixFactorizationModel model = trainModel(inputDataRdd);
model.save(jsc.sc(), "/op/tc/model/");
When I load model using separate process the program fails with below exception:
Code:
static JavaSparkContext jsc ;
private static Options options;
static{
SparkConf conf = new SparkConf().setAppName("TC recommender application");
conf.set("spark.driver.allowMultipleContexts", "true");
jsc= new JavaSparkContext(conf);
}
MatrixFactorizationModel model = MatrixFactorizationModel.load(jsc.sc(),
"/op/tc/model/");
Exception:
Exception in thread "main" java.io.IOException: Not a file:
maprfs:/op/tc/model/data
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1114)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1107)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.countApproxDistinctUserProduct(MatrixFactorizationModel.scala:96)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:126)
at com.aexp.cxp.recommendation.ProductRecommendationIndividual.main(ProductRecommendationIndividual.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:742)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Is there any configuration i need to set to load the model? any suggestion would be great help.
In Spark as in any other distributed computing framework, it is important to understand where the code runs when you are trying to debug it. It is also important to have access to various types. For example, in YARN, you would have:
the master logs if your record them yourself
the aggregated slave logs (thanks YARN, useful feature !)
the YARN node manager (will for example tell you why a container was killed etc)
etc
Digging into Spark issues can be quite time consuming if you don't look at the right place from the start. Now more specifically on this question, you have a clear stacktrace, which is not always the case, so you should use it to your advantage.
The top of the stacktrace is
Exception in thread "main" java.io.IOException: Not a file:
maprfs:/op/tc/model/data at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at
As you can see, the Spark job was executing a map operation when it failed. Who executes a map ? The slaves, therefore you have to make sure your file is available on all slaves, not only on the master.
More generally, you always need to make a clear distinction in your head between the code you are writing for the master and the code you are writing for the slaves. This will help you detecting this kind of interactions, as well as references to non-serializable objects and such common mistakes.
Related
I am trying to connect to MongoDB using Spark. (Java Spark API)
When trying to run submit the job, it fails for the with the below error message ,
20/07/05 17:32:00 ERROR DefaultMongoPartitioner:
---------------------------------------- WARNING: MongoDB version < 3.2 detected.
----------------------------------------
With legacy MongoDB installations you will need to explicitly configure the Spark Connector with a partitioner.
This can be done by: * Setting a "spark.mongodb.input.partitioner" in SparkConf. * Setting in the "partitioner" parameter in ReadConfig. * Passing the "partitioner" option to the DataFrameReader.
The following Partitioners are available:
* MongoShardedPartitioner - for sharded clusters, requires read access to the config database. * MongoSplitVectorPartitioner - for single nodes or replicaSets. Utilises the SplitVector command on the primary. * MongoPaginateByCountPartitioner - creates a specific number of partitions. Slow as requires a query for every partition. * MongoPaginateBySizePartitioner - creates partitions based on data size. Slow as requires a query for every partition.
Exception in thread "main" java.lang.UnsupportedOperationException: The DefaultMongoPartitioner requires MongoDB >= 3.2
at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner.partitions(DefaultMongoPartitioner.scala:58)
at com.mongodb.spark.rdd.MongoRDD.getPartitions(MongoRDD.scala:137)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1928)
at org.apache.spark.rdd.RDD.count(RDD.scala:1164)
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:440)
at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:46)
at com.virtualpairprogrammers.JavaIntroduction.main(JavaIntroduction.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have tried the following options , but still throws the same error message,
SparkConf conf = new SparkConf()
.setAppName("MongoSparkConnectorTour")
.set("spark.app.id", "MongoSparkConnectorTour")
.set("spark.mongodb.input.uri", uri)
.set("spark.mongodb.output.uri", uri)
.set("partitioner", "MongoPaginateBySizePartitioner")
.set("spark.mongodb.input.partitionerOptions.partitionSizeMB", "64");
Let me know if i am missing something in here which is why it is throwing the error message.
Not able to identify why it still goes to DefaultMongoPartitioner
Thanks in Advance,
Sam
There was an issue with parameters sent,
Below are the correct format
SparkConf conf = new SparkConf()
.setAppName("MongoSparkConnectorTour")
.set("spark.app.id", "MongoSparkConnectorTour")
.set("spark.mongodb.input.uri", uri)
.set("spark.mongodb.output.uri", uri)
.set("spark.mongodb.input.partitioner", "MongoPaginateBySizePartitioner")
.set("spark.mongodb.input.partitionerOptions.partitionSizeMB", "64");
The issue is solved and i am able to connect and extract data without any issues
I have a docker container running the gremlin-server.
It was started via:
./bin/gremlin-server.sh conf/gremlin-server/gremlin-server.yaml
From within a docker container, running this image:
https://hub.docker.com/r/janusgraph/janusgraph
The server is up and is listening at port 8182
$ docker ps
6019adda6081 janusgraph/janusgraph "docker-entrypoint.s…" 2 days ago Up 26 hours 0.0.0.0:8182->8182/tcp
I am interested in using a schema and indexes.
Janus offers this here: https://docs.janusgraph.org/basics/schema/
The following Is the configuration I use to attempt to connect to the gremlin-server:
AbstractConfiguration config = new BaseConfiguration();
config.setListDelimiter('/');
// contents of conf/remote-graph.properties
config.setProperty("gremlin.remote.driver.sourceName", "g");
config.setProperty("gremlin.remote.remoteConnectionClass", "org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteConnection");
// contents of conf/remote-objects.yaml:
config.setProperty("clusterConfiguration.hosts", databaseUrl);
config.setProperty("clusterConfiguration.port", 8182);
config.setProperty("clusterConfiguration.serializer.className", "org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0/");
config.setProperty("storage.backend", "cql");
config.setProperty("clusterConfiguration.serializer.config.ioRegistries", "org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry");
When I call
GraphTraversalSource g = traversal().withRemote(config);
I get a traversal source and everything seems fine. However, to use the management stuff that Janus provides, I seem to need a JanusGraphManagement object. I cannot get the generic Graph object above and cast it to a JanusGraph. The docs suggest using a JanusGraphFactory: https://docs.janusgraph.org/basics/configuration/#janusgraphfactory
So I call
JanusGraph janusGraph = JanusGraphFactory.open(config);
I get the following stack trace:
Exception in thread "main" java.lang.IllegalArgumentException: Could not find implementation class: org.janusgraph.diskstorage.cql.CQLStoreManager
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:60)
at org.janusgraph.diskstorage.Backend.getImplementationClass(Backend.java:440)
at org.janusgraph.diskstorage.Backend.getStorageManager(Backend.java:411)
at org.janusgraph.graphdb.configuration.builder.GraphDatabaseConfigurationBuilder.build(GraphDatabaseConfigurationBuilder.java:50)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:161)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:132)
at org.janusgraph.core.JanusGraphFactory.open(JanusGraphFactory.java:112)
at com.activitystream.database.GraphMigration.migrateDatabase(GraphMigration.java:69)
at com.activitystream.runners.persistence.DataStores.migrateDatabase(DataStores.java:27)
at com.activitystream.runners.persistence.EntityPersistenceRunner.main(EntityPersistenceRunner.java:23)
Caused by: java.lang.ClassNotFoundException: org.janusgraph.diskstorage.cql.CQLStoreManager
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:315)
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:56)
... 9 more
Is it possible to modify the schema over a remote connection?
If it is not possible, how can one modify the schema?
Any insight would be appreciated.
You basically have two choices - either:
Interact with your JanusGraphManagement object by way of scripts sent to Gremlin Server (typically by way of a session but I guess you could package an entire "management script" together and submit it as one request) or
Bypass Gremlin Server and instantiation your JanusGraphManagement object locally as directed in the JanusGraph documentation.
There is no way to have return a JanusGraphManagement to your client as it is not a serializable object that can be sent back from the server.
We have a requirement to load multiple parquet files into Spark session and count the number of records in each parquet file. We thought to do it by parallelly. So we chose java 1.8 parallel stream concept but it is throwing parquet file exception. If we use the sequential stream concept then no issues. But we want to load all files parallelly.
List<String> list = Arrays.asList("hdfs://NNcluster/finalsnappy/2019/4/8291/table_8291_69_2019_04_01", "hdfs://NNcluster/finalsnappy/2019/4/8291/table_8291_69_2019_04_02");
list.parallelStream().forEach(fileName-> {
SparkSession spark = get_spark_session();// gets spark session
Dataset<Row> tmpDS = spark.read().format("parquet").load(fileName);
tmpDS.show();
System.out.println(tmpDS.count());
});
Following exception, we are getting if parallelStream() method used,
java.lang.ClassNotFoundException: org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat$$anonfun$11$$anonfun$12
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:429)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:85)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:793)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:555)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:241)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
at org.apache.jsp.test_jsp.lambda$0(test_jsp.java:210)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
But all the Jars are properly loaded. If we remove parallelStream() method on list then no issues working properly as like below,
List<String> list = Arrays.asList("hdfs://NNcluster/finalsnappy/2019/4/8291/table_8291_69_2019_04_01", "hdfs://NNcluster/finalsnappy/2019/4/8291/table_8291_69_2019_04_02");
list.forEach(fileName-> {
SparkSession spark = get_spark_session();// gets spark session
Dataset<Row> tmpDS = spark.read().format("parquet").load(fileName);
tmpDS.show();
System.out.println(tmpDS.count());
});
Please, someone help us how to resolve this issue?
I'm trying to prototype an app to use Hadoop as a datastore and I'm falling over at the first hurdle. I've got access to a Hadoop cluster and I purloined a test sample from Spring to try out the first baby step:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.junit.jupiter.api.Test;
import java.io.PrintWriter;
import java.net.URI;
import java.util.Scanner;
public class HdfsTest {
#Test
public void testHdfs() throws Exception {
System.setProperty("HADOOP_USER_NAME", "adam");
// Path that we need to create in HDFS.
// Just like Unix/Linux file systems, HDFS file system starts with "/"
final Path path = new Path("/usr/adam/junk.txt");
// Uses try with resources in order to avoid close calls on resources
// Creates anonymous sub class of DistributedFileSystem to allow calling
// initialize as DFS will not be usable otherwise
try (
final DistributedFileSystem dFS
= new DistributedFileSystem() {
{
initialize(new URI(
"hdfs://hanameservice/user/adam"),
new Configuration());
}
};
// Gets output stream for input path using DFS instance
final FSDataOutputStream streamWriter = dFS.create(path);
// Wraps output stream into PrintWriter to use high level
// and sophisticated methods
final PrintWriter writer = new PrintWriter(streamWriter);
) {
// Writes tutorials information to file using print writer
writer.println("bungalow bill");
writer.println("what did you kill");
System.out.println("File Written to HDFS successfully!");
}
}
These are the Hadoop libraries I'm using:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.1</version>
</dependency>
Could I be missing a dependency?
This is the logging with the errors - it seems there are 2 separate errors though.
2017-06-23 16:01:38.787 WARN --- [ main] org.apache.hadoop.util.Shell : Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:528)
at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:549)
at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:572)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:669)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1445)
at org.apache.hadoop.fs.FileSystem.initialize(FileSystem.java:221)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:145)
at com.bp.gis.tardis.HdfsTest$1.<init>(HdfsTest.java:34)
at com.bp.gis.tardis.HdfsTest.testHdfs(HdfsTest.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:316)
at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:114)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.lambda$invokeTestMethod$6(MethodTestDescriptor.java:171)
at org.junit.jupiter.engine.execution.ThrowableCollector.execute(ThrowableCollector.java:40)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.invokeTestMethod(MethodTestDescriptor.java:168)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.execute(MethodTestDescriptor.java:115)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.execute(MethodTestDescriptor.java:57)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.lambda$execute$1(HierarchicalTestExecutor.java:81)
at org.junit.platform.engine.support.hierarchical.SingleTestExecutor.executeSafely(SingleTestExecutor.java:66)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:76)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.lambda$execute$1(HierarchicalTestExecutor.java:91)
at org.junit.platform.engine.support.hierarchical.SingleTestExecutor.executeSafely(SingleTestExecutor.java:66)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:76)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.lambda$execute$1(HierarchicalTestExecutor.java:91)
at org.junit.platform.engine.support.hierarchical.SingleTestExecutor.executeSafely(SingleTestExecutor.java:66)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:76)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:51)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:43)
at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:137)
at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:87)
at org.junit.platform.launcher.Launcher.execute(Launcher.java:93)
at com.intellij.junit5.JUnit5IdeaTestRunner.startRunnerWithArgs(JUnit5IdeaTestRunner.java:61)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
... 35 common frames omitted
2017-06-23 16:01:39.449 WARN --- [ main] org.apache.hadoop.util.NativeCodeLoader : Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
java.lang.IllegalArgumentException: java.net.UnknownHostException: hanameservice
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:130)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:343)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:287)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:156)
at com.bp.gis.tardis.HdfsTest$1.<init>(HdfsTest.java:34)
at com.bp.gis.tardis.HdfsTest.testHdfs(HdfsTest.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:316)
at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:114)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.lambda$invokeTestMethod$6(MethodTestDescriptor.java:171)
at org.junit.jupiter.engine.execution.ThrowableCollector.execute(ThrowableCollector.java:40)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.invokeTestMethod(MethodTestDescriptor.java:168)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.execute(MethodTestDescriptor.java:115)
at org.junit.jupiter.engine.descriptor.MethodTestDescriptor.execute(MethodTestDescriptor.java:57)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.lambda$execute$1(HierarchicalTestExecutor.java:81)
at org.junit.platform.engine.support.hierarchical.SingleTestExecutor.executeSafely(SingleTestExecutor.java:66)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:76)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.lambda$execute$1(HierarchicalTestExecutor.java:91)
at org.junit.platform.engine.support.hierarchical.SingleTestExecutor.executeSafely(SingleTestExecutor.java:66)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:76)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.lambda$execute$1(HierarchicalTestExecutor.java:91)
at org.junit.platform.engine.support.hierarchical.SingleTestExecutor.executeSafely(SingleTestExecutor.java:66)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:76)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:51)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:43)
at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:137)
at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:87)
at org.junit.platform.launcher.Launcher.execute(Launcher.java:93)
at com.intellij.junit5.JUnit5IdeaTestRunner.startRunnerWithArgs(JUnit5IdeaTestRunner.java:61)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: java.net.UnknownHostException: hanameservice
... 36 more
How do I sort this out? My contact person with the Hadoop cluster I'm trying to connect with is not familiar with the hdfs: protocol and their frame of reference seems to be all manual and not programmatic. They want me to login to an edge node and run scripts there in a shell. I feel I should be asking them particular questions, but I'm not sure what.
There are 2 distinct problems:
It appears you are running from a Windows host. On Windows, Hadoop requires native code extensions so that it can integrate with the OS correctly for things like file access semantics and permissions. Notice that the exception message contains a link to an Apache Hadoop wiki page: WindowsProblems. That page contains information on how to handle this.
There is a failure to establish a socket connection to host "hanameservice". This is most likely not a real name, but rather a logical name used for HDFS High Availability. Internally, the HDFS client code would map this logical name to 1 of 2 real NameNode host names, but only if the configuration is complete. You likely do not have a complete set of the configuration files (core-site.xml and hdfs-site.xml) from the cluster. You would need the complete configuration on your local system for this to work.
They want me to login to an edge node and run scripts there in a shell.
Overall, this may be the shortest path for you rather than trying to work through the Windows integration and configuration. If you wrap your code in the Hadoop Tool interface, build it as a jar, and then copy that jar to the edge node, then you'll be able to run it as hadoop jar your-app.jar. You'll be running inside a known working environment, with no need to sort out the native code extensions and no need to worry about whether or not configuration is complete and up-to-date with the cluster configuration.
I am new to apache storm , currenly trying Pluggable Scheduler, to schedule the task assignment: which task should run on which supervisor.
I tried setting the "supervisor.scheduler.meta" value in the storm.yaml file in the supervisor node as shown below and when i tried to run the supervisor i end up with the illegal argument exception.I am using apache storm 0.10.0. Could you please guide me in solving this issue. Please find the configuration files and error logs below
storm.yaml
-----------
supervisor.scheduler.meta: "special-supervisor"
error-log
----
java.lang.IllegalArgumentException: field supervisor.scheduler.meta 'special-supervisor' must be a 'java.util.Map'
at backtype.storm.config$fn$reify__880.validateField(config.clj:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at backtype.storm.config$validate_configs_with_schemas.invoke(config.clj:118)
at backtype.storm.config$read_storm_config.invoke(config.clj:123)
at backtype.storm.command.config_value$_main.invoke(config_value.clj:22)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.AFn.applyTo(AFn.java:144)
Map entries need to have key and a value. For example:
supervisor.scheduler.meta:
name: "special-supervisor"
where "name" is the key and "special-supervisor" is the value.