trying to run this example in Spark documentation. Getting the error below. Get the same error using the Java version of the example as well. The exact line where I get the error is:
idfModel = idf.fit(featurizedData)
Py4JJavaError: An error occurred while calling o1142.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 256.0 failed 1 times, most recent failure: Lost task 7.0 in stage 256.0 (TID 3308, localhost): java.lang.NullPointerException
The data i'm using is obtained by reading a Json file which has few thousand records. In Java i'm reading the file as follows:
DataFrame myData = sqlContext.read().json("myJsonFile.json");
the rest of the code is exactly the same as in the example linked above. featurizedData is a valid DataFrame, I printed it's schema and the first element and everything looks as expected. I have no idea why I'm getting a null pointer exception.
The problem is you have nan as the text field for some columns.
Since the question is tagged with PySpark, use
data_nan_imputed = data.fillna("unknown", subset=["text_col1", .., "text_coln"])
This is a good practice if you have a number of text_cols that you want to combine them to make a single text_col. Otherwise, you can also use
data_nan_dropped = data.dropna()
to get rid of the nan columns and then fit this dataset. Hopefully, it will work.
For scala or java use similar nan filling statements.
Related
I implemented a kinesis stream consumer client using node wrapper and getting this MultiLangDaemon execution error as shown below.
Starting MultiLangDaemon ... java.lang.IllegalArgumentException: No enum constant software.amazon.kinesis.common.InitialPositionInStream.TRIM_HORIZON at java.lang.Enum.valueOf(Enum.java:238) at software.amazon.kinesis.common.InitialPositionInStream.valueOf(InitialPositionInStream.java:21) at software.amazon.kinesis.multilang.config.MultiLangDaemonConfiguration$2.convert(MultiLangDaemonConfiguration.java:208) at org.apache.commons.beanutils.ConvertUtilsBean.convert(ConvertUtilsBean.java:491) at org.apache.commons.beanutils.BeanUtilsBean.setProperty(BeanUtilsBean.java:1007) at software.amazon.kinesis.multilang.config.KinesisClientLibConfigurator.lambda$getConfiguration$0(KinesisClientLibConfigurator.java:65) at java.lang.Iterable.forEach(Iterable.java:75) at java.util.Collections$SynchronizedCollection.forEach(Collections.java:2064) at software.amazon.kinesis.multilang.config.KinesisClientLibConfigurator.getConfiguration(KinesisClientLibConfigurator.java:63) at software.amazon.kinesis.multilang.MultiLangDaemonConfig.<init>(MultiLangDaemonConfig.java:101) at software.amazon.kinesis.multilang.MultiLangDaemonConfig.<init>(MultiLangDaemonConfig.java:74) at software.amazon.kinesis.multilang.MultiLangDaemonConfig.<init>(MultiLangDaemonConfig.java:58) at software.amazon.kinesis.multilang.MultiLangDaemon.buildMultiLangDaemonConfig(MultiLangDaemon.java:171) at software.amazon.kinesis.multilang.MultiLangDaemon.main(MultiLangDaemon.java:220) No enum constant software.amazon.kinesis.common.InitialPositionInStream.TRIM_HORIZON
I already cross checked properties file with details shown below listed there
initialPositionInStream, processingLanguage,streamName,executableName etc.
TRIM_HORIZON is set as value for initialPositionInStream Not sure why software.amazon.kinesis.common.InitialPositionInStream this object missing this value?
I am using node consumer client as mentioned here
https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-implementation-app-nodejs.html
Just want to help community by answering this specific error fix. It was .properties file update for that respective parameter key name. Although I am getting couple of other java.lang.RuntimeException that will try to resolve and ask for resolution in separate threads.
We reading data from ORC files and writing it back to ORC and Parquet format using MultipleOutputs. Our job is Map only and does not have a reducer.
We are getting following errors in some cases which fails the entire job. I think both the errors are related but not sure why those don't come for every job.
Let me know if more information is required.
Error: java.lang.RuntimeException: Overflow of newLength. smallBuffer.length=1073741824, nextElemLength=300947
Error: java.lang.ArrayIndexOutOfBoundsException: 1000
at org.apache.orc.impl.writer.StringTreeWriter.writeBatch(StringTreeWriter.java:70)
at org.apache.orc.impl.writer.StructTreeWriter.writeRootBatch(StructTreeWriter.java:56)
at org.apache.orc.impl.WriterImpl.addRowBatch(WriterImpl.java:546)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushInternalBatch(WriterImpl.java:297)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:334)
at org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat$OrcRecordWriter.close(OrcNewOutputFormat.java:67)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs$RecordWriterWithCounter.close(MultipleOutputs.java:375)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:574)
Error: java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at org.apache.orc.impl.DynamicByteArray.add(DynamicByteArray.java:115)
at org.apache.orc.impl.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48)
at org.apache.orc.impl.StringRedBlackTree.add(StringRedBlackTree.java:60)
at org.apache.orc.impl.writer.StringTreeWriter.writeBatch(StringTreeWriter.java:70)
at org.apache.orc.impl.writer.StructTreeWriter.writeRootBatch(StructTreeWriter.java:56)
at org.apache.orc.impl.WriterImpl.addRowBatch(WriterImpl.java:546)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushInternalBatch(WriterImpl.java:297)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:334)
at org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat$OrcRecordWriter.close(OrcNewOutputFormat.java:67)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs$RecordWriterWithCounter.close(MultipleOutputs.java:375)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:574)
In my case the solution was to change orc.rows.between.memory.checks (or spark.hadoop.orc.rows.between.memory.checks) from 5000 (default value) to 1.
Because it seems like ORC writer cannot handle adding abnormally large rows to a stripe.
The value can probably be further adjusted to achieve better safety-performance balance.
I'm using Neuroph 2.9 framework to code ANN to predict housing prices. I want get every error every time run every epoch (to show the improve of error on chart) but this cause error.
// create multi layer perceptron
System.out.println("Creating neural network");
MultiLayerPerceptron neuralNet = new MultiLayerPerceptron(
TransferFunctionType.SIGMOID, inputsCount, hiddentsCount1,
outputsCount);
// set learning parameters
MomentumBackpropagation learningRule = new MomentumBackpropagation();
learningRule.setLearningRate(0.3);
learningRule.setMomentum(0.5);
learningRule.setNeuralNetwork(neuralNet);
learningRule.setTrainingSet(TrainSet);
learningRule.doOneLearningIteration(TrainSet);
I get this:
Exception in thread "main" java.lang.NullPointerException
at org.neuroph.nnet.learning.MomentumBackpropagation.updateNeuronWeights(MomentumBackpropagation.java:72)
at org.neuroph.nnet.learning.BackPropagation.calculateErrorAndUpdateOutputNeurons(BackPropagation.java:83)
at org.neuroph.nnet.learning.BackPropagation.updateNetworkWeights(BackPropagation.java:53)
at org.neuroph.core.learning.SupervisedLearning.learnPattern(SupervisedLearning.java:190)
at org.neuroph.core.learning.SupervisedLearning.doLearningEpoch(SupervisedLearning.java:165)
at org.neuroph.core.learning.IterativeLearning.doOneLearningIteration(IterativeLearning.java:245)
at com.thao.Main.main(Main.java:76)
The problem is when I use : learningRule.learn(TrainSet); it's ok, no error come out. The documentation so bad to differ functions to chose right function to run the right thing I want.
The thing I found is that doOneLearningIteration function didn't work because inside them. It's not initiated. Therefore, to run, we need to override or run 1 epoch and then doOneLearningIteration.
That's work for me.
I've test my apps using elasticsearch with very simple line of code. Like this :
Node node = nodeBuilder()
.settings(Settings.settingsBuilder().put("cluster.name", "elasticsearch").put("clster.transport.sniff", true).put("path.home", "/home/kenny/Program/Java/elastic"$
.node();
But I got error like this :
Exception in thread "main" java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:113)
at org.elasticsearch.node.internal.InternalSettingsPreparer.randomNodeName(InternalSettingsPreparer.java:198)
at org.elasticsearch.node.internal.InternalSettingsPreparer.finalizeSettings(InternalSettingsPreparer.java:177)
at org.elasticsearch.node.internal.InternalSettingsPreparer.prepareEnvironment(InternalSettingsPreparer.java:101)
at org.elasticsearch.node.Node.<init>(Node.java:128)
at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:145)
at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:152)
at TryElastic.main(TryElastic.java:56)
I don't know how to solve this problem, I've try and looking for the solution. Line 56 at error log, refer to ".node()" method above. So, dou you have suggestion or there are something that I've to add in my code
Thanks.....
The only way this can happen is due to a misconfiguration of path.home.
When Elasticsearch tries to generate a random node name for your instance, it looks for a file at {path.home}/config/names.txt
If the file cannot be found, you'll get a (rather unfriendly and unhelpful) NullPointerException.
So the solution is to check that "/home/kenny/Program/Java/elastic" is really the top-level of an ES installation.
See here for docs on the correct directory layout.
I have written a mapreduce programm using mahout. the map output value is ClusterWritable .when i run the code in eclipse, it is run with no error, but when i run rhe jar file in terminal, it shows the exception:
java.io.IOException: wrong value class: org.apache.mahout.math.VectorWritable is not class org.apache.mahout.clustering.iterator.ClusterWritable
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:498)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.mahout.clustering.canopy.CanopyMapper.cleanup(CanopyMapper.java:59)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
The output code in map is:
context.write(new Text(), new ClusterWritable());
but i don't know why it says that the value type is VectorWritable.
Mapper being run, resulting in stacktrace above is Mahout's CanopyMapper, and not custom one you've written.
CanopyMapper.cleanup method is outputting (key: Text, value: VectorWritable).
See CanopyMapper.java
See also CanopyDriver.java and its buildClustersMR method, where MR job is configured, mapper, reducer, and appropriate output key/value classes.
You didn't state, so I'm guessing that you're using more than one MR job in a data flow pipeline. Check that outputs of each job in pipeline are valid/expected input for next job in pipeline. Consider using cascading/scalding to define your data flow (see http://www.slideshare.net/melrief/scalding-programming-model-for-hadoop)
Consider using Mahout user mailing list to post Mahout related questions.