We reading data from ORC files and writing it back to ORC and Parquet format using MultipleOutputs. Our job is Map only and does not have a reducer.
We are getting following errors in some cases which fails the entire job. I think both the errors are related but not sure why those don't come for every job.
Let me know if more information is required.
Error: java.lang.RuntimeException: Overflow of newLength. smallBuffer.length=1073741824, nextElemLength=300947
Error: java.lang.ArrayIndexOutOfBoundsException: 1000
at org.apache.orc.impl.writer.StringTreeWriter.writeBatch(StringTreeWriter.java:70)
at org.apache.orc.impl.writer.StructTreeWriter.writeRootBatch(StructTreeWriter.java:56)
at org.apache.orc.impl.WriterImpl.addRowBatch(WriterImpl.java:546)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushInternalBatch(WriterImpl.java:297)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:334)
at org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat$OrcRecordWriter.close(OrcNewOutputFormat.java:67)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs$RecordWriterWithCounter.close(MultipleOutputs.java:375)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:574)
Error: java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at org.apache.orc.impl.DynamicByteArray.add(DynamicByteArray.java:115)
at org.apache.orc.impl.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48)
at org.apache.orc.impl.StringRedBlackTree.add(StringRedBlackTree.java:60)
at org.apache.orc.impl.writer.StringTreeWriter.writeBatch(StringTreeWriter.java:70)
at org.apache.orc.impl.writer.StructTreeWriter.writeRootBatch(StructTreeWriter.java:56)
at org.apache.orc.impl.WriterImpl.addRowBatch(WriterImpl.java:546)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushInternalBatch(WriterImpl.java:297)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:334)
at org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat$OrcRecordWriter.close(OrcNewOutputFormat.java:67)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs$RecordWriterWithCounter.close(MultipleOutputs.java:375)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.close(MultipleOutputs.java:574)
In my case the solution was to change orc.rows.between.memory.checks (or spark.hadoop.orc.rows.between.memory.checks) from 5000 (default value) to 1.
Because it seems like ORC writer cannot handle adding abnormally large rows to a stripe.
The value can probably be further adjusted to achieve better safety-performance balance.
Related
I implemented a kinesis stream consumer client using node wrapper and getting this MultiLangDaemon execution error as shown below.
Starting MultiLangDaemon ... java.lang.IllegalArgumentException: No enum constant software.amazon.kinesis.common.InitialPositionInStream.TRIM_HORIZON at java.lang.Enum.valueOf(Enum.java:238) at software.amazon.kinesis.common.InitialPositionInStream.valueOf(InitialPositionInStream.java:21) at software.amazon.kinesis.multilang.config.MultiLangDaemonConfiguration$2.convert(MultiLangDaemonConfiguration.java:208) at org.apache.commons.beanutils.ConvertUtilsBean.convert(ConvertUtilsBean.java:491) at org.apache.commons.beanutils.BeanUtilsBean.setProperty(BeanUtilsBean.java:1007) at software.amazon.kinesis.multilang.config.KinesisClientLibConfigurator.lambda$getConfiguration$0(KinesisClientLibConfigurator.java:65) at java.lang.Iterable.forEach(Iterable.java:75) at java.util.Collections$SynchronizedCollection.forEach(Collections.java:2064) at software.amazon.kinesis.multilang.config.KinesisClientLibConfigurator.getConfiguration(KinesisClientLibConfigurator.java:63) at software.amazon.kinesis.multilang.MultiLangDaemonConfig.<init>(MultiLangDaemonConfig.java:101) at software.amazon.kinesis.multilang.MultiLangDaemonConfig.<init>(MultiLangDaemonConfig.java:74) at software.amazon.kinesis.multilang.MultiLangDaemonConfig.<init>(MultiLangDaemonConfig.java:58) at software.amazon.kinesis.multilang.MultiLangDaemon.buildMultiLangDaemonConfig(MultiLangDaemon.java:171) at software.amazon.kinesis.multilang.MultiLangDaemon.main(MultiLangDaemon.java:220) No enum constant software.amazon.kinesis.common.InitialPositionInStream.TRIM_HORIZON
I already cross checked properties file with details shown below listed there
initialPositionInStream, processingLanguage,streamName,executableName etc.
TRIM_HORIZON is set as value for initialPositionInStream Not sure why software.amazon.kinesis.common.InitialPositionInStream this object missing this value?
I am using node consumer client as mentioned here
https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-implementation-app-nodejs.html
Just want to help community by answering this specific error fix. It was .properties file update for that respective parameter key name. Although I am getting couple of other java.lang.RuntimeException that will try to resolve and ask for resolution in separate threads.
I am trying to run the Class : Word2VecSentimentRNN from the following link:
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/word2vecsentiment/Word2VecSentimentRNN.java
The example is big one, hence given link of the example here.
Also I have downloaded the Sample vector file from the following link:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors
I am getting the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Cannot allocate 3103474 + 3600000000 bytes (> Pointer.maxBytes)
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:484)
at org.bytedeco.javacpp.Pointer.init(Pointer.java:118)
at org.bytedeco.javacpp.FloatPointer.allocateArray(Native Method)
at org.bytedeco.javacpp.FloatPointer.<init>(FloatPointer.java:68)
at org.nd4j.linalg.api.buffer.BaseDataBuffer.<init>(BaseDataBuffer.java:457)
at org.nd4j.linalg.api.buffer.FloatBuffer.<init>(FloatBuffer.java:57)
at org.nd4j.linalg.api.buffer.factory.DefaultDataBufferFactory.createFloat(DefaultDataBufferFactory.java:238)
at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1201)
at org.nd4j.linalg.factory.Nd4j.createBuffer(Nd4j.java:1176)
at org.nd4j.linalg.api.ndarray.BaseNDArray.<init>(BaseNDArray.java:230)
at org.nd4j.linalg.cpu.nativecpu.NDArray.<init>(NDArray.java:111)
at org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory.create(CpuNDArrayFactory.java:247)
at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4261)
at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:4227)
at org.nd4j.linalg.factory.Nd4j.create(Nd4j.java:3501)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readBinaryModel(WordVectorSerializer.java:219)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.loadGoogleModel(WordVectorSerializer.java:118)
at com.nyu.sentimentanalysis.core.Word2VecSentimentRNN.run(Word2VecSentimentRNN.java:77)
I have tried to launch the application with the parameters -Xmx2g and -Xms2g. Even changed the values from time to time to check if it helps or work.
Kindly let me know what should I do. Getting locked up here.
I had this problem running standard Word2vec code and system dies after a while with OutOfMemory.
Following settings worked for me for sustaining long term load on production for DL4J/ND4J based app using Word2vec pre-trained model
java -Xmx2G -Dorg.bytedeco.javacpp.maxbytes=6G -Dorg.bytedeco.javacpp.maxphysicalbytes=6G
trying to run this example in Spark documentation. Getting the error below. Get the same error using the Java version of the example as well. The exact line where I get the error is:
idfModel = idf.fit(featurizedData)
Py4JJavaError: An error occurred while calling o1142.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 256.0 failed 1 times, most recent failure: Lost task 7.0 in stage 256.0 (TID 3308, localhost): java.lang.NullPointerException
The data i'm using is obtained by reading a Json file which has few thousand records. In Java i'm reading the file as follows:
DataFrame myData = sqlContext.read().json("myJsonFile.json");
the rest of the code is exactly the same as in the example linked above. featurizedData is a valid DataFrame, I printed it's schema and the first element and everything looks as expected. I have no idea why I'm getting a null pointer exception.
The problem is you have nan as the text field for some columns.
Since the question is tagged with PySpark, use
data_nan_imputed = data.fillna("unknown", subset=["text_col1", .., "text_coln"])
This is a good practice if you have a number of text_cols that you want to combine them to make a single text_col. Otherwise, you can also use
data_nan_dropped = data.dropna()
to get rid of the nan columns and then fit this dataset. Hopefully, it will work.
For scala or java use similar nan filling statements.
I have written a mapreduce programm using mahout. the map output value is ClusterWritable .when i run the code in eclipse, it is run with no error, but when i run rhe jar file in terminal, it shows the exception:
java.io.IOException: wrong value class: org.apache.mahout.math.VectorWritable is not class org.apache.mahout.clustering.iterator.ClusterWritable
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:498)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.mahout.clustering.canopy.CanopyMapper.cleanup(CanopyMapper.java:59)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
The output code in map is:
context.write(new Text(), new ClusterWritable());
but i don't know why it says that the value type is VectorWritable.
Mapper being run, resulting in stacktrace above is Mahout's CanopyMapper, and not custom one you've written.
CanopyMapper.cleanup method is outputting (key: Text, value: VectorWritable).
See CanopyMapper.java
See also CanopyDriver.java and its buildClustersMR method, where MR job is configured, mapper, reducer, and appropriate output key/value classes.
You didn't state, so I'm guessing that you're using more than one MR job in a data flow pipeline. Check that outputs of each job in pipeline are valid/expected input for next job in pipeline. Consider using cascading/scalding to define your data flow (see http://www.slideshare.net/melrief/scalding-programming-model-for-hadoop)
Consider using Mahout user mailing list to post Mahout related questions.
i am fairly new in java and currently working in Struts 1.3. My question is more related with java.
Following is the scenario and i am not sure regarding the best approach to be followed.
I want to make a error reporting functionality in my project where in I'm copying stacktrace and inserting it in the database's table say "errorlog" table with type of exception,date, user etc and other fields. But now what i have to do is :
I have to identify the package name, file, method and line number in which the exception occurred with the help of server logs that i have when exception occurs.
But the problem is that the log gives you the exception, file and the line number but it also contains errors that are not in my packages as in HttpServlet.service(HttpServlet.java:710) etc given below:
java.lang.NullPointerException
com.candidate.query.CandSearchAction.execute(CandSearchAction.java:34)
org.apache.struts.action.RequestProcessor.processActionPerform(RequestProcessor.java:484)
org.apache.struts.action.RequestProcessor.proc ess(RequestProcessor.java:274)
org.apache.struts.action.ActionServlet.process(ActionServlet.java:1482)
org.apache.struts.action.ActionServlet.doPost(ActionServlet.java:525)
javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
In above case, i got a null pointer exception in my CandSearchAction.java line number 34.
Now i want to know how to differentiate or extract only the line that points to my java/jsp file, so that i can filter them accordingly.
Can anyone please help regarding this situation or how to approach this problem.. is there any library that i can use or anything that could help me.
Waiting for a response that would help me.
If your code follows the same basic structure you used in your example, you could parse the stack trace for anything starting with 1 of your package names and only use those lines.