Below implementation gives Out of Memory Error for files of 1 GB size with 4 GB heap space. Files.lines will return a stream but while running Collectors.joining it gives Heap Error.
Can we stream a file with lesser memory footprint using jooq and jdbc preserving original line separators ?
Stream<String> lines = Files.lines(path);
dsl.createTable(TABLE1)
.column(COL1, SQLDataType.CLOB)
.column(COL2, SQLDataType.CLOB)
.execute();
dsl.insertInto(TABLE1)
.columns(COL1, COL2)
.values(VAL1, lines
.collect(Collectors.joining(System.lineSeparator())))
.execute();
Error ->
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332) ~[na:1.8.0_141]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[na:1.8.0_141]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[na:1.8.0_141]
at java.lang.StringBuilder.append(StringBuilder.java:136) ~[na:1.8.0_141]
at java.lang.StringBuilder.append(StringBuilder.java:76) ~[na:1.8.0_141]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:484) ~[na:1.8.0_141]
at java.lang.StringBuilder.append(StringBuilder.java:166) ~[na:1.8.0_141]
at java.util.StringJoiner.add(StringJoiner.java:185) ~[na:1.8.0_141]
at java.util.stream.Collectors$$Lambda$491/1699129225.accept(Unknown Source) ~[na:na]
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[na:1.8.0_141]
at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_141]
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[na:1.8.0_141]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_141]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_141]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_141]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_141]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[na:1.8.0_141]
jOOQ's default data type binding for CLOB data type treats the CLOB type as an ordinary String, which works well for small to mid sized lobs. For larger lobs, the streaming version of the JDBC API would be more suitable. Ideally, you would create your own data type binding, where you optimise your write operation for streaming:
https://www.jooq.org/doc/latest/manual/code-generation/custom-data-type-bindings
For example:
class StreamingLobBinding implements Binding<String, File> {
...
#Override
public void set(BindingSetStatementContext<File> ctx) {
// Ideally: register the input stream somewhere for explicit resource management
ctx.statement()
.setBinaryStream(ctx.index(), new FileInputStream(ctx.value()));
}
}
And then, you can either apply this binding to your column and have the code generator pick it up as documented in the above link, or you apply it only for single usage:
DataType<File> fileType = COL2.getDataType().asConvertedDataType(new StreamingLobBinding());
Field<File> fileCol = DSL.field(COL2.getName(), fileType);
dsl.insertInto(TABLE1)
.columns(COL1, fileCol)
.values(VAL1, DSL.val(theFile, fileType))
.execute();
Note that currently, you may need to register your input stream in some ThreadLocal to remember it and clean it up after statement execution. A future jOOQ version will offer SPI for handling that: https://github.com/jOOQ/jOOQ/issues/7591
Since sqlite-jdbc throws SQLFeatureNotSupportedException using setBinaryStream, there is one more approach as suggested in github issue above.
dsl.insertInto(TABLE1)
.columns(TABLE1.COL1, TABLE1.COL2)
.values("ABB", null)
.execute();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8));
StringBuilder lines = new StringBuilder();
int bytesCount = 0;
String line = null;
int value = 0;
Field<String> coalesce = DSL.coalesce(TABLE1.COL2, "");
char[] buffer = new char[100 * 1000 * 1000];
while ((value = reader.read(buffer)) != -1) {
lines.append(String.valueOf(buffer, 0, value));
bytesCount += lines.length();
if (bytesCount >= 100 * 1000 * 1000) {
dsl.update(TABLE1).set(TABLE1.COL2, DSL.concat(coalesce, lines.toString())).where(TABLE1.COL1.eq("ABB")).execute();
bytesCount = 0;
lines.setLength(0);
}
}
if (lines.length() > 0) {
dsl.update(TABLE1).set(TABLE1.COL2, DSL.concat(coalesce, lines.toString())).where(TABLE1.COL1.eq("ABB")).execute();
}
reader.close();
Related
I'm attempting to use the Java Spark libraries with a cluster running Spark 2.3.0 over Hadoop 3.1.0 (and using those versions of the Java libraries).
I've run into a problem where I simply cannot use groupByKey, and I am at a loss to explain why. Any attempted usage of groupByKey for any reason in any circumstance is returning a java.lang.IllegalArgumentException.
I've boiled this down to about the simplest test I can think of:
package com.failuretest;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
public class TestReport {
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName("TestReport").set("spark.executor.memory", "20G");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> test = sc.parallelize(generateTestData());
test.saveAsTextFile("/TEST/testfile1");
test.mapToPair(line -> {
String[] testParts = line.split(" ");
return new Tuple2<String, String>(testParts[0], testParts[1]);
}).groupByKey().saveAsTextFile("/TEST/testfile2");
sc.close();
}
private static List<String> generateTestData() {
List<String> testList = new ArrayList<String>();
int keyCount = 0;
int valCount = 0;
while (valCount++ < 2000000) {
if (valCount % 10 == 0) {
keyCount++;
}
testList.add("Key" + keyCount + " " + "Val" + valCount);
}
return testList;
}
}
I'm just programmatically creating an RDD that produces 10 values per key, then creating my JavaPairRDD with a simple split, then attempting groupByKey.
When it runs, I receive the following stack:
Exception in thread "main" java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:88)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKeyWithClassTag$1.apply(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:77)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1.apply(PairRDDFunctions.scala:505)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1.apply(PairRDDFunctions.scala:498)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:498)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:641)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:641)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:640)
at org.apache.spark.api.java.JavaPairRDD.groupByKey(JavaPairRDD.scala:559)
at com.failuretest.TestReport.main(TestReport.java:22)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
It doesn't get any further than the groupByKey (I'm writing a file above with the results, but it really doesn't matter since it never gets there).
I can run it all day long in my local dev instance, but running spark-submit with a jar containing the above fails every time in the cluster.
I'm really not sure where to go from here - what I am trying to do is a bit of a challenge if I cannot group by key.
Am I messing up? Is this a version conflict somewhere?
Dave
I actually figured this out before posting this, but in the interests of helping others...
I discovered that one of my colleagues had decided to have a play around with Java 10 on this particular cluster. Moved it back to Java 8 (sorry - didn't try 9) and the problem went away.
Dave
I'm hitting a GC overhead limit exceeded error in Spark using spark_apply. Here are my specs:
sparklyr v0.6.2
Spark v2.1.0
4 workers with 8 cores and 29G of memory
The closure get_dates pulls data from Cassandra one row at a time. There are about 200k rows total. The process run for about an hour and a half and then given me this memory error.
I've experimented with spark.driver.memory which is supposed to increase the heap size, but it's not working.
Any ideas? Usage below
> config <- spark_config()
> config$spark.executor.cores = 1 # this ensures a max of 32 separate executors
> config$spark.cores.max = 26 # this ensures that cassandra gets some resources too, not all to spark
> config$spark.driver.memory = "4G"
> config$spark.driver.memoryOverhead = "10g"
> config$spark.executor.memory = "4G"
> config$spark.executor.memoryOverhead = "1g"
> sc <- spark_connect(master = "spark://master",
+ config = config)
> accounts <- sdf_copy_to(sc, insight %>%
+ # slice(1:100) %>%
+ {.}, "accounts", overwrite=TRUE)
> accounts <- accounts %>% sdf_repartition(78)
> dag <- spark_apply(accounts, get_dates, group_by = c("row"),
+ columns = list(row = "integer",
+ last_update_by = "character",
+ last_end_time = "character",
+ read_val = "numeric",
+ batch_id = "numeric",
+ fail_reason = "character",
+ end_time = "character",
+ meas_type = "character",
+ svcpt_id = "numeric",
+ org_id = "character",
+ last_update_date = "character",
+ validation_status = "character"
+ ))
> peak_usage <- dag %>% collect
Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:260)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:254)
at scala.collection.Iterator$class.foreach(Iterator.scala:743)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:254)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:276)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:275)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
at sparklyr.Utils$.collect(utils.scala:196)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke$.invoke(invoke.scala:102)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
at sparklyr.StreamHandler$.read(stream.scala:62)
at sparklyr.BackendHandler.channelRead0(handler.scala:52)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
Maybe I have misread your example but the memory problem seems to occur when you collect and not when you use spark_apply. Try
config$spark.driver.maxResultSize <- XXX
where XXX is what you expect to need (I have set it to 4G for a similar job). See https://spark.apache.org/docs/latest/configuration.html for further details.
This is a GC problem, maybe you should try configuring your JVM with other arguments, are you using G1 as your GC?
If you are not able to provide more memory and you have issues with the gc collect times, you should try using another JVM (maybe Zing from Azul systems?
I've set the overhead memory needed for spark_apply using spark.yarn.executor.memoryOverhead. I've found that using the by= argument of sfd_repartition is useful and using the group_by= in spark_apply also helps. The more you are able to split up your data between executors the better.
I am trying to run a Spark sample in local mode, but am getting the following stack trace:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/internal/io/HadoopMapReduceCommitProtocol
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.internal.SQLConf$.<init>(SQLConf.scala:383)
at org.apache.spark.sql.internal.SQLConf$.<clinit>(SQLConf.scala)
at org.apache.spark.sql.internal.StaticSQLConf$$anonfun$buildConf$1.apply(SQLConf.scala:930)
at org.apache.spark.sql.internal.StaticSQLConf$$anonfun$buildConf$1.apply(SQLConf.scala:928)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:122)
at org.apache.spark.sql.internal.StaticSQLConf$.<init>(SQLConf.scala:937)
at org.apache.spark.sql.internal.StaticSQLConf$.<clinit>(SQLConf.scala)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$sessionStateClassName(SparkSession.scala:962)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
at com.megaport.PipelineExample$.main(PipelineExample.scala:37)
at com.megaport.PipelineExample.main(PipelineExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.io.HadoopMapReduceCommitProtocol
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I can see the class in the GitHub repo, but it is not in the Maven lib, or in the distro(I have the distro bundled with Hadoop) spark-core_2.11-2.0.2.jar.
The code I am trying to run is taken from the examples in the Spark distro, and it fails at the getOrCreate stage...
// scalastyle:off println
package com.megaport
// $example on$
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession
object PipelineExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.appName("My Spark Application") // optional and will be autogenerated if not specified
.master("local[*]") // avoid hardcoding the deployment environment
// .enableHiveSupport() // self-explanatory, isn't it?
.getOrCreate
// $example on$
// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Make predictions on test documents.
model.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
// $example off$
spark.stop()
}
}
Well if its not in your java library, then you should download the dependent jar and add it.
Check this SO for more details
How to import a jar in Eclipse
I am trying to search an index in my project.
When i search with the
searchText: semester:"S1 / 2016" AND status:Validated AND emp_id>0
, I am getting the results properly.
When i search with the
searchText: semester:"S1 / 2016" AND status:Validated AND emp_id>500 , I am getting the below exception
java.lang.NegativeArraySizeException
at com.google.appengine.api.search.dev.GenericScorer.search(GenericScorer.java:196)
at com.google.appengine.api.search.dev.LocalSearchService.searchForApp(LocalSearchService.java:584)
at com.google.appengine.api.search.dev.LocalSearchService.search(LocalSearchService.java:534)
at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.google.appengine.tools.development.ApiProxyLocalImpl$AsyncApiCall.callInternal(ApiProxyLocalImpl.java:541)
at com.google.appengine.tools.development.ApiProxyLocalImpl$AsyncApiCall.call(ApiProxyLocalImpl.java:484)
at com.google.appengine.tools.development.ApiProxyLocalImpl$AsyncApiCall.call(ApiProxyLocalImpl.java:461)
at java.util.concurrent.Executors$PrivilegedCallable$1.run(Executors.java:493)
at java.security.AccessController.doPrivileged(Native Method)
at java.util.concurrent.Executors$PrivilegedCallable.call(Executors.java:490)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Below is my code where the search is performed:
public Results<ScoredDocument> retrieveDocuments(String searchText) {
if (searchText.length() > 2000) {
throw new InternalException(new Exception("Error - Too long query !!"),
BaseValidationMessages.SEARCH_STRING_EXCEEDS_LIMIT);
}
QueryOptions options =QueryOptions.newBuilder().setOffset(0).setLimit(2)
.setSortOptions(createSortOptions("emp_id", "asc")).build();
Query query = Query.newBuilder().setOptions(options).build(searchText);
IndexSpec indexSpec = IndexSpec.newBuilder().setName("Beneficiaries").build();
Index index = SearchServiceFactory.getSearchService().getIndex(indexSpec);
return index.search(query);
}
Google Appengine throws NegativeArraySizeException, if the offset passed to the index search is out range of the available results count.
For E.g, Consider that there 50 documents available in search Index for the given search Query Text, if the offset set in the search query options is more than 50, then it would throw NegativeArraySizeException on executing the search.
QueryOptions options = QueryOptions.newBuilder().setOffset(**51**)
So Before setting offset, make sure the documents are available for the offset using the previous results.
First Query:
QueryOptions options = QueryOptions.newBuilder().setOffset(0).setLimit(50);
IndexSpec indexSpec = IndexSpec.newBuilder().setName("Beneficiaries").build();
Index index = SearchServiceFactory.getSearchService().getIndex(indexSpec);
Results<ScoredDocument>results = index.search(query);
If the results returned from the above query is equal to 50, then search for the Next Set;
QueryOptions options = QueryOptions.newBuilder().setOffset(**51**)
if the results returned is less than 50, then we can assume that, there is no more results available to search for the offset > 50, then we can stop further searching.
I have broken wiki xml dump into many small parts of 1M and tried to clean it (after cleaning it with another program by somebody else)
I get an out of memory error which I don't know how to solve. Can anyone enlighten me?
I get the following error message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:212)
at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:235)
at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252)
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:645)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1541)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1237)
at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:234)
at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:224)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:224)
at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:220)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
Where line 234 is as follows:
writer.addDocument(document)
It is adding some documents to Lucene
and where line 224 is as follows:
for (doc <- target_xml \\ "doc") yield {
It is the first line of a for loop for adding various elements as fields in the index.
Is it a code problem, setting problem or hardware problem?
EDIT
Hi, this is my for loop:
for (knowledgeFile <- knowledgeFiles) yield {
System.err.println(s"processing file: ${knowledgeFile}")
val target_xml=XML.loadString(" <file>"+cleanFile(knowledgeFile).mkString+"</file>")
for (doc <- target_xml \\ "doc") yield {
val id = (doc \ "#id").text
val title = (doc \ "#title").text
val text = doc.text
val document = new Document()
document.add(new StringField("id", id, Store.YES))
document.add(new TextField("text", new StringReader(title + text)))
writer.addDocument(document)
val xml_doc = <page><title>{ title }</title><text>{ text }</text></page>
id -> xml_doc
}
}).flatten.toArray`
The inner loop just loops thru every doc element. The outer loop loops thru every file. Is the nested for the source of the problem?
Below is the cleanFile function for reference:
def cleanFile(fileName:String):Array[String] = {
val tagRe = """<\/?doc.*?>""".r
val lines = Source.fromFile(fileName).getLines.toArray
val outLines = new Array[String](lines.length)
for ((line,lineNo) <- lines.zipWithIndex) yield {
if (tagRe.findFirstIn(line)!=None)
{
outLines(lineNo) = line
}
else
{
outLines(lineNo) = StringEscapeUtils.escapeXml11(line)
}
}
outLines
}
Thanks again
Looks like you would like to try increasing the heap size by having -xmx jvm argument?