Iterative GraphFrames AggregateMessages hitting memory limits - java

I'm using GraphFrame's aggregateMessages capability to build a custom clustering algorithm. I tested this algorithm on a small sample dataset (~100 items) and verified that it works. But when I run this on my real dataset of 50k items, I am getting OOM errors after ~10 iterations. Interestingly, the first few iterations are processed in a couple mins and mem is the normal range. It's after iteration 6 that mem usage creeps to ~30GB and eventually bombs. I am running this on a 2 node cluster 16cores with 32GB.
Since this is an iterative algorithm and the fact that the mem after each iteration only increases, I wonder if I need to release memory somehow. I added the unpersist blocks at the end of the the loop but that hasnt helped.
Are there any other efficiencies I could use? Are there best practices around using GraphFrames in an iterative setting?
Another thing I've noticed is that on the spark UI of the executor page, the used "storage memory" for ~300MB, but the spark process is infact taking ~30GB. Not sure if this is a memory leak!
while ( true ) {
System.out.println("["+new Date()+"] Running " + i);
Dataset<Row> lastRoutesDs = groups;
Dataset<Row> groupUnwind = groups.withColumn("id", explode(col("routeItems")));
GraphFrame gf = new GraphFrame(groupUnwind, edgesDs);
Dataset<Row> lvl1 = gf.aggregateMessages()
.sendToSrc(when(
callUDF("contains_in_array_str", AggregateMessages.dst().getField("routeItems"),
AggregateMessages.src().getField("id")).equalTo(false),
struct(AggregateMessages.dst().getField("routeItems").as("routeItems"),
AggregateMessages.dst().getField("routeScores").as("routeScores"),
AggregateMessages.dst().getField("grpId").as("grpId"),
AggregateMessages.dst().getField("grpScore").as("grpScore"),
AggregateMessages.edge().getField("score").as("edgeScore"))))
.agg(collect_set(AggregateMessages.msg()).as("incomings"))
.withColumn("inItem", explode(col("incomings")))
.groupBy("id", "inItem.grpId")
.agg(first("inItem.routeItems").as("routeItems"), first("inItem.routeScores").as("routeScores"),
first("inItem.grpScore").as("grpScore"), collect_list("inItem.edgeScore").as("inScores"))
.groupBy("grpId")
.agg(bestRouteAgg.apply(col("routeItems"), col("routeScores"), col("inScores"), col("grpScore"),
col("id"), col("grpScore")).as("best"))
.withColumn("newScore", callUDF("calcRouteScores", expr("size(best.routeItems)+1"),
col("best.routeScores"), col("best.inScores")))
.withColumn("edgeCount", expr("size(best.routeScores)"))
.persist(StorageLevel.MEMORY_AND_DISK());
lvl1
.filter("newScore > " + groupMaxScore)
.withColumn("itr", lit(i))
.select("grpId", "best.routeItems","best.routeScores", "best.grpScore", "edgeCount", "itr")
.write()
.mode(SaveMode.Append)
.json(workspaceDir + "clusters-rank-collect");
if (lvl1.count() == 0) {
System.out.println("****** End reached " + i);
break;
}
Dataset<Row> newGroups = lvl1.filter("newScore <= " + groupMaxScore)
.withColumn("routeItems_new",
callUDF("merge2Array", col("best.routeItems"), array(col("best.newNode"))))
.withColumn("routeScores_new",
callUDF("merge2ArrayDouble", col("best.routeScores"), col("best.inScores")))
.select(col("grpId"), col("routeItems_new").as("routeItems"),
col("routeScores_new").as("routeScores"), col("newScore").as("grpScore"));
if (i > 0 && (i % 2) == 0) {
newGroups = newGroups
.checkpoint();
}
newGroups = newGroups
.persist(StorageLevel.DISK_ONLY());
System.out.println( newGroups.count() );
groups.unpersist();
lastRoutesDs.unpersist();
groupUnwind.unpersist();
lvl1.unpersist();
groups = newGroups;
i++;
}

Related

Get value from website but java code does not work. how to fix?

My code is not working. it gives error for this line "int temp = Integer.parseInt(currentTemp.substring(0, currentTemp.indexOf("˚ ")));" I tried several ways but I could not. Maybe a different factor affects it. Is there any idea to fix it?
Error is here:
Background: # darksky.feature:5
Given I am on Darksky Home Page # DarkskySD.iAmOnDarkskyHomePage()
Current Temp: 43°
Current Temp:43˚ Rain.
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1967)
at framework.DarkskyTS.etr(DarkskyTS.java:138)
at stepdefinition.DarkskySD.currentTempGreaterOrless(DarkskySD.java:39)
at ✽.Then I verify current temp is not greater or less then temps from daily timeline(darksky.feature:25)
#currenttempgreaterorless
Scenario: Verify Current Temperature should not be greater or less than the Temperature from Daily Timeline # darksky.feature:24
Then I verify current temp is not greater or less then temps from daily timeline # DarkskySD.currentTempGreaterOrless()
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1967)
at framework.DarkskyTS.etr(DarkskyTS.java:138)
at stepdefinition.DarkskySD.currentTempGreaterOrless(DarkskySD.java:39)
at ✽.Then I verify current temp is not greater or less then temps from daily timeline(darksky.feature:25)
Failed scenarios:
darksky.feature:24 # Scenario: Verify Current Temperature should not be greater or less than the Temperature from Daily Timeline
1 Scenarios (1 failed)
2 Steps (1 failed, 1 passed)
0m5.234s
public void tempValue(){
String currentTemp = SharedSD.getDriver().findElement(By.cssSelector(".summary.swap")).getText();
System. out.println("Current Temp:" + currentTemp);
List<WebElement> tempsInTimeLine = SharedSD.getDriver().findElements(By.cssSelector(".temps span:last-child"));
int temp = Integer.parseInt(currentTemp.substring(0, currentTemp.indexOf("˚ ")));
int highestInTimeLine = temp;
int lowestInTimeLine = temp;
for (WebElement tempInTime: tempsInTimeLine) {
String sLIneTemp = tempInTime.getText();
int lineTemp = Integer.parseInt(sLIneTemp.substring(0, sLIneTemp.indexOf("˚ ")));
if (lineTemp > highestInTimeLine){
highestInTimeLine = lineTemp;
}
if (lineTemp < lowestInTimeLine ){
lowestInTimeLine = lineTemp;
}
//int lineTemp = Integer.parseInt(sLIneTemp.substring(0, sLIneTemp.indexOf("˚ ")));
}
System. out.println("Highest Temp:" + highestInTimeLine);
System. out.println("Lowest Temp:" + lowestInTimeLine );
}
Instead of using indexOf("° "), try using indexOf("°"). It will work. The space you are putting after the degree symbol is unnecessary.
That message is telling you that "˚ " cannot be found in the string you are trying to parse. Like #Ashish, I suspect this is caused by the space after your degree symbol. Your new line should look like:
int lineTemp = Integer.parseInt(sLIneTemp.substring(0, sLIneTemp.indexOf("˚")));
Here is answer the question:
driver.get("https://darksky.net/forecast/40.7127,-74.0059/us12/en");
String currentTemp = driver.findElement(By.cssSelector(".summary.swap")).getText();
System.out.println("Current Temp:" + currentTemp);
List<WebElement> tempsInTimeLine = driver.findElements(By.cssSelector(".temps span:last-child"));
int temp = Integer.parseInt(currentTemp.substring(0, 2));
int highestInTimeLine = temp;
int lowestInTimeLine = temp;
for (WebElement tempInTime: tempsInTimeLine) {
String sLIneTemp = tempInTime.getText();
int lineTemp = Integer.parseInt(sLIneTemp.substring(0, 2));
if (lineTemp > highestInTimeLine){
highestInTimeLine = lineTemp;
}
if (lineTemp < lowestInTimeLine ){
lowestInTimeLine = lineTemp;
}
}
System.out.println("Highest Temp:" + Integer.toString(highestInTimeLine));
System.out.println("Lowest Temp:" + Integer.toString(lowestInTimeLine ));
I ran into the same issue. I am not sure whether my environment is the issue or not.
When I debugged that indexOf("°") call, I saw that it thought the parameter is "°" rather than just "°". I changed it to '°' and it worked.
However, when running mvn clean install, I noticed a warning -
[WARNING] Using platform encoding (Cp1252 actually) to copy filtered resources, i.e. build is platform dependent!
So the file was UTF-8 (not sure whether it had a BOM or not) and was misinterpreted by Maven.
Going to their FAQ, they suggest to add -
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
under <properties>.
I added it, ran git checkout -- path/to/file.java to get the original code/encoding and re-ran mvn clean install.
The warning was no longer emitted and the issue was fixed.

BigQueue disk space not clearing

I am using a java persistence Queue named BigQueue, It stores the data in the disk, bigQueue.gc() is used to clear the used disk space. The big queue.gc() is not clearing the used disk space. The disk memory is continuously increasing.
IBigQueue bigQueue = new BigQueueImpl("/home/test/BigQueueNew", "demo1");
for (int i = 0; i < 10000; i++) {
ManagedObject mo = new ManagedObject();
mo.setName("Aravind " + i);
bigQueue.enqueue(serialize(mo));
}
while (!bigQueue.isEmpty()) {
ManagedObject mo = (ManagedObject) deserialize(bigQueue.dequeue());
System.out.println("Key Dqueue ME");
}
bigQueue.close();
// bigQueue.removeAll(); bigQueue.gc();; System.out.println("Big Queue is " + bigQueue.isEmpty() +" Size is "+bigQueue.size());
In case someone is looking at this as well.
If you are using Java 11 on ubuntu, this could be a known issue. Refer to the link below.
Unless it is fixed at the source, you could download the source and fix it yourself.
https://github.com/bulldog2011/bigqueue/issues/39

sparklyr failing with java.lang.OutOfMemoryError: GC overhead limit exceeded

I'm hitting a GC overhead limit exceeded error in Spark using spark_apply. Here are my specs:
sparklyr v0.6.2
Spark v2.1.0
4 workers with 8 cores and 29G of memory
The closure get_dates pulls data from Cassandra one row at a time. There are about 200k rows total. The process run for about an hour and a half and then given me this memory error.
I've experimented with spark.driver.memory which is supposed to increase the heap size, but it's not working.
Any ideas? Usage below
> config <- spark_config()
> config$spark.executor.cores = 1 # this ensures a max of 32 separate executors
> config$spark.cores.max = 26 # this ensures that cassandra gets some resources too, not all to spark
> config$spark.driver.memory = "4G"
> config$spark.driver.memoryOverhead = "10g"
> config$spark.executor.memory = "4G"
> config$spark.executor.memoryOverhead = "1g"
> sc <- spark_connect(master = "spark://master",
+ config = config)
> accounts <- sdf_copy_to(sc, insight %>%
+ # slice(1:100) %>%
+ {.}, "accounts", overwrite=TRUE)
> accounts <- accounts %>% sdf_repartition(78)
> dag <- spark_apply(accounts, get_dates, group_by = c("row"),
+ columns = list(row = "integer",
+ last_update_by = "character",
+ last_end_time = "character",
+ read_val = "numeric",
+ batch_id = "numeric",
+ fail_reason = "character",
+ end_time = "character",
+ meas_type = "character",
+ svcpt_id = "numeric",
+ org_id = "character",
+ last_update_date = "character",
+ validation_status = "character"
+ ))
> peak_usage <- dag %>% collect
Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:260)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:254)
at scala.collection.Iterator$class.foreach(Iterator.scala:743)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:254)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:276)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:275)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
at sparklyr.Utils$.collect(utils.scala:196)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke$.invoke(invoke.scala:102)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
at sparklyr.StreamHandler$.read(stream.scala:62)
at sparklyr.BackendHandler.channelRead0(handler.scala:52)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
Maybe I have misread your example but the memory problem seems to occur when you collect and not when you use spark_apply. Try
config$spark.driver.maxResultSize <- XXX
where XXX is what you expect to need (I have set it to 4G for a similar job). See https://spark.apache.org/docs/latest/configuration.html for further details.
This is a GC problem, maybe you should try configuring your JVM with other arguments, are you using G1 as your GC?
If you are not able to provide more memory and you have issues with the gc collect times, you should try using another JVM (maybe Zing from Azul systems?
I've set the overhead memory needed for spark_apply using spark.yarn.executor.memoryOverhead. I've found that using the by= argument of sfd_repartition is useful and using the group_by= in spark_apply also helps. The more you are able to split up your data between executors the better.

Sigar ProcCpu gather method always returns 0 for percentage value

I'm using Sigar to try and get the CPU and memory usage of individual processes (under Windows). I am able to get these stats correctly for the system as a whole with the below code :
Sigar sigar = new Sigar();
long totalMemory = sigar.getMem().getTotal() / 1024 /1024;
model.addAttribute("totalMemory", totalMemory);
double usedPercentage = sigar.getMem().getUsedPercent();
model.addAttribute("usedPercentage", String.format( "%.2f", usedPercentage));
double freePercentage = sigar.getMem().getFreePercent();
model.addAttribute("freePercentage", String.format( "%.2f", freePercentage));
double cpuUsedPercentage = sigar.getCpuPerc().getCombined() * 100;
model.addAttribute("cpuUsedPercentage", String.format( "%.2f", cpuUsedPercentage));
This displays the following quite nicely in my web page :
Total System Memory : 16289 MB
Used Memory Percentage : 66.81 %
Free Memory Percentage : 33.19 %
CPU Usage : 30.44 %
Now I'm trying to get info from individual processes such as Java and SQL Server and, while the memory is correctly gathered, the CPU usage for both processes is ALWAYS 0. Below is the code I'm using :
Sigar sigar = new Sigar();
List<ProcessInfo> processes = new ArrayList<>();
ProcessFinder processFinder = new ProcessFinder(sigar);
long[] javaPIDs = null;
Long sqlPID = null;
try
{
javaPIDs = processFinder.find("Exe.Name.ct=" + "java.exe");
sqlPID = processFinder.find("Exe.Name.ct=" + "sqlservr.exe")[0];
}
catch (Exception ex)
{}
int i = 0;
while (i < javaPIDs.length)
{
Long javaPID = javaPIDs[i];
ProcessInfo javaProcess = new ProcessInfo();
javaProcess.setPid(javaPID);
javaProcess.setName("Java");
ProcMem javaMem = new ProcMem();
javaMem.gather(sigar, javaPID);
javaProcess.setMemoryUsage(javaMem.getResident() / 1024 / 1024);
MultiProcCpu javaCpu = new MultiProcCpu();
javaCpu.gather(sigar, javaPID);
javaProcess.setCpuUsage(String.format("%.2f", javaCpu.getPercent() * 100));
processes.add(javaProcess);
i++;
}
if (sqlPID != null)
{
ProcessInfo sqlProcess = new ProcessInfo();
sqlProcess.setPid(sqlPID);
sqlProcess.setName("SQL Server");
ProcMem sqlMem = new ProcMem();
sqlMem.gather(sigar, sqlPID);
sqlProcess.setMemoryUsage(sqlMem.getResident() / 1024 / 1024);
ProcCpu sqlCpu = new MultiProcCpu();
sqlCpu.gather(sigar, sqlPID);
sqlProcess.setCpuUsage(String.format( "%.2f", sqlCpu.getPercent()));
processes.add(sqlProcess);
}
model.addAttribute("processes", processes);
I have tried both ProcCpu and MultiProcCpu and both of them always return 0.0 even if I can see Java using 15% CPU in task manager. The documentation on the Sigar library is virtually non existent but the research i did tells me that i appear to be doing this correctly.
Does anyone know what I'm doing wrong?
Thanks!
I found the issue while continuing to search online. Basically, the sigar library can only retrieve the correct CPU values after a certain time. My issue is that i was initializing a new Sigar instance every time the page was displayed. I made my Sigar instance global to my Spring controller and now it returns correct percentages.

Orientdb - SQL query with millions of vertices causes Java OutOfMemory error

I need to create edges between all vertices of class V1 and all vertices of class V2. My classes have 2-3 million vertices each. A double for loop with a SELECT * FROM V1, SELECT * FROM V2 gives a Java OutOfMemory (heap space) error (see below). This is an offline process that will be performed once or twice if needed (not a frequent operation) as the graph will not be regularly updated by the users, only myself.
How can I do it in batches (using SELECT...LIMIT or g.getvertices()) to avoid this?
Here's my code:
OrientGraphNoTx G = MyOrientDBFactory.getNoTx();
G.setUseLightweightEdges(false);
G.declareIntent(new OIntentMassiveInsert());
for (Vertex p1 : (Iterable<Vertex>) EG.command( new OCommandSQL("SELECT * FROM V1")).execute())
{
for (Vertex p2 : (Iterable<Vertex>) EG.command( new OCommandSQL("SELECT * FROM V2")).execute())
{
if (p1.getProperty("prop1")==p2.getProperty("prop1")
{
//p1.addEdge("MyEdge", p2);
EG.command( new OCommandSQL("create edge MyEdge from" + p1.getId() +"to "+ p2.getId() + " retry 100") ).execute ();
}
}
}
G.shutdown();
OrientDB 2.1.5 with Java/Graph API
NetBeans 8.1 with VM options -Xmx4096m and -Dstorage.diskCache.bufferSize=7200
Error message in console:
2016-05-24 15:48:06:112 INFO {db=MyDB} [TIP] Query 'SELECT * FROM
V1' returned a result set with more than 10000 records. Check if
you really need all these records, or reduce the resultset by using a
LIMIT to improve both performance and used RAM
[OProfilerStub]java.lang.OutOfMemoryError: Java heap space Dumping
heap to java_pid7896.hprof ...
Error message in Netbeans output
Exception in thread "main"
com.orientechnologies.orient.enterprise.channel.binary.OResponseProcessingException:
Exception during response processing. at
com.orientechnologies.orient.enterprise.channel.binary.OChannelBinaryAsynchClient.throwSerializedException(OChannelBinaryAsynchClient.java:443)
at
com.orientechnologies.orient.enterprise.channel.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:398)
at
com.orientechnologies.orient.enterprise.channel.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:282)
at
com.orientechnologies.orient.enterprise.channel.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:171)
at
com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:2166)
at
com.orientechnologies.orient.client.remote.OStorageRemote.command(OStorageRemote.java:1189)
at
com.orientechnologies.orient.client.remote.OStorageRemoteThread.command(OStorageRemoteThread.java:444)
at
com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:63)
at
com.tinkerpop.blueprints.impls.orient.OrientGraphCommand.execute(OrientGraphCommand.java:49)
at xx.xxx.xxx.xx.MyEdge.(MyEdge.java:40) at
xx.xxx.xxx.xx.GMain.main(GMain.java:60) Caused by:
java.lang.OutOfMemoryError: GC overhead limit exceeded
As a workaround you can use code similar to the following
Iterable<Vertex> cv1= g.command( new OCommandSQL("SELECT count(*) FROM V1")).execute();
long counterv1=cv1.iterator().next().getProperty("count");
int[] ids=g.getRawGraph().getMetadata().getSchema().getClass("V1").getClusterIds();
long repeat=counterv1/10000;
long rest=counterv1-(repeat*10000);
List<Vertex> v1=new ArrayList<Vertex>();
int rid=0;
for(int i=0;i<repeat;i++){
Iterable<Vertex> v= g.command( new OCommandSQL("SELECT * FROM V1 WHERE #rid >= " + ids[0] + ":" + rid + " limit 10000")).execute();
CollectionUtils.addAll(v1, v.iterator());
rid=10000*(i+1);
}
if(rest>0){
Iterable<Vertex> v=g.command( new OCommandSQL("SELECT * FROM V1 WHERE #rid >= " + ids[0] + ":" + rid + " limit "+ rest)).execute();
CollectionUtils.addAll(v1, v.iterator());
}
Hope it helps.

Categories