Spark join/groupby datasets take a lot time

Spark join/groupby datasets take a lot time - java

I have 2 datasets(tables) with 35kk+ rows.
I try to join(or group by) this datasets by some id. (in common it will be one-to-one)
But this operation takes a lot time: 25+ h.
Filters only works fine: ~20 mins.
Env: emr-5.3.1
Hadoop distribution:Amazon
Applications:Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2
Instance type: m3.xlarge
Code (groupBy):
Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")
Code (join):
...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))
Also I found this message in EMR logs:
HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.

There Might be a possibility of Data getting Skewed. We faced this. Check your joining column. This happens mostly if your joining column has NULLS
Check Data Stored pattern with :
select joining_column, count(joining_col) from <tablename>
group by joining_col
This will give you an idea that whether the data in your joining column is Evenly distributed

Related

Apache Camel SQL component select all records in a batch mode

I'm using apache camel as an ETL from (select *...) PostgreSQL to (insert...) MariaDB .
In the PostgreSQL there are a lot of records (more then 1M) and I want to do it in a batch way.
I've tried with several flag (batchCount, batchSize) but non of them worked.
I've also search in Apache Camel docs, without any success.
from("sql:SELECT * FROM my_schema.trees?dataSource=#postgersqlDataSource&batch=true")
.convertBodyTo(String.class)
.process(ex -> {
log.info("batch insert for single table");
List<Map<String, Object>> rows = ex.getIn().getBody(List.class);
log.info(String.format("Value: %s", rows.size()));
})
.to("stream:out");
But the program crashed because it load everything to the memory (with 1 records it worked of course).
Any advise?
it runs overs Spring boot.

The batch option is only for producer (eg to).
https://camel.apache.org/components/3.20.x/sql-component.html
Instead take a look at outputType=StreamList where you can combine this with split EIP (in streaming mode) to process the rows without loading all into memory.
This also mean you process 1 row at a time
from sql
split
process (1 row here)

Mongo DB / No duplicates

I have have a mongo collection that keeps state records for devices. Thus, there could be multiple records per device. What I would like to do is create a query through the mongoTemplate that gets the latest record for each device.
Here's the constraints:
Pass in a Set<'String'> name_ids, regular field within mongo collection not the _id or found within the _id
get only the latest record for each device with matching name_id
return List<'DeviceStateData'> (No duplicates should be found with the same name_id)
example of collection object:
{
_id: "241324123412",
name_id: "flyingMan",
powerState:"ON",
timeStamp: ISODate('')
}
Thanks

You should look on Distinct function.
Here you can find details with Spring.

Neo4j Bulk Data - Create Relationship [OutOfMemory Exception]

I am using Neo4j Procedure to create relationships on bulk data.
Initially insert that all data using load csv.
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///XXXX.csv" AS row
....
data size is too large[10M] but its successfully executed
my problem is i want to create relationships between this all nodes many-many
but i got exception [OutMemoryException] while executing queries
MATCH(n1:x{REMARKS :"LATEST"}) MATCH(n2:x{REMARKS :"LATEST"}) WHERE n1.DIST_ID=n2.ENROLLER_ID CREATE (n1)-[:ENROLLER]->(n2) ;
I have already created Indexing and Constraints also
Any idea please help me?

The problem is that your query is performed in one transaction, which leads to the exception [OutMemoryException]. And this is a problem, since at this moment the possibility of periodic transactions only have to load the CSV. So, you can, for example, re-read the CSV after first load:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///XXXX.csv" AS row
MATCH (n1:x{REMARKS :"LATEST", DIST_ID: row.DIST_ID})
WITH n1
MATCH(n2:x{REMARKS :"LATEST"}) WHERE n1.DIST_ID=n2.ENROLLER_ID
CREATE (n1)-[:ENROLLER]->(n2) ;
Or try the trick with periodic committing from the APOC library:
call apoc.periodic.commit("
MATCH (n2:x {REMARKS:'Latest'}) WHERE exists(n2.ENROLLER_ID)
WITH n2 LIMIT {perCommit}
OPTIONAL MATCH (n1:x {REMARKS:'Latest'}) WHERE n1.DIST_ID = n2.ENROLLER_ID
WITH n2, collect(n1) as n1s
FOREACH(n1 in n1s|
CREATE (n1)-[:ENROLLER]->(n2)
)
REMOVE n2.ENROLLER_ID
RETURN count(n2)",
{perCommit: 1000}
)
P.S. ENROLLER_ID property is used as a flag for selecting nodes for processing. Of course, you can use another flag, which is set in the processing.
Or a more accurate with apoc.periodic.iterate:
CALL apoc.periodic.iterate("
MATCH (n1:x {REMARKS:'Latest'})
MATCH (n2:x {REMARKS:'Latest'}) WHERE n1.DIST_ID = n2.ENROLLER_ID
RETURN n1,n2
","
WITH {n1} as n1, {n2} as n2
MERGE (n1)-[:ENROLLER]->(n2)
", {batchSize:10000, parallel:true}
)

DB2 ERRORCODE 4499 SQLSTATE=58009

On our production application we recently become weird error from DB2:
Caused by: com.ibm.websphere.ce.cm.StaleConnectionException: [jcc][t4][2055][11259][4.13.80] The database manager is not able to accept new requests, has terminated all requests in progress, or has terminated your particular request due to an error or a force interrupt. ERRORCODE=-4499, SQLSTATE=58009
This occurs when hibernate tries to select data from one big table(More than 6 milions records and 320 columns).
I observed that when ResultSet lower that 10 elements, hibernate selects successfully.
Our architecture:
Spring 4.0.3
Hibernate 4.3.5
DB2 v10 z/Os
Websphere 7.0.0.31(with JDBC V9.7FP5)
This select works when I tried to executed this in Data Studio or when app is started localy from Tomcat(connected to production Data Source). I suppose that Data Source on Websphere is not corectly configured, but I tried some modifications and without results. I also tried to update JDBC Driver but that not helped. Actually I become then ERRORCODE = -1244.
Ok, so now I'm looking for any help ;).
I can obviously provide additional information when needed.
Maybe someone fighted earlier with this problem?
Thanks in advance!

We have the same problem and finally solved by running REORG and RUNSTAT on the table(s). In our case, databse and tables were damaged and after running both mentioned operations, it resolved.

This occurs when hibernate tries to select data from one big table(More than 6 milions records and 320 columns)
6 million records with 320 columns seems huge to be read at once through hibernate. How you tried creating a database cursor and streaming few records at a time? In plain JDBC it is done as follows
Statement stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(50); //fetch only 50 records at a time
while with hibernate you would need the below code
Query query = session.createQuery(query);
query.setReadOnly(true);
query.setFetchSize(50);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to flush() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.flush() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand.
Analyze the database table locking impact when using this approach.

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?

My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark join/groupby datasets take a lot time - java

Related

Apache Camel SQL component select all records in a batch mode

Mongo DB / No duplicates

Neo4j Bulk Data - Create Relationship [OutOfMemory Exception]

DB2 ERRORCODE 4499 SQLSTATE=58009

produce hfiles for multiple tables to bulk load in a single map reduce

Categories

Resources