Spark Application FAIR Scheduling - java

I have a job who iterates over the a table columns to get the distinct values of each one. The queries takes about 6 seconds each one but they don't use the full CPU's. That's why I have decided to use the FAIR scheduling within an Application so the resources can be fully used. Actually this application have 4 cores and 10 Gb ram.
I have added to my spark spark-defaults.conf file the following lines:
spark.scheduler.mode FAIR
spark.scheduler.allocation.file /bin/spark/pools.xml
I have created the following pool:
<pool name="filters">
<schedulingMode>FAIR</schedulingMode>
<weight>1000</weight>
<minShare>0</minShare>
</pool>
And this is my code:
List<ColumnMetadata> fields = getCategoryFieldsFromViewMetadata(...);
Dataset<Row> dsCube = sqlContext.sql("...");
dsCube = dsCube
.select(JavaConversions.asScalaBuffer(filterColumns))
.persist(StorageLevel.MEMORY_ONLY());
dsCube.createOrReplaceTempView("filter_temp");
sqlContext.sparkContext().setLocalProperty("spark.scheduler.pool", "filters");
fields.parallelStream().forEach((ColumnMetadata field) -> {
Dataset<Row> temp = sqlContext.sql("select distinct tenant_id, user_domain, cube_name, field, value "
+"from filter_temp");
saveDataFrameToMySQL("analytics_cubes_filters", temp, SaveMode.Append); //Here I save the results to a MySQL table.
});
sqlContext.sparkContext().setLocalProperty("spark.scheduler.pool", null);
The filters pool is being used, I can see it in the spark application GUI and the jobs are executed in parallel, but if before, each query was executed in FIFO mode in 6 seconds now using the FAIR mode 4 parallel queries are executed in 24 seconds. I have checked the CPU usage and looks like before when the FIFO mode was being used.
Am I missing something?

Related

SAP ASE 16 Sybase Database Duplicate records polling

I am trying to read records from SAP ASE 16 Database table concurrently using java to increase performance. I am using select…..for update query to read database table records concurrently. Here two threads are trying to read records from single table.
I am executing this command on a microservice based environment.
I have done below configuration on database:
• I have enabled select for update using: sp_configure "select for update", 1
• To set locking scheme I have used: alter table poll lock datarows
Table name: poll
This is the query that I am trying to execute:
SQL Query:
SELECT e_id, e_name, e_status FROM poll WHERE (e_status = 'new') FOR UPDATE
UPDATE poll SET e_status = 'polled' WHERE (e_id = #e_id)
Problem:
For some reason I am getting duplicate records on executing above query for majority of records sometimes beyond 200 or 300.
It seems like locks are not being acquired during execution of above command. Is there any configuration that I am missing from database side. Does it have anything to do with shared lock and exclusive lock?

Concurrency Batch filling the database

I have an embedded SQL DataBase that contains 2 million+ rows with String and Integer fields. The dataBase is filled by addBatch and executeBatch operations where one batch = 100.000 requests.
The function which create one Batch:
limit = 100000
public void insertData(data) {
if (insertCounter >= limit) {
flushToDb();
}
prepareInsert.setString(1, data.getString());
prepareInsert.setString(2, data.getString());
prepareInsert.setString(3, data.getString());
prepareInsert.setString(4, data.getString());
prepareInsert.setInteger(5, data.getInteger());
prepareInsert.setString(6, data.getString());
prepareInsertRef.setInteger(7, data.getInteger());
prepareInsertRef.addBatch();
insertCounter++;
}
When I use only one thread the database is filled in 13 seconds.
However, when I try to add the concurrency my performance does not increase.
In my case I create
executorService = Executors.newFixedThreadPool(THREAD_NUMBER);
It executes the InsertDate tasks from the BlockingQueue concurrently, but my program's running time increases to 18 seconds.
In the project I use the HSQL database because it supports concurrency write and read operations.
I'd like to hear your ideas on how to improve my multi-threads solution for database filling.

Spark join/groupby datasets take a lot time

I have 2 datasets(tables) with 35kk+ rows.
I try to join(or group by) this datasets by some id. (in common it will be one-to-one)
But this operation takes a lot time: 25+ h.
Filters only works fine: ~20 mins.
Env: emr-5.3.1
Hadoop distribution:Amazon
Applications:Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2
Instance type: m3.xlarge
Code (groupBy):
Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")
Code (join):
...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))
Also I found this message in EMR logs:
HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.
There Might be a possibility of Data getting Skewed. We faced this. Check your joining column. This happens mostly if your joining column has NULLS
Check Data Stored pattern with :
select joining_column, count(joining_col) from <tablename>
group by joining_col
This will give you an idea that whether the data in your joining column is Evenly distributed

Aerospike Abnormal Increase In Memory Consumption when using UDF via Java Client (Query Aggregation)

So let me explain first. I have a sample snippet of code from Java that is written like this:
Statement statement = new Statement();
statement.setNamespace("foo");
statement.setSetName("bar");
statement.setAggregateFunction(Thread.currentThread().getContextClassLoader(),
"udf/resource/path","udfFilename","udfFunctionName",
"args1","args2","args3","args4");
ResultSet rs = aerospikeClient.getClient().queryAggregate(null,statement);
while(rs.next()){
//insert logic code here
}
With that snippet of sample code, I was able to use a UDF written in lua as sampled by the Aerospike Documentation. The UDF just search around the multiple bins and returns back its findings, it never persist nor transforms any data.
Now the thing is, when the function that uses this code that invokes UDF, in AMC (Aerospike Management Console) it spawns aggregation jobs that is mark as "done(ok)" but never mark as completed, and still are under the "Running Jobs" Table not the "Completed Jobs" Table. (see picture below)
Jobs Under Running Jobs Table
Jobs Under Completed Jos Table
and under Bash Terminal command "Top" I saw the Memory Percentage Usage of the Aerospike Server just keeps growing as the Jobs grew even more in number until such a time the Aerospike Server Fails since it maxed out its memory usage of the machine.
My questions are,
is it possible for the jobs to let go of this resources (if they are really what transpire on the abnormal memory increase)?
if it wasn't the jobs that is responsible, what is?
***EDITED:
Sample Lua Code:
local function map_request(record)
return map {response = record.response,
templateId = record.templateId,
id = record.id, requestSent = record.requestSent,
dateReplied = record.dateReplied}
end
function checkResponse(stream, responseFilter, templateId, validtyPeriod, currentDate)
local function filterResponse(record)
if responseFilter ~= "FOO" and validityPeriod > 0 then
return (record.response == responseFilter) and
(record.templateId == templateId) and
(record.dateReplied + validityPeriod) > currentDate
else
return (record.response == responseFilter) and
(record.templateId == templateId)
end
end
return stream:filter (filterResponse):map(map_request)
end

MySQL Batch Inserts - Crashing with OutOfMemory Exception

I am trying to insert two million rows into a MySQL table with Batch Insert. Following is the code I have.
public void addItems(List<Item> Items) {
try {
conn = getConnection();
st = conn.prepareStatement(insertStatement);
for (Item item : items) {
int index = 1;
st.setString(index++, item.getA());
st.setString(index++, item.getB());
st.setLong(index++, item.getC());
st.setInt(index++, item.getD());
st.setFloat(index++, item.getE());
st.setInt(index++, item.getF());
st.setString(index++, item.getG());
st.setString(index++, item.getH());
st.addBatch();
}
st.executeBatch();
st.clearBatch();
}
}
I call this addItems() function multiple times(sequentially) and I pass no more than 100 items per call. What I observe is that this addItems() call successfully returns and I process more and more data(in fact all the 2 million rows) by sequentially calling addItems(), and then finally my program crashes with an OutOfMemoryException, while I find that only 100 rows inserted in the table out of 2 million rows that Java has processed. I have also set autoCommit to true.
Other parameters that would be of interest -
MySQL
buffer_pool_size = default value(128 MB)
log_file_size = default value(5 MB)
DB Connection String "jdbc:mysql://host:port/database?useServerPrepStmts=false&rewriteBatchedStatements=true";
I have already allocated 512MB to Java process.
Maximum number of connections: 10
Min connections: 1
Questions -
Is the preparedStatement.executeBatch() call an asynchronous
operation or does the MySQL connector buffer these calls before
sending it to the database?
How do I ensure that 100 rows are committed first and then process
the next set of rows?
Will increasing buffer_pool_size and log_file_size help faster inserts?
I do not have access to DB host, so have not tried this yet.
I will try this when I have access.
How to solve this issue? - I cannot get further because of this.
1.You can allways look at the code to figure stuff like this. Looking at the source code here, lines 1443-1447 seems like the answer is - it depends. For example, the version, or if the batch size is larger then 3 (otherwise it's not worth the effort).
4.What I did in similar situation is executing the batch after each X rows (let's say, 100).

Categories