Fetching Result From Java API from Pentaho Kettle Job - java

I have a Job in Pentaho. The job has many sub-jobs and many transformation. Most of the transformation writes to a table. I would like to get some stat information like below.
Table1 Finished processing (I=0, O=0, R=86400, W=86400, U=0, E=0)
Table2 Finished processing (I=0, O=0, R=86400, W=86400, U=0, E=0)
Table3 Finished processing (I=0, O=0, R=86400, W=86400, U=0, E=0)
My code is: With this code, I'm just getting the result of the last transformation. For Example, If i run 40 transformation, my result is just the 40th transformation result. But I would like to see all the 40 transformation result.
KettleEnvironment.init();
JobMeta jobMeta = new JobMeta("Job.kjb", null);
Job job = new Job(null, jobMeta);
job.start();
job.waitUntilFinished()
Result result = job.getResult();
System.out.println("dfffdgfdg: "+result.getLogText());

Use the logging system. On each transformation of interest, right-click anywhere, select setting, then logging and setup the data you want to collect stat on (for example in front of the Output button select the step that writes the data on the table you want to monitor). I suggest you use the default to start with.
After that, press the SQL button, and Pentaho Data Integrator will create a table in a database, with the relevant columns. And each time you run the transformation (or anyone using the same repository) will put a row in the table. After that, just SELECT * FROM TRANSFORMATION_LOG.
In the last Pentaho Meetup, I explained why you should do that at transformation level, and at at job level (although you can automate this if you know how to navigate in a repository). You'll also have a pointer to a github with a JSP you an copy/paste in your Pentaho BA server's WEB_INF so that you get exactly what you are after in web server.
Do not hesitate to ask for more info or provide feedback.

Related

Spring batch job step up defenition

Im developing a Spring Batch application, technology which by the way I'm new to.
I have made some tutoriais, and read some docs in order to prepare myself to this development.
Im already "confortable" with some of the most common APIs (ItemRead, ItemProcessor, ItemWriter, Setps, Tasklets, Jobs, Parameters...)
My requirement is simple.
1 - Read some data from CSV file.
2 - Fetch an Entity from database by each line of the CSV file.
3 - Update the state of the Entity.
4 - Export a new CSV file with some generated data from each Entity.
My problem is not how to fetch, how to update or how to export a csv file, but more conceptualy how to setup my JOB.
The way I see it I like to end up with a Job something like
1 - ItemRead -> to read the whole csv file.
2 - ItemProcessor -> to update the entity.
3 - ItemWriter -> to persist the entity.
4 - ItemWriter -> to export the new CSV file based on the entity state.
Does it make sense? There's a better way. Am I missisng some pitfalls?
updating line by line might not be best idea. Instead I suggest
Job 1
Read entire file
Write it to the tmp table
Here write update query that join original table and tmp table based on primary key.
After query execution, call second job
Job 2
Read records from table
Write it to the file
Finally clear tmp table for next Job sequence.
This Ans is based on my thought process. there might be other best approaches as well.

Spark join/groupby datasets take a lot time

I have 2 datasets(tables) with 35kk+ rows.
I try to join(or group by) this datasets by some id. (in common it will be one-to-one)
But this operation takes a lot time: 25+ h.
Filters only works fine: ~20 mins.
Env: emr-5.3.1
Hadoop distribution:Amazon
Applications:Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2
Instance type: m3.xlarge
Code (groupBy):
Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")
Code (join):
...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))
Also I found this message in EMR logs:
HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.
There Might be a possibility of Data getting Skewed. We faced this. Check your joining column. This happens mostly if your joining column has NULLS
Check Data Stored pattern with :
select joining_column, count(joining_col) from <tablename>
group by joining_col
This will give you an idea that whether the data in your joining column is Evenly distributed

Neo4j Bulk Data - Create Relationship [OutOfMemory Exception]

I am using Neo4j Procedure to create relationships on bulk data.
Initially insert that all data using load csv.
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///XXXX.csv" AS row
....
data size is too large[10M] but its successfully executed
my problem is i want to create relationships between this all nodes many-many
but i got exception [OutMemoryException] while executing queries
MATCH(n1:x{REMARKS :"LATEST"}) MATCH(n2:x{REMARKS :"LATEST"}) WHERE n1.DIST_ID=n2.ENROLLER_ID CREATE (n1)-[:ENROLLER]->(n2) ;
I have already created Indexing and Constraints also
Any idea please help me?
The problem is that your query is performed in one transaction, which leads to the exception [OutMemoryException]. And this is a problem, since at this moment the possibility of periodic transactions only have to load the CSV. So, you can, for example, re-read the CSV after first load:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///XXXX.csv" AS row
MATCH (n1:x{REMARKS :"LATEST", DIST_ID: row.DIST_ID})
WITH n1
MATCH(n2:x{REMARKS :"LATEST"}) WHERE n1.DIST_ID=n2.ENROLLER_ID
CREATE (n1)-[:ENROLLER]->(n2) ;
Or try the trick with periodic committing from the APOC library:
call apoc.periodic.commit("
MATCH (n2:x {REMARKS:'Latest'}) WHERE exists(n2.ENROLLER_ID)
WITH n2 LIMIT {perCommit}
OPTIONAL MATCH (n1:x {REMARKS:'Latest'}) WHERE n1.DIST_ID = n2.ENROLLER_ID
WITH n2, collect(n1) as n1s
FOREACH(n1 in n1s|
CREATE (n1)-[:ENROLLER]->(n2)
)
REMOVE n2.ENROLLER_ID
RETURN count(n2)",
{perCommit: 1000}
)
P.S. ENROLLER_ID property is used as a flag for selecting nodes for processing. Of course, you can use another flag, which is set in the processing.
Or a more accurate with apoc.periodic.iterate:
CALL apoc.periodic.iterate("
MATCH (n1:x {REMARKS:'Latest'})
MATCH (n2:x {REMARKS:'Latest'}) WHERE n1.DIST_ID = n2.ENROLLER_ID
RETURN n1,n2
","
WITH {n1} as n1, {n2} as n2
MERGE (n1)-[:ENROLLER]->(n2)
", {batchSize:10000, parallel:true}
)

spring batch: how to write valid data to one table, and invalid data another table

I have a csv file like
day,cost
20140101, 20
2014-01-5, 20
20140101, ab
so there are some invalid data and I want to load the valid data into table_normal, and invalid data into table_unnormal
so the final data should be
for table of table_normal
day,cost
20140101, 20
for table of table_unnormal
day,cost, reason
2014-01-5, 20, 'invalid day'
20140101, ab,'invalid cost'
I know how to get the reason in processor, but how could be job write to different tables?
I could suggest 3 ways to do this, none of which is very direct and easy.
a) Write you own CustomJDBCItemWriter - you can filter out any ways you want and you should be able to write some record in table table_normal and some record in table_unnormal
b) Use a CompositeItemWriter - both writers will get the "full record list" from the processor. You can then filter out the record needed in each writer. Very similar to (a)
c) If you can do 2 passes over the input - you can write you job in two steps.
Step 1 : Read Records --> Process only bad records --> write to table_unnormal
Step 2 : Read Records --> Process only good records --> write to table_normal
There isn't a good in-build feature to handle this scenario in spring batch directly (at-least none I am aware off)

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?
My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

Categories