Camel import data between two server - java

I integrate two systems and i have to insert data from one client table to another in another server, without any business logic or datamodification, onece per week. Every time when its run i have to inesrt all data. So i wrote camel configuration which i atached below. Its working for small piece of data but when clients table has over than 20000 rows i get exeption. java.lang.OutOfMemoryError: GC overhead limit exceeded. I try change java memory like "set JAVA_OPTS=-Dfile.encoding=UTF-8 -Xms2048m -Xmx16384m -XX:PermSize=1012m -XX:MaxPermSize=2048m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit". But its not helps.
enter image description here
enter image description here

i am working on the same tool.
To query and the insert large tables i am using jdbc StreamList.
In this example i work with two datasources,
dataSource1 for query and dataSource2 for insert.
Query from dataSource1 processed as stream, then body split row by row, so not all data at once, then each row are written to file and also an prepared insert statement is build and executed for dataSource2.
private void createRoute(String tableName) {
this.from("timer://timer1?repeatCount=1") //
.setBody(this.simple("delete from " + tableName)) //
.to("jdbc:dataSource2") //
.setBody(this.simple("select * from " + tableName)) //
.to("jdbc:dataSource1?outputType=StreamList") //
// .to("log:stream") //
.split(this.body())//
.streaming() //
// .to("log:row") //
.multicast().to("direct:file", "direct:insert") //
.end().log("End");
this.from("direct:file")
// .marshal(jsonFormat) //
.marshal(new CsvDataFormat().setDelimiter(';'))//
.to("stream:file?fileName=output/" + tableName + ".txt").log("Data: ${body}");
this.from("direct:insert").process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
StringBuilder insert = new StringBuilder("insert into ").append(tableName).append(" (");
StringBuilder values = new StringBuilder("values(");
LinkedHashMap<?, ?> body = (LinkedHashMap<?, ?>) exchange.getIn().getBody();
Iterator<String> i = (Iterator<String>) body.keySet().iterator();
while (i.hasNext()) {
String key = i.next();
insert.append(key);
values.append(":?" + key);
exchange.getOut().getHeaders().put(key, body.get(key));
if (i.hasNext()) {
insert.append(",");
values.append(",");
} else {
insert.append(") ");
values.append(") ");
}
}
String sql = insert.append(values).toString();
exchange.getOut().setBody(sql);
}
}) //
.log("SQL: ${body}")//
.to("jdbc:dataSource2?useHeadersAsParameters=true");

Related

How to properly iterate Bigquery TableResult in Java

I am trying to iterate the rows from TableResult using getValues() as below.
if I use getValues(), it's retrieving only the first page rows. I want to iterate all the rows using getValues() and NOT using iterateAll().
In the below code, the problem is its going infinite time. not ending. while(results.hasNextPage()) is not ending. what is the problem in the below code?
{
query = "select from aa.bb.cc";
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query)
.setPriority(QueryJobConfiguration.Priority.BATCH)
.build();
TableResult results = bigquery.query(queryConfig);
int i = 0;
int j=0;
while(results.hasNextPage()) {
j++;
System.out.println("page " + j);
System.out.println("Data Extracted::" + i + " records");
for (FieldValueList row : results.getNextPage().getValues()) {
i++;
}
}
System.out.println("Total Count::" + results.getTotalRows());
System.out.println("Data Extracted::" + i + " records");
}
I have only 200,000 records in the source table. below is the out put and I forcefully stopped the process.
page 1
Data Extracted::0 records
page 2
Data Extracted::85242 records
page 3
Data Extracted::170484 records
page 4
Data Extracted::255726 records
page 5
Data Extracted::340968 records
page 6
Data Extracted::426210 records
page 7
Data Extracted::511452 records
page 8
Data Extracted::596694 records
.......
.......
.......
.......
In short, you need to update TableResults variable with your getNextPage() variable. If you don't update it you will always be looping the same results over and over. Thats why you are getting tons of records in your output.
If you check the following samples: Bigquery Pagination and Using Java Client Library. There are ways that we can deal with pagination results. Although not specific for single run queries.
As show on the code below, which is partially based on pagination sample, you need to use the output of getNextPage() to update results variable and proceed to perform the next iteration inside the while up until it iterates all pages but the last.
QueryRun.Java
package com.projects;
// [START bigquery_query]
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.TableResult;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobId;
import com.google.cloud.bigquery.FieldValueList;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.BigQuery.QueryResultsOption;
import java.util.UUID;
import sun.jvm.hotspot.debugger.Page;
public class QueryRun {
public static void main(String[] args) {
String projectId = "bigquery-public-data";
String datasetName = "covid19_ecdc_eu";
String tableName = "covid_19_geographic_distribution_worldwide";
String query =
"SELECT * "
+ " FROM `"
+ projectId
+ "."
+ datasetName
+ "."
+ tableName
+ "`"
+ " LIMIT 100";
System.out.println(query);
query(query);
}
public static void query(String query) {
try {
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).build();
// Create a job ID so that we can safely retry.
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
TableResult results = queryJob.getQueryResults(QueryResultsOption.pageSize(10));
int i = 0;
int j =0;
// get all paged data except last line
while(results.hasNextPage()) {
j++;
for (FieldValueList row : results.getValues()) {
i++;
}
results = results.getNextPage();
print_msg(i,j);
}
// last line run
j++;
for (FieldValueList row : results.getValues()) {
i++;
}
print_msg(i,j);
System.out.println("Query performed successfully.");
} catch (BigQueryException | InterruptedException e) {
System.out.println("Query not performed \n" + e.toString());
}
}
public static void print_msg(int i,int j)
{
System.out.println("page " + j);
System.out.println("Data Extracted::" + i + " records");
}
}
// [END bigquery_query]
output:
SELECT * FROM `bigquery-public-data.covid19_ecdc_eu.covid_19_geographic_distribution_worldwide` LIMIT 100
page 1
Data Extracted::10 records
page 2
Data Extracted::20 records
page 3
Data Extracted::30 records
page 4
Data Extracted::40 records
page 5
Data Extracted::50 records
page 6
Data Extracted::60 records
page 7
Data Extracted::70 records
page 8
Data Extracted::80 records
page 9
Data Extracted::90 records
page 10
Data Extracted::100 records
Query performed successfully.
As a final note, there are not official sample about pagination for queries so I'm not totally sure of the recommended way to handle pagination with java. Its not quite clear on the BigQuery for Java documentation page. If you can update your question with your approach to pagination I would appreciate.
If you have issues running the attached sample please see Using the BigQuery Java client sample, its github page and its pom.xml file inside of it and check if you are in compliance with it.
Probably I Am late in the response.
But reading on the java client guide (https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries#complete_source_code)
it says that:
Iterate over the QueryResponse to get all the rows in the results. The iterator automatically handles pagination. Each FieldList exposes the columns by numeric index or column name.
Sai that, it should be easier to simply use the iterateAll() method.
Let me know if I Am wrong.

JDBC callable statement executes locally, but "The value is not set for the parameter number 2" when run in Jenkins

I'm using jdbc.SQLServerDriver in a Java Maven project to insert test records into a test reporting database. When I run a test locally either through Intellij or by running 'mvn clean compile', 'mvn test' in Powershell, the records are successfully inserted into the database. However, when I run through Jenkins (declarative pipeline), I get the following error message: com.microsoft.sqlserver.jdbc.SQLServerException: The value is not set for the parameter number 2.
I have looked at several resources around CallableStatement and several StackOverflow posts about the error message, but I do not understand why the parameter would be set when running locally, but not in Jenkins.
Here is my Jenkinsfile:
pipeline {
agent any
stages {
stage('Compile') {
steps {
withMaven(maven: 'LocalMaven'){
bat 'mvn clean compile'
}
}
}
stage('Test') {
steps {
withMaven(maven: 'LocalMaven'){
bat 'mvn test'
}
}
}
}
}
Here is my code to execute the stored proc:
public static void ExecuteStoredProc(String procedureName, Hashtable parameters, Connection connection)
{
try {
String paraAppender;
StringBuilder builder = new StringBuilder();
// Build the paramters list to be passed in the stored proc
for (int i = 0; i < parameters.size(); i++) {
builder.append("?,");
}
paraAppender = builder.toString();
paraAppender = paraAppender.substring(0,
paraAppender.length() - 1);
CallableStatement stmt = connection.prepareCall("{Call "
+ procedureName + "(" + paraAppender + ")}");
// Creates Enumeration for getting the keys for the parameters
Enumeration params = parameters.keys();
// Iterate in all the Elements till there is no keys
while (params.hasMoreElements()) {
// Get the Key from the parameters
String paramsName = (String) params.nextElement();
// Set Paramters name and Value
stmt.setString(paramsName, parameters.get(paramsName)
.toString());
}
// Execute Query
stmt.execute();
} catch (Exception e) {
System.out.println(procedureName);
System.out.println(parameters.keySet());
System.out.println(e.getMessage());
}
}
}
Here are the values I am passing in in the Hashtable:
public static void CreateRun(Connection connection)
{
//Params
Hashtable table = new Hashtable();
table.put("Machine", "Machine");
table.put("ClientOS", "CLientOS");
table.put("Environment", "Environment");
table.put("Browser", "Browser");
table.put("BrowserVersion", "BrowserVersion");
table.put("RunBuild", "RunBuild");
table.put("DevMachine", "1");
table.put("ExpectedCases", "1");
DatabaseUtil.ExecuteStoredProc("sp_CreateRun",table, connection );
}
And here is the stored proc:
... PROC [dbo].[sp_CreateRun]
#Machine varchar(45),
#ClientOS varchar(45),
#Environment varchar(45),
#Browser varchar(45),
#BrowserVersion varchar(45),
#RunBuild varchar(45),
#DevMachine bit,
#ExpectedCases int
AS
BEGIN
INSERT into Run (Start_Time, End_Time, Machine, Client_OS, Environment, Browser, Browser_Version, Run_Build, Dev_Machine, Expected_Cases)
values (GETDATE(),GetDate(),#Machine,#ClientOS,#Environment,#Browser, #BrowserVersion,#RunBuild,#DevMachine,#ExpectedCases)
END
Thanks in advance for taking a look.
Thanks, all, for the helpful comments. I never did figure out how to make this work with CallableStatement, but I did get it working both locally and on Jenkins using Spring SimpleJdbcCall. I really like how much cleaner the code is now.
public static void ExecuteStoredProc(String procedureName, Map<String, String> parameters)
{
JdbcTemplate template = new JdbcTemplate(SpringJDBCConfig.sqlDataSource());
MapSqlParameterSource sqlParameterSource = new MapSqlParameterSource();
for (String key : parameters.keySet()) {
sqlParameterSource.addValue(key, parameters.get(key));
}
SimpleJdbcCall call = new SimpleJdbcCall(template)
.withCatalogName("matrix")
.withSchemaName("dbo")
.withProcedureName(procedureName);
call.execute(sqlParameterSource);
}

VSAM file locking when writing to it using Java JDBC

This is my first time trying to read and write to a VSAM file. What I did was:
Created a Map for the File using VSE Navigator
Added the Java beans VSE Connector library to my eclipse Java project
Use the code show below to Write and Read to the KSDS file.
Reading the file is not a problem but when I tried to write to the file it only works if I go on the mainframe and close the File before running my java program but it locks the file for like an hour. You cannot open the file on the mainframe or do anything to it.
Anybody can help with this problem. Is there a special setting that I need to set up for the file on the mainframe ? Why do you first need to close the file on CICS to be able to write to it ? And why does it locks the file after writing to it ?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.sql.*;
public class testVSAM {
public static void main(String argv[]){
Integer test = Integer.valueOf(2893);
String vsamCatalog = "VSESP.USER.CATALOG";
String FlightCluster = "FLIGHT.ORDERING.FLIGHTS";
String FlightMapName = "FLIGHT.TEST2.MAP";
try{
String ipAddr = "10.1.1.1";
String userID = "USER1";
String password = "PASSWORD";
java.sql.Connection jdbcCon;
java.sql.Driver jdbcDriver = (java.sql.Driver) Class.forName(
"com.ibm.vse.jdbc.VsamJdbcDriver").newInstance();
// Build the URL to use to connect
String url = "jdbc:vsam:"+ipAddr;
// Assign properties for the driver
java.util.Properties prop = new java.util.Properties();
prop.put("port", test);
prop.put("user", userID);
prop.put("password", password);
// Connect to the driver
jdbcCon = DriverManager.getConnection(url,prop);
try {
java.sql.PreparedStatement pstmt = jdbcCon.prepareStatement(
"INSERT INTO "+vsamCatalog+"\\"+FlightCluster+"\\"+FlightMapName+
" (RS_SERIAL1,RS_SERIAL2,RS_QTY1,RS_QTY2,RS_UPDATE,RS_UPTIME,RS_EMPNO,RS_PRINTFLAG,"+
"RS_PART_S,RS_PART_IN_A_P,RS_FILLER)"+" VALUES(?,?,?,?,?,?,?,?,?,?,?)");
//pstmt.setString(1, "12345678901234567890123003");
pstmt.setString(1, "1234567890");
pstmt.setString(2,"1234567890123");
pstmt.setInt(3,00);
pstmt.setInt(4,003);
pstmt.setString(5,"151209");
pstmt.setString(6, "094435");
pstmt.setString(7,"09932");
pstmt.setString(8,"P");
pstmt.setString(9,"Y");
pstmt.setString(10,"Y");
pstmt.setString(11," ");
// Execute the query
int num = pstmt.executeUpdate();
System.out.println(num);
pstmt.close();
}
catch (SQLException t)
{
System.out.println(t.toString());
}
try
{
// Get a statement
java.sql.Statement stmt = jdbcCon.createStatement();
// Execute the query ...
java.sql.ResultSet rs = stmt.executeQuery(
"SELECT * FROM "+vsamCatalog+"\\"+FlightCluster+"\\"+FlightMapName);
while (rs.next())
{
System.out.println(rs.getString("RS_SERIAL1") + " " + rs.getString("RS_SERIAL2")+ " " + rs.getString("RS_UPTIME")+ " " + rs.getString("RS_UPDATE"));
}
rs.close();
stmt.close();
}
catch (SQLException t)
{
}
}
catch (Exception e)
{
// do something appropriate with the exception, *at least*:
e.printStackTrace();
}
}
}
Note: the OS is z/VSE
The short answer to your original question is that KSDS VSAM is not a DBMS.
As you have discovered, you can define the VSAM file such that you can update it both from batch and from CICS, but as #BillWoodger points out, you must serialize your updates yourself.
Another approach would be to do all updates from the CICS region, and have your Java application send a REST or SOAP or MQ message to CICS to request its updates. This does require there be a CICS program to catch the requests from the Java application and perform the updates.
The IBM Mainframe under z/VSE has different partitions that run different jobs. For example partition F7 CICS, partition F8 Batch Jobs, ETC.
When you define a new VSAM file you have to set the SHAREOPTIONS of the file. When I define the file I set the SHAREOPTIONS (2 3). 2 Means that only one partition can write to the file.
So when the batch program (in a different partition to the CICS partition) which is called from Java was trying to write to the file it was not able to write to the file unless I close the file in CICS first.
To fix it I REDEFINE the CICS file with SHAREOPTIONS (4 3). 4 Means that multiple partitions of the Mainframe can write to it. Fixing the problem
Below is a part of the definition code where you set the SHAREOPTION:
* $$ JOB JNM=DEFFI,CLASS=9,DISP=D,PRI=9
* $$ LST CLASS=X,DISP=H,PRI=2,REMOTE=0,USER=JAVI
// JOB DEFFI
// EXEC IDCAMS,SIZE=AUTO
DEFINE CLUSTER -
( -
NAME (FLIGHT.ORDERING.FLIGHTS) -
RECORDS (2000 1000) -
INDEXED -
KEYS (26 0) -
RECORDSIZE (128 128) -
SHAREOPTIONS (4 3) -
VOLUMES (SYSWKE) -
) -
.
.
.

Pelops Java Client to insert into Cassandra Database

I recently started working with Cassandra database. I was able to setup single node cluster in my local machine.
And now I was thinking to start writing some sample data to Cassandra Database using Pelops client.
Below is the keyspace and column family I have created so far-
create keyspace my_keyspace with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use my_keyspace;
create column family users with column_type = 'Standard' and comparator = 'UTF8Type';
Below is the code that I have so far. I made some progress as before I was getting some exception which I was able to fix it. Now I am getting another exception
public class MyPelops {
private static final Logger log = Logger.getLogger(MyPelops.class);
public static void main(String[] args) throws Exception {
// A comma separated List of Nodes
String NODES = "localhost";
// Thrift Connection Pool
String THRIFT_CONNECTION_POOL = "Test Cluster";
// Keyspace
String KEYSPACE = "my_keyspace";
// Column Family
String COLUMN_FAMILY = "users";
Cluster cluster = new Cluster(NODES, 9160);
Pelops.addPool(THRIFT_CONNECTION_POOL, cluster, KEYSPACE);
Mutator mutator = Pelops.createMutator(THRIFT_CONNECTION_POOL);
log.info("- Write Column -");
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
new Column().setName(" Name ".getBytes()).setValue(
" Test One ".getBytes()));
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
new Column().setName(" Work ".getBytes()).setValue(
" Engineer ".getBytes()));
log.info("- Execute -");
mutator.execute(ConsistencyLevel.ONE);
Selector selector = Pelops.createSelector(THRIFT_CONNECTION_POOL);
int columnCount = selector.getColumnCount(COLUMN_FAMILY, "Row1",
ConsistencyLevel.ONE);
log.info("- Column Count = " + columnCount);
List<Column> columnList = selector
.getColumnsFromRow(COLUMN_FAMILY, "Row1",
Selector.newColumnsPredicateAll(true, 10),
ConsistencyLevel.ONE);
log.info("- Size of Column List = " + columnList.size());
for (Column column : columnList) {
log.info("- Column: (" + new String(column.getName()) + ","
+ new String(column.getValue()) + ")");
}
log.info("- All Done. Exit -");
System.exit(0);
}
}
Whenever I am running this program, I am getting this exception-
Exception in thread "main" org.scale7.cassandra.pelops.exceptions.InvalidRequestException: Column timestamp is required
And this exception is coming as soon as it tries to execute this line-
mutator.execute line
As I mentioned above, I am new to Cassandra Database and Pelops client as well. This is my first time working with that. Can anyone help me with this problem with step by step process? I am running Cassandra 1.2.3 in my local box.
Any step by step guidance like how to insert data in Cassandra database will help me a lot in understanding how Cassandra works.
Thanks in advance.
Each cassandra column is a Key-Value-Timestamp triplet.
You didn't set the timestamp in yours columns
Column c = new Column();
c.setTimestamp(System.currentTimeMillis());
You can use the client way to create the column and make the job easier
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
mutator.newColumn(" Name ", " Test One "));
In this way you can avoid both setting the timestamp (the client will do it for you) and using getBytes() on String.
Regards, Carlo

What is the fastest way to bulk load data into HBase programmatically?

I have a Plain text file with possibly millions of lines which needs custom parsing and I want to load it into an HBase table as fast as possible (using Hadoop or HBase Java client).
My current solution is based on a MapReduce job without the Reduce part. I use FileInputFormat to read the text file so that each line is passed to the map method of my Mapper class. At this point the line is parsed to form a Put object which is written to the context. Then, TableOutputFormat takes the Put object and inserts it to table.
This solution yields an average insertion rate of 1,000 rows per second, which is less than what I expected. My HBase setup is in pseudo distributed mode on a single server.
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
Here is the code for my current solution:
public static class CustomMap extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
protected void map(LongWritable key, Text value, Context context) throws IOException {
Map<String, String> parsedLine = parseLine(value.toString());
Put row = new Put(Bytes.toBytes(parsedLine.get(keys[1])));
for (String currentKey : parsedLine.keySet()) {
row.add(Bytes.toBytes(currentKey),Bytes.toBytes(currentKey),Bytes.toBytes(parsedLine.get(currentKey)));
}
try {
context.write(new ImmutableBytesWritable(Bytes.toBytes(parsedLine.get(keys[1]))), row);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
return -1;
}
conf.set("hbase.mapred.outputtable", args[1]);
// I got these conf parameters from a presentation about Bulk Load
conf.set("hbase.hstore.blockingStoreFiles", "25");
conf.set("hbase.hregion.memstore.block.multiplier", "8");
conf.set("hbase.regionserver.handler.count", "30");
conf.set("hbase.regions.percheckin", "30");
conf.set("hbase.regionserver.globalMemcache.upperLimit", "0.3");
conf.set("hbase.regionserver.globalMemcache.lowerLimit", "0.15");
Job job = new Job(conf);
job.setJarByClass(BulkLoadMapReduce.class);
job.setJobName(NAME);
TextInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(CustomMap.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TableOutputFormat.class);
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
Long startTime = Calendar.getInstance().getTimeInMillis();
System.out.println("Start time : " + startTime);
int errCode = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadMapReduce(), args);
Long endTime = Calendar.getInstance().getTimeInMillis();
System.out.println("End time : " + endTime);
System.out.println("Duration milliseconds: " + (endTime-startTime));
System.exit(errCode);
}
I've gone through a process that is probably very similar to yours of attempting to find an efficient way to load data from an MR into HBase. What I found to work is using HFileOutputFormat as the OutputFormatClass of the MR.
Below is the basis of my code that I have to generate the job and the Mapper map function which writes out the data. This was fast. We don't use it anymore, so I don't have numbers on hand, but it was around 2.5 million records in under a minute.
Here is the (stripped down) function I wrote to generate the job for my MapReduce process to put data into HBase
private Job createCubeJob(...) {
//Build and Configure Job
Job job = new Job(conf);
job.setJobName(jobName);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(HiveToHBaseMapper.class);//Custom Mapper
job.setJarByClass(CubeBuilderDriver.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HFileOutputFormat.class);
TextInputFormat.setInputPaths(job, hiveOutputDir);
HFileOutputFormat.setOutputPath(job, cubeOutputPath);
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum);
hConf.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClientPort);
HTable hTable = new HTable(hConf, tableName);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
return job;
}
This is my map function from the HiveToHBaseMapper class (slightly edited ).
public void map(WritableComparable key, Writable val, Context context)
throws IOException, InterruptedException {
try{
Configuration config = context.getConfiguration();
String[] strs = val.toString().split(Constants.HIVE_RECORD_COLUMN_SEPARATOR);
String family = config.get(Constants.CUBEBUILDER_CONFIGURATION_FAMILY);
String column = strs[COLUMN_INDEX];
String Value = strs[VALUE_INDEX];
String sKey = generateKey(strs, config);
byte[] bKey = Bytes.toBytes(sKey);
Put put = new Put(bKey);
put.add(Bytes.toBytes(family), Bytes.toBytes(column), (value <= 0)
? Bytes.toBytes(Double.MIN_VALUE)
: Bytes.toBytes(value));
ImmutableBytesWritable ibKey = new ImmutableBytesWritable(bKey);
context.write(ibKey, put);
context.getCounter(CubeBuilderContextCounters.CompletedMapExecutions).increment(1);
}
catch(Exception e){
context.getCounter(CubeBuilderContextCounters.FailedMapExecutions).increment(1);
}
}
I pretty sure this isn't going to be a Copy&Paste solution for you. Obviously the data I was working with here didn't need any custom processing (that was done in a MR job before this one). The main thing I want to provide out of this is the HFileOutputFormat. The rest is just an example of how I used it. :)
I hope it gets you onto a solid path to a good solution. :
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
mapreduce.tasktracker.map.tasks.maximum parameter which is defaulted to 2 determines the maximum number of tasks that can run in parallel on a node. Unless changed, you should see 2 map tasks running simultaneously on each node.

Categories