unexpected multiple execution of mapper intended to run once - java

I tried to write a very simple job with only 1 mapper and no reducer to write some data to hbase. In the mapper I tried to simply open connection with hbase, write a few rows of data to a table and then close connection. In job driver I am using JobConf.setNumMapTasks(1); and JobConf.setNumReduceTasks(0); to specify that only 1 mapper and no reducer are to be executed. I am also setting the reducer class to IdentityReducer in jobConf. The strange behavior I am observing is that the job successfully writes the data to hbase table however after that I see in the logs it continuously tried to open connection with hbase and then closes the connection which goes on for 20-30 minutes and after the job is declared to have completed with 100% success. At the end when I check the _success file created by the dummy data I put in OutputCollector.collect(...) I see hundred of rows of dummy data when there should only be 1.
Following is the code for job driver
public int run(String[] arg0) throws Exception {
Configuration config = HBaseConfiguration.create(getConf());
ensureRequiredParametersExist(config);
ensureOptionalParametersExist(config);
JobConf jobConf = new JobConf(config, getClass());
jobConf.setJobName(config.get(ETLJobConstants.ETL_JOB_NAME));
//set map specific configuration
jobConf.setNumMapTasks(1);
jobConf.setMaxMapAttempts(1);
jobConf.setInputFormat(TextInputFormat.class);
jobConf.setMapperClass(SingletonMapper.class);
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(Text.class);
//set reducer specific configuration
jobConf.setReducerClass(IdentityReducer.class);
jobConf.setOutputKeyClass(LongWritable.class);
jobConf.setOutputValueClass(Text.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setNumReduceTasks(0);
//set job specific configuration details like input file name etc
FileInputFormat.setInputPaths(jobConf, jobConf.get(ETLJobConstants.ETL_JOB_FILE_INPUT_PATH));
System.out.println("setting output path to : " + jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH));
FileOutputFormat.setOutputPath(jobConf,
new Path(jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH)));
JobClient.runJob(jobConf);
return 0;
}
Driver class extends Configured and implements Tool (I used the sample from definitive guide)Following is the code in my mapper class.
Following is the code in my Mapper's map method where I simply open the connection with Hbase, do some preliminary check to make sure table exists and then write the rows and close the table.
public void map(LongWritable arg0, Text arg1,
OutputCollector<LongWritable, Text> arg2, Reporter arg3)
throws IOException {
HTable aTable = null;
HBaseAdmin admin = null;
try {
arg3.setStatus("started");
/*
* set-up hbase config
*/
admin = new HBaseAdmin(conf);
/*
* open connection to table
*/
String tableName = conf.get(ETLJobConstants.ETL_JOB_TABLE_NAME);
HTableDescriptor htd = new HTableDescriptor(toBytes(tableName));
String colFamilyName = conf.get(ETLJobConstants.ETL_JOB_TABLE_COLUMN_FAMILY_NAME);
byte[] tablename = htd.getName();
/* call function to ensure table with 'tablename' exists */
/*
* loop and put the file data into the table
*/
aTable = new HTable(conf, tableName);
DataRow row = /* logic to generate data */
while (row != null) {
byte[] rowKey = toBytes(row.getRowKey());
Put put = new Put(rowKey);
for (DataNode node : row.getRowData()) {
put.add(toBytes(colFamilyName), toBytes(node.getNodeName()),
toBytes(node.getNodeValue()));
}
aTable.put(put);
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo added another data row to hbase");
row = fileParser.getNextRow();
}
aTable.flushCommits();
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo Finished adding data to hbase");
} finally {
if (aTable != null) {
aTable.close();
}
if (admin != null) {
admin.close();
}
}
arg2.collect(new LongWritable(10), new Text("something"));
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxoadded some dummy data to the collector");
}
As you could see around the end that I am writing some dummy data to collection in the end (10, 'something') and I see hundreds of rows of this data in the _success file after the job has terminated.
I can't identify why the mapper code is restarted multiple times over and over instead of running just once. Any help would be greatly appreciated.

Using JobConf.setNumMapTasks(1) is just saying to hadoop that you wish to use 1 mapper, if possible, unlike the setNumReduceTasks, which actually defines the number that you specified.
That's why more mappers are run and you observe all these numbers.
For more details, please read this post.

Related

How can I create a table using Mybatis and SQLite?

I am trying to create a new database and new table using Mybatis and SQLite. I found from previous answers (1, 2, 3) that Mybatis does support using CREATE and ALTER statements, by marking them as "UPDATE" within Mybatis mapper syntax. However, those questions/answers were using Mapper XML whereas I'm using annotations, and also none were using SQLite.
SQLite creates a new database as soon as you open a new connection to it, so it doesn't matter if the DB exists before or not. A new database is created with a size of zero bytes, which is fine (SQLite treats a 0 byte file as an empty database). But after the table creation I would expect the database size to be non-zero as it stores the table structure for that table. After running my code which I think should create the table (I'm checking my syntax against this answer), the database size still reads as 0 bytes, which says to me that the table has not actually been created. What am I doing wrong?
My Java code to test this scenario:
public class Example {
public static void main(String[] args) {
String userHomePath = System.getProperty("user.home");
File exampleDb = new File(userHomePath, "example.sqlite3");
String jdbcConnectionString = "jdbc:sqlite:" + exampleDb.getAbsolutePath();
DataSource dataSource = new PooledDataSource("org.sqlite.JDBC", jdbcConnectionString, null, null);
Environment environment = new Environment("Main", new JdbcTransactionFactory(), dataSource);
Configuration configuration = new Configuration(environment);
configuration.addMapper(GenericMapper.class);
SqlSessionFactoryBuilder builder = new SqlSessionFactoryBuilder();
SqlSessionFactory sessionFactory = builder.build(configuration);
try (SqlSession session = sessionFactory.openSession()) {
GenericMapper genericMapper = session.getMapper(GenericMapper.class);
genericMapper.createExampleTableIfMissing();
}
}
}
My mapper:
public interface GenericMapper {
#Update("CREATE TABLE IF NOT EXISTS extbl (id INTEGER PRIMARY KEY AUTOINCREMENT)")
void createExampleTableIfMissing();
}
Checking the file after this code has run:
C:\Users\me>dir example.sqlite3
Volume in drive C is Windows
Volume Serial Number is D4DE-B46A
Directory of C:\Users\me
12/04/2021 18:14 0 example.sqlite3
1 File(s) 0 bytes
0 Dir(s) 27,326,779,392 bytes free
C:\Users\me>

How to ignore existing elements of tables when testing database with DBUnit

I'm using DBUnit for testing my database.
My database is not empty, So what I want is to ignore existing elements and to test just data inserted by my test.
This is an example of how test is run :
1- Table contains 10 elements
2- DBUnit insert some data from the dataset (3 elements)
3- My test insert data (1 element)
4- My expected dataset contains 4 elements which are the 3 elemnts defined in the first dataset and the element recently added by the test
5- So, when I do an assert equals of the actual and the expected table it shows me an error, wich is normal because my table already contains elements.
The question is :
Is there any way to ignore elements existing in the database in the assert ?
I want just to test data inserted by dataset and test.
This is the code :
#Override
protected IDataSet getDataSet() throws Exception {
// transforme fichier XML en BDD
URL url = this.getClass().getResource("/dataset-peqt2-init.xml");
File testFile = new File(url.getFile());
return new FlatXmlDataSetBuilder().build(testFile);
}
#Override
protected DatabaseOperation getSetUpOperation() throws Exception
{
return DatabaseOperation.REFRESH;
}
/**
* Reset the state of database
* Called before every test
*/
#Override
protected DatabaseOperation getTearDownOperation() throws Exception
{
return DatabaseOperation.DELETE;
}
/**
* get the actual table from the database
* #param tableName
* #return
* #throws Exception
* #throws SQLException
* #throws DataSetException
*/
private ITable getActualTable(String tableName) throws Exception, SQLException, DataSetException {
// get the actual table values
IDatabaseConnection connection = getConnection();
IDataSet databaseDataSet = connection.createDataSet();
return databaseDataSet.getTable(tableName);
}
/**
* get the expected table from the dataset
* #param tableName
* #param fileName
* #return
* #throws Exception
*/
private ITable getExpectedTable(String tableName, String fileName) throws Exception {
// get the expected table values
URL url = this.getClass().getResource("/"+fileName);
File testFile = new File(url.getFile());
IDataSet expectedDataSet = new FlatXmlDataSetBuilder().build(testFile);
return expectedDataSet.getTable(tableName);
}
#Test
public void test01_insert() throws SQLException, Exception {
File file = new File(SynchroDerbi.class.getResource("/test-insert.lst").getFile());
log.debug("test01_insert() avec ref : "+file.getName());
SynchroDerbi.run(file);
String fileName = "dataset-insert-expected.xml";
actualTable = getActualTable("equipment");
expectedTable = getExpectedTable("equipment",fileName);
Assertion.assertEqualsIgnoreCols(expectedTable, actualTable, new String[]{"id","idSite"});
}
Do DatabaseOperation.CLEAN_INSERT; in getSetUpOperation().
In this way the operation first will delete all table records and then will do your dataset insert.
Your tests shouldn't depend on the current state of the system. So instead of asserting for equality you should use "contains" checks. You get the results from select and then assert that those contain the results you just inserted.
If you want to be more strict with your checks (which may impact the maintainability of the tests) you can select the N of records BEFORE, then do inserts and then check if BEFORE+N = AFTER.
PS: DBUnit is not a very flexible and maintainable tool. Instead you could use the code of your system to save the state. That way if a column changes - you don't need to change DBUnit data. To ensure your tests don't step on each other toes use data randomization. Using your system's code may further help with isolation if you can start and rollback transactions within the tests.

hbase java code returns null for a get but hbase shell get comman returns record

I have just started using hbase and also not a proficient java programmer. I created a debug program to test the current hbase program that does put & get records and also as a deduping mechanism. The debug program checks to see if certain ids are present in the hbase table that should have been inserted using the other program. When I do a get, for the most part records are there but some will be returned as null (not found). When I manually check from the hbase shell and request the same id, it returns the row with timestamp. Is there something I am not understanding here? Are there multiple versions of a record kept in hbase? I assumed hbase made unique records based on the id provided.
// code to get record
public static byte[] getPreHbase(String provid, String commentId) throws IOException {
provid = "98";
commentId = commentId.trim();
String rec = provid + "." + commentId;
byte [] value= "test".getBytes();
try{
Get g = new Get(Bytes.toBytes(rec));
Result r = htableII.get(g);
value = r.getValue(Bytes.toBytes("cmmnttest"),Bytes.toBytes("cmmntposts"));
String valueStr = Bytes.toString(value);
}catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return value;
As I mentioned this is only sometimes for some ids while others are returned. This is the manual call in shell
get 'hb_test', '98.1010000000003_1asdfghjkl'
COLUMN CELL
cmmnttest:cmmntposts timestamp=1420659812914,
value= 1010000000003_1asdfghjkl
1 row(s) in 0.0140 seconds

Bulk Insert Data into HBase using MapReduce

I need to insert 400 million rows into a HBase table.
Schema looks something like this
where I am generating key by simply concatenating int and int and value as System.nanoTime()
my mapper looks something like this
public class DatasetMapper extends Tablemapper <Text,LongWritable> {
private static Configuration conf = HBaseConfiguration.create();
public void map (Text key, LongWritable values, Context context) throws exception {
// instantiate HTable object that connects to table name
HTable htable = new HTable(conf,"temp") // already created temp table
htable.setAutoFlush(flase);
htable.setWriteBufferSize(1024*1024*12);
// construct key
int i = 0, j = 0;
for(i=0; i<400000000,i++) {
String rowkey = Integer.toString(i).concat(Integer.toString(j));
Long value = Math.abs(System.nanoTime());
Put put = new Put(Bytes.toBytes(rowkey));
put.add(Bytes.toBytes("location"),Bytes.toBytes("longlat"),Bytes.toBytes(value);
htable.put(put)
j++;
htable.flushCommits();
}
}
and my job looks like this
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"initdb");
job.setJarByClass(DatasetMapper.class); // class that contains mapper
TableMapReduceUtil.initTableMapperJob(
null, // input table
null,
DatabaseMapper.class, // mapper class
null, // mapper output key
null, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
temp, // output table
null, // reducer class
job);
job.setNumReduceTasks(0);
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
The job runs but inserts 0 records. I know I am making some mistake but I am not able to catch it as I am new to HBase. Please help me.
thanks
First things first, name of your mapper is DatasetMapper but in your job config you have specified DatabaseMapper. I am wondering how it is working without any error.
Next, it looks like you have mixed the TableMapper and Mapper usage together. Hbase TableMapper is an abstract class which extends Hadoop Mapper and helps us to read from HBase conveniently and TableReducer helps in writing back to HBase. You are trying to put data from your Mapper and you are using TableReducer at the same time. You mapper will actually never get called.
Either use TableReducer to put the data or use just Mapper. If you really wish to do it in your Mapper you can use TableOutputFormat class. See the example given at Page 301 of HBase Definitive Guide. This is the Google Books link
HTH
P.S. : You might find these links helpful in learning HBase+MR integration properly :
Link 1.
Link 2.

ormlite with persistent h2 db - new tables not get persisted

When I am creating a new H2 database via ORMLite the database file get created but after I close my application, all the data that it stored in the database is lost:
JdbcConnectionSource connection =
new JdbcConnectionSource("jdbc:h2:file:" + path.getAbsolutePath() + ".h2.db");
TableUtils.createTable(connection, SomeClass.class);
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection, SomeClass.class);
SomeClass sc = new SomeClass(id, ...);
dao.create(sc);
SomeClass retrieved = dao.queryForId(id);
System.out.println("" + retrieved);
This code will produce good results. It will print the object that I stored.
But when I start the application again this time without creating the table and storing new object I get an exception telling me that the required table is not exists:
JdbcConnectionSource connection =
new JdbcConnectionSource("jdbc:h2:file:" + path.getAbsolutePath() + ".h2.db");
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection, SomeClass.class);
SomeClass retrieved = dao.queryForId(id); // will produce an exception..
System.out.println("" + retrieved);
The following worked fine for me if I ran it once and then a second time with the createTable turned off. The 2nd insert gave me a primary key violation of course but that was expected. It created the file with (as #Thomas mentioned) a ".h2.db.h2.db" prefix.
Some questions:
After you run your application the first time, can you see the path file being created?
Is it on permanent storage and not in some temporary location cleared by the OS?
Any chance some other part of your application is clearing it before the database code begins?
Hope this helps.
#Test
public void testStuff() throws Exception {
File path = new File("/tmp/x");
JdbcConnectionSource connection = new JdbcConnectionSource("jdbc:h2:file:"
+ path.getAbsolutePath() + ".h2.db");
// TableUtils.createTable(connection, SomeClass.class);
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection,
SomeClass.class);
int id = 131233;
SomeClass sc = new SomeClass(id, "fopewjfew");
dao.create(sc);
SomeClass retrieved = dao.queryForId(id);
System.out.println("" + retrieved);
connection.close();
}
I can see Russia from my house:
> ls -l /tmp/
...
-rw-r--r-- 1 graywatson wheel 14336 Aug 31 08:47 x.h2.db.h2.db
Did you close the database? It is closed automatically but it's better to close it manually (so recovery is faster).
In many cases the database URL is the problem. Are you sure the same path is used in both cases? Otherwise you end up with two databases. By the way, ".h2.db" is added automatically, you don't need to add it manually.
To better analyze the problem, you could append ;TRACE_LEVEL_FILE=2 to the database URL, and then check in the *.trace.db file what SQL statements were executed against the database.

Categories