Some inserts are not appearing in DB table - java

I am trying to insert the records after reading them from a file. There are bunch of files, I read/parse one file at a time and insert all the records from that file at a time. This works fine in most cases, but I have noticed, there is one file which has 879 records which are not getting inserted or not appearing in DB table. The same code works fine, if I remove or add any 1 record from the file. DB is Oracle 12c.
The file whose records are not inserted is not throwing any error.
Can anybody please point out what could cause such a behavior?
This is how the code looks like..
ExecutorService executorService = Executors.newFixedThreadPool(threadCount);
List<CompletableFuture<Void>> completableFutures =
pnlFiles.stream().map(pnlFileForTheBook -> CompletableFuture.runAsync(() -> {
String pnlFileName = "";
String cobDate = "";
String filePath = "";
try {
File localPnlFile = nasDataLoader.downloadFile(pnlFileForTheBook, filePath);
if(localPnlFile != null) {
List<PnlFile> pnlFileLines = parse(localPnlFile, PnlFile.class);
List<PnlFileDsl> pnlFileDsls = pnlFileLines.stream().map(line -> {
return new PnlFileDsl(line, cobDate);
}).collect(Collectors.toList());
persistenceManager.persist(pnlFileDsls);
}
}
catch () {}
}, executorService)).collect(Collectors.toList());
List<Void> result = completableFutures.stream().map(CompletableFuture::join)
.collect(Collectors.toList());
The same code works with all other files, the file whose records are not appearing has no issues.

Related

Can we use cosmosContainer.queryItems() method to execute the delete query on cosmos container

I have a Java method in my code, in which I am using following line of code to fetch any data from azure cosmos DB
Iterable<FeedResponse<Object>> feedResponseIterator =
cosmosContainer
.queryItems(sqlQuery, queryOptions, Object.class)
.iterableByPage(continuationToken, pageSize);
Now the whole method looks like this
public List<LinkedHashMap> getDocumentsFromCollection(
String containerName, String partitionKey, String sqlQuery) {
List<LinkedHashMap> documents = new ArrayList<>();
String continuationToken = null;
do {
CosmosQueryRequestOptions queryOptions = new CosmosQueryRequestOptions();
CosmosContainer cosmosContainer = createContainerIfNotExists(containerName, partitionKey);
Iterable<FeedResponse<Object>> feedResponseIterator =
cosmosContainer
.queryItems(sqlQuery, queryOptions, Object.class)
.iterableByPage(continuationToken, pageSize);
int pageCount = 0;
for (FeedResponse<Object> page : feedResponseIterator) {
long startTime = System.currentTimeMillis();
// Access all the documents in this result page
page.getResults().forEach(document -> documents.add((LinkedHashMap) document));
// Along with page results, get a continuation token
// which enables the client to "pick up where it left off"
// in accessing query response pages.
continuationToken = page.getContinuationToken();
pageCount++;
log.info(
"Cosmos Collection {} deleted {} page with {} number of records in {} ms time",
containerName,
pageCount,
page.getResults().size(),
(System.currentTimeMillis() - startTime));
}
} while (continuationToken != null);
log.info(containerName + " Collection has been collected successfully");
return documents;
}
My question is that can we use same line of code to execute delete query like (DELETE * FROM c)? If yes, then what it would be returning us in Iterable<FeedResponse> feedResponseIterator object.
SQL statements can only be used for reads. Delete operations must be done using DeleteItem().
Here are Java SDK samples (sync and async) for all document operations in Cosmos DB.
Java v4 SDK Document Samples

Writing from Spark to HBase : org.apache.spark.SparkException: Task not serializable

I'm on a heatmap project for my university, we have to get some data (212Go) from a txt file (coordinates, height), then put it in HBase to retrieve it on a web client with Express.
I practiced using a 144Mo file, this is working :
SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
for (String s : data.collect()) {
String[] tmp = s.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
}
table.put(put);
But I now that I use the 212Go file, I got some memory errors, I guess the collect method gather all the data in memory, so 212Go is too much.
So now I'm trying this :
SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
data.foreach(line ->{
String[] tmp = line.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
});
table.put(put);
And I'm getting "org.apache.spark.SparkException: Task not serializable", I searched about it and tried some fixing, without success, upon what I read here : Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
Actually I don't understand everything in this topic, I'm just a student, maybe the answer to my problem is obvious, maybe not, anyway thanks in advance !
As a rule of thumb - serializing database connections (any type) doesn't make sense. There are not designed to be serialized and deserialized, Spark or not.
Create connection for each partition:
data.foreachPartition(partition -> {
Connection co = ConnectionFactory.createConnection(getConf());
... // All required setup
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
while (partition.hasNext()) {
String line = partition.next();
String[] tmp = line.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
}
... // Clean connections
});
I also recommend reading Design Patterns for using foreachRDD from the official Spark Streaming programming guide.

Java compare two csv files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So I have two csv files i wish to compare.
Each file could be as much as 20mb each.
Each line has the key followed by the data so key,data
But the data is then separated by comma as well.
csv1.csv
KEY , DATA
AB45,12,15,65,NN
AB46,12,15,64,YY
AB47,45,85,95,YN
csv2.csv
AB45,12,15,65,NN
AB46,15,15,65,YY
AB48,65,45,60,YY
What i want to do is read both files and compare the data for each key.
I was thinking read each file line by line adding into a TreeMap. I can then compare each set of data for a given key and if there is a difference write it to another file.
Any advice?
As I am unsure of how to read the files to extract just the keys and data in an efficient way.
Use a CSV parsing library dedicated for that to speed things up. With uniVocity-parsers you can parse these 20mb files in 100ms or less. The following solution is a bit involved to prevent loading too much data into memory. Check the tutorial I linked above, there are many ways to accomplish what you need with this library.
First we read one of the CSV files, and generate a Map:
public static void main(String... args) {
//First we parse one file (ideally the smaller one)
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
//Parse all data into a list.
List<String[]> records = parser.parseAll(new File("/path/to/csv1.csv"));
//Convert that list into a map. The first column of this input will produce the keys.
Map<String, String[]> mapOfRecords = toMap(records);
//this where the magic happens.
processFile(new File("/path/to/csv2.csv"), new File("/path/to/diff.csv"), mapOfRecords);
}
This is the code to generate a Map from the list of records:
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
With the map of records, we can process your second file and generate another with any updates found:
private static void processFile(final File input, final File output, final Map<String, String[]> mapOfExistingRecords) {
//configures a new parser again
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
//All parsed rows will be submitted to the following Processor. This way you won't have to store all rows in memory.
settings.setProcessor(new RowProcessor() {
//will write the changed rows to another file
CsvWriter writer;
#Override
public void processStarted(ParsingContext context) {
CsvWriterSettings settings = new CsvWriterSettings(); //configure at till
writer = new CsvWriter(output, settings);
}
#Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from will have the ID as index 0.
// If the map contains the ID, we'll get a row
String[] existingRow = mapOfExistingRecords.get(row[0]);
if (!Arrays.equals(row, existingRow)) {
writer.writeRow(row);
}
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above. All differences will be
//written to the output file.
parser.parse(input);
}
This should work just fine. I hope it helps you.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I work with a lot of CSV file comparisons for my job. I didn't know python before I started working, but I picked it up really quick. If you want to compare CSV files quickly, python is a wonderful way to go, and its fairly easy to pick up if you know java.
I modified a script I use to fit your basic use case (you'll need to modify it a bit more to do exactly what you want). It Runs under a few seconds when I use it compare csv files with millions of rows. If you need to do this in java, you can pretty much transfer this to some java methods. There are similar csv libraries you can use that will replace all the csv functions below.
import csv, sys, itertools
def getKeyPosition(header_row, key_value):
counter = 0
for header in header_row:
if (header == key_value):
return counter
counter += 1
# This will create a dictonary of your rows by their key. (key is the column location)
def getKeyDict(csv_reader, key_position):
key_dict = {}
row_counter = 0
unique_records = 0
for row in csv_reader:
row_counter += 1
if row[key_position] not in key_dict:
key_dict.update({row[key_position]: row})
unique_records += 1
# My use case requires a lot of checking for duplicates
if unique_records != row_counter:
print "Duplicate Keys in File"
return key_dict
def main():
f1 = open(sys.argv[1])
f2 = open(sys.argv[2])
f1_csv = csv.reader(f1)
f2_csv = csv.reader(f2)
f1_header = next(f1_csv)
f2_header = next(f2_csv)
f1_header_key_position = getKeyPosition(f1_header, "KEY")
f2_header_key_position = getKeyPosition(f2_header, "KEY")
f1_row_dict = getKeyDict(f1_csv, f1_header_key_position)
f2_row_dict = getKeyDict(f2_csv, f2_header_key_position)
outputFile = open("KeyDifferenceFile.csv" , 'w')
writer = csv.writer(outputFile)
writer.writerow(f1_header)
#Heres the logic for comparing rows
for key, row_1 in f1_row_dict.iteritems():
#Do whatever comparisions you need here.
if key not in f2_row_dict:
print "Oh no, this key doesn't exist in the file 2"
if key in f2_row_dict:
row_2 = f2_row_dict.get(key)
if row_1 != row_2:
print "oh no, the two rows don't match!"
# You can get more header keys to compare by if you want.
data_position = getKeyPosition(f2_header, "DATA")
row_1_data = row_1[data_position]
row_2_data = row_2[data_position]
if row_1_data != row_2_data:
print "oh no, the data doesn't match!"
# Heres how you'd right the rows
row_to_write = []
#Differences between
for row_1_column, row_2_column in itertools.izip(row_1_data, row_2_data):
row_to_write.append(row_1_column - row_2_column)
writer.writerow(row_to_write)
# Make sure to close those files!
f1.close()
f2.close()
outputFile.close()
main()

unexpected multiple execution of mapper intended to run once

I tried to write a very simple job with only 1 mapper and no reducer to write some data to hbase. In the mapper I tried to simply open connection with hbase, write a few rows of data to a table and then close connection. In job driver I am using JobConf.setNumMapTasks(1); and JobConf.setNumReduceTasks(0); to specify that only 1 mapper and no reducer are to be executed. I am also setting the reducer class to IdentityReducer in jobConf. The strange behavior I am observing is that the job successfully writes the data to hbase table however after that I see in the logs it continuously tried to open connection with hbase and then closes the connection which goes on for 20-30 minutes and after the job is declared to have completed with 100% success. At the end when I check the _success file created by the dummy data I put in OutputCollector.collect(...) I see hundred of rows of dummy data when there should only be 1.
Following is the code for job driver
public int run(String[] arg0) throws Exception {
Configuration config = HBaseConfiguration.create(getConf());
ensureRequiredParametersExist(config);
ensureOptionalParametersExist(config);
JobConf jobConf = new JobConf(config, getClass());
jobConf.setJobName(config.get(ETLJobConstants.ETL_JOB_NAME));
//set map specific configuration
jobConf.setNumMapTasks(1);
jobConf.setMaxMapAttempts(1);
jobConf.setInputFormat(TextInputFormat.class);
jobConf.setMapperClass(SingletonMapper.class);
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(Text.class);
//set reducer specific configuration
jobConf.setReducerClass(IdentityReducer.class);
jobConf.setOutputKeyClass(LongWritable.class);
jobConf.setOutputValueClass(Text.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setNumReduceTasks(0);
//set job specific configuration details like input file name etc
FileInputFormat.setInputPaths(jobConf, jobConf.get(ETLJobConstants.ETL_JOB_FILE_INPUT_PATH));
System.out.println("setting output path to : " + jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH));
FileOutputFormat.setOutputPath(jobConf,
new Path(jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH)));
JobClient.runJob(jobConf);
return 0;
}
Driver class extends Configured and implements Tool (I used the sample from definitive guide)Following is the code in my mapper class.
Following is the code in my Mapper's map method where I simply open the connection with Hbase, do some preliminary check to make sure table exists and then write the rows and close the table.
public void map(LongWritable arg0, Text arg1,
OutputCollector<LongWritable, Text> arg2, Reporter arg3)
throws IOException {
HTable aTable = null;
HBaseAdmin admin = null;
try {
arg3.setStatus("started");
/*
* set-up hbase config
*/
admin = new HBaseAdmin(conf);
/*
* open connection to table
*/
String tableName = conf.get(ETLJobConstants.ETL_JOB_TABLE_NAME);
HTableDescriptor htd = new HTableDescriptor(toBytes(tableName));
String colFamilyName = conf.get(ETLJobConstants.ETL_JOB_TABLE_COLUMN_FAMILY_NAME);
byte[] tablename = htd.getName();
/* call function to ensure table with 'tablename' exists */
/*
* loop and put the file data into the table
*/
aTable = new HTable(conf, tableName);
DataRow row = /* logic to generate data */
while (row != null) {
byte[] rowKey = toBytes(row.getRowKey());
Put put = new Put(rowKey);
for (DataNode node : row.getRowData()) {
put.add(toBytes(colFamilyName), toBytes(node.getNodeName()),
toBytes(node.getNodeValue()));
}
aTable.put(put);
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo added another data row to hbase");
row = fileParser.getNextRow();
}
aTable.flushCommits();
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo Finished adding data to hbase");
} finally {
if (aTable != null) {
aTable.close();
}
if (admin != null) {
admin.close();
}
}
arg2.collect(new LongWritable(10), new Text("something"));
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxoadded some dummy data to the collector");
}
As you could see around the end that I am writing some dummy data to collection in the end (10, 'something') and I see hundreds of rows of this data in the _success file after the job has terminated.
I can't identify why the mapper code is restarted multiple times over and over instead of running just once. Any help would be greatly appreciated.
Using JobConf.setNumMapTasks(1) is just saying to hadoop that you wish to use 1 mapper, if possible, unlike the setNumReduceTasks, which actually defines the number that you specified.
That's why more mappers are run and you observe all these numbers.
For more details, please read this post.

hbase java code returns null for a get but hbase shell get comman returns record

I have just started using hbase and also not a proficient java programmer. I created a debug program to test the current hbase program that does put & get records and also as a deduping mechanism. The debug program checks to see if certain ids are present in the hbase table that should have been inserted using the other program. When I do a get, for the most part records are there but some will be returned as null (not found). When I manually check from the hbase shell and request the same id, it returns the row with timestamp. Is there something I am not understanding here? Are there multiple versions of a record kept in hbase? I assumed hbase made unique records based on the id provided.
// code to get record
public static byte[] getPreHbase(String provid, String commentId) throws IOException {
provid = "98";
commentId = commentId.trim();
String rec = provid + "." + commentId;
byte [] value= "test".getBytes();
try{
Get g = new Get(Bytes.toBytes(rec));
Result r = htableII.get(g);
value = r.getValue(Bytes.toBytes("cmmnttest"),Bytes.toBytes("cmmntposts"));
String valueStr = Bytes.toString(value);
}catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return value;
As I mentioned this is only sometimes for some ids while others are returned. This is the manual call in shell
get 'hb_test', '98.1010000000003_1asdfghjkl'
COLUMN CELL
cmmnttest:cmmntposts timestamp=1420659812914,
value= 1010000000003_1asdfghjkl
1 row(s) in 0.0140 seconds

Categories