I want to copy data from one HBase table to another using Java APIs, but not able to find one. Is there any Java API to do the same?
Thanks.
The following is not by far the most optimized way - but from the tone of the question it seems performance is not the critical factor here.
First, you need to set up your HBaseConfiguration and your input / output tables:
Configuration config = HBaseConfiguration.create();
HTable inputTable = new HTable(config, "input_table");
HTable outputTable = new HTable(config, "output_table");
What you want is a "Scan", which allows a range scan to be performed. You need to define the query parameters, by adding columns to a Scan object.
Scan scan = new Scan(Bytes.toBytes("smith-"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("givenName"));
scan.addColumn(Bytes.toBytes("contactinfo"), Bytes.toBytes("email"));
scan.setFilter(new PageFilter(25));
Now you are ready to invoke the scan object and retrieve results:
ResultScanner scanner = inputTable.getScanner(scan);
for (Result result : scanner) {
putToOutputTable(result);
}
Now to save to the second table, you will either do Put's within the for loop, or aggregate the results into a List/Array or similar for a bulk put.
protected void putToOutputTable(Result result) {
// Retrieve the Map of families to their most recent qualifiers and values.
NavigableMap<byte[],NavigableMap<byte[],byte[]>> map = result.getNoVersionMap();
for ( // iterate through the family/values map entries for this result ) {
// Convert the result to the row key and the column values here ..
// specifically set the rowKey, colFamily, colQualifier, and colValue(s)
Put p = new Put(Bytes.toBytes(rowKey));
// To set the value you'd like to update in the row 'myLittleRow',
// specify the column family, column qualifier, and value of the table
// cell you'd like to update. The column family must already exist
// in your table schema. The qualifier can be anything.
// All must be specified as byte arrays as hbase is all about byte
// arrays. Lets pretend the table 'myLittleHBaseTable' was created
// with a family 'myLittleFamily'.
p.add(Bytes.toBytes(colFamily), Bytes.toBytes(colQualifier),
Bytes.toBytes(colValue));
}
table.put(p);
}
If instead you want a more scalable version, take a look at how to use map/reduce to read from input hdfs files / write to output hbase tables here: Hbase Map/Reduce
Related
I am aware that BigTable supports operations append and increment using ReadModifyWriteRow requests, but I'm wondering if there is support or an alternative way to use more generic mapping functions where the value from the cell can be accessed and modified within some sort of closure? For instance, bitwise ANDing a long value in a cell:
Function<Long, Long> modifyFunc = f -> f & 10L;
ReadModifyWriteRow
.create("tableName", "rowKey")
.apply("family", "qualifier", modifyFunc);
Doing a mapping like this is not supported by Bigtable, so here is an option you could try. This will only work with single cluster instances due to consistency required for it.
You could add a column to keep track of row version (in addition to the existing row versions) and then you can read the data and version, modify it in memory and then do a checkAndMutate with the version and new value. Something like this:
Row row = dataClient.readRow(tableId, rowkey);
ArrayList<RowCell> cells = row.getCells();
// Get the value and timestamp/version from the cell you are targetting.
RowCell cell = cells.get(...);
long version = cell.getTimestamp();
ByteString value = cell.getValue();
// Do your mapping to the new value.
ByteString newValue = ...;
Mutation mutation =
Mutation.create().setCell(COLUMN_FAMILY_NAME, COLUMN_NAME, timestamp, newValue);
// Filter on a column that tracks the version to do validation.
Filter filter =
FILTERS
.chain()
.filter(FILTERS.family().exactMatch(COLUMN_FAMILY_NAME))
.filter(FILTERS.qualifier().exactMatch(VERSION_COLUMN))
.filter(FILTERS.value().exactMatch(version));
ConditionalRowMutation conditionalRowMutation =
ConditionalRowMutation.create(tableId, rowkey).condition(filter).then(mutation);
boolean success = dataClient.checkAndMutateRow(conditionalRowMutation);
I want to do batch insert to postgres using jooq:
List<MyTableRecord> records = new ArrayList<>();
for (Dto dto : dtos) {
Field<Long> sequenceId = SEQUENCE.nextval();
Long id = using(ctx).select(sequenceId).fetchOne(sequenceId);
records.add(mapToRecord(dto, id));
}
using(ctx).batchInsert(records).execute();
The problem is that I am fetching next sequence number for each row.
For simple insert I can use Field in statement like this:
create.insertInto(ID, VALUE)
.values(SEQUENCE.nextval(), val("William"))
.execute();
How can I do so with batch insert?
Pre-fetch all the sequence values
You could pre-fetch all the sequence values you need using this:
List<Long> ids = using(ctx)
.select(sequenceId)
.from(generateSeries(1, dtos.size()))
.fetch(sequenceId);
for (int i = 0; i < dtos.size(); i++)
records.add(mapToRecord(dtos.get(i), ids.get(i)));
using(ctx).batchInsert(records).execute();
This seems like a useful feature to have out of the box, in an RDBMS agnostic way via using(ctx).nextvals(SEQUENCE, dtos.size()). We'll consider this for a future jOOQ version: https://github.com/jOOQ/jOOQ/issues/10658
Don't use records
An alternative is to batch actual INSERT statements instead of Record.insert() calls via batchInsert(). That way, you can put the SEQUENCE.nextval() expression in the statement. See: https://www.jooq.org/doc/latest/manual/sql-execution/batch-execution/
I have a csv file that consists of data in this format:
id, name, surname, morecolumns
5, John, Lok, more
2, John2, Lok2, more
1, John3, Lok3, more
etc..
I want to sort my csv file using the id as key and store the sorted results in another file.
What I've done so far in order to create JavaPairs of (id, rest_of_line).
SparkConf conf = new SparkConf().setAppName.....;
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("inputfile.csv");
// extract the header
JavaRDD<String> lines = file.filter(s -> !s.equals(header));
// create JavaPairs
JavaPairRDD<Integer, String> pairRdd = lines.mapToPair(
new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(final String line) {
String str = line.split(",", 2)[0];
String str2 = line.split(",", 2)[1];
int id = Integer.parseInt(str);
return new Tuple2(id, str2);
}
});
// sort and save the output
pairRdd.sortByKey(true, 1);
pairRdd.coalesce(1).saveAsTextFile("sorted.csv");
This works in cases that I have small files. However when I am using bigger files, the output is not sorted properly. I think this happens because the sort procedure takes place on different nodes, so the merge of all the procedures from all the nodes doesn't give the expected output.
So, the question is how can I sort my csv file using the id as key and store the sorted results in another file.
The method coalesce is probably the one to blame, as it apparently does not contractually guarantee the ordering or the resulting RDD (see Which operations preserve RDD order?). So if you avoid such coalesce, the resulting output files will be ordered.
As you want a unique csv file, you could get the results from whatever file-system you're using but taking care of their actual order, and merge them. For example, if you're using HDFS (as stated by #PinoSan) this can be done using the command hdfs dfs -getmerge <hdfs-output-dir> <local-file.csv>.
As pointed by #mauriciojost, you should not do coalesce.
Instead, better way to do this is pairRdd.sortByKey(true,pairRdd.getNumPartitions()).saveAsTextFile(path) so that maximum possible work is carried out on partitions that hold data.
I am very new with Spark, and I have a query that brings data from two Oracle tables. Such tables have to be joined by a field, which works fine with the code below. However, I need to apply filters as in an Oracle "where" clause. For example, bring employees whose age is between 25 and 50. I also have to apply GroupBy filters and sort the final results with OrderBy. The thing is that the only action that is performed correctly is the retrieval of all data from the tables and the join between them. The rest of the filters are simply not applied and I have no idea of why. Can you please help me out with this? I am sure I am missing something because NO compile errors are gotten. The data is loaded fine, but the "where" clauses seem not to be having any effect on the data, although there are Employees with age between 25 and 50. Many thanks!
public static JavaRDD<Row> getResultsFromQuery(String connectionUrl) {
JavaSparkContext sparkContext = new JavaSparkContext(new SparkConf()
.setAppName("SparkJdbcDs").setMaster("local"));
SQLContext sqlContext = new SQLContext(sparkContext);
Map<String, String> options = new HashMap<>();
options.put("driver", "oracle.jdbc.OracleDriver");
options.put("url", connectionUrl);
options.put("dbtable", "EMPLOYEE");
DataFrameReader dataFrameReader = sqlContext.read().format("jdbc")
.options(options);
DataFrame dataFrameFirstTable = dataFrameReader.load();
options.put("dbtable", "DEPARTMENT");
dataFrameReader = sqlContext.read().format("jdbc").options(options);
DataFrame dataFrameSecondTable = dataFrameReader.load();
//JOIN. IT WORKS JUST FINE!!!
DataFrame resultingDataFrame = dataFrameFirstTable.join(dataFrameSecondTable,
"DEPARTMENTID");
//FILTERS. THEY DO NOT THROW ERROR, BUT ARE NOT APPLIED. RESULTS ARE ALWAYS THE SAME, WITHOUT FILTERS
resultingDataFrame.where(resultingDataFrame.col("AGE").geq(25));
resultingDataFrame.where(resultingDataFrame.col("AGE").leq(50));
JavaRDD<Row> resultFromQuery = resultingDataFrame.toJavaRDD();
//HERE I CONFIRM THAT THE NUMBER OF ROWS GOTTEN IS ALWAYS THE SAME, SO THE FILTERS DO NOT WORK.
System.out.println("Number of rows "+resultFromQuery.count());
return resultFromQuery;
}
where returns a new dataframe and does NOT alter the existing one, so you need to store the output:
DataFrame greaterThan25 = resultingDataFrame.where(resultingDataFrame.col("AGE").geq(25));
DataFrame lessThanGreaterThan = greaterThan25.where(resultingDataFrame.col("AGE").leq(50));
JavaRDD<Row> resultFromQuery = lessThanGreaterThan.toJavaRDD();
Or you can just chain it:
DataFrame resultingDataFrame = dataFrameFirstTable.join(dataFrameSecondTable, "DEPARTMENTID")
.where(resultingDataFrame.col("AGE").geq(25))
.where(resultingDataFrame.col("AGE").leq(50));
people.select("person_id", "first_name").filter(people("person_id") == 2).show
It won't work and you'll be getting the Following Error:
Error: overloaded method value filter with alternatives:
(conditionExpr: String)org.apache.spark.sql.DataFrame
(condition: org.apache.spark.sql.Column)
org.apache.spark.sql.DataFrame cannot be applied to (Boolean)
It seems that, to work with Select clauses in Spark dataframe along with filter, we can't pass Boolean.
These two queries are used to select single row from Spark DataFrame with two different clauses, where and filter.
people.select("person_id", "first_name").filter(people("person_id") === 2).show
people.select("person_id", "first_name").where(people("person_id") === 2).show
Use one of the above query, to select single row from Spark DataFrame.
My first CSV file looks like this with header included (header is included only at the top not after every entry):
NAME,SURNAME,AGE
Fred,Krueger,Unknown
.... n records
My second file might look like this:
NAME,MIDDLENAME,SURNAME,AGE
Jason,Noname,Scarry,16
.... n records with this header template
The merged file should look like this:
NAME,SURNAME,AGE,MIDDLENAME
Fred,Krueger,Unknown,
Jason,Scarry,16,Noname
....
Basically if headers don't match, all new header titles (columns) should be added after original header and their values according to that order.
Update
Above CSV were made smaller so I can illustrate what I want to achieve, in reality CSV files are generated one step before this (merge) and can be up to 100 columns
How can I do this?
I'd create a model for the 'bigger' format (a simple class with four fields and a collection for instances of this class) and implemented two parsers, one for the first, one for the second model. Create records for all rows of both csv files and implement a writer to output the csv in the correct format. IN brief:
public void convert(File output, File...input) {
List<Record> records = new ArrayList<Record>();
for (File file:input) {
if (input.isThreeColumnFormat()) {
records.addAll(ThreeColumnFormatParser.parse(file));
} else {
records.addAll(FourColumnFormatParser.parse(file));
}
}
CsvWriter.write(output, records);
}
From your comment I see, that you a lot of different csv formats with some common columns.
You could define the model for any row in the various csv files like this:
public class Record {
Object id; // some sort of unique identifier
Map<String, String> values; // all key/values of a single row
public Record(Object id) {this.id=id;}
public void put(String key, String value){
values.put(key, value);
}
public void get(String key) {
values.get(key);
}
}
For parsing any file you would first read the header and add the column headers to a global keystore (will be needed later on for outputting), then create records for all rows, like:
//...
List<Record> records = new ArrayList<Record>()
for (File file:getAllFiles()) {
List<String> keys = getColumnsHeaders(file);
KeyStore.addAll(keys); // the store is a Set
for (String line:file.getLines()) {
String[] values = line.split(DELIMITER);
Record record = new Record(file.getName()+i); // as an example for id
for (int i = 0; i < values.length; i++) {
record.put(keys.get(i), values[i]);
}
records.add(record);
}
}
// ...
Now the keystore has all used column header names and we can iterate over the collection of all records, get all values for all keys (and get null if the file for this record didn't use the key), assemble the csv lines and write everything to a new file.
Read in the header of the first file and create a list of the column names. Now read the header of the second file and add any column names that don't exist already in the list to the end of the list. Now you have your columns in the order that you want and you can write this to the new file first.
Next I would parse each file and for each row I would create a Map of column name to value. Once the row is parsed you could then iterate over the new list of column names and pull the values from the map and write them immediately to the new file. If the value is null don't print anything (just a comma, if required).
There might be more efficient solutions available, but I think this meets the requirements you set out.
Try this:
http://ondra.zizka.cz/stranky/programovani/ruzne/querying-transforming-csv-using-sql.texy
crunch input.csv output.csv "SELECT AVG(duration) AS durAvg FROM (SELECT * FROM indata ORDER BY duration LIMIT 2 OFFSET 6)"