Read Data from HBase - java

I'm new to HBase, what's the best way to retrieve results from a table, row by row? I would like to read the entire data in the table. My table has two column families say col1 and col2.

From Hbase shell, you can use scan command to list data in table, or get to retrieve a record. Reference here

I think here is what you need: both through HBase shell and Java API: http://cook.coredump.me/post/19672191046/hbase-client-example
However you should understand hbase shell 'scan' is very slow (it is not cached). But it is intended only for debug purpose.
Another useful part of information for you is here: http://hbase.apache.org/book/perf.reading.html
This chapter is right about reading from HBase but is is somewhat harder to understand because it assumes some level of familiarity and contains more advanced advices. I'd recommend to you this guide starting from the beginning.

USe Scan api of Hbase , there you can specify start row and end row and can retrive data frm the table .
Here is an example:
http://eternaltechnology.blogspot.in/2013/05/hbase-scanner-example-scanning.html

I was looking for something like this!
Map function
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
String x1 = new String(value.getValue(Bytes.toBytes("ColumnFamily"), Bytes.toBytes("X1")));
String x2 = new String(value.getValue(Bytes.toBytes("ColumnFamily"), Bytes.toBytes("X2")));
}
Driver file:
Configuration config2 = new Configuration();
Job job2 = new Job(config1, "kmeans2");
//Configuration for job2
job2.setJarByClass(Converge.class);
job2.setMapperClass(Converge.Map.class);
job2.setReducerClass(Converge.Reduce.class);
job2.setInputFormatClass(TableInputFormat.class);
job2.setOutputFormatClass(NullOutputFormat.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.getConfiguration().set(TableInputFormat.INPUT_TABLE, "tablename");

Related

Apache Beam Streaming unable to write to BigQuery column-based partition

I'm currently building a streaming pipeline using Java SDK and trying to write to a BigQuery partitioned table using the BigQueryIO write/writeTableRows. I explored a couple of patterns but none of them succeed; few of them are below.
Using SerializableFunction to determine TableDestination
.withSchema(TableSchemaFactory.buildLineageSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) or CREATE_NEVER
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
and then calling this function inside the .to() method
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s", project, dataset, table);
return new TableDestination(dest, null, timePartitioning);
I also tried to format the partition column obtained from input and add it as part of the String location with $ annotation, like below:
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
input.get("processingDate")
...convert to string MMddYYYY format
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s$%s", project, dataset, table, convertedDate);
return new TableDestination(dest, null, timePartitioning);
however, none of them succeed, either failing with
invalid timestamp
timestamp field value out of range
You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.
The destination table's partition is not supported for streaming. You can only stream to meta-table of date partitioned tables.
Streaming to metadata partition of column based partitioning table is disallowed.
I can't seem to get the right combination. Has anyone encountered the same issue before? Can anyone point me to the right direction or give me some pointers? what I want to achieve is load the streaming data based on the date column defined and not on processing time.
Thank you!
I expect most of these issues will be solved if you drop the partition decorator from dest. In most cases the BigQuery APIs for loading data will be able to figure out the right partition based on the messages themselves.
So try changing your definition of dest to:
String dest = String.format("%s.%s.%s", project, dataset, table);

Spark read() works but sql() throws Database not found

I'm using Spark 2.1 to read data from Cassandra in Java.
I tried the code posted in https://stackoverflow.com/a/39890996/1151472 (with SparkSession) and it worked. However when I replaced spark.read() method with spark.sql() one, the following exception is thrown:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `wiki`.`treated_article`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation `wiki`.`treated_article`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I'm using same spark configuration for both read and sql methods
read() code:
Dataset dataset =
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
sql() code:
spark.sql("SELECT * FROM WIKI.TREATED_ARTICLE");
Spark Sql uses a Catalogue to look up database and table references. When you write in a table identifier that isn't in the catalogue it will throw errors like the one you posted. The read command doesn't require a catalogue since you are required to specify all of the relevant information in the invocation.
You can add entries to the catalogue either by
Registering DataSets as Views
First create your DataSet
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
Then use one of the catalogue registry functions
void createGlobalTempView(String viewName)
Creates a global temporary view using the given name.
void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name.
void createTempView(String viewName)
Creates a local temporary view using the given name
OR Using a SQL Create Statement
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test",
cluster "Test Cluster",
pushdown "true"
)
Once added to the catalogue by either of these methods you can reference the table in all sql calls issued by that context.
Example
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test"
);
SELECT * FROM words;
// Hello 1
// World 2
The Datastax (My employer) Enterprise software automatically registers all Cassandra tables by placing entries in the Hive Metastore used by Spark as a Catalogue. This makes all tables accessible without manual registration.
This method allows for select statements to be used without an accompanying CREATE VIEW
I cannot think of a way to make that work off the top of my head. The problem lies in that Spark doesn't know the format to try, and the location that this would be specified is taken by the keyspace. The closest documentation for something like this that I can find is here in the DataFrames section of the Cassandra connector documentation. You can try to specify a using statement, but I don't think that will work inside of a select. So, your best bet beyond that is to create a PR to handle this case, or stick with the read DSL.

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?
My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

CassandraTemplate.execute Batch statements help needed in Spring

I am having difficult time in finding a sample program that uses the execute of batch statement as argument for org.springframework.data.cassandra.core.CassandraTemplate;
Basically I am trying to do multiple insert as a batch.
CqlTemplate cqltemplate = new CqlTemplate(session);
cqltemplate.execute(Batch arg0);
How does it all come together? Also batch has issues in dealing with inserting multiple records to any unknown table (not linked to entity class). My project requires a method to do multiple insert for a given table and hashmap of key and values (row data) - which does not have an equivalent POJO Class. Any suggestions on how to achieve this?
I was able to work it out. Thank you for the guidance. Sorry for posting it late:
Insert insert1 = QueryBuilder.insert...
Batch batch = QueryBuilder.batch(insert1);
Insert insert2 = QueryBuilder.insert...
batch.add(insert2);
CassandraOperations cassandraOperations = new CassandraTemplate(session);
WriteOptions options = new WriteOptions();
options.setTtl(60);
options.setConsistencyLevel(ConsistencyLevel.ONE);
options.setRetryPolicy(RetryPolicy.DOWNGRADING_CONSISTENCY);
cassandraOperations.execute(batch.toString(), options);
According to the reference you need to create an object of class com.datastax.driver.core.querybuilder.Batch.
You can create with com.datastax.driver.core.querybuilder.QueryBuilder batch method. CQLTemplate should not be created in the code, it should be injected in the configuration:
CQLTemplate cqlTemplate=new CQLTemplate();
yourServiceBean.setCQLTemplate(cqlTemplate);
And in your service/dao it would be something like:
Batch batch=QueryBuilder.batch (...)
cqlTemplate.execute(batch);

How can I bulk update my mongo data to add a fixed level?

I recently changed a POJO from having all its typed properties to something free in a typed JSONObject field called content.
The problem is that all old documents map to the old POJO version, so they are stored like this:
{"_id":"ObjectId(value)","field1":"value1","field2":"value2"}
Can I update all fields via a single mongo command so I can wrap all the content, except the id, so the result would be something like this:
{"_id":"ObjectId(value)","content":{"field1":"value1","field2":"value2"}}
?
Or should I program a simple program that does it one by one? (as in iterating all values sort of manually adding the new content level)
Unfortunately, there are no MongoDB commands that will allow you to restructure a document in this way. You'll need to write a program to fetch all of your documents one by one, update the structure, and then send the updated structure back to MongoDB.
Often the best way to do this is to write the modified documents to a new collection, and then drop the old collection when you're done.
I solved it creating a .js file to execute via mongo shell.
mongo myDb fixresults.js
The file is as follows:
for( var c = db.results.find(); c.hasNext(); ) {
var full = c.next();
var anon = db.results.findOne({"_id":full._id},{"_id":0});
var n = {"_id":full._id,"content":anon};
db.results.temp.insert(n);
}
This will insert the transformed value into the .temp collection, which you can rename later to replace the original.

Categories