Where clause in Phoenix integration with Spark - java

I am trying to read some data from Phoenix to Spark using its
String connectionString="jdbc:phoenix:auper01-01-20-01-0.prod.vroc.com.au,auper01-02-10-01-0.prod.vroc.com.au,auper01-02-10-02-0.prod.vroc.com.au:2181:/hbase-unsecure";
Map<String, String> options2 = new HashMap<String, String>();
options2.put("driver", "org.apache.phoenix.jdbc.PhoenixDriver");
//options2.put("dbtable", url);
options2.put("table", "VROC_SENSORDATA_3");
options2.put("zkUrl", connectionString);
DataFrame phoenixFrame2 = this.hc.read().format("org.apache.phoenix.spark")
.options(options2)
.load();
System.out.println("The phoenix table is:");
phoenixFrame2.printSchema();
phoenixFrame2.show(20, false);
But I need to do a select with where clause, I also used the dbtable which is used for a JDBC connection in Spark but I guess it doesn't have any effect!
Based on the documentation
"In contrast, the phoenix-spark integration is able to leverage the underlying splits provided by Phoenix in order to retrieve and save data across multiple workers. All that’s required is a database URL and a table name. Optional SELECT columns can be given, as well as pushdown predicates for efficient filtering."
But seems there is no way to parallelize the reading from Phoenix, it would be really inefficient to read the whole table in a Spark Dataframe and then doing filtering, but it seems I can find a way to apply a where clause. Does anyone know how to apply where clause in my above codes?

Related

Group by multiple fields using mongodb aggregate builders in java application

I am fetching data from mongodb and doing some operations using aggregates builders in my java application.
I was able to group by single field using the below piece of code.
Bson group = group("$city", sum("totalPop", "$pop"));
Bson project = project(fields(excludeId(), include("totalPop"), computed("city", "$_id")));
List<Document> results = zips.aggregate(Arrays.asList(group, project)).into(new ArrayList<>());
Now I need to group by using multiple fields...say city and location.
Can someone help on this?

Spark and Cassandra: requirement failed: Columns not found in class com.datastax.spark.connector.japi.CassandraRow: [mycolumn...]

I have a CassandraRow object that contains values of a row. I read it from a one table. I want to write that same object to another table. But then I get this error:
requirement failed: Columns not found in class com.datastax.spark.connector.japi.CassandraRow: [myColumn1, myColumns2, ...]
I tried to pass my own mapping by creating a Map and passing it in the function. This is my code:
CassandraRow row = fetch();
Map<String, String> mapping = Map.of("myColumn1", "myColumn1", "myColumns2", "myColumns2"....);
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaRDD<CassandraRow> insightRDD = ctx.parallelize(List.of(row));
CassandraJavaUtil.javaFunctions(insightRDD).writerBuilder("mykeyspace", "mytable",
CassandraJavaUtil.mapToRow(CassandraRow.class, mapping)).saveToCassandra(); //I also tried without mapping
Any help is appreciated. I have tried POJO approach and it is working. But I don't want to be restricted to creating POJOs. I want a generic approach that would work with any table and any row.
I could not find a way to generalize my solution using Apache Spark. So I use Datastax Java Driver for Apache Cassandra and wrote SQL queries. That was generic enough for me.

Access dataBase using Apache-Spark-SQL

Hi i'm a new learner for apache spark using java
This is a correct way or not?
This code is working,but performance wise very slow i don't know which one is best approach to access data for every loop.
Dataset<Row> javaRDD = sparkSession.read().jdbc(dataBase_url, "sample", properties);
javaRDD.toDF().registerTempTable("sample");
Dataset<Row> Users = sparkSession.sql("SELECT DISTINCT FROM_USER FROM sample ");
List<Row> members = Users.collectAsList();
for (Row row : members) {
Dataset<Row> userConversation = sparkSession.sql("SELECT DESCRIPTION FROM sample WHERE FROM_USER ='"+ row.getDecimal(0) +"'");
userConversation.show();
}
Try create a set with all Users and then execute a query like
sparkSession.sql("SELECT DESCRIPTION FROM sample WHERE FROM_USER IN usersSet);
This way you execute only one query so you pay the overhead needed to DB connection only once.
Another approach if you run Spark over HDFS and this is a one time query is to use a tool like Sqoop to load the SQL table in Hadoop and use the data natively in Spark.

Spark read() works but sql() throws Database not found

I'm using Spark 2.1 to read data from Cassandra in Java.
I tried the code posted in https://stackoverflow.com/a/39890996/1151472 (with SparkSession) and it worked. However when I replaced spark.read() method with spark.sql() one, the following exception is thrown:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `wiki`.`treated_article`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation `wiki`.`treated_article`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I'm using same spark configuration for both read and sql methods
read() code:
Dataset dataset =
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
sql() code:
spark.sql("SELECT * FROM WIKI.TREATED_ARTICLE");
Spark Sql uses a Catalogue to look up database and table references. When you write in a table identifier that isn't in the catalogue it will throw errors like the one you posted. The read command doesn't require a catalogue since you are required to specify all of the relevant information in the invocation.
You can add entries to the catalogue either by
Registering DataSets as Views
First create your DataSet
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
Then use one of the catalogue registry functions
void createGlobalTempView(String viewName)
Creates a global temporary view using the given name.
void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name.
void createTempView(String viewName)
Creates a local temporary view using the given name
OR Using a SQL Create Statement
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test",
cluster "Test Cluster",
pushdown "true"
)
Once added to the catalogue by either of these methods you can reference the table in all sql calls issued by that context.
Example
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test"
);
SELECT * FROM words;
// Hello 1
// World 2
The Datastax (My employer) Enterprise software automatically registers all Cassandra tables by placing entries in the Hive Metastore used by Spark as a Catalogue. This makes all tables accessible without manual registration.
This method allows for select statements to be used without an accompanying CREATE VIEW
I cannot think of a way to make that work off the top of my head. The problem lies in that Spark doesn't know the format to try, and the location that this would be specified is taken by the keyspace. The closest documentation for something like this that I can find is here in the DataFrames section of the Cassandra connector documentation. You can try to specify a using statement, but I don't think that will work inside of a select. So, your best bet beyond that is to create a PR to handle this case, or stick with the read DSL.

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?
My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

Categories