Spark read() works but sql() throws Database not found - java

I'm using Spark 2.1 to read data from Cassandra in Java.
I tried the code posted in https://stackoverflow.com/a/39890996/1151472 (with SparkSession) and it worked. However when I replaced spark.read() method with spark.sql() one, the following exception is thrown:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `wiki`.`treated_article`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation `wiki`.`treated_article`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I'm using same spark configuration for both read and sql methods
read() code:
Dataset dataset =
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
sql() code:
spark.sql("SELECT * FROM WIKI.TREATED_ARTICLE");

Spark Sql uses a Catalogue to look up database and table references. When you write in a table identifier that isn't in the catalogue it will throw errors like the one you posted. The read command doesn't require a catalogue since you are required to specify all of the relevant information in the invocation.
You can add entries to the catalogue either by
Registering DataSets as Views
First create your DataSet
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
Then use one of the catalogue registry functions
void createGlobalTempView(String viewName)
Creates a global temporary view using the given name.
void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name.
void createTempView(String viewName)
Creates a local temporary view using the given name
OR Using a SQL Create Statement
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test",
cluster "Test Cluster",
pushdown "true"
)
Once added to the catalogue by either of these methods you can reference the table in all sql calls issued by that context.
Example
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test"
);
SELECT * FROM words;
// Hello 1
// World 2
The Datastax (My employer) Enterprise software automatically registers all Cassandra tables by placing entries in the Hive Metastore used by Spark as a Catalogue. This makes all tables accessible without manual registration.
This method allows for select statements to be used without an accompanying CREATE VIEW

I cannot think of a way to make that work off the top of my head. The problem lies in that Spark doesn't know the format to try, and the location that this would be specified is taken by the keyspace. The closest documentation for something like this that I can find is here in the DataFrames section of the Cassandra connector documentation. You can try to specify a using statement, but I don't think that will work inside of a select. So, your best bet beyond that is to create a PR to handle this case, or stick with the read DSL.

Related

How to change dataflow job graph during runtime with arguments?

I am using Dataflow to read data from a JDBC table and load results to a BigQuery table. There is one parameter "flag" that I want to pass during runtime and if the flag is set True, results should be loaded to an additional table in BigQuery.
To summarise:
If the flag is set False - Read table A from JDBC, write to table A in BigQuery
If the flag is set True - Read table A from JDBC, write to table A as well as B in BigQuery.
Please refer sample code of my pipeline
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options);
ValueProvider < String > gcsFlag = options.getGcsFlag();
PCollection < TableRow > inputData = pipeline.apply("Reading JDBC Table",
JdbcIO. < TableRow > read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(options.getDriverClassName(), options.getJdbcUrl())
.withUsername(options.getUsername()).withPassword(options.getPassword()))
.withQuery(options.getSqlQuery())
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new CustomRowMapper()));
inputData.apply(
"Write to BigQuery Table 1",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable()));
if (gcsFlag.get().equals("TRUE")) {
inputData.apply(
"Write to BigQuery Table 2",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCustomGcsTempLocation(options.getBigQueryLoadingTemporaryDirectory())
.to(options.getOutputTable2()));
}
pipeline.run();
}
The challenge that I am facing is I have to pass the ValueProvider during compiling and creating the dataflow template. The job graph is constructed at compile time only and I am not able to re-use the same template again for other cases.
Is there a way that I can pass the ValueProvider<String> flag at runtime and the job graph can be constructed during runtime? With this, I can reuse the same template for both cases. Similarly, I want to also provide sqlQuery (options.getSqlQuery()) at runtime. So that I can use the same template for all the tables that I want to read from Source.
Any help is appreciated.
When you create the DAG it can't change in the runtime.
But still, you have a chance to fix your problem. Try Beam Partition-Pattern
https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Faster way of updating database table using Hibernate (Java 8 reduction?)

I am working on a monitoring tool developed in Spring Boot using Hibernate as ORM.
I need to compare each row (already persisted rows of sent messages) in my table and see if a MailId (unique) has received a feedback (status: OPENED, BOUNCED, DELIVERED...) Yes or Not.
I get the feedbacks by reading csv files from a network folder. The CSV parsing and reading of files goes very fast, but the update of my database is very slow. My algorithm is not very efficient because I loop trough a list that can have hundred thousands of objects and look in my table.
This is the method that make the update in my table by updating the "target" Object (row in table database)
#Override
public void updateTargetObjectFoo() throws CSVProcessingException, FileNotFoundException {
// Here I make a call to performProcessing method which reads files on a folder and parse them to JavaObjects and I map them in a feedBackList of type Foo
List<Foo> feedBackList = performProcessing(env.getProperty("foo_in"), EXPECTED_HEADER_FIELDS_STATUS, Foo.class, ".LETTERS.STATUS.");
for (Foo foo: feedBackList) {
//findByKey does a simple Select in mySql where MailId = foo.getMailId()
Foo persistedFoo = fooDao.findByKey(foo.getMailId());
if (persistedFoo != null) {
persistedFoo.setStatus(foo.getStatus());
persistedFoo.setDnsCode(foo.getDnsCode());
persistedFoo.setReturnDate(foo.getReturnDate());
persistedFoo.setReturnTime(foo.getReturnTime());
//The save account here does an MySql UPDATE on the table
fooDao.saveAccount(foo);
}
}
}
What if I achieve this selection/comparison and update action in Java side? Then re-update the whole list in database?
Will it be faster?
Thanks to all for your help.
Hibernate is not particularly well-suited for batch processing.
You may be better off using Spring's JdbcTemplate to do jdbc batch processing.
However, if you must do this via Hibernate, this may help: https://docs.jboss.org/hibernate/orm/5.2/userguide/html_single/chapters/batch/Batching.html

Cassandra Java request

I trying to execute a simple request in Java with Cassandra database but I don't understand my example it don't work :
public class Test {
public static void main( String[] args )
{
Cluster cluster = Cluster.builder()
.addContactPoints("127.0.0.1")
.build();
Session session = cluster.connect("test");
ResultSet results = session.execute("SELECT * FROM datatest");
for ( Row row : results ) {
System.out.println(row.getString("name"));
}
}
}
When I execute my code I have this error :
[main] INFO com.datastax.driver.core.policies.DCAwareRoundRobinPolicy - Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
[main] INFO com.datastax.driver.core.Cluster - New Cassandra host /127.0.0.1:9042 added
Exception in thread "main" java.lang.IllegalArgumentException: name is not a column defined in this metadata
at com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
at com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
at com.datastax.driver.core.ArrayBackedRow.getIndexOf(ArrayBackedRow.java:69)
at com.datastax.driver.core.AbstractGettableData.getString(AbstractGettableData.java:137)
at fakemillions.Test.main(Test.java:23)
This is my database :
My database screenshot
It sounds like you haven't defined a column with the name "name" in the table datatest. I haven't used Datastax driver before, but as the documentation says the IllegalArgumentException is thrown if
name is not part of the ResultSet this row is part of, i.e. if !this.columns().names().contains(name)
It'd be better if you provided how you created the datatest table.
When you create a table using thrift, which contains dynamic columns, and therefore did not give explicit column definitions, then CQL by default assign the following names: key, column1, value. In your case it will have column1 contain the value "name", value column will contain "TiTi", etc.
Think about it as this. The data is stored really as key-value pairs. CQL can't infer any column names from these key-value pairs, since there may be billions of different keys (column names) in the same row.
Here is a good guide to migrate thrift tables to cql: http://www.datastax.com/dev/blog/thrift-to-cql3
Try this work in my case
Cluster.Builder builder = Cluster.builder();
Cluster cluster=builder.withLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy("us-east"))).addContactPoints("127.0.0.1").build();

Querying the appropriate database schema

This is a follow-on question to my earlier question about specifying multiple schemata in java using jooq to interact with H2.
My test H2 DB currently has 2 schemata, PUBLIC and INFORMATION_SCHEMA. PUBLIC is specified as the default schema by H2. When running a query that should extract information from eg INFORMATION_SCHEMA.TABLES the query fails with a "table unknown" SQL error. I am only able to execute such queries by executing a factory.use(INFORMATION_SCHEMA). There are no build errors etc and eclipse properly autocompletes eg TABLES.TABLE_NAME.
If I dont do this, jooq doesnt seem to prepend the appropriate schema even though I create the correct Factory object for the schema eg
InformationSchemaFactory info = new InformationSchemaFactory(conn);
I read about mapping but am a bit confused as to which schema I would use as the input/output.
By default, the InformationSchemaFactory assumes that the supplied connection is actually connected to the INFORMATION_SCHEMA. That's why schema names are not rendered in SQL. Example:
// This query...
new InformationSchemaFactory(conn).selectFrom(INFORMATION_SCHEMA.TABLES).fetch();
// ... renders this SQL (with the asterisk expanded):
SELECT * FROM "TABLES";
The above behaviour should be documented in your generated InformationSchemaFactory Javadoc. In order to prepend "TABLES" with "INFORMATION_SCHEMA", you have several options.
Use a regular factory instead, which is not tied to any schema:
// This query...
new Factory(H2, conn).selectFrom(INFORMATION_SCHEMA.TABLES).fetch();
// ... renders this SQL:
SELECT * FROM "INFORMATION_SCHEMA"."TABLES";
Use another schema's factory, such as the generated PublicFactory:
// This query...
new PublicFactory(conn).selectFrom(INFORMATION_SCHEMA.TABLES).fetch();
// ... renders this SQL:
SELECT * FROM "INFORMATION_SCHEMA"."TABLES";
Use Settings and an appropriate schema mapping to force the schema name to be rendered.
The first option is probably the easiest one.
This blog post here will give you some insight about how to log executed queries to your preferred logger output: http://blog.jooq.org/2011/10/20/debug-logging-sql-with-jooq/

Mysql Copy Database From Sql Statement

I am attempting to create a test database (based off of my production db) at runtime, but rather than have to maintain an exact duplicate test db i'd like to copy the entire data structure of my production db at runtime and then when I close the test database, drop the entire database.
I assume I will be using statements such as:
CREATE DATABASE test //to create the test db
CREATE TABLE test.sampleTable LIKE production.sampleTable //to create each table
And when I am finished with the test db, calling a close method will run something like:
DROP DATABASE test //delete the database and all its tables
But how do I go about automatically finding all the tables within the production database without having to manually write them out. The idea is that I can manipulate my production db without having to be concerned with maintaining the structure identically within the test db.
Would a stored procedure be necessary in this case? Some sample code on how to achieve something like this would be appreciated.
If the database driver you are using supports it, you can use DatabaseMetaData#getTables to get the list of tables for a schema. You can get access to DatabaseMetaData from Connection#getMetaData.
In your scripting language, you call "SHOW TABLES" on the database you want to copy. Reading that result set a row at a time, your program puts the name of the table into a variable (let's call it $tablename) and can generate the sql: "CREATE TABLE test.$tablename LIKE production.$tablename". Iterate through the result set and you're done.
(You won't get foreign key constraints that way, but maybe you don't need those. If you do, you can run "SHOW CREATE TABLE $tablename" and parse the results to pick out the constraints.)
I don't have a code snippet for java, but here is one for perl that you could treat as pseudo-code:
$ref = $dbh->selectall_arrayref("SHOW TABLES");
unless(defined ($ref)){
print "Nothing found\n";
} else {
foreach my $row_ref (#{$ref}){
push(#tables, $row_ref->[0]);
}
}
The foreach statement iterates over the result set in an array reference returned by the database interface library. The push statement puts the first element of the current row of the result set into an array variable #tables. You'd be using the database library appropriate for your language of choice.
I would use mysqldump : http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html
It will produce a file containing all the sql commands needed to replicate the prod database
The solutions was as follows:
private static final String SQL_CREATE_TEST_DB = "CREATE DATABASE test";
private static final String SQL_PROD_TABLES = "SHOW TABLES IN production";
JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);
jdbcTemplate.execute(SQL_CREATE_TEST_DB);
SqlRowSet result = jdbcTemplate.queryForRowSet(SQL_PROD_TABLES);
while(result.next()) {
String tableName = result.getString(result.getMetaData().getColumnName(1)); //Retrieves table name from column 1
jdbcTemplate.execute("CREATE TABLE test2." + tableName + " LIKE production." + tableName); //Create new table in test2 based on production structure
}
This is using Spring to simplify the database connection etc, but the real magic is in the SQL statements. As mentioned by D Mac, this will not copy foreign key constraints, but that can be achieved by running another SQL statement and parsing the results.

Categories