How to check equality of spark columns after renaming

How to check equality of spark columns after renaming - java

I am trying to write some tests for a Java Spark-Sql application. One operation I need to test renames a column, and I ran into some difficulty comparing the actual value of the renamed column with my expected value. After some experimentation, I was able to write the following two tests to demonstrate the problem:
First, as a sanity check, I tried this (df is a spark sql DataFrame, generated by reading some sample data from a json file I'm testing against):
#Test
public void testColumnEquality() throws Exception {
Column val1 = df.col("col2");
Column val2 = df.col("col2");
Assert.assertEquals(val1, val2);
}
Which passes, as one would expect. Then I tried this:
#Test
public void testReanmeColumnEquality() throws Exception {
Column val1 = df.col("col2").as("col2");
Column val2 = df.col("col2").as("col2");
Assert.assertEquals(val1, val2);
}
which fails with the error java.lang.AssertionError: expected:<col2 AS col2#4L> but was:<col2 AS col2#5L>
Digging around in the scala code (full disclosure - I know very little scala) it looks like this has to do with the NamedExpression unique id.
Is there any way to sensibly check that these two columns represent the same operations with the same alias?
(I'm working in spark 1.6, and would ideally like a solution for that version line, but if this is fixed in 2.0 that would also be good information.)
Thanks you.

I wrote a blog post about how to resolve this:
The trick is: check whether the Expression has the Alias trait:
`column.expr() instanceof Alias`
If it does, unpack the child expression and the name using the Extractor pattern:
alias = (Alias) column.expr()
Option<Tuple2<Expression, String>> aliasTuple = Alias$.MODULE$.unapply(alias);

I did some digging and it looks like the information about the child of a Column with an alias is lost in the process of instantiating the new Column. Maybe there is a state to query somewhere, but I didn't find it.
So it's not an answer, but hopefully it is useful or of interest to somebody.
more info
The definition of the as method on a Column object refers to the name function (see Column.scala), which just call the Alias case class defined here. The Alias (with its child), is not exposed. It is directly given to the Column class withExpr function which instantiate a new column based on the Alias named expression.
So you either compare directly the result of toString on the columns (loosing the information on where the column comes from, i.e. which dataframe), or you actually parse the string printed by the explain(true) method...but it doesn't seem sensible to me...

Related

How to use DSL.coalesce with lists of fields?

Using Jooq, I am trying to fetch from a table by id first, if no matches found, then fetch by handle again.
And I want all fields of the returned rows, not just one.
Field<?> firstMatch = DSL.select(Tables.MY_TABLE.fields())
.from(Tables.MY_TABLE.fields())
.where(Tables.MY_TABLE.ID.eq(id))
.asfield(); // This is wrong, because it supports only one field, but above we selected Tables.MY_TABLE.fields(), which is plural.
Field<?> secondMatch = DSL.select(Tables.MY_TABLE.fields())
.from(Tables.MY_TABLE.fields())
.where(Tables.MY_TABLE.HANDLE.eq(handle))
.asfield(); // Same as above.
dslContext.select(DSL.coalesce(firstMatch, secondMatch))
.fetchInto(MyClass.class);
Due to the mistake mentioned above in the code, the following error occurs:
Can only use single-column ResultProviderQuery as a field
I am wondering how to make firstMatch and secondMatch two lists of fields, instead of two fields?
I tried
Field<?>[] secondMatch = DSL.select(Tables.MY_TABLE.fields())
.from(Tables.MY_TABLE.fields())
.where(Tables.MY_TABLE.HANDLE.eq(handle))
.fields();
but the following error occurred in the line containing DSL.coalesce
Type interface org.jooq.Field is not supported in dialect DEFAULT
Thanks in advance!

This sounds much more like something you'd do with a simple OR?
dslContext.selectFrom(MY_TABLE)
.where(MY_TABLE.ID.eq(id))
// The ne(id) part might not be required...
.or(MY_TABLE.ID.ne(id).and(MY_TABLE.HANDLE.eq(handle))
.fetchInto(MyClass.class);
If the two result sets should be completely exclusive, then you can do this:
dslContext.selectFrom(MY_TABLE)
.where(MY_TABLE.ID.eq(id))
.or(MY_TABLE.HANDLE.eq(handle).and(notExists(
selectFrom(MY_TABLE).where(MY_TABLE.ID.eq(id))
)))
.fetchInto(MyClass.class);
If on your database product, a query using OR doesn't perform well, you can write an equivalent query with UNION ALL, which might perform better.

Less repetition in jOOQ query

Any idea on how I could define the following jOOQ query with less repetition?
I am using jOOQ 3.11.4.
db.insertInto(ACCOUNT,
ACCOUNT.ACCOUNT_ID,
ACCOUNT.EMAIL,
ACCOUNT.FIRST_NAME,
ACCOUNT.LAST_NAME,
ACCOUNT.IS_ADMIN,
ACCOUNT.PASSWORD)
.values(account.accountId,
account.email,
account.firstName,
account.lastName,
account.isAdmin,
account.password)
.onConflict(ACCOUNT.ACCOUNT_ID)
.doUpdate()
.set(ACCOUNT.EMAIL, account.email)
.set(ACCOUNT.FIRST_NAME, account.firstName)
.set(ACCOUNT.LAST_NAME, account.lastName)
.set(ACCOUNT.IS_ADMIN, account.isAdmin)
.set(ACCOUNT.PASSWORD, account.password)
.returning(
ACCOUNT.ACCOUNT_ID,
ACCOUNT.EMAIL,
ACCOUNT.FIRST_NAME,
ACCOUNT.LAST_NAME,
ACCOUNT.IS_ADMIN,
ACCOUNT.PASSWORD
)
.fetchOne()
(I turns out my question is mostly code, and StackOverflow does not let me post it as is, without adding more details, which I do not think is necessary for my question, but nevertheless, they want me to post some more text, which I am doing right now by typing this message, and I hope you did not have to read to the end.)

Since you're passing all the columns to the insert statement, you might write this instead:
// Create an AccountRecord that contains your POJO data
Record rec = db.newRecord(ACCOUNT);
rec.from(account);
// Don't pass the columns to the insert statement explicitly
db.insertInto(ACCOUNT)
// But pass the record to the set method. It will use all the changed values
.set(rec)
// Use the MySQL syntax, which can be emulated on PostgreSQL using ON CONFLICT
.onDuplicateKeyUpdate()
// But pass the record to the set method again
.set(rec)
// Don't specify any columns to the returning clause. It will take all the ACCOUNT columns
.returning()
.fetchOne();

JOOQ create Table and insert values DSLContext

I'm trying to generate a Database using JOOQ
i do create a Table with this code:
CreateTableAsStep<Record> table = create.createTable("TestTable");
CreateTableColumnStep step = table.column("testColumn", SQLDataType.Integer);
step.execute();
this works fine, but when it comes to inserting data, i run into a problem
the doc includes the following example:
create.insertInto(AUTHOR)
.set(AUTHOR.ID, 100)
.set(AUTHOR.FIRST_NAME, "Hermann")
.set(AUTHOR.LAST_NAME, "Hesse")
.newRecord()
.set(AUTHOR.ID, 101)
.set(AUTHOR.FIRST_NAME, "Alfred")
.set(AUTHOR.LAST_NAME, "Döblin")
.execute();
here AUTHOR is not a simple String it expects a org.jooq.Table<R extends Record>
i thought there might be a return type when creating the table, but i did not find it. Google did not help as Table is not the best word to search for ;-)
Question: how can i get to an instance of a Table - i do have its name as String?

You can always create Table references via DSL.table(String) or DSL.table(Name). For example:
// Assuming this:
import static org.jooq.impl.DSL.*;
create.insertInto(table(name("TestTable")))
.set(field(name("testColumn")), 1)
.execute();
Notice also my usage of DSL.field(Name).
Plain SQL vs. Name references
It's worth reading up on the difference between creating dynamic table / field objects at runtime either with plain SQL strings (as in DSL.table(String)) or with name references (as in DSL.table(Name)). Essentially:
Plain SQL strings are case-insensitive and subject to SQL injection
Name references are case-sensitive by default
In your case, as you probably created case sensitive table/column names, you should prefer the latter. More info can be found here:
http://www.jooq.org/doc/latest/manual/sql-building/plain-sql
http://www.jooq.org/doc/latest/manual/sql-building/names

Spring Batch Paging with sortKeys and parameter values

I have a Spring Batch project running in Spring Boot that is working perfectly fine. For my reader I'm using JdbcPagingItemReader with a MySqlPagingQueryProvider.
#Bean
public ItemReader<Person> reader(DataSource dataSource) {
MySqlPagingQueryProvider provider = new MySqlPagingQueryProvider()
provider.setSelectClause(ScoringConstants.SCORING_SELECT_STATEMENT)
provider.setFromClause(ScoringConstants.SCORING_FROM_CLAUSE)
provider.setSortKeys("p.id": Order.ASCENDING)
JdbcPagingItemReader<Person> reader = new JdbcPagingItemReader<Person>()
reader.setRowMapper(new PersonRowMapper())
reader.setDataSource(dataSource)
reader.setQueryProvider(provider)
//Setting these caused the exception
reader.setParameterValues(
startDate: new Date() - 31,
endDate: new Date()
)
reader.afterPropertiesSet()
return reader
}
However, when I modified my query with some named parameters to replace previously hard coded date values and set these parameter values on the reader as shown above, I get the following exception on the second page read (the first page works fine because the _id parameter hasn't been made use of by the paging query provider):
org.springframework.dao.InvalidDataAccessApiUsageException: No value supplied for the SQL parameter '_id': No value registered for key '_id'
at org.springframework.jdbc.core.namedparam.NamedParameterUtils.buildValueArray(NamedParameterUtils.java:336)
at org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate.getPreparedStatementCreator(NamedParameterJdbcTemplate.java:374)
at org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate.query(NamedParameterJdbcTemplate.java:192)
at org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate.query(NamedParameterJdbcTemplate.java:199)
at org.springframework.batch.item.database.JdbcPagingItemReader.doReadPage(JdbcPagingItemReader.java:218)
at org.springframework.batch.item.database.AbstractPagingItemReader.doRead(AbstractPagingItemReader.java:108)
Here is an example of the SQL, which has no WHERE clause by default. One does get created automatically when the second page is read:
select *, (select id from family f where date_created between :startDate and :endDate and f.creator_id = p.id) from person p
On the second page, the sql is modified to the following, however it seems that the named parameter for _id didn't get supplied:
select *, (select id from family f where date_created between :startDate and :endDate and f.creator_id = p.id) from person p WHERE id > :_id
I'm wondering if I simply can't use the MySqlPagingQueryProvider sort keys together with additional named parameters set in JdbcPagingItemReader. If not, what is the best alternative to solving this problem? I need to be able to supply parameters to the query and also page it (vs. using the cursor). Thank you!

I solved this problem with some intense debugging. It turns out that MySqlPagingQueryProvider utilizes a method getSortKeysWithoutAliases() when it builds up the SQL query to run for the first page and for subsequent pages. It therefore appends and (p.id > :_id) instead of and (p.id > :_p.id). Later on, when the second page sort values are created and stored in JdbcPagingItemReader's startAfterValues field it will use the original "p.id" String specified and eventually put into the named parameter map the pair ("_p.id",10). However, when the reader tries to fill in _id in the query, it doesn't exist because the reader used the non-alias removed key.
Long story short, I had to remove the alias reference when defining my sort keys.
provider.setSortKeys("p.id": Order.ASCENDING)
had to change to in order for everything to work nicely together
provider.setSortKeys("id": Order.ASCENDING)

I had the same issue and got another possible solution.
My table T has a primary key field INTERNAL_ID.
The query in JdbcPagingItemReader was like this:
SELECT INTERNAL_ID, ... FROM T WHERE ... ORDER BY INTERNAL_ID ASC
So, the key is: in some conditions, the query didn't return results, and then, raised the error above No value supplied for...
The solution is:
Check in a Spring Batch decider element if there are rows.
If it is, continue with chunk: reader-processor-writer.
It it's not, go to another step.
Please, note that they are two different scenarios:
At the beginning, there are rows. You get them by paging and finally, there are no more rows. This has no problem and decider trick is not required.
At the beginning, there are no rows. Then, this error raised, and the decider solved it.
Hope this helps.

What is the best way to import an XML string into a SQL Server table

I am working with a 3rd product called JPOS and it has an XMLPackager whereby I get a string from this packager that contains a record in an XML format such as:
<MACHINE><B000>STRING_VALUE</B000><B002>STRING_VALUE</B002><B003>STRING_VALUE</B003><B004>STRING_VALUE</B004><B007>STRING_VALUE</B007><B011>STRING_VALUE</B011><B012>STRING_VALUE</B012><B013>STRING_VALUE</B013><B015>STRING_VALUE</B015><B018>STRING_VALUE</B018><B028>STRING_VALUE</B028><B032>STRING_VALUE</B032><B035>STRING_VALUE</B035><B037>STRING_VALUE</B037><B039>STRING_VALUE</B039><B041>STRING_VALUE</B041><B043>STRING_VALUE</B043><B048>STRING_VALUE</B048><B049>STRING_VALUE</B049><B058>STRING_VALUE</B058><B061>STRING_VALUE</B061><B063>STRING_VALUE</B063><B127>STRING_VALUE</B127></MACHINE>
I have a SQL server table that contains a column for each of the listed. Not that it matters but I could potentially have thru defined with specific STRING_VALUEs. I'm not sure what is the best way to go about this in Java. My understanding is that SQL Server can take an XML string (not document) and do an insert. Is it best to parse each value and then put into a list that populate each value into? This is the first time I've used an XML file and therefore trying to get some help/direction.
Thanks.

Sorry, one of my colleagues was able to help and provide a quick answer. I'll try it from my Java code and it looks like it should work great. Thanks anyway.
Here is the SP that she created whereby I can pass in my XML string and bit value:
CREATE PROCEDURE [dbo].[sbssp_InsertArchivedMessages]
(
#doc varchar(max),
#fromTo bit
)
AS
BEGIN
DECLARE #idoc int, #lastId int
EXEC sp_xml_preparedocument #idoc OUTPUT, #doc
INSERT INTO [dbo].[tblArchivedMessages]
SELECT * FROM OPENXML(#idoc, '/MACHINE', 2) WITH [dbo].[tblArchivedMessages]
SET #lastId = (SELECT IDENT_CURRENT('tblArchivedMessages'))
UPDATE [dbo].[tblArchivedMessages]
SET FromToMach = #fromTo
WHERE ID = #lastId
END
GO
Regards.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to check equality of spark columns after renaming - java

Related

How to use DSL.coalesce with lists of fields?

Less repetition in jOOQ query

JOOQ create Table and insert values DSLContext

Spring Batch Paging with sortKeys and parameter values

What is the best way to import an XML string into a SQL Server table

Categories

Resources