How do I parse a base query (à la Google Data) in Java? - java

I have a system where I query a REST / Atom server for documents. The queries are inspired by GData and look like :
http://server/base/feeds/documents?bq=[type in {'news'}]
I have to parse the "bq" parameter to know which type of documents will be returned without actually doing the query. So for example,
bq=[type = 'news'] -> return ["news"]
bq=[type in {'news'}] -> return ["news"]
bq=[type in {'news', 'article'}] -> return ["news", "article"]
bq=[type = 'news']|[type = 'article'] -> return ["news", "article"]
bq=[type = 'news']|[title = 'My Title'] -> return ["news"]
Basically, the query language is a list of predicate that can be combined with OR ("|") or AND (no separator). Each predicate is constraint on a field. The constraint can be =, <, >, <=, >=, in, etc... There can be spaces everywhere where it make sense.
I'm a bit lost between Regexp, StringTokenizer, StreamTokenizer, etc... and I am stuck with Java 1.4, so no Parser ...
Who can point me in the right direction ?
Thanks !

The right way would be to use parser generator like Antlr, JFlex or JavaCC.
A quick and dirty way would be:
String[] disjunctedPredicateGroups = query.split("\|");
List<String[]> normalizedPredicates = ArrayList<String[]>;
for (String conjunction : disjunctedPredicateGroups ) {
normalizedPredicates.add(conjunction.split("\[|\]"));
}
// process each predicate

Related

How to use ExternalCatalog.listPartitions() with Java

I'm new in Java. I want to drop partition in hiveTable. I want to use SparkSession.ExternalCatalog().listPartitions and SparkSession.ExternalCatalog().dropPartitions.
I saw this methods on scala How to truncate data and drop all partitions from a Hive table using Spark
But I can't understand how to run it on Java. It's a part of etl process and I want to understand how to deal with it on Java.
My code failed because of the misunderstanding how to manipulate with datatypes and convert it to java. What type of object need and how to understand what data return API.
Example of my code:
ExternalCatalog ec = SparkSessionFactory.getSparkSession.sharedState().externalCatalog();
ec.listPartitions("need_schema", "need_table");
And it fails because of the:
method listPartitions in class org.apache.spark.sql.catalog.ExternalCatalog cannot be applied to given types.
I can't beat it because of less information about api (https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-ExternalCatalog.html#listPartitions) and java knowledges and because the all examples I find wrote on scala.
Finally I need to convert this code that works on scala to java:
def dropPartitions(spark:SparkSession, shema:String, table:String, need_date:String):Unit = {
val cat = spark.sharedState.externalCatalog
val partit = cat.ListPartitions(shema,table).map(_.spec).map(t => t.get("partition_field")).flatten
val filteredPartit = partita.filter(_<dt).map(x => Map("partition_field" -> x))
cat.dropPartitions(
shema
,table
,filteredPartitions
,ignoreIfNotExists=true
,purge=false
,retainData=false
}
Please, if you know how deal with it can you help in this things:
some example of code in java to write my own container to manipulate data from externalCatalog
what data structure use in this api and some theoretical source which can help me to understand how they are usable with java
what's mean in scala code string: cat.ListPartitions(shema,table).map(_.spec).map(t => t.get("partition_field")).flatten?
tnx
UPDATING
Thank you very much for your feedback #jpg. I'll try. I have big etl task and goal of it to writing into dynamic partitioned table data once a week. Business rules of making this datamart: (sysdate - 90 days). And because of that I want to drop arrays of partition (by days) in target table in public access schema. And I have read that the right way of drop partition - using externalCatalog. I should use java because of the historical tradition this project) and try to understand how to do this most efficiently. Some of methods of externalCatalog I can return into terminal through System.out.println():
externalCatalog.tableExists(), externalCatalog.listTables() and methods of externalCatalog.getTable. But I don't understand how to deal with externalCatalog.listPartitions.
UPDATING ONE MORE TIME
Hello everyone. I have one step forward in my task:
Now I can return in terminal buffer of list partitions:
ExternalCatalog ec = SparkSessionFactory.getSparkSession.sharedState().externalCatalog();
ec.listPartitions("schema", "table", Option.empty()); // work! null or miss parameter fail program
Seq<CatalogTablePartition> ctp = ec.listPartitions("schema", "table", Option.empty());
List<CatalogTablePartition> catalogTablePartitions = JavaConverters.seqAsJavaListConverter(ctp).asJava();
for CatalogTablePartition catalogTablePartition: catalogTablePartitions) {
System.out.println(catalogTablePartition.toLinkedHashMap().get("Partition Values"));//retutn me value of partition like "Some([validation_date=2021-07-01])"
)
But this is another problem.
I can return values in this api in method ec.dropPartitions like Java List. It's want in 3d parameter Seq<Map<String, String>> structure. Also I can't filtered partition in this case - in my dreams I want filtered the values of partition less by date parameter and then drop it.
If anyone know how to wrote map method with this api to return it like in my scala example please help me.
I solved it by myself. Maybe it'll help someone.
public static void partitionDeleteLessDate(String db_name, String table_name, String date_less_delete) {
ExternalCatalog ec = SparkSessionFactory.getSparkSession.sharedState().externalCatalog();
Seq<CatalogTablePartition> ctp = ec.listPartitions(db_name, table_name, Option.empty());
List<CatalogTablePartition> catalogTablePartitions = JavaConverters.seqAsJavaListConverter(ctp).asJava();
List<Map<String, String>> allPartList = catalogTablePartitions.stream.
.map(s -> s.spec.seq())
.collect(Collectors.toList());
List<String> datePartDel =
allPartList.stream()
.map(x -> x.get("partition_name").get())
.sorted()
.collect(Collectors.toList());
String lessThisDateDelete = date_less_delete;
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
LocalDate date = LocalDate.parse(lessThisDateDelete, formatter);
List<String> filteredDates = datePartDel.stream()
.map(s -> LocalDate.parse(s, formatter))
.filter(d -> d.isBefore(date))
.map(s -> s.toString())
.collect(Collectors.toList());
for (String seeDate : filteredDates)) {
List<Map<String, String>> elem = allPartList.stream()
.filter(x -> x.get("partition_name").get().equals(seeDate))
.collect(Collectors.toList());
Seq<Map<String, String>> seqElem = JavaConverters.asScalaIteratorConverter(elem.iterator()).asScala.toSeq();
ec.dropPartitions(
db_name
, table_name
, seqElem
, true
, false
, false
);
}
}

Hibernate-search search by list of numbers

I am working in a Hibernate-search, Java application with an entity which has a numeric field indexed:
#Field
#NumericField
private Long orgId;
I want to get the list of entities which match with a list of Long values for this property. I used the "simpleQueryString" because it allows to use "OR" logic with char | for several objective values. I have something like this:
queryBuilder.simpleQueryString().onField("orgId").matching("1|3|8").createQuery()
After run mi application I get:
The specified query '+(orgId:1 orgId:3 orgId:8)' contains a string based sub query which targets the numeric encoded field(s) 'orgId'. Check your query or try limiting the targeted entities.
So, Can some body tell me what is wrong with this code?, Is there other way to do what I need?.
=================================
UPDATE 1:
yrodiere' answer solves the issue, but I have another doubt, I want validate whether entities match other fields, I know I can use BooleanJuntion, but then I need mix "must" and "should" usages right?. i.e.:
BooleanJunction<?> bool = queryBuilder.bool();
for (Integer orgId: orgIds) {
bool.should( queryBuilder.keyword().onField("orgId").matching(orgId).createQuery() );
}
bool.must(queryBuilder.keyword().onField("name").matching("anyName").createQuery() );
Then, I am validating that the entities must match a "name" and also they match one of the given orgIds, Am I right?
As the error message says:
The specified query [...] contains a string based sub query which targets the numeric encoded field(s) 'orgId'.
simpleQueryString can only be used to target text fields. Numeric fields are not supported.
If your string was generated programmatically, and you have a list of integers, this is what you'll need to do:
List<Integer> orgIds = Arrays.asList(1, 3, 8);
BooleanJunction<?> bool = queryBuilder.bool();
for (Integer orgId: orgIds) {
bool.should( queryBuilder.keyword().onField("orgId").matching(orgId).createQuery() );
}
LuceneQuery query = bool.createQuery();
query will match documents whose orgId field contains 1, 3 OR 8.
See https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_combining_queries
EDIT: If you need additional clauses, I'd recommend not mixing must and should in the same boolean junction, but nesting boolean junctions instead.
For example:
BooleanJunction<?> boolForOrgIds = queryBuilder.bool();
for (Integer orgId: orgIds) {
boolForOrgIds.should(queryBuilder.keyword().onField("orgId").matching(orgId).createQuery());
}
BooleanJunction<?> boolForWholeQuery = queryBuilder.bool();
boolForWholeQuery.must(boolForOrgIds.createQuery());
boolForWholeQuery.must(queryBuilder.keyword().onField("name").matching("anyName").createQuery());
// and add as many "must" as you need
LuceneQuery query = boolForWholeQuery.createQuery();
Technically you can mix 'must' and 'should', but the effect won't be what you expect: 'should' clauses will become optional and will only raise the score of documents when they match. So, not what you need here.

How can I find Documents with Duplicate Array Elements?

Here is my Document:
{
"_id":"5b1ff7c53e3ac841302cfbc2",
"idProf":"5b1ff7c53e3ac841302cfbbf",
"pacientes":["5b20d2c83e3ac841302cfbdb","5b20d25f3e3ac841302cfbd0"]
}
I want to know how to find a duplicate entry in the array using MongoCollection in Java.
This is what I'm trying:
BasicDBObject query = new BasicDBObject("idProf", idProf);
query.append("$in", new BasicDBObject().append("pacientes", idJugador.toString()));
collection.find(query)
We can try to solve this in your Java-application code.
private final MongoCollection collection;
public boolean hasDuplicatePacientes(String idProf) {
Document d = collection.find(eq("idProf", idProf)).first();
List<String> pacientes = (List<String>) d.get("pacientes");
int original = pacientes.size();
if (original == 0) {
return false;
}
Set<String> unique = new HashSet(pacientes);
return original != unique.size();
}
Or if you're searching for a way to do this fully on db-side, I believe it's also possible with something like Neil Lunn provided.
The best approach really is to compare the length of the array to the length of an array which would have all duplicates removed. A "Set" does not have duplicate entries, so what you need to do is convert an array into a "Set" and test against the original.
Modern MongoDB $expr
Modern MongoDB releases have $expr which can be used with aggregation expressions in a regular query. Here the expressions we would use are $setDifference and $size along with $ne for the boolean comparison:
Document query = new Document(
"$expr", new Document(
"$ne", Arrays.asList(
new Document("$size", "$pacientes"),
new Document("$size",
new Document("$setDifference", Arrays.asList("$pacientes", Collections.emptyList()))
)
)
)
);
MongoCursor<Document> cursor = collection.find(query).iterator();
Which serializes as:
{
"$expr": {
"$ne": [
{ "$size": "$pacientes" },
{ "$size": { "$setDifference": [ "$pacientes", [] ] } }
]
}
}
Here it is actually the $setDifference which is doing the comparison and returning only unique elements. The $size is returning the length, both of the original document array content and the newly reduced "set". And of course where these are "not equal" ( the $ne ) the condition would be true meaning that a duplicate was found in the document.
The $expr operates on receiving a boolean true/false value in order whether to consider the document a match for the condition or not.
Earlier Version $where clause
Basically $where is a JavaScript expression that evaluates on the server
String whereClause = "this.pacientes.length != Object.keys(this.pacientes.reduce((o,e) => Object.assign(o, { [e.valueOf()]: null}), {})).length";
Document query = new Document("$where": whereClause);
MongoCursor<Document> cursor = collection.find(query).iterator();
You do need to have not explicitly disabled JavaScript evaluation on the server ( which is the default ) and it's not as efficient as using $expr and the native aggregation operators. But JavaScript expressions can be evaluated in the same way using $where, and the argument in Java code is basically sent as a string.
In the expression the .length is a property of all JavaScript arrays, so you have the original document content and the comparison to the "set". The Array.reduce() uses each array element as a "key" in a resulting object, from which the Object.keys() will then return those "keys" as a new array.
Since JavaScript objects work like a Map, only unique keys are allowed and this is a way to get that "set" result. And of course the same != comparison will return true when the removal of duplicate entries resulted in a change of length.
In either case of $expr or $where these are computed conditions which cannot use an index where present on the collection. As such it is generally recommended that additional criteria which use regular equality or range based query expressions which can indeed utilize an index be used alongside these expressions. Such additional criteria in the predicate would improve query performance greatly where an index is in place.

Union query with multiple selects post java 8

Here is a query that I want to try out in MySQL
SELECT A.x
FROM A
WHERE A.y = 'P'
UNION
SELECT A.x
FROM A
WHERE A.y = 'Q'
The above is a cut-down, much simpler version of the original query that I am trying. In my original query, each SELECT statement involves multiple tables with INNER JOIN
If the possible number of values in 'y' column of table 'A' that I need to query upon is 'n', then my query will involve doing 'n-1' unions on 'n' SELECT statements
I know that JOOQ can do union of multiple SELECT statements. But is there a good way to do this post Java 8 style? maybe using Steam.collect()?
This is what I have but wondering if I could do better
String firstValueToQuery = valuesToQuery.get(0);
Select<Record5<UUID, UUID, String, Integer, String>> selectQuery = getSelectQueryForValue(firstValueToQuery);
valuesToQuery.stream()
.skip(1)
.forEach(valueToQuery -> selectQuery.unionAll(getSelectQueryForValue(valueToQuery)));
selectQuery.fetchStream();
Here is how I implement getSelectQueryForValue
private Select<Record5<UUID, UUID, String, Integer, String>> getSelectQueryForValue(String valueToQuery) {
return jooq.select(
A.P,
A.Q,
A.R,
A.S,
A.T)
.from(A)
.where(A.Y.eq(valueToQuery));
}
PS: I understand that I could rather use the 'IN' clause like below
SELECT A.x
FROM A
WHERE A.y IN ('P','Q',...)
But with my current data distribution in the database, MySQL is using a sub-optimal query plan. Thus using UNION so that the database implicitly prefers a faster query plan by making use of the right index
The idiomatic approach here would be as follows (using JDK 9 API):
try (Stream<Record5<UUID, UUID, String, Integer, String>> stream = valuesToQuery
.stream()
.map(this::getSelectQueryForValue)
.reduce(Select::union)
.stream() // JDK 9 method
.flatMap(Select::fetchStream)) {
...
}
It uses the useful Optional.stream() method, which was added in JDK 9. In JDK 8, you could do this instead:
valuesToQuery
.stream()
.map(this::getSelectQueryForValue)
.reduce(Select::union)
.ifPresent(s -> {
try (Stream<Record5<UUID, UUID, String, Integer, String>> stream =
s.fetchStream()) {
...
}
})
I blogged about this in more detail here.

Parse HQL to AST Structure and convert AST back to HQL

I have a HQL query:
query = select item.itemNumber from items item where item.stock>0 and item.price<100.00
i like to parse this query and convert it into a tree structure:
AST queryTree = parse(query);
than i like to iterate through the nodes, change some values, and convert the tree back to a string represenation:
Iterator<ASTNode> it = queryTree.nodeIterator();
while(it.hasNext())
{
ASTNode node = it.next();
System.out.println( node.text() + "->" + node.value() );
}
query = queryTree.toString();
it would be nice if the parse method would throw Exceptions in case the HQL grammer is violated, but its not necessary.
Has anyone an idea how this can be accomplished? Are there any API methods offered by hibernate to accomplish that task?
Thanks,
You could take a look at the new experimental parser that is being worked on here: https://github.com/hibernate/hibernate-hql-parser

Categories