How do I run a spark sql aggregator cumulatively? - java

I am currently working on a project with spark datasets (in Java) where I have to create a new column derived from an accumulator run over all the previous rows.
I have been implementing this using a custom UserDefinedAggregationFunction over a Window from unboundedPreceding to currentRow.
This goes something like this:
df.withColumn("newColumn", customAccumulator
.apply(columnInputSeq)
.over(customWindowSpec));
However, I would really prefer to use a typed Dataset for type safety reasons and generally cleaner code. i.e: perform the same operation with an org.apache.spark.sql.expressions.Aggregator over a Dataset<CustomType>. The problem here is I have looked through all the documentation and can't work out how to make it behave in the same way as above (i.e. I can only get a final aggregate over the whole column rather than a cumulative state at each row).
Is what I am trying to do possible and if so, how?
Example added for clarity:
Initial table:
+-------+------+------+
| Index | Col1 | Col2 |
+-------+------+------+
| 1 | abc | def |
| 2 | ghi | jkl |
| 3 | mno | pqr |
| 4 | stu | vwx |
+-------+------+------+
Then with example aggregation operation:
First reverse the accumulator, prepend Col1 append Col2 and return this value, also setting it as the accumulator.
+-------+------+------+--------------------------+
| Index | Col1 | Col2 | Accumulator |
+-------+------+------+--------------------------+
| 1 | abc | def | abcdef |
| 2 | ghi | jkl | ghifedcbajkl |
| 3 | mno | pqr | mnolkjabcdefihgpqr |
| 4 | stu | vwx | sturpqghifedcbajklonmvwx |
+-------+------+------+--------------------------+
Using a UserDefinedAggregateFunction I have been able to produce this but with an Aggregator I can only get the last row.

You don't
My source for this is a friend who has been working on an identical problem to this and has now concluded it's impossible

Related

How to join two frames' rows in H2O?

I am implementing my own algorithm in H2O's Java source code (under package h2o-algos).
How can I join two frames' rows (i.e. vectors) in H2O given H2O Java methods?
For instance, given two Frame A and B
Frame A:
| Id | Name |
| -------- | -------------- |
| 123 | John |
| 456 | Bob |
Frame B:
| Id | Name |
| -------- | -------------- |
| 789 | Alice |
I want the resultant Frame C to be:
| Id | Name |
| -------- | -------------- |
| 123 | John |
| 456 | Bob |
| 789 | Alice |
Is there a way to do this faster then: making new vectors, than create a new frame from the new vectors? I have read the documentation and found that the Frame::append() method would create new columns, not joining rows.
This functionality is called "row binding", it is not exposed as an API method. It is, however, available as a Rapids expression (simple scheme-like language). You can follow this example to row-bind 2 H2O Frames: https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/test/java/water/rapids/ast/prims/mungers/AstRBindTest.java#L40 In a nutshell, if you have 2 frames with keys A and B you would run water.rapids.Rapids.exec("rbind A B").getFrame()

Set any field once for all the tests in Fitnesse table

I want to set one field in the fitnesse table, only once for all the tests. For example I want to set Operator as + for all the tests in the table.
Below is the regular table.
!|CalculatorFixture |
|Value1|Operator|Value2|calculate?|
|3.0 |+ |5.0 |8.0 |
|2.0 |* |3.5 |7.0 |
I want something like:
!| CalculatorFixture |
|Operator |
|+ |
|Value1|Value2|calculate?|
|3.0 |5.0 |8.0 |
|6.0 |3.0 |9.0 |
|5.0 |2.0 |7.0 |
Any Idea how can I achieve this in the fixture or in the fitnesse table?
FYI, I am using Slim: !define TEST_SYSTEM {slim}
You can set a Java static field in a previous table fixture and then access it in the CalculatorFixture.
You can also pass 'constructor parameters' to scenarios by using having or given as first cell after the scenario name (from FitNesse's tests)
|scenario | myDivision _ _ _|numerator, denominator, quotient|
|setNumerator | #numerator |
|setDenominator | #denominator|
|check | quotient| #quotient |
| myDivision | having |numerator| 12|
| denominator|quotient|
| 3 |4.0 |
| 6 |2.0 |
| 4 |3.0 |

Create dynamic classes / objects to be included in a list

I have a xlsx file, that has some tabs with different data. I want to be able to save each row of a tab in a list. The first thing that comes to mind is a list of lists, but I was wondering if there is another way. I'd like to save that information in a object, with all its benefits, but can't think of a way to generate/create such diverse objects on the fly. The data in the xlsx is diverse and ideally the program is agnostic of any content.
So instead of e.g. create a list for each row, than put that list in another list for each tab and each tab in another list, I'd like to store the information that each row represents in a single object and just have a list of different objects.
A small graphic to visualize the problem :
+--------------------------------------------------------------------+
|LIST |
| |
| +------------------+ +------------------+ +-----------------+ |
| | Class1 | | Class2 | | Class3 | |
| |------------------| |------------------| |-----------------| |
| | var1 | | var1 | | var5 | |
| | var2 | | var2 | | var6 | |
|... | var3 | | | | var7 |...|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| +------------------+ +------------------+ +-----------------+ |
| |
+--------------------------------------------------------------------+
How about a generic class Row which will contain all the information in a row from your file. Then you simply create a list of Rows. Methods for the Row can allow you to get each column.
Without knowing more about the data, you will not be able to write classes to encapsulate it. You could "dynamically" create classes to create new source code. But then the question is, how would you use the new classes?
Well since you want to avoid a "list of lists" kind of solution there would be another way.
This might not be very efficient or fast but I don't have any experience with it, so maybe it isn't too bad. Here's the idea:
For each Row:
Use javassist to create as many fields as needed dynamically that contain each cell's information. Then create an instance of this class and store it in your list of rows. You could also add a field with information about this particular row (e.g. how many fields there are or their names or types or whatever you might need).
The number of fields or methods could also be determined using Reflection.
To get started with javassist there's a tutorial here.
Besides that I don't think there's much to do that does not involve some sort of List<List<SomeType>>

How do I selectively update columns in a table when using LOAD DATA INFILE?

I am trying to load data from a text file into a MySQL table, by calling MySQL's LOAD DATA INFILE from a Java process. This file can contain some data for the current date and also for previous days. The table can also contain data for previous dates. The problem is that some of the columns in the file for previous dates might have changed. But I don't want to update all of these columns but only want the latest values for some of the columns.
Example,
Table
+----+-------------+------+------+------+
| id | report_date | val1 | val2 | val3 |
+----+-------------+------+------+------+
| 1 | 2012-12-01 | 10 | 1 | 1 |
| 2 | 2012-12-02 | 20 | 2 | 2 |
| 3 | 2012-12-03 | 30 | 3 | 3 |
+----+-------------+------+------+------+
Data in Input file:
1|2012-12-01|10|1|1
2|2012-12-02|40|4|4
3|2012-12-03|40|4|4
4|2012-12-04|40|4|4
5|2012-12-05|50|5|5
Table after the load should look like
mysql> select * from load_infile_tests;
+----+-------------+------+------+------+
| id | report_date | val1 | val2 | val3 |
+----+-------------+------+------+------+
| 1 | 2012-12-01 | 10 | 1 | 1 |
| 2 | 2012-12-02 | 40 | 4 | 2 |
| 3 | 2012-12-03 | 40 | 4 | 3 |
| 4 | 2012-12-04 | 40 | 4 | 4 |
| 5 | 2012-12-05 | 50 | 5 | 5 |
+----+-------------+------+------+------+
5 rows in set (0.00 sec)
Note that column val3 values are not updated. Also I need to do this for large files as well, some files can be >300Megs or more, and so it needs to be a scalable solution.
Thanks,
Anirudha
It would be good to use LOAD DATA INFILE with REPLACE option, but in this case records will be dropped and added again, so old val3 values will be lost.
Try to load data into temporary table, then update your table from temp. table using INSERT ... SELECT/UPDATE or INSERT ... ON DUPLICATE KEY UPDATE statements.

How can i get only the first level of childrens in Hibernate

If I have this three structure in a table,
-A
|
*---B
| |
| *---C
| |
| *---D
| |
| *---E
| |
| *---F
| |
| *---G
*---H
*---I
|
*---J
assuming list() method is called and it returns a colleccion of B and H.
In this scenario I would like hibernate obtain C,D,I and J in a single query.(lazy=false is not working because I dont need E,F and G, just the FIRST LEVEL)
thanks a lot
I don't think there is any way to achieve this. Let others know if you find something

Categories