I have a Spark Dataset loaded in memory and persisted to the parquet files. There is a UI application where a user may define the value to be populated in the particular Column of the Dataset. It could be a formula where the value will depend on the values in the different Columns of the same Dataset Row.
Initially I thought about brute force solution and wanted to iterate throw the List and update certain Column value, but it could be highly inefficient.
List listOfRows = dataframe.collectAsList();
for(Row oneRow : listOfRows) {
// Process every single Row
}
Then I tried to use Dataset.withColumn(..) api :
for (String cn : cvtCols) {
if (cn.equalsIgnoreCase(columnName)) {
dataframe = dataframe.withColumn(cn, <some value here>);
}
}
However, that updated the whole dataset at once and I don't see how to inject a formula here, in my case it's Javascript, where there is a potential dependency on the other Column values in the same row.
The first solution is very costly in resources. By calling the collectAsList() action, you are asking spark to return all the the elements of the dataframe as a list to the driver program. This case can cause an OutOfMemoryError.
Also, any operation done after that action will be done only on the driver program without using the spark executors.
In your case, you need to use withColumn() without the for loop. Then, to inject the formula that depends on other columns you can replace the <some value here> with an expression that usesorg.apache.spark.sql.functions.col. Refer to this link for more details https://sparkbyexamples.com/spark/spark-dataframe-withcolumn/
Related
I want to iterate a dataframe by partitions and for each partition iterate all of its rows and create a deleteList of them that will contain HBase's delete objects for each row.
I'm using Spark and HBase with Java and I've created a Row object with the following code:
df.foreachPartition((ForeachPartitionFunction<Row> iterator -> {
while (iterator.hasNext()) {
Row row = RowFactory.create(iterator.next());
deleteList.add(new Delete(Bytes.toBytes(String.valueOf(row))));
}
}
But it won't work because I cannot access row's value correctly. While df has one column named "hbase_key".
It's hard to tell from your post which class exactly is Row, but I suspect it is org.apache.spark.sql.Row ?
If that's the case, try the methods like getString(i) or similar, where i is the index of the column in the row you are trying to access.
Again, depending on how you are configuring your Hbase access, I suspect that in your case the 0 index would be the value of the row-key of the physical HBase table, and the subsequent indices will be the respective column values that are returned with your row. But again, that would depend on how exactly you arrived at this point in your code.
Your Row object should have methods to access other data types as well, such as getInt(i), etc.
as a newbie I want to sum the values of a column pv from a database table evm in my model and store it in a variable. I have tried the SQL code SELECT SUM(pv) FROM evm; but that doesn't seem to work.I would be grateful if you lend me an aid regarding how to pull this one.
You can always write a native query and get the response in the resultset to populate the field of your pojo. Once you have the POJO/DTO created as the list of result set perform your sum on the field by Iterating the list.
You do just use the SQL you have suggested. (The database in an AnyLogic model is a standard HSQLDB database which supports this SQL syntax.)
The simplest way to execute it is to use AnyLogic's in-built functions for such queries (as would be produced by the Insert Database Query wizard), so
mySumVariable = selectFirstValue("SELECT SUM(pv) FROM evm;");
You didn't say what errors you had; obviously the table and column has to exist (and the column you're summing needs to be numeric, though NULLs are OK), as does the variable you're assigning the sum to.
If you wanted to do this in a way which more easily fits one of the standard query 'forms' suggested by the wizard (i.e., not having to know particular SQL syntax) you could just adapt the "Iterate over returned rows and do something" code to 'explicitly' sum the columns; e.g., (using the Query DSL format this time):
List<Tuple> rows = selectFrom(evm).list();
for (Tuple row : rows) {
mySumVariable += row.get(evm.pv);
}
I am creating an agent based model in Anylogic 8.7. There is a point that I want to use query to get a List of values from a database table(rs_table) with a condition, here is the Java code that anylogic writes at the designated place:
(int) selectFrom(rs_table) .where(rs_table.tr.eq(1)) .list(rs_table.impact)
but I do not know how to store those values and how to reach them one by one. I would be grateful if you help me out thanks.
I would use a collection. Add a collection element from the "Agent" Pallet. The collection should have the following properties:
Collection Class: LinkedList
Element Class: Int
Use the following code:
collection.addAll(
selectFrom(rs_table) .where(rs_table.tr.eq(1)) .list(rs_table.impact)
);
Now, you can access the value from the collection as follows:
collection.get(i);
The "Iterate over returned rows and do something" option of the Insert Database Query wizard is precisely designed for this. It produces query code that loops through the returned list and prints each column's value to the console (via a traceln call); you just replace the code within the loop with what you actually want to do for each returned row (where the template code shows you how to get the value of each column in the row).
The wizard (if you use the QueryDSL form) will produce code like below:
List<Tuple> rows = selectFrom(rs_table)
.where(rs_table.tr.eq(1))
.list();
for (Tuple row : rows) {
traceln(
row.get( rs_table.tr ) + "\t" +
row.get( rs_table.impact )
);
}
(with extra row.get lines for any other table columns beyond the tr and impact ones).
(In Java terms, the query's list function returns a List of Tuple objects as the code shows.)
I am using JDBI to iterate through a resultset via streams. Currently mapToMap is causing problems when there is a column with the same name in the result. What I need is just the values without the column names.
Is there a way to map the results to an Object list/array? The docs does not have an example for this. I would like to have something like
query.mapTo(List<Object>.class).useStream(s -> { . . .})
First of all - what kind of use case would allow you not care at all about the column name but only the values? I am genuinely curious
If it does make sense, it is trivial to implement a RowMapper<List<Object>> in your case, which runs through all the columns by index and puts the results of rs.getObject(i) into a list.
Using com.netflix.astyanax, I add entries for a given row as follows:
final ColumnListMutation<String> columnList = m.withRow(columnFamily, key);
columnList.putEmptyColumn(columnName);
Later I retrieve all my columns with:
final OperationResult<ColumnList<String>> operationResult = keyspace
.prepareQuery(columnFamily).getKey(key).execute();
operationResult.getResult().getColumnNames();
The following correctly return all the columns I have added but the columns are not ordered accordingly to when they were entered in the database. Since each column has a timestamp associated to it, there ought to be a way to do exactly this but I don't see it. Is there?
Note: If there isn't, I can always change the code above to:
columnList.putColumn(ip,new Date());
and then retrieve the column values, order them accordingly, but that seems cumbersome, inefficient, and silly since each column already has a timestamp.
I know from PlayOrm that if you do column Slices, it returns those in order. In fact, playorm uses that do enable S-SQL in partitions and basically batches the column slicing which comes back in order or reverse order depending on how requested. You may want to do a column slice from 0 to MAXLONG.
I am not sure about getting the row though. I haven't tried that.
oh, and PlayOrm is just a mapping layer on top of astyanax though not really relational and more noSql'ish really as demonstrated by it's patterns pages
http://buffalosw.com/wiki/Patterns-Page/
Cassandra will never order your columns in "insertion order".
Columns are always ordered lowest first. It also depends on how cassandra interprets your column names. You can define the interpretation with the comparator you set when defining your column family.
From what you gave it looks you use String timestamp values. If you simply serialized your timestamps as e.g. "123141" and "231" be aware that with an UTF8Type comparator "231">"123131".
Better approach: Use Time-based UUIDs as column names, as many examples for Time-series data in Cassandra propose. Then you can use the UUIDType comparator.
CREATE COLUMN FAMILY timeseries_data
WITH comparator = UUIDType
AND key_validation_class=UTF8Type;