I want to iterate a dataframe by partitions and for each partition iterate all of its rows and create a deleteList of them that will contain HBase's delete objects for each row.
I'm using Spark and HBase with Java and I've created a Row object with the following code:
df.foreachPartition((ForeachPartitionFunction<Row> iterator -> {
while (iterator.hasNext()) {
Row row = RowFactory.create(iterator.next());
deleteList.add(new Delete(Bytes.toBytes(String.valueOf(row))));
}
}
But it won't work because I cannot access row's value correctly. While df has one column named "hbase_key".
It's hard to tell from your post which class exactly is Row, but I suspect it is org.apache.spark.sql.Row ?
If that's the case, try the methods like getString(i) or similar, where i is the index of the column in the row you are trying to access.
Again, depending on how you are configuring your Hbase access, I suspect that in your case the 0 index would be the value of the row-key of the physical HBase table, and the subsequent indices will be the respective column values that are returned with your row. But again, that would depend on how exactly you arrived at this point in your code.
Your Row object should have methods to access other data types as well, such as getInt(i), etc.
Related
I have a Spark Dataset loaded in memory and persisted to the parquet files. There is a UI application where a user may define the value to be populated in the particular Column of the Dataset. It could be a formula where the value will depend on the values in the different Columns of the same Dataset Row.
Initially I thought about brute force solution and wanted to iterate throw the List and update certain Column value, but it could be highly inefficient.
List listOfRows = dataframe.collectAsList();
for(Row oneRow : listOfRows) {
// Process every single Row
}
Then I tried to use Dataset.withColumn(..) api :
for (String cn : cvtCols) {
if (cn.equalsIgnoreCase(columnName)) {
dataframe = dataframe.withColumn(cn, <some value here>);
}
}
However, that updated the whole dataset at once and I don't see how to inject a formula here, in my case it's Javascript, where there is a potential dependency on the other Column values in the same row.
The first solution is very costly in resources. By calling the collectAsList() action, you are asking spark to return all the the elements of the dataframe as a list to the driver program. This case can cause an OutOfMemoryError.
Also, any operation done after that action will be done only on the driver program without using the spark executors.
In your case, you need to use withColumn() without the for loop. Then, to inject the formula that depends on other columns you can replace the <some value here> with an expression that usesorg.apache.spark.sql.functions.col. Refer to this link for more details https://sparkbyexamples.com/spark/spark-dataframe-withcolumn/
I am creating an agent based model in Anylogic 8.7. There is a point that I want to use query to get a List of values from a database table(rs_table) with a condition, here is the Java code that anylogic writes at the designated place:
(int) selectFrom(rs_table) .where(rs_table.tr.eq(1)) .list(rs_table.impact)
but I do not know how to store those values and how to reach them one by one. I would be grateful if you help me out thanks.
I would use a collection. Add a collection element from the "Agent" Pallet. The collection should have the following properties:
Collection Class: LinkedList
Element Class: Int
Use the following code:
collection.addAll(
selectFrom(rs_table) .where(rs_table.tr.eq(1)) .list(rs_table.impact)
);
Now, you can access the value from the collection as follows:
collection.get(i);
The "Iterate over returned rows and do something" option of the Insert Database Query wizard is precisely designed for this. It produces query code that loops through the returned list and prints each column's value to the console (via a traceln call); you just replace the code within the loop with what you actually want to do for each returned row (where the template code shows you how to get the value of each column in the row).
The wizard (if you use the QueryDSL form) will produce code like below:
List<Tuple> rows = selectFrom(rs_table)
.where(rs_table.tr.eq(1))
.list();
for (Tuple row : rows) {
traceln(
row.get( rs_table.tr ) + "\t" +
row.get( rs_table.impact )
);
}
(with extra row.get lines for any other table columns beyond the tr and impact ones).
(In Java terms, the query's list function returns a List of Tuple objects as the code shows.)
I am building a model that allows users to configure a table in mongodb and increment values based on row, column name. Below is my model:
public Class Matrix{
String id;
List<String> columns;
List<String> rows;
long[][] values;
}
When a new object is saved, I populate the Matrix with all 0's,
object.setValues(new long[object.getRows().size()][object.getColumns().size()]);
My use case is, I need to increment the corresponding number when a word for row and column is encountered. Labels for columns and rows are stored in corresponding lists. So, I am trying to do something like this:
matrix.update(<get the index of both row and column value and update values[][] accordingly>)
However, I can't seem to find/form a query that will do both (i.e. return index and update the value). Another approach would be to get the document by id and increment it in Java, but that would need two db calls.
Is there any alternate way to do this? Should I change my model?
Sorry if this is a newbie question.
If I've a table in my database called Settings with mostly columns of type long and I'm returning the last row in the table to a variable called results with this statement:
List results = em.createQuery("SELECT s FROM Settings s ORDER BY s.idsettings DESC").setMaxResults(1).getResultList();
It gives me an array of type Vector with each index holding a Settings array. How do I get access to the data in the Settings array? http://i.imgur.com/G8AxKKU.png
I need to get the column data and store them as longs in my java program so I can work with them.
It gives me an array of type Vector with each index holding a Settings array
No, as shown in your debugger, it returns a Vector that only contains one element (the one you want).
Just retrieve that object and use getters to read its columns.
Settings settings = (Settings)results.get(0);
long batLow = settings.getBatLow();
long batUp = settings.getBatUp();
...
Using com.netflix.astyanax, I add entries for a given row as follows:
final ColumnListMutation<String> columnList = m.withRow(columnFamily, key);
columnList.putEmptyColumn(columnName);
Later I retrieve all my columns with:
final OperationResult<ColumnList<String>> operationResult = keyspace
.prepareQuery(columnFamily).getKey(key).execute();
operationResult.getResult().getColumnNames();
The following correctly return all the columns I have added but the columns are not ordered accordingly to when they were entered in the database. Since each column has a timestamp associated to it, there ought to be a way to do exactly this but I don't see it. Is there?
Note: If there isn't, I can always change the code above to:
columnList.putColumn(ip,new Date());
and then retrieve the column values, order them accordingly, but that seems cumbersome, inefficient, and silly since each column already has a timestamp.
I know from PlayOrm that if you do column Slices, it returns those in order. In fact, playorm uses that do enable S-SQL in partitions and basically batches the column slicing which comes back in order or reverse order depending on how requested. You may want to do a column slice from 0 to MAXLONG.
I am not sure about getting the row though. I haven't tried that.
oh, and PlayOrm is just a mapping layer on top of astyanax though not really relational and more noSql'ish really as demonstrated by it's patterns pages
http://buffalosw.com/wiki/Patterns-Page/
Cassandra will never order your columns in "insertion order".
Columns are always ordered lowest first. It also depends on how cassandra interprets your column names. You can define the interpretation with the comparator you set when defining your column family.
From what you gave it looks you use String timestamp values. If you simply serialized your timestamps as e.g. "123141" and "231" be aware that with an UTF8Type comparator "231">"123131".
Better approach: Use Time-based UUIDs as column names, as many examples for Time-series data in Cassandra propose. Then you can use the UUIDType comparator.
CREATE COLUMN FAMILY timeseries_data
WITH comparator = UUIDType
AND key_validation_class=UTF8Type;