Sort Order in HBase with Pig/Piglatin in Java - java

I created a HBase Table in the shell and added some data. In http://hbase.apache.org/book/dm.sort.html is written that the datasets are first sorted by the rowkey and then the column. So I tried something in the HBase Shell:
hbase(main):013:0> put 'mytable', 'key1', 'cf:c', 'val'
0 row(s) in 0.0110 seconds
hbase(main):011:0> put 'mytable', 'key1', 'cf:d', 'val'
0 row(s) in 0.0060 seconds
hbase(main):012:0> put 'mytable', 'key1', 'cf:a', 'val'
0 row(s) in 0.0060 seconds
hbase(main):014:0> get 'mytable', 'key1'
COLUMN CELL
cf:a timestamp=1376468325426, value=val
cf:c timestamp=1376468328318, value=val
cf:d timestamp=1376468321642, value=val
3 row(s) in 0.0570 seconds
Everything looks fine. I got the right order a -> c -> d like expected.
Now i tried the same with Apache Pig in Java:
pigServer.registerQuery("mytable_data = load 'hbase://mytable' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf', '-loadKey true') as (rowkey:chararray, columncontent:map[]);");
printAlias("mytable_data"); // own function, which itereate over the keys
I got this result:
(key1,[c#val,d#val,a#val])
So, now the order is c -> d -> a. That seems a little odd to me, shouldn't it be the same like in HBase? It's important for me to get the right order because I transform the map afterwards into a bag and then join it with other tables. If both inputs are sorted I could use a merge join without sorting these to datasets?! So does anyone now how it is possible to get the sorted map (or bag) of the columns?

You're fundamentally misunderstanding something -- the HBaseStorage backend loads each row as a single Tuple. You've told Pig to load the column family cf as a map:[], which is exactly what Pig is doing. A Pig map under the hood is just a java.util.HashMap, which obviously has no order.
There is no way currently in pig to convert the map to a bag, but that should be a trivial UDF to write, barring the null checks and other boilerplate, the body is something like
public DataBag exec(Tuple input) {
DataBag resultBag = bagFactory.newDefaultBag();
HashMap<String, Object> map = (HashMap<String, Object>) input.get(0);
for (Map.Entry<String, Object> entry : map) {
Tuple t = tupleFactory.newTuple();
t.append(entry.getKey());
t.append(entry.getValue().toString());
resultBag.add(t);
}
return resultBag;
}
With that then you can generate a bag{(k:chararray, v:chararray)}, use FLATTEN to get a list of (k:chararray, v:chararray) and ORDER those by k.
As for whether there is a way to get the data sorted -- generally no. If the amount of fields in the column family is not constant or the fields are not always the same / defined, your only options are
transforming the map to a bag of tuples and sorting then
or writing a custom LoadFunc which takes a table, a column family and emits a tuple per KeyValue pair scanned. HBase will ensure the ordering and give you the data in the sorted order you see in the shell, but note that the order is only guaranteed upon loading. Any further transformation you apply ruins that.

Related

How to update a Column values in Spark Dataset

I have a Spark Dataset loaded in memory and persisted to the parquet files. There is a UI application where a user may define the value to be populated in the particular Column of the Dataset. It could be a formula where the value will depend on the values in the different Columns of the same Dataset Row.
Initially I thought about brute force solution and wanted to iterate throw the List and update certain Column value, but it could be highly inefficient.
List listOfRows = dataframe.collectAsList();
for(Row oneRow : listOfRows) {
// Process every single Row
}
Then I tried to use Dataset.withColumn(..) api :
for (String cn : cvtCols) {
if (cn.equalsIgnoreCase(columnName)) {
dataframe = dataframe.withColumn(cn, <some value here>);
}
}
However, that updated the whole dataset at once and I don't see how to inject a formula here, in my case it's Javascript, where there is a potential dependency on the other Column values in the same row.
The first solution is very costly in resources. By calling the collectAsList() action, you are asking spark to return all the the elements of the dataframe as a list to the driver program. This case can cause an OutOfMemoryError.
Also, any operation done after that action will be done only on the driver program without using the spark executors.
In your case, you need to use withColumn() without the for loop. Then, to inject the formula that depends on other columns you can replace the <some value here> with an expression that usesorg.apache.spark.sql.functions.col. Refer to this link for more details https://sparkbyexamples.com/spark/spark-dataframe-withcolumn/

How to store listed values from a database to variables in Anylogic 8.7.1?

I am creating an agent based model in Anylogic 8.7. There is a point that I want to use query to get a List of values from a database table(rs_table) with a condition, here is the Java code that anylogic writes at the designated place:
(int) selectFrom(rs_table) .where(rs_table.tr.eq(1)) .list(rs_table.impact)
but I do not know how to store those values and how to reach them one by one. I would be grateful if you help me out thanks.
I would use a collection. Add a collection element from the "Agent" Pallet. The collection should have the following properties:
Collection Class: LinkedList
Element Class: Int
Use the following code:
collection.addAll(
selectFrom(rs_table) .where(rs_table.tr.eq(1)) .list(rs_table.impact)
);
Now, you can access the value from the collection as follows:
collection.get(i);
The "Iterate over returned rows and do something" option of the Insert Database Query wizard is precisely designed for this. It produces query code that loops through the returned list and prints each column's value to the console (via a traceln call); you just replace the code within the loop with what you actually want to do for each returned row (where the template code shows you how to get the value of each column in the row).
The wizard (if you use the QueryDSL form) will produce code like below:
List<Tuple> rows = selectFrom(rs_table)
.where(rs_table.tr.eq(1))
.list();
for (Tuple row : rows) {
traceln(
row.get( rs_table.tr ) + "\t" +
row.get( rs_table.impact )
);
}
(with extra row.get lines for any other table columns beyond the tr and impact ones).
(In Java terms, the query's list function returns a List of Tuple objects as the code shows.)

How to right query which will work like key of hash map

In MSSQL-2012.
I have table which can have duplicates row. There are two column(col1,col2) which we can combined and can use as key in hash map, so we can retrieve all latest row in JAVA. eg : Map
So if I have 100 rows in table. Using SQL Query I can get 100 row and Using MAP, I will get latest row as 20.(JUST A Example).
My question is : Is it possible to write query, so we can replace MAP part to query. instead of MAP in java, we can get similar data retrieve.

Merge CSV files with dynamic headers in Java

I have two or more .csv files which have the following data:
//CSV#1
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType
1, Test, 2014-04-03, 2, page
//CSV#2
Actor.id, Actor.DisplayName, Published, Object.id
2, Testing, 2014-04-04, 3
Desired Output file:
//CSV#Output
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page,
2, Testing, 2014-04-04, , , 3
For the case some of you might wonder: the "." in the header is just an additional information in the .csv file and shouldn't be treated as a separator (the "." results from the conversion of a json-file to csv, respecting the level of the json-data).
My problem is that I did not find any solution so far which accepts different column counts.
Is there a fine way to achieve this? I did not have code so far, but I thought the following would work:
Read two or more files and add each row to a HashMap<Integer,String> //Integer = lineNumber, String = data, so that each file gets it's own HashMap
Iterate through all indices and add the data to a new HashMap.
Why I think this thought is not so good:
If the header and the row data from file 1 differs from file 2 (etc.) the order won't be kept right.
I think this might result if I do the suggested thing:
//CSV#Suggested
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page //wrong, because one "," is missing
2, Testing, 2014-04-04, 3 // wrong, because the 3 does not belong to Target.id. Furthermore the empty values won't be considered.
Is there a handy way I can merge the data of two or more files without(!) knowing how many elements the header contains?
This isn't the only answer but hopefully it can point you in a good direction. Merging is hard, you're going to have to give it some rules and you need to decide what those rules are. Usually you can break it down to a handful of criteria and then go from there.
I wrote a "database" to deal with situations like this a while back:
https://github.com/danielbchapman/groups
It is basically just a Map<Integer, Map<Integer. Map<String, String>>> which isn't all that complicated. What I'd recommend is you read each row into a structure similar to:
(Set One) -> Map<Column, Data>
(Set Two) -> Map<Column, Data>
A Bidi map (as suggested in the comments) will make your lookups faster but carries some pitfalls if you have duplicate values.
Once you have these structures you lookup can be as simple as:
public List<Data> process(Data one, Data two) //pseudo code
{
List<Data> result = new List<>();
for(Row row : one)
{
Id id = row.getId();
Row additional = two.lookup(id);
if(additional != null)
merge(row, additional);
result.add(row);
}
}
public void merge(Row a, Row b)
{
//Your logic here.... either mutating or returning a copy.
}
Nowhere in this solution am I worried about the columns as this is just acting on the raw data-types. You can easily remap all the column names either by storing them each time you do a lookup or by recreating them at output.
The reason I linked my project is that I'm pretty sure I have a few methods in there (such as outputing column names etc...) that might save you considerable time/point you in the right direction.
I do a lot of TSV processing in my line of work and maps are my best friends.

What data structure could I use for counting occurrences of a country code?

I need some kind of data structure don't know yet which would be most suitable.
Here is what I'm working with: I have a bunch rows of data processing and each row has its own country code.
I want to obtain as result how many times each country code repeats throughout the process.
You might try a HashMap. With a HashMap, you can use the country code as the key, and the count of how many times each appears as the value stored in that key. If you encounter a particular country code for the first time, insert it into the map with an initial value of 1; otherwise, increment the existing value.
HashMap<String, Integer> myMap = new HashMap<String, Integer>();
for (... record : records) {
String countryCode = record.getCountryCode();
int curVal;
if (myMap.containsKey(countryCode)) {
curVal = myMap.get(countryCode);
myMap.put(countryCode, curVal + 1);
} else {
myMap.put(countryCode, 1);
}
}
// myMap now contains the count of each country code, which
// can be used for whatever purpose needed.
I would use a HashMap with the country code as the key and the count as the value. Build the map from your collection and increment the count if it is already in the map.
Create a Map using country code String as the key and the current count as the value.
You realize, of course, that you can get such a thing directly out of a SQL query:
select country_code, count(country_code)
from your_table
group by country_code
order by country_code
You'll get a ResultSet with country code and count pairs. That's easy to load into a Map.
To complete the answer with something other than HashMap.
If your country code list is or can easily be turned into a not very sparse numeric sequence, try a int[] or long[].
If your country code range is sparse, but don't have many elements, create a CountryCode enum And use a EnumMap to store the amounts:
Example:
Map<CountryCode, Long> countryCodeAppearances =
new EnumMap<CountryCode,Long>(CountryCode.class);
Lightweight data structures will perform better and impose less memory / garbage collection overhead. So, array should be the fastest. EnumMap is kind off a hidden gem that, under the right circumstances may also give you a performance boost.
Guava offers the AtomicLongMap

Categories