How to set ORC BytesColumnVector value to NULL? - java

I'm writing an ORC file using Groovy.
One of the columns is a String. The ORC column type is:
.addField("Name", TypeDescription.createString())
The column vector is:
BytesColumnVector vName = (BytesColumnVector) batch.cols[1]
The values to be assigned to vName may include NULLs, but I can't get ORC to write a null value into its data.
Attempting to assign a null value through set(), setValue() or setRef() throws a null pointer error, either at the point of assignment, or when the batch row is written deeper within ORC.
The closest I can get is this:
byte[] b = new byte[0]
vName.setRef (i,b,0,0)
but this puts an empty string into the data file, as shown in the following dump snippet (see the second column, 'Name'):
{"ProductID":355,"Name":"","MakeFlag":false,"StandardCost":0,"Weight":null,"ModifiedDate":"2014-02-08 10:01:36.827"}
Any thoughts on how to set a null string?
EDIT: With the answer to this question, I was able to complete some code to write the contents of a database table to ORC. It may be useful to people searching for ORC-related examples.
https://www.linkedin.com/pulse/orc-adls-polybase-ron-dunn/enter link description here

An empty string is what I use. I don't think there's another way to do it.
Just make sure you mark the column as containing nulls.
Your code would ideally look like this:
BytesColumnVector vName = (BytesColumnVector) batch.cols[1];
byte[] EMPTY_BYTES = "".getBytes(StandardCharsets.UTF_8);
vName.setRef(i, EMPTY_BYTES, 0, 0);
vName.isNull[i] = true;
vName.noNulls = false;

Related

Apache poi getCell() returns wrong value

So im trying to recive cell value like '150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbols150symbol'
In some cases ill get correct (150...) value and in some reciving '78.0'. At first i thouth that i got wrong cell type in my .xls but after some work i found that they are the same. Also calling method getCellType returns me '1' and that is CELL_TYPE_STRING.
In the end its working something like this:
String value1 = getCellValue(row.getCell(0)); --150... correct value
String value2 = getCellValue(row.getCell(1)); --150... correct value
String value3 = getCellValue(row.getCell(2)); --78.0 incorrect value
private String getCellValue(Cell cell) {
switch (cell.getCellType()) {
case Cell.CELL_TYPE_STRING: //similar for other cell types
//getting cell value based on its type
}
}
looking for some advices and tips cause im running out of ideas, maby im missing something?
thats how my excel looks :
enter image description here
p.s. there are many '150...' vals just for testing
You are using getCell(int var1) to get the content in a cell. But it returns the object of Cell class. You should use methods in Cell class to get values in Excel cells. There are methods like
getDateCellValue()
getNumericCellValue()
getStringCellValue()
to get the cell content depending on the CellType. Using CellType, you can decide how to get the cell content.
UPDATE:
If the problem still exists, check the data type of the cell from;
And do the necessary changes in the code to get the correct value.
Small tip: Also you can add a single quote ' at the beginning of the cell content. Then Apache POI will get the cell content as a String. The ' will not become a part of the cell content.

Can not modify value in JavaRDD

I have a question about how to update JavaRDD values.
I have a JavaRDD<CostedEventMessage> with message objects containing information about to which partition of kafka topic it should be written to.
I'm trying to change the partitionId field of such objects using the following code:
rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));
where the repartitionEvent logic is:
costedEventMessage.setPartitionId(1);
return costedEventMessage;
But the modification does not happen.
Could you please advice why and how to correctly modify values in a JavaRDD?
Spark is lazy, so from the code you pasted above it's not clear if you actually performed any action on the JavaRDD (like collect or forEach) and how you came to the conclusion that data was not changed.
For example, if you assumed that by running the following code:
List<CostedEventMessage> messagesLst = ...;
JavaRDD<CostedEventMessage> rddToKafka = javaSparkContext.parallelize(messagesLst);
rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));
Each element in messagesLst would have partition set to 1, you are wrong.
That would hold true if you added for example:
messagesLst = rddToKafka.collect();
For more details refer to documentation

How can we convert a string to a working variable which fetch values in JAVA?

I have a case where I need to store a location of each key value of json, so that for each key, it automatically fetches the location and gives the value for it from json.
Here, I have a location of the key 'vehicle_id' inside json 'car' assigned to a variable like:
String location="jresp.getJSONArray('cars').getJSONObject(0).getString('vehicle_id')"
How do I make it as a variable in JAVA such that this location fetches the value of vehicle_id for me from JSON? I need it like:
String value=jresp.getJSONArray('cars').getJSONObject(0).getString('vehicle_id');
so that it gives me a value. I've searched in net, but couldn't find it anywhere. Please help me!
Instead of having the complete statement as a variable like -
String location="jresp.getJSONArray('cars').getJSONObject(0).getString('vehicle_id')"
You could have 3 different variables like -
String jsonArray = "cars";//You might need to do some string processing to get these
int objSeq = 0;
String key = "vehicle_id";
Then you can definitely use it in your Java statement -
String value=jresp.getJSONArray(jsonArray).getJSONObject(objSeq).getString(key);

How to sort a csv file?

I want to create a file in this format:
device1,t1,t2,t3,t4,t5
device2,t1,t2,t3,t4,t5
device3,t6,t7,t8,t9,t10
device4,t6,t7,t8,t9,t10
Here, t1, t2, ..., tn are time stamps.
Every value tn is generated based on one execution of JAR file along with that device name gets generated too.
I am able to generate a format like this using the JAR file now:
For example:
Current format in csv file:
device1,t1,device2,t2,device2,t3,device1,t4,device2,t5,device2,t6,device1,t7,device2,t8
I want this in this format in csv file:
device1-t1,t4,t7
device2-t2,t3,t5,t6,t8
So here, I have to put the time stamp belonging to specific devices on the right-hand side.
Please let me know how can I sort it in Java.
I will answer this question here as per my understanding of your question.
What you can do is to create a hashmap which stores device name as hashmap key.
And then for values create a sortedCollection.
Feed your timestamp in this sorted collection and keep updating this HashMap for the corresponding device name key.
As and when you will update your sorted timestamp collections, they will automatically be stored in sorted manner.
your hashmap will look like :
key : value (collection)
device1 : t1, t4, t7
device2 : t2, t5, t8 (add more timestamp in the end of this collection)
Then feed this hashmap data in the CSV file.
This is to do from java end.
If you want to sort in csv whenever a new timestamp is added for a device, then I dont think so that you can do this from java. Then you would have to keep some logic in csv file once all your data is added in csv file.
This is the solution:
I got output as:
Entire map:{Device1=[[t8], t9], Device2=[[[[[t2], t3], t5], t7], t10]}
BufferedReader reader = new BufferedReader(new FileReader("results.csv"));
String eachline;
// int i=2, j=2;
while((eachline = reader.readLine()) != null)
{
String[] fields = eachline.split(",");
if(Integer.parseInt(fields[2])==0)//data is = 0
{
if(tree.get(fields[0])!=null)//returns null if this key not present
{
values.add(tree.get(fields[0]));//get entire key value pair for that particular field
}
values.add(fields[1]);//to prev value, add next value
tree.put(fields[0], values.toString());// write to hashmap along with value
values.clear();
}
}
System.out.println("Entire map:"+tree);

hbase: querying for specific value with dynamically created qualifier

Hy,
Hbase allows a column family to have different qualifiers in different rows. In my case a column family has the following specification
abc[cnt] # where cnt is an integer that can be any positive integer
what I want to achieve is to get all the data from a different column family, only if the value of the described qualifier (in a different column family) matches.
for narrowing the Scan down I just add those two families I need for the query. but that is as far as I could get for now.
I already achieved the same behaviour with a SingleColumnValueFilter, but then the qualifier was known in advance. but for this one the qualifier can be abc1, abc2 ... there would be too many options, thus too many SingleColumnValueFilter's.
Then I tried using the ValueFilter, but this filter only returns those columns that match the value, thus the wrong column family.
Can you think of any way to achieve my goal, querying for a value within a dynamically created qualifier in a column family and returning the contents of the column family and another column family (as specified when creating the Scan)? preferably only querying once.
Thanks in advance for any input.
UPDATE: (for clarification as discussed in the comments)
in a more graphical way, a row may have the following:
colfam1:aaa
colfam1:aab
colfam1:aac
colfam2:abc1
colfam2:abc2
whereas I want to get all of the family colfam1 if any value of colfam2 has e.g. the value x, with regard to the fact that colfam2:abc[cnt] is dynamically created with cnt being any positive integer
I see two approaches for this: client-side filtering or server-side filtering.
Client-side filtering is more straightforward. The Scan adds only the two families "colfam1" and "colfam2". Then, for each Result you get from scanner.next(), you must filter according to the qualifiers in "colfam2".
byte[] queryValue = Bytes.toBytes("x");
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("colfam1");
scan.addFamily(Bytes.toBytes("colfam2");
ResultScanner scanner = myTable.getScanner(scan);
Result res;
while((res = scanner.next()) != null) {
NavigableMap<byte[],byte[]> colfam2 = res.getFamilyMap(Bytes.toBytes("colfam2"));
boolean foundQueryValue = false;
SearchForQueryValue: while(!colfam2.isEmpty()) {
Entry<byte[], byte[]> cell = colfam2.pollFirstEntry();
if( Bytes.equals(cell.getValue(), queryValue) ) {
foundQueryValue = true;
break SearchForQueryValue;
}
}
if(foundQueryValue) {
NavigableMap<byte[],byte[]> colfam1 = res.getFamilyMap(Bytes.toBytes("colfam1"));
LinkedList<KeyValue> listKV = new LinkedList<KeyValue>();
while(!colfam1.isEmpty()) {
Entry<byte[], byte[]> cell = colfam1.pollFirstEntry();
listKV.add(new KeyValue(res.getRow(), Bytes.toBytes("colfam1"), cell.getKey(), cell.getValue());
}
Result filteredResult = new Result(listKV);
}
}
(This code was not tested)
And then finally filteredResult is what you want. This approach is not elegant and might also give you performance issues if you have a lot of data in those families. If "colfam1" has a lot of data, you don't want to transfer it to the client if it will end up not being used if value "x" is not in a qualifier of "colfam2".
Server-side filtering. This requires you to implement your own Filter class. I believe you cannot use the provided filter types to do this. Implementing your own Filter takes some work, you also need to compile it as a .jar and make it available to all RegionServers. But then, it helps you to avoid sending loads of data of "colfam1" in vain.
It is too much work for me to show you how to custom implement a Filter, so I recommend reading a good book (HBase: The Definitive Guide for example). However, the Filter code will look pretty much like the client-side filtering I showed you, so that's half of the work done.

Categories