Write data in Apache Parquet format

Write data in Apache Parquet format - java

I'm having a scheduler that gets our cluster metrics and writes the data onto a HDFS file using an older version of the Cloudera API. But recently, we updated our JARs and the original code errors with an exception.
java.lang.ClassCastException: org.apache.hadoop.io.ArrayWritable cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116)
at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:324)
I need help in using the ParquetHiveRecord class write the data (which are POJOs) in parquet format.
Code sample below:
Writable[] values = new Writable[20];
... // populate values with all values
ArrayWritable value = new ArrayWritable(Writable.class, values);
writer.write(value); // <-- Getting exception here
Details of "writer" (of type ParquetWriter):
MessageType schema = MessageTypeParser.parseMessageType(SCHEMA); // SCHEMA is a string with our schema definition
ParquetWriter<ArrayWritable> writer = new ParquetWriter<ArrayWritable>(fileName, new
DataWritableWriteSupport() {
#Override
public WriteContext init(Configuration conf) {
if (conf.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null)
conf.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
}
});
Also, we were using CDH and CM 5.5.1 before, now using 5.8.3
Thanks!

I think you need to use DataWritableWriter rather than ParquetWriter. The class cast exception indicates the write support class is expecting an instance of ParquetHiveRecord instead of ArrayWritable. DataWritableWriter likely breaks down the individual records in ArrayWritable to individual messages in the form of ParquetHiveRecord and sends each to the write support.
Parquet is sort of mind bending at times. :)

Looking at the code of the DataWritableWriteSupport class:
https ://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
You can see it is using the DataWritableWriter, hence you do not need to create an instance of DataWritableWriter, the idea of Write support is that you will be able to write different formats to parquet.
What you do need is to wrap your writables in ParquetHiveRecord

Related

How manually read data from Flink's checkpoint file and keep in Java memory

We need to read data from our checkpoints manually for different reasons (let's say we need to change our state object/class structure, so we want to read restore and copy data to a new type of object)
But, while we are reading everything is good, when we want to keep/store it in memory and deploying to flink cluster we get empty list/map. in log we see that we are reading and adding all our data properly to list/map but as soon as our method completes it's work we lost data, list/map is empty :(
val env = ExecutionEnvironment.getExecutionEnvironment();
val savepoint = Savepoint.load(env, checkpointSavepointLocation, new HashMapStateBackend());
private List<KeyedAssetTagWithConfig> keyedAssetsTagWithConfigs = new ArrayList<>();
val keyedStateReaderFunction = new KeyedStateReaderFunctionImpl();
savepoint.readKeyedState("my-uuid", keyedStateReaderFunction)
.setParallelism(1)
.output(new MyLocalCollectionOutputFormat<>(keyedAssetsTagWithConfigs));
env.execute("MyJobName");
private static class KeyedStateReaderFunctionImpl extends KeyedStateReaderFunction<String, KeyedAssetTagWithConfig> {
private MapState<String, KeyedAssetTagWithConfig> liveTagsValues;
private Map<String, KeyedAssetTagWithConfig> keyToValues = new ConcurrentHashMap<>();
#Override
public void open(final Configuration parameters) throws Exception {
liveTagsValues = getRuntimeContext().getMapState(ExpressionsProcessor.liveTagsValuesStateDescriptor);
}
#Override
public void readKey(final String key, final Context ctx, final Collector<KeyedAssetTagWithConfig> out) throws Exception {
liveTagsValues.iterator().forEachRemaining(entry -> {
keyToValues.put(entry.getKey(), entry.getValue());
log.info("key {} -> {} val", entry.getKey(), entry.getValue());
out.collect(entry.getValue());
});
}
public Map<String, KeyedAssetTagWithConfig> getKeyToValues() {
return keyToValues;
}
}
as soon as this code executes I expect having all values inside map which we get from keyedStateReaderFunction.getKeyToValues(). But it returns empty map. However, I see in logs we are reading all of them properly. Even data empty inside keyedAssetsTagWithConfigs list where we are reading output in it.
If anyone has any idea will be very helpful because I get lost, I never had such experience that I put data to map and then I lose it :) When I serialize and write my map or list to text file and then deserialize it from there (using jackson) I see my data exists, but this is not a solution, kind of "workaround"
Thanks in advance

The code you show creates and submits a Flink job to be executed in its own environment orchestrated by the Flink framework: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#flink-application-execution
The job runs independently than the code that builds and submits the Flink job so when you call keyedStateReaderFunction.getKeyToValues(), you are calling the method of the object that was used to build the job, not the actual object that was run in the Flink execution environment.
Your workaround seems like a valid option to me. You can then submit the file with your savepoint contents to your new job to recreate its state as you'd like.

You have an instance of KeyedStateReaderFunctionImpl in the Flink client which gets serialized and sent to each task manager. Each task manager then deserializes a copy of that KeyedStateReaderFunctionImpl and calls its open and readKey methods, and gradually builds up a private Map containing its share of the data extracted from the savepoint/checkpoint.
Meanwhile the original KeyedStateReaderFunctionImpl back in the Flink client has never had its open or readKey methods called, and doesn't hold any data.
In your case the parallelism is one, so there is only one task manager, but in general you will need collect the output from each task manager and assemble together the complete results from these pieces. These results are not available in the flink client process because the work hasn't been done there.

I found a solution, started job in attached mode and collecting results in main thread
val env = ExecutionEnvironment.getExecutionEnvironment();
val configuration = env.getConfiguration();
configuration
.setBoolean(DeploymentOptions.ATTACHED, true);
...
val myresults = dataSource.collect();
Hope will help somebody else because I wasted couple of days while trying to find a soltion.

Using Thrift for offline serialisation?

I use Apache Thrift protocol for tablet-server and interlanguage integration, and all is OK few years.
Integration is between languages (C#/C++/PC Java/Dalvik Java) and thrift is probably one of simplest and safest. So I want pack-repack sophisticated data structures (and changed over years) with Thrift library. Lets say in thrift terms kind of OfflineTransport or OfflineProtocol.
Scenario:
I want to make backup solution, for example during internet provider failure process data in offline mode: serialise, store, try to process in few ways. For example sent serialised data by normal email via poor backup connection etc.
Question is: where in Thrift philosophy is best extension point for me?
I understand, only part of online protocol is possible to backup offline, ie real time return of value is not possible, that is OK.

Look for serializer. There are misc. implementations but they all share the same common concept to use a buffer or file / stream as transport medium:
Writing data in C#
E.g. we plan to store the bits into a bytes[] buffer. So one could write:
var trans = new TMemoryBuffer();
var prot = new TCompactProtocol( trans);
var instance = GetMeSomeDataInstanceToSerialize();
instance.Write(prot);
Now we can get a hold of the data:
var data = trans.GetBuffer();
Reading data in C#
Reading works similar, except that you need to know from somewhere what root instance to construct:
var trans = new TMemoryBuffer( serializedBytes);
var prot = new TCompactProtocol( trans);
var instance = new MyCoolClass();
instance.Read(prot);
Additional Tweaks
One solution to the chicken-egg problem during load could be to use a union as an extra serialization container:
union GenericFileDataContainer {
1 : MyCoolClass coolclass;
2 : FooBar foobar
// more to come later
}
By always using this container as the root instance during serialization it is easy to add more classes w/o breaking compatibility and there is no need to know up front what exactly is in a file - you just read it and check what element is set in the union.

There is an RPC framework that uses the standard thrift Protocol named "thrifty", and it is the same effect as using thrift IDL to define the service, that is, thrify can be compatible with code that uses thrift IDL, which is very helpful for cross-platform. And has a ThriftSerializer class in it:
[ThriftStruct]
public class LogEntry
{
[ThriftConstructor]
public LogEntry([ThriftField(1)]String category, [ThriftField(2)]String message)
{
this.Category = category;
this.Message = message;
}
[ThriftField(1)]
public String Category { get; }
[ThriftField(2)]
public String Message { get; }
}
ThriftSerializer s = new ThriftSerializer(ThriftSerializer.SerializeProtocol.Binary);
byte[] s = s.Serialize<LogEntry>();
s.Deserialize<LogEntry>(s);
you can try it:https://github.com/endink/Thrifty

Two questions about Flink deserializing

I'm a very newbie of Flink and cluster computing. I spent all day trying to parse correctly on Flink a stupid stream from Kafka with NONE results: It's a bit frustrating...
I've in kafka a stream of JSON-LD messages identified with a string key. I simply would like to retrieve them in Flink and then separate messages with different keys.
1)
Initially I considered to send messages as String instead of JSON-LD. I though was easier...
I tried every deserialiser but none works. The simple deserialiser obsviously works but it completely ignore keys.
I believed I had to use (Flink apparently has just two deserialiser which support keys):
DataStream<Object> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new TypeInformationKeyValueSerializationSchema(String.class, String.class, env.getConfig()), properties))
.rebalance();
stream.print();
But I obtain:
06/12/2017 02:09:12 Source: Custom Source(4/4) switched to FAILED
java.io.EOFException
at org.apache.flink.runtime.util.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:306)
How can I receive stream messages without lose keys?
2)
My kafka producer is implemented in javascript, since Flink support JSONDeserialization I though to send in kafka directly JSON Object.
I'm not sure that's works correctly with JSON-LD but I've used:
json.parse(jsonld_message)
to serialize as json the message. Then I sent this with usual string key.
But in Flink this code doesn't work:
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new JSONKeyValueDeserializationSchema(false), properties))
.rebalance();
stream.print();
raising a
JsonParserException.
I think first approach is simpler and I prefer it because allows to consider one problem at time (first: receive data, second: reconvert string in JSON-LD with external library I guess).

SOLVED:
Finally I decided to implement a custom deserializer implementing the KeyedDeserializedSchema interface.

In order to use Flink's TypeInformationKeyValueSerializationSchema to read data from Kafka it must be written in a compatible way. Assuming that your key and value are of type String, then the key and value must be written in a way that Flink's StringSerializer understands the data.
Consequently, you have to make sure that your Kafka producer writes the data in a compatible way. Otherwise Flink' won't be able to read the data.

** I faced similar issue. Ideally TypeInformationKeyValueSerializationSchema with String types for keys and values should have been able to read my kafka record which has both keys and values as Strings. but it was not able to and had a EOF exception as pointed out by above post.So this issue is easily reproducible and needs to be fixed. Please let me know if i can be of any help in this process.In the meantime implemented Custom Serializer using
Kafka Deserializer Schema
. Here is the code as there is little doc regarding it to read keys/values and additional things:
**
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.kafka.clients.consumer.ConsumerRecord;
public class CustomKafkaSerializer implements KafkaDeserializationSchema<Tuple2<String,String>> {
#Override
public boolean isEndOfStream(Tuple2<String,String> stringStringPair) {
return false;
}
#Override
public Tuple2<String,String> deserialize(ConsumerRecord<byte[], byte[]> consumerRecord) throws Exception {
String key = new String(consumerRecord.key());
String value = new String(consumerRecord.value());
return new Tuple2<>(key,value);
}
#Override
public TypeInformation<Tuple2<String,String>> getProducedType() {
return TypeInformation.of(new TypeHint<Tuple2<String, String>>(){});
}
}

How can I get the path of my Neo4j <config storeDirectory=""> in a Batch Inserter method?

I'm using Neo4j 2.2.8 and Spring Data in a web application. I'm using xml to configure my database, like:
<neo4j:config storeDirectory="S:\Neo4j\mybase" />
But I'm trying to use a Batch Inserter to add more than 1 million of nodes sourced from a .txt file. After reading the file and setting the List of objects, my code to batch is something like:
public void batchInserter(List<Objects> objects) {
BatchInserter inserter = null;
try {
inserter = BatchInserters.inserter("S:\\Neo4j\\mybase");
Label movimentosLabel = DynamicLabel.label("Movimentos");
inserter.createDeferredSchemaIndex(movimentosLabel).on("documento").create();
for (Objects objs : objects{
Map<String, Object> properties = new HashMap<>();
properties.put("documento", objs.getDocumento());
long movimento = inserter.createNode(properties, movimentosLabel);
DynamicRelationshipType relacionamento = DynamicRelationshipType.withName("CONTA_MOVIMENTO");
inserter.createRelationship(movimento, objs.getConta().getId(), relacionamento, null);
}
} finally {
if (inserter != null) {
inserter.shutdown();
}
}
}
Is it possible to get the path of my database configured in my xml in the "inserter"? Because with the above configuration Neo4j gives me an error about multiple connections. Can I set a property to solve this error of multiple connections? Has anyone had this problem and have any idea how to solve it? Ideas are welcome.
Thanks to everyone!

Your question has several pieces to it:
Error About Multiple Connections
If you're using spring-data with a local database tied to a particular directory or file, be aware that you can't have two neo4j processes opening the same DB at the same time. This means that if you've decided to use BatchInserter against the same file/directory, this cannot happen at all while the JVM that's using the spring-data DB is running. There won't be a way I know of to get around that problem. One option would be to not use the batch inserter against the file, but to use the REST API to do inserting.
get the path of my database configured in my xml
Sure, there's a way to do that, you'd have to consult the relevant documentation. I can't give you the code for that because it depends on which config file your'e talking about and how it's structured, but in essence there should be a way to inject the right thing into your code here, and read the property from the XML file out of that injected object.
But that won't help you given your "Multiple connections" issue mentioned above.
Broadly, I think your solution is either:
Don't run your spring app and your batch inserter at the same time.
Run your spring app, but do insertion via the REST API or other method, so there isn't a multiple connection issue to begin with.

GWT - impossible to find working dir with Eclipse

I need to show on my panel the working dir.
I use String value = System.getProperty("user.dir"). Afterwards i put this string on label but I receive this message on console:
The method getProperty(String, String) in the type System is not applicable for the arguments (String).
I use eclipse.

Issue
I am guessing you have not gone through GWT 101 - You cannot blindly use JAVA CODE on client side.
Explanation
You can find the list of classes and methods supported for GWT from JAVA.
https://developers.google.com/web-toolkit/doc/latest/RefJreEmulation
For System only the following are supported.
err, out,
System(),
arraycopy(Object, int, Object, int, int),
currentTimeMillis(),
gc(),
identityHashCode(Object),
setErr(PrintStream),
setOut(PrintStream)
Solution
In your case Execute System.getProperty("user.dir") in your server side code and access it using RPC or any other server side gwt communication technique.

System.getProperty("key") is not supported,
but System.getProperty("key", "default") IS supported, though it will only return the default value as there is not system properties per se.
If you need the working directory during gwt compile, you need to use a custom linker or generator, grab the system property at build time, and emit it as a public resource file.
For linkers, you have to export an external file that gwt can download and get the compile-time data you want. For generators, you just inject the string you want into compiled source.
Here's a slideshow on linkers that is actually very interesting.
http://dl.google.com/googleio/2010/gwt-gwt-linkers.pdf
If you don't want to use a linker and an extra http request, you can use a generator as well, which is likely much easier (and faster):
interface BuildData {
String workingDirectory();
}
BuildData data = GWT.create(BuildData.class);
data.workingDirectory();
Then, you need to make a generator:
public class BuildDataGenerator extends IncrementalGenerator {
#Override
public RebindResult generateIncrementally(TreeLogger logger,
GeneratorContext context, String typeName){
//generator boilerplate
PrintWriter printWriter = context.tryCreate(logger, "com.foo", "BuildDataImpl");
if (printWriter == null){
logger.log(Type.TRACE, "Already generated");
return new RebindResult(RebindMode.USE_PARTIAL_CACHED,"com.foo.BuildDataImpl");
}
SourceFileComposerFactory composer =
new SourceFileComposerFactory("com.foo", "BuildDataImpl");
//must implement interface we are generating to avoid class cast exception
composer.addImplementedInterface("com.foo.BuildData");
SourceWriter sw = composer.createSourceWriter(printWriter);
//write the generated class; the class definition is done for you
sw.println("public String workingDirectory(){");
sw.println("return \""+System.getProperty("user.dir")+"\";");
sw.println("}");
return new RebindResult(RebindMode.USE_ALL_NEW_WITH_NO_CACHING
,"com.foo.BuildDataImpl");
}
}
Finally, you need to tell gwt to use your generator on your interface:
<generate-with class="dev.com.foo.BuildDataGenerator">
<when-type-assignable class="com.foo.BuildData" />
</generate-with>

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Write data in Apache Parquet format - java

Related

How manually read data from Flink's checkpoint file and keep in Java memory

Using Thrift for offline serialisation?

Two questions about Flink deserializing

How can I get the path of my Neo4j <config storeDirectory=""> in a Batch Inserter method?

GWT - impossible to find working dir with Eclipse

Categories

Resources