I'm currently using Apache Flink for my master thesis and I have to partition it multiple times over an iteration.
I would like to have the same data on the same nodes while processing, but I don't know how I can do that. I think, Flink will always send the data again arbitrarily to the nodes.
Is there a possibility to call multiple times, e.g., partitionByHash(...) and have the data on the same node?
Thanks!
Related
I have a usecase where i need to move records from hive to kafka. I couldn't find a way where i can directly add a kafka sink to flink dataset.
Hence i used a workaround where i call the map transformation on the flink dataset and inside the map function i use the kafkaProducer.send() command for the given record.
The problem i am facing is that i don't have any way to execute kafkaProducer.flush() on every worker node, hence the number of records written in kafka is always slightly lesser than the number of records in the dataset.
Is there an elegant way to handle this? Any way i can add a kafka sink to dataset in flink? Or a way to call kafkaProducer.flush() as a finalizer?
You could simply create a Sink that will use KafkaProducer under the hood and will write data to Kafka.
The question says it all. How can I do one of the following things:
How can I limit the number of concurrent tasks running for one processor cluster-wide?
Is there any unique and short ID for the Node, I run on? I could use these ID to append to the database-table-name to load (see details below) and have an exclusive table per connection.
I have a NIFI cluster and a self-written, specialized Processor, that loads heavy amounts of data into a database via JDBC (up to 20Mio rows per Second). It uses some of the database-vendor specific tuning tricks to be really fast in my particular case. One of these tricks needs an exclusive, empty table to load into for each connection.
At the moment, my processor opens one connection per Node in the NIFI-Cluster (it takes a connection from the DBCPConnectionPool). With about 90-100 nodes in the cluster, I'd get 90-100 connections - all of them bulk loading data at the same time.
I'm using NIFI 1.3.0.0
Any help or comment is highly appreciated. Sorry for not showing any code. It's about 700 lines not really helping with the question. But I plan to put it on Git and as part of the open-source project Kylo.
A common way of breaking up tasks in NiFi is to split the flow file into multiple files on the primary node. Then other nodes would pull one of the flow files and process it.
In your case, each file would contain a range of values to pull from the table. Let's say you had a hundred rows and wanted only 3 nodes to pull data. So you'd create 3 flow files each having separate attribute values:
start-row-id=1, end-row-id=33
start-row-id=34, end-row-id=66
start-row-id=67, end-row-id=100
Then a node would pick up a flow file from a remote process group or a queue (such as JMS or SQS). There's only 3 flow files so no more than 3 nodes would being loading data from a connection.
Currently I'm working on Spark Streaming and data volume will be huge and I have the following scenario.
Every 2 mins,Streamed data will be processed. During some transformations, I will need to validate against the data which may come in next batch (i.e) after 2 mins. In such cases , I need to hold these particular data on in-memory or Disk plus in-memory combo so that these data will be compared in next batch/after 2 mins .
Either Accumulator nor broadcast variables won't be help in my case.In this case what would be the best approach?
Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.
So I'm getting started with Elasticsearch, and I created a few nodes on my machine using:
elasticsearch -Des.node.name=Node-2
Now, as far as I understand a node is another machine/server on a cluster, you can correct me if I'm wrong, now.
1.In order to add nodes to a cluster you need these machines to be on the same network? can I have a node in US and another node in the EU as part of the same structure? Or do they need to be in the same building, same network.
2.What is the idea with nodes? to split the data on multiple machines/nodes and also split power to calculate certain querys?
By default ElasticSearch looks for nodes running the same same clustername on the same network. If you want to configure things differently take a look at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html
The idea to to split data across multiple machines in case it doesn't fit one one machine AND to prevent data loss in case a node fails (by default all data is replicated 3 times) AND to split query computation power. (ElasticSearch automatically splits your query into query's for all separate nodes, and aggregates the results).
Hope this answers your questions :)