In my problem I need to query a database and join the query results with a Kafka data stream in Flink. Currently this is done by storing the query results in a file and then use Flink's readFile functionality to create a DataStream of query results. What could be a better approach to bypass the intermediary step of writing to file and create a DataStream directly from query results?
My current understanding is that I would need to write a custom SourceFunction as suggested here. Is this the right and only way or are there any alternatives?
Are there any good resources for writing the custom SoruceFunctions or should I just look at current implementations for reference and customise them fro my needs?
One straightforward solution would be to use a lookup join, perhaps with caching enabled.
Other possible solutions include kafka connect, or using something like Debezium to mirror the database table into Flink. Here's an example: https://github.com/ververica/flink-sql-CDC.
Related
Is CSV the only options to speed up my bulk relationships creation?
I read many articles in internet, and they all are telling about CSV. CSV will definitely give me a performance boost (could you suppose how big?), but I'm not sure I can store data in CSV format. Any other options? How much I will get from using Neo4J 3 BOLT protocol?
My program
I'm using Neo4j 2.1.7. I try to create about 50000 relationships at once. I execute queries in batch of size 10000, and it takes about 120-140 seconds to insert all 50000.
My query looks like:
MATCH (n),(m)
WHERE id(n)=5948 and id(m)=8114
CREATE (n)-[r:MY_REL {
ID:"4611686018427387904",
TYPE: "MY_REL_1"
PROPERTY_1:"some_data_1",
PROPERTY_2:"some_data_2",
.........................
PROPERTY_14:"some_data_14"
}]->(m)
RETURN id(n),id(m),r
As it is written in the documentation:
Cypher supports querying with parameters. This means developers don’t
have to resort to string building to create a query. In addition to
that, it also makes caching of execution plans much easier for Cypher.
So, you need pack your data as parameters and pass with cypher query:
UNWIND {rows} as row
MATCH (n),(m)
WHERE id(n)=row.nid and id(m)=row.mid
CREATE (n)-[r:MY_REL {
ID:row.relId,
TYPE:row.relType,
PROPERTY_1:row.someData_1,
PROPERTY_2:row.someData_2,
.........................
PROPERTY_14:row.someData_14
}]->(m)
RETURN id(n),id(m),r
I am writing a spark streaming job that consumes data from Kafka & writes to RDBMS. I am currently stuck because I do not know which would be the most efficient way to store this streaming data into RDBMS.
On searching, I found a few methods -
Using DataFrame
Using JdbcRDD
Creating connection & PreparedStatement inside foreachPartition() of rdd and using PreparedStatement.insertBatch()
I can not figure out which one would be the most efficient method of achieving my goal.
Same is the case with storing & retrieving data from HBase.
Can anyone help me with this ?
I am studying Javaee Batch API (jsr-352) in order to test the feasibility of changing out current ETL tool for our own solution using this technology.
My goal is to build a job in which I:
get some (dummy) data from a datasource in step1,
some other data from other data-source in step2 and
merge them in step3.
I would like to process each item and not write to a file, but send it to the next step. And also store the information for further use. I could do that using batchlets and jobContext.setTransientUserData().
I think I am not getting the concepts right: as far as I understood, JSR-352 is meant for this kind of ETL tasks, but it has 2 types of steps: chunk and batchlets. Chunks are "3-phase-steps", in which one reads, processes and writes the data. Batchlets are tasks that are not performed on each item on the data, but once (as calculating totals, sending email and others).
My problem is that my solution is not correct if I consider the definition of batchlets.
How could one implement this kinf od job using Javaee Batch API?
I think you better to use chunk rather than batchlet to implement ETLs. typical chunk processing with a datasource is something like following:
ItemReader#open(): open a cursor (create Connection, Statement and ResultSet) and save them as instance variables of ItemReader.
ItemReader#readItem(): create and return a object that contains data of a row using ResultSet
ItemReader#close(): close JDBC resources
ItemProcessor#processItem(): do calculation and create and return a object which contains result
ItemWriter#writeItems(): save calculated data to database. open Connection, Statement and invoke executeUpdate() and close them.
As to your situation, I think you have to choose one data which considerble as primary one, and open a cursor for it in ItemReader#open(). then get another one in ItemProcessor#processItem() for each item.
Also I recommend you to read useful examples of chunk processing:
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-1/
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-2/
My blog entries about JBatch and chunk processing:
http://www.nailedtothex.org/roller/kyle/category/JBatch
I am creating reports on mongodb using java . So here I need to use map reduce to create reports .I am having 3 replicas in production . For reports queries I do not want make request to primary mongo database . I want to make request to only secondary replica ,So here if we use map reduce it will create a temporary collection.
1) Here is there any problem if i set read preferences as secondary
for reports using map reduce?
2) will create temporary collection on
secondary replica?
3) Is there any other way to use secondary
replica for report purpose since i do not want to create traffic on
primary database?4) will i get correct desired results since having
huge data?
Probably the easiest way to do this is to just connect to the secondary directly, instead of connecting to the Replica Set with ReadPreference.SECONDARY_ONLY. In that case, it will definitely create a temporary on the secondary and you should have the correct results (of course!).
I would also advice you to look at the Aggregation Framework though, as it's a lot faster and often easier to use and debug than Map Reduce jobs. It's not as powerful, but I have yet had to find a situation where I couldn't use the Aggregation Framework for my aggregation and reporting needs.
Hi friends
i am generating a web crawler, i like to know some things about that,
1)Can i use Map reduce to Fetch the Data from the NET
2)Can i able to save the Fetched data to HBase?
3)Can i able to Write an App in PHP for Fetch the Data from HBase?if yes can u gave me a code snippet??How can i Adding/Viewing/Deleting Data from HBase using PHP
For your questions, yes, it can all be done. How you approach it depends on what exactly you want to achieve.
1) Your main control would need to partition the task. You would likely maintain some kind of list of addresses to crawl, possible running sequential mapreduce tasks that each time read the list in, split the list between mappers which could do the crawling, and write directly to hbase or another intermediary. They would also probably output generated urls to crawl next which in turn would be filtered down to uniques in the reduce phase, with the reduce outputting the list of things to crawl next. You'd need to maintain a list of recently crawled stuff and filter that out too, but that's not specific to MR/Hbase.
2) You can use table output format to send the outputs to hbase. You can also just make HBase connections with HTable and write directly in your mapper.
3) As TheDeveloper said, yes, with thrift. His link is good.
For questions number 3, you can interact with Hbase from PHP, but you need to do it via the Thrift interface. See this blog post for more info. Hope this helps
Can be done easily via REST using Stargate.