I am writing a Google Dataflow Pipeline and as one of the Sources I require a MySQL resultset via a query. A couple of questions then:
What would be proper way to extract data from MySQL as a step in my pipeline, can this simply be done in-line using JDBC?
In the case that I indeed do need to implement "User-Defined Data Format" wrapping MySQL as a source, does anyone know if an implementation already exists and I do not need to reinvent the wheel? (don't get me wrong I would enjoy writing it, but I would imagine this would be quite a common scenario to use MySQL as a source)
Thanks all!
At this time, Cloud Dataflow does not provide MySQL input source.
The preferred way to implement support for this is to implement a user-defined input source that can handle MySQL queries.
An alternative way would be to execute the query in the main program and stage the results of the query to a temporary location in GCS, process the results using Dataflow, and remove the files in temporary.
Hope this helps
A JDBC connector has been just added to Apache Beam (incubating). See JdbcIO.
Could you please clarify the need for GroupByKey in the above example? Since the previous ParDo (ReadQueryResults) returns rows key'd on primary key, wouldn't the GroupByKey essentially create a group for each row of the result set? The subsequent ParDo (Regroup) would have parallelized the processing per row even without the GroupByKey, right?
Related
I want to get some filtered data from one oracle db and refresh tables in other oracle db and this refresh needs to be done frequently. So what are best possible ways to do it?
Please suggest the optimal way to do it.
Using db links or using oracle schedule jobs or write java code.
There are numerous ways to do this, but the most straightforward is to use materialized views with queries that involve dblinks, which you can schedule refreshes for by using dbms_scheduler. There are a lot of docs online to help you. Here's one:
Working with Materialized Views
I don't know Java so I can't comment it.
As far as database is concerned, one option is to create database link between these two databases and a materialized view in one of them which fetches data over the database link from another database.
You can schedule refresh; there are various options. Read documentation to pick the right one for your situation. Have a quick look at Tim Hall's materialized views article; if you find it interesting, search Oracle documentation (related to version you use) for more info.
create a database link between source and target databases and follow any of these native tool options.
Create a materialized view using query that points to source database.
Write a procedure in target site using select queries to read data from source site and update/insert the target tables accordingly.Later schedule those procedures using scheduler jobs.
Use the Golden gate provided if table you chosen should have primary key or unique key.
you can write your own Java or python code which works like PUB and SUB mode to publish the data into target site.
Can you change the file metadata on a cloud database using Apache Beam? From what I understand, Beam is used to set up dataflow pipelines for Google Dataflow. But is it possible to use Beam to change the metadata if you have the necessary changes in a CSV file without setting up and running an entire new pipeline? If it is possible, how do you do it?
You could code Cloud Dataflow to handle this but I would not. A simple GCE instance would be easier to develop and run the job. An even better choice might be UDF (see below).
There are some guidelines for when Cloud Dataflow is appropriate:
Your data is not tabular and you can not use SQL to do the analysis.
Large portions of the job are parallel -- in other words, you can process different subsets of the data on different machines.
Your logic involves custom functions, iterations, etc...
The distribution of the work varies across your data subsets.
Since your task involves modifying a database, I am assuming a SQL database, it would be much easier and faster to write a UDF to process and modify the database.
First, Apache Beam does not currently support schema update yet. There is a feature request for some times but no news
Another option is to alter your current dataflow written with Apache Beam pipeline to migrate your table to another (corrected schema) table. This, unfortunately, is not scale if you have a lot of data and plus if you need to frequently change table schema ( renaming columns, renaming table name, changing data types, ..etc).
What I propose is issue SQL queries to update your table schema instead. You can write a bash script using this guide that executes ALTER TABLE statement.
I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).
New to Oracle here but I have now read about the various bulk insert options for Oracle. In essence, true bulk loading is done using Direct Path loading mechanism via SQL*Loader. There's also APPEND hint options that use serial or parallel Direct Path loading. But each of these have the following limitations -
SQL*Loader works off of a Control File, which contains the path of the data file. In my case, there is no file.
APPEND hint option for INSERT can only use the syntax - insert into select from. In my case, the source data is not in any table.
Source of my data is actually a Spark dataframe. I am looking for options to push this data in chunks to Oracle tables, but using Direct Path loading option. For example, in Postgres, the PGConnection interface provides getCopyAPI.copyIn functionality and you can create a huge serialized blob than can be sent over as one big chunk using COPY tableName FROM STDIN yourBlob command. I am unable to find anything similar Java API for Oracle that works on in-memory records and is able to push data directly (without any insert statements).
Any ideas on how to achieve this? Anyone done this before?
In general, how do folks using Oracle and Spark push data to Oracle from a dataframe in an optimized way?
Thanks in advance!
Is there any way for us to query the db to suggest index creation/index deletion that would improve the performance of the db system?
We understand that a dba can manually view the trace files to create/drop indices but can i write a java program that queries the db engine to suggest the same automatically.
Or some open source tools that i can check out to perform the same automatically.
Thx.
Well there's no standard JDBC way to do this. There may be specific driver implementations for specific DBS that would allow you to EXPLAIN your query (trace the use of indexes), etc. But there's no one-size fits all answer here.
in general I would lean to saying NO.
Your index performance is dependent on the queries you would fire on them...So No !
With MySQL specifically, you can flag slow queries, as well as queries not using indexes.
Ultimately, the database will do its best (within what you've created) to optimize your query.
http://dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
This data would be logged to a file, and not necessarily available via an API.