Data refresh from one oracle db to another oracle db - java

I want to get some filtered data from one oracle db and refresh tables in other oracle db and this refresh needs to be done frequently. So what are best possible ways to do it?
Please suggest the optimal way to do it.
Using db links or using oracle schedule jobs or write java code.

There are numerous ways to do this, but the most straightforward is to use materialized views with queries that involve dblinks, which you can schedule refreshes for by using dbms_scheduler. There are a lot of docs online to help you. Here's one:
Working with Materialized Views

I don't know Java so I can't comment it.
As far as database is concerned, one option is to create database link between these two databases and a materialized view in one of them which fetches data over the database link from another database.
You can schedule refresh; there are various options. Read documentation to pick the right one for your situation. Have a quick look at Tim Hall's materialized views article; if you find it interesting, search Oracle documentation (related to version you use) for more info.

create a database link between source and target databases and follow any of these native tool options.
Create a materialized view using query that points to source database.
Write a procedure in target site using select queries to read data from source site and update/insert the target tables accordingly.Later schedule those procedures using scheduler jobs.
Use the Golden gate provided if table you chosen should have primary key or unique key.
you can write your own Java or python code which works like PUB and SUB mode to publish the data into target site.

Related

Change file metadata using Apache Beam on a cloud database?

Can you change the file metadata on a cloud database using Apache Beam? From what I understand, Beam is used to set up dataflow pipelines for Google Dataflow. But is it possible to use Beam to change the metadata if you have the necessary changes in a CSV file without setting up and running an entire new pipeline? If it is possible, how do you do it?
You could code Cloud Dataflow to handle this but I would not. A simple GCE instance would be easier to develop and run the job. An even better choice might be UDF (see below).
There are some guidelines for when Cloud Dataflow is appropriate:
Your data is not tabular and you can not use SQL to do the analysis.
Large portions of the job are parallel -- in other words, you can process different subsets of the data on different machines.
Your logic involves custom functions, iterations, etc...
The distribution of the work varies across your data subsets.
Since your task involves modifying a database, I am assuming a SQL database, it would be much easier and faster to write a UDF to process and modify the database.
First, Apache Beam does not currently support schema update yet. There is a feature request for some times but no news
Another option is to alter your current dataflow written with Apache Beam pipeline to migrate your table to another (corrected schema) table. This, unfortunately, is not scale if you have a lot of data and plus if you need to frequently change table schema ( renaming columns, renaming table name, changing data types, ..etc).
What I propose is issue SQL queries to update your table schema instead. You can write a bash script using this guide that executes ALTER TABLE statement.

Is there a way to insert data to Oracle using Direct Path loading for in-memory records?

New to Oracle here but I have now read about the various bulk insert options for Oracle. In essence, true bulk loading is done using Direct Path loading mechanism via SQL*Loader. There's also APPEND hint options that use serial or parallel Direct Path loading. But each of these have the following limitations -
SQL*Loader works off of a Control File, which contains the path of the data file. In my case, there is no file.
APPEND hint option for INSERT can only use the syntax - insert into select from. In my case, the source data is not in any table.
Source of my data is actually a Spark dataframe. I am looking for options to push this data in chunks to Oracle tables, but using Direct Path loading option. For example, in Postgres, the PGConnection interface provides getCopyAPI.copyIn functionality and you can create a huge serialized blob than can be sent over as one big chunk using COPY tableName FROM STDIN yourBlob command. I am unable to find anything similar Java API for Oracle that works on in-memory records and is able to push data directly (without any insert statements).
Any ideas on how to achieve this? Anyone done this before?
In general, how do folks using Oracle and Spark push data to Oracle from a dataframe in an optimized way?
Thanks in advance!

Google Cloud Dataflow User-Defined MySQL Source

I am writing a Google Dataflow Pipeline and as one of the Sources I require a MySQL resultset via a query. A couple of questions then:
What would be proper way to extract data from MySQL as a step in my pipeline, can this simply be done in-line using JDBC?
In the case that I indeed do need to implement "User-Defined Data Format" wrapping MySQL as a source, does anyone know if an implementation already exists and I do not need to reinvent the wheel? (don't get me wrong I would enjoy writing it, but I would imagine this would be quite a common scenario to use MySQL as a source)
Thanks all!
At this time, Cloud Dataflow does not provide MySQL input source.
The preferred way to implement support for this is to implement a user-defined input source that can handle MySQL queries.
An alternative way would be to execute the query in the main program and stage the results of the query to a temporary location in GCS, process the results using Dataflow, and remove the files in temporary.
Hope this helps
A JDBC connector has been just added to Apache Beam (incubating). See JdbcIO.
Could you please clarify the need for GroupByKey in the above example? Since the previous ParDo (ReadQueryResults) returns rows key'd on primary key, wouldn't the GroupByKey essentially create a group for each row of the result set? The subsequent ParDo (Regroup) would have parallelized the processing per row even without the GroupByKey, right?

Are there any samples for appengine Java report generation?

We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.
There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.
I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1

Date versioning and restoring

We are developing a SAAS based application. One of the requirements is to record every change in database tables i.e. create date/time based version of data. Client should be able to revert back to any version of data.
I have almost 30 tables in database, and data insertion frequency is 80,000 records added/updated per day through bulk import. However, client can also use GUI to insert data through forms (other than bulk import).
Before creating any strategy to implement this requirement, I would love have your comments/suggestion on how to implement this.
On a side note, I have reviewed this blog post and found it very good starting point but I still doubt on how to restore past data.
Database snapshot is a promising solution, but as I said earlier that this is a SAAS based application and we are storing multiple clients data in a single database, and snapshot would restore data for other clients as well.
Please suggest any strategy/plan on how to execute this requirement.
If you plan on using JPA/Hibernate to fetch your data, you can give Envers a shot.
Envers is a JBoss open-source project for maintaining versions of Database Entities. You can mark certain columns of the entire table with #Audited annotation to start tracking audit history. It typically stores all the audit data in a table with _AUDIT name. It also provides API to query historical data.
For details please go thru http://www.jboss.org/envers

Categories