Can you change the file metadata on a cloud database using Apache Beam? From what I understand, Beam is used to set up dataflow pipelines for Google Dataflow. But is it possible to use Beam to change the metadata if you have the necessary changes in a CSV file without setting up and running an entire new pipeline? If it is possible, how do you do it?
You could code Cloud Dataflow to handle this but I would not. A simple GCE instance would be easier to develop and run the job. An even better choice might be UDF (see below).
There are some guidelines for when Cloud Dataflow is appropriate:
Your data is not tabular and you can not use SQL to do the analysis.
Large portions of the job are parallel -- in other words, you can process different subsets of the data on different machines.
Your logic involves custom functions, iterations, etc...
The distribution of the work varies across your data subsets.
Since your task involves modifying a database, I am assuming a SQL database, it would be much easier and faster to write a UDF to process and modify the database.
First, Apache Beam does not currently support schema update yet. There is a feature request for some times but no news
Another option is to alter your current dataflow written with Apache Beam pipeline to migrate your table to another (corrected schema) table. This, unfortunately, is not scale if you have a lot of data and plus if you need to frequently change table schema ( renaming columns, renaming table name, changing data types, ..etc).
What I propose is issue SQL queries to update your table schema instead. You can write a bash script using this guide that executes ALTER TABLE statement.
Related
I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).
Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself.
I saw this and also another duplication
but they are related directly to Hive, it wont work for Athena.
Ideally I am looking for a way to do it programmatically without the need to define it at the console.
This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API.
If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is MSCK REPAIR TABLE.
It doesn't seem to be possible with Athena as avro.schema.url is not a supported property.
table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)
You can use avro.schema.literal (you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards.
Strange errors like:
SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause
I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.
You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.
You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.
Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.
My understanding of Apache Hive is that its a SQL-like tooling layer for querying Hadoop clusters. My understanding of Apache Pig is that its a procedural language for querying Hadoop clusters. So, if my understanding is correct, Hive and Pig seem like two different ways of solving the same problem.
My problem, however, is that I don't understand the problem they are both solving in the first place!
Say we have a DB (relational, NoSQL, doesn't matter) that feeds data into HDFS so that a particular MapReduce job can be run against that input data:
I'm confused as to which system Hive/Pig are querying! Are they querying the database? Are they querying the raw input data stored in the DataNodes on HDFS? Are they running little ad hoc, on-the-fly MR jobs and reporting their results/outputs?
What is the relationship between these query tools, the MR job input data stored on HDFS, and the MR job itself?
Apache Pig and Apache Hive load data from the HDFS unless you run it locally, in which case it will load it locally. How does it get the data from a DB? It does not. You need other framework to export the data in your traditional DB into your HDFS, such as Sqoop.
Once you have the data in your HDFS, you can start working with Pig and Hive. They never query a DB. In Apache Pig, for example, you could load your data using a Pig loader:
A = LOAD 'path/in/your/HDFS' USING PigStorage('\t');
As for Hive, you need to create a table and then load the data into the table:
LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;
Again, the data must be in the HDFS.
As to how it works, it depends. Traditionally it has always worked with a MapReduce execution engine. Both Hive and Pig parse the statements you write in PigLatin or HiveQL and translate it into an execution plan consisting of a certain number of MapReduce jobs, depending on the plan. However, now it can also translate it into Tez, a new execution engine which perhaps is too new to work correctly.
Why the need of Pig or Hive? Well, you really don't need these frameworks. Everything they can do, you can do it as well writing your own MapReduce or Tez jobs. However, writing for instance a JOIN operation in MapReduce might take hundreds or thousands of lines of code (really), while it is only one single line of code in Pig or Hive.
I dont think you can query any data with Hive/Pig without actually adding to them. So first you need to add data. And this data can be coming from any place and you just give the path for the data to be picked or add directly to them. Once data is in place, the query fetches the data only from those tables.
Underneath, they use map reduce as a tool to do the process. If you just have on the go data lying somewhere and need analysis, you can directly go to map redue and define your own logic. Hive is mostly at the SQL front. So you get querying features similar to SQL, and at the backend, map reduce does the job. Hope this info helps
Im not agree with that Pig and Hive solve the same problem, Hive is for querying data stored on hdfs as external or internal tables, Pig is for managing data flow stored on hdfs in a Directed Acyclic Graph, this is their main goals and we dont care about other uses, here i want to make difference between :
Querying data (the main purpose of Hive) which is getting answers to some questions about your data, for example : How many distinct user visiting my website per mounth in this year.
Managing a data flow (the main purpose of Pig) is to make your data go from initial state to have at the end a different state through transformations, for example : Data in location A filtered by critiria c joined with data in location B stored in location C.
Smeeb, Pig,Hive does same thing , I mean processing data which comes in files or what ever format.
here if you want to process data present in RDMS, first get that data to HDFS with help of Sqoop(SQL+HADOOP).
Hive used HQL like SQL to process, Pig uses kind flow with help of piglatin.
Hive stores all input data in tables format so, first thing before load data to Hive create a hive table, that structure (metadata) will be stored in any RDMS(Mysql). Then load with LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;
I am writing a Google Dataflow Pipeline and as one of the Sources I require a MySQL resultset via a query. A couple of questions then:
What would be proper way to extract data from MySQL as a step in my pipeline, can this simply be done in-line using JDBC?
In the case that I indeed do need to implement "User-Defined Data Format" wrapping MySQL as a source, does anyone know if an implementation already exists and I do not need to reinvent the wheel? (don't get me wrong I would enjoy writing it, but I would imagine this would be quite a common scenario to use MySQL as a source)
Thanks all!
At this time, Cloud Dataflow does not provide MySQL input source.
The preferred way to implement support for this is to implement a user-defined input source that can handle MySQL queries.
An alternative way would be to execute the query in the main program and stage the results of the query to a temporary location in GCS, process the results using Dataflow, and remove the files in temporary.
Hope this helps
A JDBC connector has been just added to Apache Beam (incubating). See JdbcIO.
Could you please clarify the need for GroupByKey in the above example? Since the previous ParDo (ReadQueryResults) returns rows key'd on primary key, wouldn't the GroupByKey essentially create a group for each row of the result set? The subsequent ParDo (Regroup) would have parallelized the processing per row even without the GroupByKey, right?