BQ load Avro files with different schemas, field additions only

BQ load Avro files with different schemas, field additions only - java

Context:
We have a Dataflow job that writes Avro files to GCS with a schema that changes weekly (field additions only). So that means, under a GCS prefix, we have a bunch of Avro files with different schemas, most likely 2 schemas at any given time. For more details, please see the Context section in this post.
The problem:
According to this SO post, when loading Avro files with multiple schemas into BigQuery, BigQuery will choose the file with the largest lexico order. However, this is not the behavior I observed. I am observing inconsistent behaviors.
In my first try, my newer schema was picked up and the new fields are there. However, the BQ load itself took much longer than it should. It took 7 minutes to load 368,594 records.
In my second try, the files with larger lexico order are using the new schema, and I am able to open the Avro file and see the new fields in the header. But when I BQ load these files into a table, the added fields are missing. But if I load the file with the largest order individually, the table will have the new fields.
We have a custom file naming policy, which is:
"chunk-$windowStart-$windowEnd-shardIndex-of-shardNum-UUID.avro"
Question:
Since BQ does auto schema detection for all Avro files, what exactly is the rule regarding old/new schemas? Especially when only field additions happen?
Why BQ load took so long in my first attempt? Did it load with the old schema, then found out about the new schema halfway, then re-did all the work?
Any suggestions on how I can debug this?

Google Cloud Support here!
Schema autodetection is an inference process which BigQuery carries out based on a small sample of rows. This means that the inferred schema may vary depending on the sample analysed. This may account for why you have experienced an inconsistent behaviour. For more info, check out this doc
To answer to this question I need more information, so I would encourage you to open a ticket with Google Cloud Support so that we can better assist you.
See answer 2.
I hope that helps.

Related

Athena create table from parquet schema

Is there a way to create a table in Amazon Athena directly from parquet file based on avro schema? The schema is encoded into the file so its seems stupid that I need to actually create the DDL myself.
I saw this and also another duplication
but they are related directly to Hive, it wont work for Athena.
Ideally I am looking for a way to do it programmatically without the need to define it at the console.

This is now more-or-less possible using AWS Glue. Glue can crawl a bunch of different data sources, including Parquet files on S3. Discovered tables are added to the Glue data catalog and queryable from Athena. Depending on your needs, you could schedule a Glue crawler to run periodically, or you could define and run a crawler using the Glue API.
If you have many separate hunks of data that share a schema, you can also use a partitioned table to reduce the overhead of making new loads available to Athena. For example, I have some daily dumps that load into tables partitioned by date. As long as the schema doesn't change, all you then need to do is MSCK REPAIR TABLE.

It doesn't seem to be possible with Athena as avro.schema.url is not a supported property.
table property 'avro.schema.url' is not supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException...)
You can use avro.schema.literal (you would have to copy the avro json schema to the query) but I still experienced problems querying the data afterwards.
Strange errors like:
SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause

What is the best way to process large CSV files?

I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.

You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.

You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.

Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.

Saving Data from a JavaFX-Application without Database

Unfortunately I couldn't find anything specific to this topic / to my problem. Here we go:
I'm building a JavaFX Business Application for a friend of mine. Unfortunately I do not have any possibility to connect to a Database. I want the Application to load a savestate from a file. The application contains a list with clients and the clients got some specific properties. I do not want to hardcode this to a .prop or .txt file, because I'm sure that there's a different way of doing this, isn't there?
Thanks in advance, appreciate it!

Lots of choices for persisting data to local storage. The exact choice depends on your needs. You do not describe enough details to make a specific recommendation.
Here is a list of possibilities, roughly in increasing order of complexity of your data.
Text file
If you have small amounts of simple data, save to a text file. You can store each piece in a separate file, or combine into a single file. Recent versions of Java have new classes to make this easier than ever. See Oracle Tutorial.
Comma-separate & Tab-delimited
For sets of structured data, write to text files in comma-separated values (CSV) or tab-delimited values. For example a list of people with rows for each person, and columns for name, phone number, and email address.
While reading/writing such files is easy enough to program yourself, I suggest using an established library to eliminate the drudgery, avoid bugs, and save yourself some time. There are a few such libraries written in Java.
My favorite is the Apache Commons CSV project. This library makes easy work of the chore of reading/writing such files. Despite the name, this library supports tab-delimited as well as comma-separated formats. I've written a few Answers here on Stack Overflow showing how to use this library, as you can see here, here, and here.
By the way, plain old ASCII defines a few character positions explicitly for delimiting in data files, with four levels of grouping (document, group, record/row, and field). Unicode, of course, inherits these from ASCII as code points. I am puzzled why these have remained so obscure and so infrequently used. Seems much more logical to me than using commas and tabs which may well exist inside the data payload.
Serialization
You can write out the data values stored within an object. This is called serialization. Java has a serialization facility built-in, but be sure to study up on the details.
To more simply write out an object’s values and later read them back in to reconstitute an object, I have enjoyed using the Simple XML Serialization project. This works well for relatively simple needs, and is aimed at the situation where you want the structure of a class to drive the process of determining what to write.
Java has other XML binding facilities both built-in and third-party. These are much more powerful in their flexibility. They are especially good for when you want to define and verify the XML structure in a rigid fashion such as defining a XML DTD or XML Schema against which to validate the data and perhaps even generate the Java class in which to represent the data.
Embedded database
For more complicated data, use an embedded relational database.
The SQLite database is bundled with many platforms. This is a C-based library, not pure Java. As the name indicates, SQLite is indeed quite “lite“, lacking rigid data types and many other common database features. SQLite is meant to be an alternative to writing text files than as a competitor to more serious databases. It is a great product if your needs fit the sweet-spot of its capabilities.
My first choice for an embedded database would be H2 Database Engine. Built in pure Java. Can be run inside your app, or separately as a server (you choice). Has sophisticated relational database features. Has been around for years, often updated, and is well-worn. The principal author has much experience in the field.

Are there any samples for appengine Java report generation?

We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.

There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.

I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1

How to handle multiple schemas containing the same data

I'm working on a system which predicts soccer matches at work. I have several pre-existing databases which each contain broadly the same data, although some vendors provide more data than others. I have a core set of fields that my application uses and which all vendors provide:
homeTeamId, awayTeamId, fullTimeHomeGoals, fullTimeAwayGoals, homeShotsOnTarget, awayShotsOnTarget, etc...
Because these databases have come from different sources the field-names vary. Also some of this data is subjective (the definition of a shot on target varies). This means that I need to know which vendor a match came from. There is also overlap because several vendors will have the data for a particular match.
At the moment we are using one source of data at a time, but we will use two or more vendors at once based on the competition covered by that vendor in future (by selecting based on competition we remove the issue of duplicate matches).
My solution was to use XML to store a mapping of the fieldName. E.g
<Schemas>
<Schema>
<SchemaName>VendorA</SchemaName>
<TableName>VendorA_MatchResults</TableName>
<FullTimeHomeGoals>homeFullTimeScore</FullTimeHomeGoals>
Etc...
</Schema>
</Schemas>
Then whenever I need a field for a sql query, look at the vendor the user has specified in the job configuration XML and lookup the fields relevant to that data vendor. When we come to uses results from two vendors I was planning to use a view and treat this as a new vendor in the XML.
This must be a reasonably common problem but I couldn't find anything online discussing how to tackle it. My gut instinct says the DB should be able to handle this internally instead, perhaps with a view?
I'd be grateful for any advice or ideas.
For background, I'm using MySql and Java to develop this application.

I think what you are doing is good.
You should create a class which stores these configurations for each schema. Store these configs in a Map. Make sure that these entries and their configurations come from a config file, just like you did.
I would recommend Spring here, it makes your life easier. Each time when you want to add a new vendor, you just need to edit this config file, and restart your app.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.