I'm new to druid and currently I'm working on a project where data is collected in monthly or weekly basis and further this data is used for some analysis purpose. Currently I'm storing all the data collected in postgres with timestamp for each row. Now i've decided to go with timeseries databases(Druid), i've gone through the Druid docs and got to know how to load data into Druid through druid console (Basically I exported data into csv from my postgres and loaded that through druid console) and through commands. Now if i want to load and query data using Java how can i do that?.
As I'm not finding much resources regarding this, especially how to load data (in the form of CSV) into druid using Java, it would be very much helpful if someone guide me through this.
You can use druid client-apis
or use druidry
examples are here
Related
Can you change the file metadata on a cloud database using Apache Beam? From what I understand, Beam is used to set up dataflow pipelines for Google Dataflow. But is it possible to use Beam to change the metadata if you have the necessary changes in a CSV file without setting up and running an entire new pipeline? If it is possible, how do you do it?
You could code Cloud Dataflow to handle this but I would not. A simple GCE instance would be easier to develop and run the job. An even better choice might be UDF (see below).
There are some guidelines for when Cloud Dataflow is appropriate:
Your data is not tabular and you can not use SQL to do the analysis.
Large portions of the job are parallel -- in other words, you can process different subsets of the data on different machines.
Your logic involves custom functions, iterations, etc...
The distribution of the work varies across your data subsets.
Since your task involves modifying a database, I am assuming a SQL database, it would be much easier and faster to write a UDF to process and modify the database.
First, Apache Beam does not currently support schema update yet. There is a feature request for some times but no news
Another option is to alter your current dataflow written with Apache Beam pipeline to migrate your table to another (corrected schema) table. This, unfortunately, is not scale if you have a lot of data and plus if you need to frequently change table schema ( renaming columns, renaming table name, changing data types, ..etc).
What I propose is issue SQL queries to update your table schema instead. You can write a bash script using this guide that executes ALTER TABLE statement.
New to Oracle here but I have now read about the various bulk insert options for Oracle. In essence, true bulk loading is done using Direct Path loading mechanism via SQL*Loader. There's also APPEND hint options that use serial or parallel Direct Path loading. But each of these have the following limitations -
SQL*Loader works off of a Control File, which contains the path of the data file. In my case, there is no file.
APPEND hint option for INSERT can only use the syntax - insert into select from. In my case, the source data is not in any table.
Source of my data is actually a Spark dataframe. I am looking for options to push this data in chunks to Oracle tables, but using Direct Path loading option. For example, in Postgres, the PGConnection interface provides getCopyAPI.copyIn functionality and you can create a huge serialized blob than can be sent over as one big chunk using COPY tableName FROM STDIN yourBlob command. I am unable to find anything similar Java API for Oracle that works on in-memory records and is able to push data directly (without any insert statements).
Any ideas on how to achieve this? Anyone done this before?
In general, how do folks using Oracle and Spark push data to Oracle from a dataframe in an optimized way?
Thanks in advance!
I need some guidance on how to retrieving data from a database.
The database is called Drug Combination DataBase and so far I'm just using a small text file that contains a small portion of the data, but the complete database is available as a 14mb sql-file. Can I load this in an efficient way during run-time in my java application so I can look up a few entries? I've never used an sql-file to retrieve data in java before so I don't know what is the best strategy.
By the way, I'm creating an application that reads large graphs and another xml database so memory usage is fairly high.
the way to connect a Java program to a database is through JDBC. the file needs to be read int and saved to a database like MySQL or PostgresQL in order to be accessed. check out this link for a good tutorial:
jdbc tutorial
I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.
You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.
You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.
Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.
So I have developed a backend API in java on Google App Engine. The API is used to return real time stats by running queries on BigQuery.
I have also developed Front End in AngularJS which communicates with the backend API and allows users to login, and view aggregated stats.
I would like to let my users export data as CSV from BigQuery (directly through my front end application).
I'm not sure what's the best way to achieve that. Your help is highly appreciated.
I could see you doing this one of two ways, depending on your needs:
Run an export job to Google Cloud Storage in CSV format, then download the exported CSV from GCS.
Read data from tabledata.list API, converting the fv format into CSV on your server and creating a downloadable CSV file.
I'd probably recommend the first option. Export jobs are likely to scale better, since they are generally more performant for large tables than repeated calls to tabledata.list. It also avoids the need to write custom code to convert your data to CSV.
Exporting to GCS is currently our scaled solution, maybe you can contact the team to have some special quota for you. tabledata.list returning direct CSV output is about to be deprecated, an alternative is using bq cli's head command with --format=csv, but that is not a solution for scale.