Using hadoop for data analytics

Using hadoop for data analytics - java

I have a question regarding implementation of hadoop in one of my projects. Basically the requirement is that, we receive buch of logs on daily basis containing information regarding videos(When it was played, when it stopped, which user playe it etc).
What we have to do is analyze these files and return stats data in response to an HTTP request.
Example request: http://somesite/requestData?startDate=someDate&endDate=anotherDate. Basically this request asks for count of all videos played between a date Range.
My question is can we use hadoop to solve this?
I have read in various articles hadoop is not real time. So to approach this scenario should i use hadoop in conjunction with MySQL?
What i have thought of doing is to write a Map/Reduce job and store count for each video for each day in mysql. The hadoop job can be scheduled to run like once a day. Mysql data can then be used to serve the request in real time.
Is this approach correct? Is hive useful in this in any way? Please provide some guidance on this.

Yes, your approach is correct - you can create the per day data with MR job or Hive and store them in MySQL for serving in real time.
However newer versions of Hive when configured with Tez can provide decent query performance. You could try storing your per day data in Hive serve them directly from there. If the query is a simple select, it should be fast enough.

Deciding using Hadoop is an investment, as you'll need clusters and development/operational effort.
For a Hadoop solution to make sense, your data must be big. Big, as in terabytes of data, coming in real fast, possibly without proper catalog information. If you can store/process your data in your current environment, run your analysis there.
Assuming your aim is not educational, I strongly recommend you to reconsider your choice of Hadoop. Unless you have real big data, it'll only cost you more effort.
On the other hand, if you really need a distributed solution, I think your approach of daily runs is correct, accept that there are better alternatives to writing a Map/Reduce job, such as Hive, Pig or Spark.

Related

What is the best way to process large CSV files?

I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.

Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.

You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.

You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.

Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.

Does MapReduce need to be use with HDFS

I want to make a better performance for data processing using Hadoop MapReduce. So, do I need to use it along with Hadoop DFS? Or maybe MapReduce can be use with other type of data distributed? Show me the way, please....

Hadoop is a framework which includes Map Reduce programming model for computation and HDFS for storage.
HDFS stands for hadoop distributed file system which is inspired from Google File System. The overall Hadoop project is inspired based on the research paper published by Google.
research.google.com/archive/mapreduce-osdi04.pdf
http://research.google.com/archive/mapreduce.html
Using Map Reduce programming model data will be computed in parallel way in different nodes across the cluster which will decrease the processing time.
You need to use HDFS or HBASE to store your data in the cluster to get the high performance. If you like to choose normal file system, then there will not be much difference. Once the data goes to distributed system, automatically it will be divided across different block and replicated by default 3 times to avoid fault tolerance. All these will not be possible with normal file system
Hope this helps!

First, your idea is wrong. Performance of Hadoop MapReduce is not directly related to the performance of HDFS. It is considered to be slow because of its architecture:
It processes data with Java. Each separate mapper and reducer is a separate instance of JVM, which need to be invoked, which takes some time
It puts intermediate data on the HDDs many times. At minimum, mappers write their results (one), reducers reads and merges them, writing result set to disks (two), reducer results written back to your filesystem, usually HDFS (three). You can find more details on the process here: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/.
Second, Hadoop is open framework and it supports many different filesystems. You can read data from FTP, S3, local filesystem (NFS share, for instance), MapR-FS, IBM GPFS, GlusterFS by RedHat, etc. So you are free to choose the one you like. The main idea for MapReduce is to specify InputFormat and OutputFormat that would be able to work with your filesystem
Spark at the moment is considered to be a faster replacement of the Hadoop MapReduce as it puts much of the computations to the memory. But its use really depends on your case

Small-scale in-memory graph Database in Java

I'm planning to write a Java application wich relies on a small (Around 3000 nodes) graph to represent its structure. The data should be loaded from a custom file at startup to create an in-memory graph database. I've looked into Neo4j but saw that you can't make it run directly as in-memory. Googling around a bit I found Google JIMFS (Java in-memory file system) may suit my needs.
Does anyone have experience with getting Neo4j to work on a JIMFS FileSystem?
Are there more suited alternatives wich work in Java (possibly in-memory out of the box like HSQLDB) for small-scale graphs and still provide a declarative query language like Cypher?
Note that performance is not so much of an issue to me, it's more of a playground to gather some experience with graph databases, but I don't want the application to create a Database file system on disk.

Note that performance is not so much of an issue to me,
In that case you can go for ImpermamentGraphDatabase of neo4j, which is created like this:
graphDb = new TestGraphDatabaseFactory().newImpermanentDatabase();
It doesn't create any files on filesystem.
Source:
http://neo4j.com/docs/stable/tutorials-java-unit-testing.html

I don't know why you wouldn't want the application to create a Database file system on disk but I can easily tell that there are many options. I used neo4j and for most cases found its query methodology clear and visualizer very useful, thereby in my limited knowledge, make it my number one choice. However considering your requirements you might find this interesting :
https://bitbucket.org/lambdazen/bitsy/wiki/Home

Does mahout work real time or does it pre-process the data based on the algorithm rules?

I am trying to build a recommendation engine, for that I am thinking of using apache mahout but I am unable to make out if mahout process the data in real time or does it pre-process the data when the server is idle and store the results somewhere in the database.
Also does anyone have any idea what approach do sites like amazon,netflix follow?

Either/or, but not both. There are parts inside from an older project that are essentially real time for moderate scale. There are also Hadoop based implementations which are all offline. The two are not related.
I am a primary creator of these parts, and if you want a system that does both together, I suggest you look at my current project Myrrix (http://myrrix.com)

Are there any samples for appengine Java report generation?

We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.

There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.

I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.