We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.
There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.
I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1
Related
I'm trying to create something similar to the below HTML form using Play framework with Java:
https://www.w3schools.com/howto/tryit.asp?filename=tryhow_js_filter_list
But instead of hard-coding the selectable values to be searched (like "Adele" in the example I gave above), I plan on querying a large dataset with thousands or 10s of thousands of records and allowing users to search that dataset. Please answer my below 2 questions:
Is this possible to do in the Play Framework
Would doing this be bad practice? Would it be better to have users enter a string and only query when they hit the 'Search' button?
This is 100% possible with Play! The bottleneck will be your dataset. Is it a file, an SQL database, a nosql database, a search engine, a webservice query. This is where the problem usually lies.
Debatable. Implement it any way you like, if you experience bottlenecks tweak it. You could implement a 200ms timeout before firing the ajax request. A very nice tool is select2 which allows you to easily configure this.
I have a question regarding implementation of hadoop in one of my projects. Basically the requirement is that, we receive buch of logs on daily basis containing information regarding videos(When it was played, when it stopped, which user playe it etc).
What we have to do is analyze these files and return stats data in response to an HTTP request.
Example request: http://somesite/requestData?startDate=someDate&endDate=anotherDate. Basically this request asks for count of all videos played between a date Range.
My question is can we use hadoop to solve this?
I have read in various articles hadoop is not real time. So to approach this scenario should i use hadoop in conjunction with MySQL?
What i have thought of doing is to write a Map/Reduce job and store count for each video for each day in mysql. The hadoop job can be scheduled to run like once a day. Mysql data can then be used to serve the request in real time.
Is this approach correct? Is hive useful in this in any way? Please provide some guidance on this.
Yes, your approach is correct - you can create the per day data with MR job or Hive and store them in MySQL for serving in real time.
However newer versions of Hive when configured with Tez can provide decent query performance. You could try storing your per day data in Hive serve them directly from there. If the query is a simple select, it should be fast enough.
Deciding using Hadoop is an investment, as you'll need clusters and development/operational effort.
For a Hadoop solution to make sense, your data must be big. Big, as in terabytes of data, coming in real fast, possibly without proper catalog information. If you can store/process your data in your current environment, run your analysis there.
Assuming your aim is not educational, I strongly recommend you to reconsider your choice of Hadoop. Unless you have real big data, it'll only cost you more effort.
On the other hand, if you really need a distributed solution, I think your approach of daily runs is correct, accept that there are better alternatives to writing a Map/Reduce job, such as Hive, Pig or Spark.
I'm planning to write a Java application wich relies on a small (Around 3000 nodes) graph to represent its structure. The data should be loaded from a custom file at startup to create an in-memory graph database. I've looked into Neo4j but saw that you can't make it run directly as in-memory. Googling around a bit I found Google JIMFS (Java in-memory file system) may suit my needs.
Does anyone have experience with getting Neo4j to work on a JIMFS FileSystem?
Are there more suited alternatives wich work in Java (possibly in-memory out of the box like HSQLDB) for small-scale graphs and still provide a declarative query language like Cypher?
Note that performance is not so much of an issue to me, it's more of a playground to gather some experience with graph databases, but I don't want the application to create a Database file system on disk.
Note that performance is not so much of an issue to me,
In that case you can go for ImpermamentGraphDatabase of neo4j, which is created like this:
graphDb = new TestGraphDatabaseFactory().newImpermanentDatabase();
It doesn't create any files on filesystem.
Source:
http://neo4j.com/docs/stable/tutorials-java-unit-testing.html
I don't know why you wouldn't want the application to create a Database file system on disk but I can easily tell that there are many options. I used neo4j and for most cases found its query methodology clear and visualizer very useful, thereby in my limited knowledge, make it my number one choice. However considering your requirements you might find this interesting :
https://bitbucket.org/lambdazen/bitsy/wiki/Home
I am trying to build a recommendation engine, for that I am thinking of using apache mahout but I am unable to make out if mahout process the data in real time or does it pre-process the data when the server is idle and store the results somewhere in the database.
Also does anyone have any idea what approach do sites like amazon,netflix follow?
Either/or, but not both. There are parts inside from an older project that are essentially real time for moderate scale. There are also Hadoop based implementations which are all offline. The two are not related.
I am a primary creator of these parts, and if you want a system that does both together, I suggest you look at my current project Myrrix (http://myrrix.com)
I have a simple task that I feel there has to be an app out there for (or is easy to build or extend an open-source version).
I need to run a mysql query repeatedly and look for changes in the results between runs (the data is coming in in real time).
I have built several of these queries and throughout the day find myself jumping between tabs in my mysql client running them, and trying to see what has changed. This becomes difficult as there are hundreds of rows of data and you can't remember the previous values easily.
Ideally I could have a simple app (or web app) that stores the query, and refreshes over and over again. As the data is filled into the table it could compare the old results and change the color to red or green (or something).
I would need sorting, and simple filtering (possibly with string replaces into the query based on the inputs.
We run Ubuntu at work and I have tried doing this via terminal scripts (we use Ruby), but I feel a more-visual output would give me better results.
Googling around I see several for-pay apps, but there has to be something out there to do this.
I don't mind coding one up, but I don't like to re-invent the wheel if I don't have to.
Many thanks!
For simple things like this you are not reinventing the wheel as much as making your own sandwich -- some things don't make much sense to buy. Just build the simplest web page possible (e.g. a table with the table names you are interested in and maybe a timestamp for the last time it was checked. Have some javascipt run your query and color the cells based on the change you are looking for...repeating this operation as needed. I could give you more specific info if you can tell me how the data changes...more entries into a table? Updates to existing data?
I often use JDBC servlets via Tomcat for this. Here's an excellent tutorial and a very simple example.
I've done something similar in the past using Excel. Just build a connected spreadsheet, make your queries and the result will be outputed to Excel, then you format the way you like it. Very flexible, and if you need some kind of logic beyond the query itself, there are always Excel's built in functions and VBA.
Here is a useful link to help you. It is very simple:
http://port25.technet.com/archive/2007/04/10/connecting-office-applications-to-mysql-and-postgresql-via-odbc.aspx