Up until now I cannot find a really concise working example for the GAE MapReduce library.
Much are plain snippets if not ideas. The example that Google provides is quite not easy to understand consisting of pipelined Jobs etc.
I have a data in the datastore that which I need to check each one using MapReduce to check if a given word exist, and if it does, update the boolean wordFound to true field on the same Datastore entity.
How can I easily do this with GAE MapReduce library?
Related
My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices
While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.
I have a question regarding implementation of hadoop in one of my projects. Basically the requirement is that, we receive buch of logs on daily basis containing information regarding videos(When it was played, when it stopped, which user playe it etc).
What we have to do is analyze these files and return stats data in response to an HTTP request.
Example request: http://somesite/requestData?startDate=someDate&endDate=anotherDate. Basically this request asks for count of all videos played between a date Range.
My question is can we use hadoop to solve this?
I have read in various articles hadoop is not real time. So to approach this scenario should i use hadoop in conjunction with MySQL?
What i have thought of doing is to write a Map/Reduce job and store count for each video for each day in mysql. The hadoop job can be scheduled to run like once a day. Mysql data can then be used to serve the request in real time.
Is this approach correct? Is hive useful in this in any way? Please provide some guidance on this.
Yes, your approach is correct - you can create the per day data with MR job or Hive and store them in MySQL for serving in real time.
However newer versions of Hive when configured with Tez can provide decent query performance. You could try storing your per day data in Hive serve them directly from there. If the query is a simple select, it should be fast enough.
Deciding using Hadoop is an investment, as you'll need clusters and development/operational effort.
For a Hadoop solution to make sense, your data must be big. Big, as in terabytes of data, coming in real fast, possibly without proper catalog information. If you can store/process your data in your current environment, run your analysis there.
Assuming your aim is not educational, I strongly recommend you to reconsider your choice of Hadoop. Unless you have real big data, it'll only cost you more effort.
On the other hand, if you really need a distributed solution, I think your approach of daily runs is correct, accept that there are better alternatives to writing a Map/Reduce job, such as Hive, Pig or Spark.
I was wondering if it is possible to define a Hierarchical MapReduce job?.
In other words I would like to have a map-reduce job, that in the mapper phase will call a different MapReduce job. Is it possible? Do you have any recommendations how to do it?
I want to do it in order to have additional level of parallelism/distribution in my program.
Thanks,
Arik.
Hadoop definitive guide book contains lot of recipes related to MapReduce job chaining including sample code and detailed explanation. Especially chapter called like 'advanced API usage' or something near it.
I personally succeeded with replacement of complex map-reduce job with several HBase tables used as sources with handmade TableInputFormat extension. The result was input format which combines source data with minimal reduction so job was transformed to single mapper step. So I recommend you to look in this direction too.
You should try Cascading. It allows you to define pretty complex jobs with multiple steps.
I guess you need oozie tool. Oozie helps in defining workflows using an xml file.
I am trying to build a recommendation engine, for that I am thinking of using apache mahout but I am unable to make out if mahout process the data in real time or does it pre-process the data when the server is idle and store the results somewhere in the database.
Also does anyone have any idea what approach do sites like amazon,netflix follow?
Either/or, but not both. There are parts inside from an older project that are essentially real time for moderate scale. There are also Hadoop based implementations which are all offline. The two are not related.
I am a primary creator of these parts, and if you want a system that does both together, I suggest you look at my current project Myrrix (http://myrrix.com)
We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.
There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.
I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1