I've got an existing Grails/MongoDB application that I am adding some automated tests to. I want those tests to be executed against a specific set of data in a Mongo collection. I want the tests to be able to mangle the data (with predictable results, if I'm lucky) and then be able to quickly drop and recreate/reload the database so that I can run the test again.
Since I'm going to base this seed test data on real data from our production system, I'd like to be able to perhaps load the data from a JSON/BSON format that I could retrieve from a query in the Mongo shell or something similar.
Basically I don't want to have to write a hundred lines of code like the following:
new Record(name: 'John Doe', age: '25', favoriteColor: 'blue').save()
Except with 30 properties each, all the while ensuring that constraints are met and that the data is realistic. That's why I want to use production data.
I also don't want to have to resort to spawning execs that run mongorestore to load and reload real data, since that would require additional software to be running on the tester's machine.
Is there a better way? Perhaps somehow unmarshalling raw JSON into something that I can then execute with the Grails MongoDB GORM or GMongo or a direct call to the Java MongoDB driver?
Do you need to store your test data in a transportable file, or will you always have access to a mongodb instance on which it could live? Say, for example, that you have a test mongodb server and that you can rely on having access to it whenever your tests are run.
In that case, the simplest solution is to keep the test data in a collection which you'd clone before each test run. Tests would then be free to mangle the cloned collection as much as they want without any actual data loss.
If you need to have your test data live in a file (because, for example, you want to store it on your code repository), then you need to find a format that's easy to serialize to / deserialize from BSON. JSON seems like an obvious choice, especially since, as #drorb said above, mongodb has tools to do that for you already.
You'd then just need to write one script to dump the content of an existing collection in JSON files, and another to load a set of JSON files and store them in a collection - probably not more than a few lines each.
I'd suggest storing each object in a separate JSON file rather than have a large file with all the test data. As much as I like JSON, it doesn't lend itself well to streaming, and you'd have to store the whole collection in memory before you can start dumping it in mongodb. If your test data is big enough, it could start causing memory problems.
You can use the com.mongodb.util.JSON class to convert JSON data directly to a DBObject.
Take a look at this example which demonstrates how to do it using the Java driver.
This MongoDB blog post shows how to do it using GORM and the Groovy driver.
Related
I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).
I'm looking to make a web that makes use of two sets of databases, given in CSV format and both are 10 MB in size. I've chosen to use Java dynamic web app with JSP, that users can use to search and sort through the data provided through the CSV.
From what I understand, the user/client sends a request to the server, the server will call upon the Java cases in the backend, which has the different sorting methods and data from the CSV that can be manipulated.
This data, that sits in the backend, is where I'm running into confusion. I know its possible to load the data to a database, and have that sitting on the server that I could call upon.
If I use a class that reads the CSV and loads the data to arrays, Would this reading work be done every time someone accesses the website causing latency or would it already be loaded into arrays in the server?
Depending on the scope you use it would be loaded in an application context, therefore one time (say in a singleton class loaded at the application startup).
But I wouldn't recommend this approach, I would recommend a proper designed database where you can put your csv data into. This way you would have the database engine to help you organize your data which would give you scalability and maintainability (although with a proper design of your classes say a DAO pattern would give you the same).
Organized data in a database would give you more flexibility to search through your data using already made SQL functions.
In order to make my case here are some advantages of a Database system over a file system:
No redundant data – Redundancy removed by data normalization
Data Consistency and Integrity – data normalization takes care of it too
Secure – Each user has a different set of access
Privacy – Limited access
Easy access to data
Easy recovery
Flexible
Concurrency - The database engine will allow you to concurrent read the data or even write to it.
I'm not listing the disadvantages since I'm making my case :)
I can read from a CSV file to build your arrays. You can then add the arrays to session scope. The CSV file will only be read at the servlet that processes it. Future usage will be retrieved from session.
suppose that I have an Employee Class and there is another Class Company that has more than one Employee so I want to save the Employee objects locally, that means every time I run my application I can retrieve these objects.
I would suggest using some kind of embedded solution. There are a number of options available such as H2 or Neo4j.
For a comprehensive list check out Wikipedia.
Although these solutions are technically databases the don't run on their own server, they run inside your current Java process.
To get started with H2 the following steps are required:
Add the h2*.jar to the classpath (H2 does not have any dependencies)
Use the JDBC driver class: org.h2.Driver
The database URL jdbc:h2:~/test opens the database test in your user home directory
A new database is automatically created
After that, you can simply use JDBC against the new database everything is stored on disk.
I would not recommend using Java-serialization of Java objects to a file (Serializable). This is not a maintanable solution and whenever your code changes (for the classes that were serialized) you have to work out some plan on how to migrate your data.
Another solution is to dump your object graph as JSON in a file. There are multiple libraries that can help you with serialization and deserialization from Java to JSON and vice versa. One example is the excellent Jackson library that easily and without fuzz converts objects to and from JSON.
you can use Microsoft Excel to save the attributes of this object and retrieve them later, there is an api for excel http://poi.apache.org/download.html
We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.
There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.
I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1
If I have systems that are based on realtime data, how can I ensure that all the information that is current is redundantly stored in a file? So that when the program starts again, it uses this information to initialize itself back to where it was when it closed.
I know of xstream and HSQLDB. but wasn't sure if this was the best option for data that needs to be a literal carbon copy.
It really all depends what type of app data you're storing. If you need to recreate java objects exactly how they were (i.e. variables and state the same), you can serialize the objects you need. There are many serialization mechanisms, for example, xstream as you mentioned. If you're storing objects directly, using one of those mechanisms would work.
But, a lot of times, you want to store the state of your application, which doesn't necessarily correspond directly to serializing objects directly. If that's the case, you can write out only the relevant data you need. The type of storage you use depends on your needs. If you have a large amount of data, consider a database. A smaller amount might work better in a flat file.
One other thing is that storing data redundantly in a single file doesn't seem too useful. If the file gets corrupted, you'll lose both copies, so if redundancy is a concern, store it in different places (i.e. a primary and backup database).
There's no one right way to do it, but hopefully these ideas get you started.
Creating a literal copy (i.e. a snapshot) of a large body of in-memory data is expensive. Repeating the process each time you get an update to the in-memory data is probably prohibitively expensive. You need to re-think your application architecture.
One approach is to commit your realtime data to a database as it comes in, and then display the data either from the database for coherency.
A second approach is to commit to a database and maintain a parallel in-memory data structure which you display from. You also need to implement code to rebuild the in-memory data structure from the database on application restart. This is more code, and there is more opportunity for glitches where the user sees different stuff after a restart due to some bug.
A third approach is to work entirely from an in-memory data structure and deal with data persistence as follows:
periodically, you suspend processing updates and take a snapshot of the entire in-memory data structure using xstream, java serialization or whatever.
every update needs to be reliably logged (with a timestamp) to a file or files in a form that can be replayed.
when the application restarts, you reload from the last snapshot and then replay all updates that arrived since the snapshot.
The last approach has the problem that there is only one up-to-date stable copy of the data. If that is lost due to a hard disc or OS failure, then you are toast. In the other approaches, this issue can be address using a hot standby database implemented using the RDBMS's off-the-shelf support for such things.