Neo4J POJO approach - java

I have a quick question for Neo4J, is it possible to migrate from mysql to neo4j? Based from what I read, it seems that you can, but so far all the tutorials are meant for web service. I was wondering if there is a way (POJO) to do this kind of process. Currently I have over 300k records in process to be exported in CSV and I plan to load them into neo4j using spring. Can I just read them with JDBC and create new nodes in neo4j? thanks!

It's possible to migrate MySQL database to Neo4j, but it depends how you want to do it and what results do you expect.
You can use CSV export/import. It's simple to use, but with some limitations. For one time operation it should be good enough.
Second option is to write your own script or program which transform data from RDBMS to the Graph. It could be more powerful, you can do cleaning, transformation easily. Also you can use Spring Data for Neo4j to create persisted entities.
Next option is to use GraphAware Neo4j Importer. It's "Framework" for importing data from RDBMS to the Neo4j with lot of powerful features, but learning curve is steep.

Related

What is the best way to process large CSV files?

I have a third party system that generates a large amount of data each day (those are CSV files that are stored on FTP). There are 3 types of files that are being generated:
every 15 minutes (2 files). These files are pretty small (~ 2 Mb)
everyday at 5 PM (~ 200 - 300 Mb)
every midnight (this CSV file is about 1 Gb)
Overall the size of 4 CSVs is 1.5 Gb. But we should take into account that some of the files are being generated every 15 minutes. These data should be aggregated also (not so hard process but it will definitely require time). I need fast responses.
I am thinking how to store these data and overall on the implementation.
We have java stack. The database is MS SQL Standard. From my measurements MS SQL Standard with other applications won't handle such load. What comes to my mind:
This could be an upgrade to MS SQL Enterprise with the separate server.
Usage of PostgreSQL on a separate server. Right now I'm working on PoC for this approach.
What would you recommend here? Probably there are better alternatives.
Edit #1
Those large files are new data for the each day.
Okay. After spending some time with this problem (it includes reading, consulting, experimenting, doing several PoC). I came up with the following solution.
Tl;dr
Database: PostgreSQL as it is good for CSV, free and open source.
Tool: Apache Spark is a good fit for such type of tasks. Good performance.
DB
Regarding database, it is an important thing to decide. What to pick and how it will work in future with such amount of data. It is definitely should be a separate server instance in order not to generate an additional load on the main database instance and not to block other applications.
NoSQL
I thought about the usage of Cassandra here, but this solution would be too complex right now. Cassandra does not have ad-hoc queries. Cassandra data storage layer is basically a key-value storage system. It means that you must "model" your data around the queries you need, rather than around the structure of the data itself.
RDBMS
I didn't want to overengineer here. And I stopped the choice here.
MS SQL Server
It is a way to go, but the big downside here is pricing. Pretty expensive. Enterprise edition costs a lot of money taking into account our hardware. Regarding pricing, you could read this policy document.
Another drawback here was the support of CSV files. This will be the main data source for us here. MS SQL Server can neither import nor export CSV.
MS SQL Server silently truncating a text field.
MS SQL Server's text encoding handling going wrong.
MS SQL Server throwing an error message because it doesn't understand quoting or escaping.
More on that comparison could be found in the article PostgreSQL vs. MS SQL Server.
PostgreSQL
This database is a mature product and well battle-tested too. I heard a lot of positive feedback on it from others (of course, there are some tradeoffs too). It has a more classic SQL syntax, good CSV support, moreover, it is open source.
It is worth to mention that SSMS is a way better than PGAdmin. SSMS has an autocomplete feature, multiple results (when you run several queries and get the several results at one, but in PGAdmin you get the last one only).
Anyway, right now I'm using DataGrip from JetBrains.
Processing Tool
I've looked through Spring Batch and Apache Spark. Spring Batch is a bit too low-level thing to use for this task and also Apache Spark provides the ability to scale easier if it will be needed in future. Anyway, Spring Batch could also do this work too.
Regarding Apache Spark example, the code could be found in learning-spark project.
My choice is Apache Spark for now.
You might consider looking into the Apache Spark project. After validating and curating the data maybe use Presto to run queries.
You could use uniVocity-parsers to process the CSV as fast as possible, as this library comes with the fastest CSV parser around. I'm the author of this library and it is is open-source and free (Apache V2 License)
Now for loading the data into a database, you could try the univocity framework (commercial). We use it to load massive amounts of data into databases such as SQL server and PostgreSQL very quickly - from 25K to 200K rows/second, depending on the database and its config.
Here's a simple example on how the code to migrate from your CSV would look like:
public static void main(String ... args){
//Configure CSV input directory
CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv");
csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1");
//should grab column names from CSV files
csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true);
javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment
//Configures the target database
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource);
//Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types.
database.getDefaultEntityConfiguration().setParameterConversionEnabled(true);
DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database));
//Creates a mapping between data stores "csv" and "database"
DataStoreMapping mapping = engine.map(csv, database);
// if names of CSV files and their columns match database tables an their columns
// we can detect the mappings from one to the other automatically
mapping.autodetectMappings();
//loads the database.
engine.executeCycle();
}
To improve performance, the framework allows you can manage the database schema and perform operations such as drop constraints and indexes, load the data, and recreate them. Data & schema transformations are also very well supported if you need.
Hope this helps.
Pentaho Data Integration (or a similar ETL tool) can handle importing the data into a SQL database and can do aggregation on the fly. PDI has a community edition and can be run stand-alone or via a Java API.

Small-scale in-memory graph Database in Java

I'm planning to write a Java application wich relies on a small (Around 3000 nodes) graph to represent its structure. The data should be loaded from a custom file at startup to create an in-memory graph database. I've looked into Neo4j but saw that you can't make it run directly as in-memory. Googling around a bit I found Google JIMFS (Java in-memory file system) may suit my needs.
Does anyone have experience with getting Neo4j to work on a JIMFS FileSystem?
Are there more suited alternatives wich work in Java (possibly in-memory out of the box like HSQLDB) for small-scale graphs and still provide a declarative query language like Cypher?
Note that performance is not so much of an issue to me, it's more of a playground to gather some experience with graph databases, but I don't want the application to create a Database file system on disk.
Note that performance is not so much of an issue to me,
In that case you can go for ImpermamentGraphDatabase of neo4j, which is created like this:
graphDb = new TestGraphDatabaseFactory().newImpermanentDatabase();
It doesn't create any files on filesystem.
Source:
http://neo4j.com/docs/stable/tutorials-java-unit-testing.html
I don't know why you wouldn't want the application to create a Database file system on disk but I can easily tell that there are many options. I used neo4j and for most cases found its query methodology clear and visualizer very useful, thereby in my limited knowledge, make it my number one choice. However considering your requirements you might find this interesting :
https://bitbucket.org/lambdazen/bitsy/wiki/Home

implementing search using java technology(java web)

I want to implement a search functionality in my web application that I am building using java technology. I would have to search through the database, depending on the user query and will display the results. Which way can I go about doing this(please take note I am using java technology)??.Thanks.
You can use a product like http://lucene.apache.org/core/ or http://lucene.apache.org/solr/ for this instead of writing this on your own.
Lucene is a high-performance search engine for documents.
SOLR is built on top of Lucene and provides additional features (like hit highlighting, faceted search, database integration or rich document (Word, PDF, ..) search)
Lucene will analyze your text data and build up an index. When performing a search you run a lucene query against this index.
Assuming you mean free text searching of the data in the database...
For free text searching Lucene and/or SOLR are very good solutions. These work by creating a separate index of the data in your database. It is up to you to either pull the data from the database and index it using Lucene/SOLR or arrange your code that writes to the database to also update the Lucene/SOLR index. Given what you have said it sounds like this is being retrofitted to an existing database so pulling the data and indexing it may be the best solution. In this case SOLR is probbaly a better fit as it is a packaged solution.
Another option would be Hibernate Search. Again this would be a solution to use if you are starting out. It would be more difficult to add after the fact.
Also bear in mind some databases support free text searching in addition to normal relational queries and could be worth a look. SQL Server certainly has text search capabilities and I would imagine other databases have some sort of support. I am not too sure how you access these but I would expect to be able to do it using SQL via JDBC. It is likely to be database specific though.
If you just mean normal SQL searching then there are a whole load of Java EE technologies, plain JDBC, Spring templates, ORM technologies (JPA, JDO, Hibernate etc). The list goes on and it would be difficult to suggest any particular approach without a lot more info.

easy object persistence strategy - hibernate?

I'm doing a Java software-project at my university that mainly is about storing data-sets (management of software tests).
The first thing I thought of was a simple SQL DB, however the necessary DB scheme is not available for now (let's say the project is stupid but there's no choice).
Is a persistency framework like Hibernate able to store data internally (for example in XML) and to convert this XML into decent SQL later?
My intention is to use the additional abstraction layer of a framework like Hibernate to save work, because it might have conversion functions. I know that Hibernate can generate class files from SQL, but I'm not too sure whether it needs a DB at every point during development. Using a XML Scheme for now and converting it into SQL later maybe an idea :)
You can persist XML with hibernate into a relational DB, but you cannot use XML directly as a storage engine. Why not simply store you're data into a relational db from the start - you'll create some schema yourself and you'll adapt it to the actual one when you receive it.
I would recommand using a lightweight DB such as HSQLDB instead.

Integrating Pentaho/Talend/etc. with an OR Mapper

We have an application (Java) with an own OR mapper. Within this system we have what can be compared to Hibernate's interceptors (we call it triggers): Do specific actions just before saving data in the database, after it's deleted and so on. The underlying database is MySQL.
Now we would like to use tools such as Pentaho Data Integration or Talend to convert data to put it into our system. It's no problem to do that directly on the SQL level, but by doing so we loose the built-in power of our triggers.
Is there a way to somehow integrate any of the Data Integration solutions into our existing application? It would be great if there was a way to write into instances of our classes instead of writing into the database directly.
Any hints welcome :-)
I'd prefer Talend which is a Java code generator tool. (You can se my blog post at http://www.robertomarchetto.com/www/talend_studio_vs_kettle_pentao_pdi_comparison)
You could use a tJavaRow so you can write Java code for each processed row. In tJavaRow you can call Hibernate code, for example using a custom class defined in a new routine.
2 ways with Pentaho data integration I can think of straight off:
Simply create a plugin which adds/deletes data - you could copy the existing salesforce insert/update plugins, they would be a good start - rip out all the salesforce code and replace with yours.
Perhaps harder; But maybe more satisfying - Write a jdbc driver which uses your code!

Categories