I need to extract a subgraph (a subset of nodes and edges) based on a user defined conditions such as attributes values and labels.
This is already feasible using either a query language such as cypher or gremlin, or simply coded using a java methods.
However, since I'm dealing with large graphs, I wish to keep the extracted subgraph for further querying, and even iterate the subextraction-querying process.
I've seen these discussions : Extract subgraph in neo4j , Extracting subgraph from neo4j database. However, I couldn't figure out the answer for my case.
I was thinking of some alternatives :
Building a new index each time I need to extract a subgraph
Use a cache to store the nodes/edges which might be useful for arithmetic computation such as average etc.
Create a new instance of embedded ne4j, however this is really costly !
Another point, is getByID cheaper than index lookup. I know this depends on the case: large graphs or small index ...
You could just create a new neo4j java embedded database to hold your results and query further? No need to boot up another server IMHO.
Also, getByID is generally cheaper than index lookup, since you avoid the index roundtrip. Index lookups are great for more complax lookups like text matching etc.
Related
Background
I'm using NoSql database supporting graphs for the first time. It is a huge medical application handling thousands of patients. It is a greenfield project and we as a team are struggling with our persistence layer. We don't know how relationships should be represented and if we should use Triples to handle queries involving huge amount of data. We are using Java API.
Data structure
Imagine that there are 3 types of JSON documents in our Marklogic database: Patient, Event, File Evidence.
There are thousands of patients in the application
One patient can have multiple events associated with this patient (admitted, discharged, transferred, prescribed medications, added note, changed internal status etc.)
each event can have multiple files attached to it as an evidence
Assume there are hundreds of thousands of patients, events and files.
Question
Is it possible to query patients with events and files at once? Is using semantics (possible triples: 'patient has event', 'event has file') recommended in our case?
Our approach
We try to use triples to provide relationships between our documents, add them to one graph, use combination query to fetch IRI first and then in the second call fetch documents by IRI. We tried self-paced trainings and exploring https://github.com/marklogic/marklogic-samplestack but with no luck. Help of someone who has done that in the past and would like to share his experience would be great.
I your situation, keep in mind that you can also store the triples in each of the documents themselves (with the inferred subject being the document itself). Then in your example, you could be combining cts:triple-range-query with standard cts:search.
Example:
If I had events and embedded a triple such as [this event-> ownedByPatient -> [iri/for/patiens#12345]
Then I could query:
search for events filtered by fragments where the cts:triple-range-query states that the events are owned by patient 12345
This approach is a combination of semantics and MarkLogic search - using triples to link the appropriate types.
As for different types of documents, triples do not care what they are pointing at - an IRI of a person, event, etc. Its just about how you model you data itself and the ontology used to describe the relationships. So, you can also approach this as managed triples (not embedded) and treat it all as a graph database pointing at your content (like the approach you are describing)
Once you get further along, you may also decide to force restrictions on the types of relationships using RDF rules.
You've given us very little information to work with to answer such broad questions. Nevertheless, I'll do my best with what you gave.
One option is organize the data however is most intuitive to you, and use server-side Javascript (SJS) to combine the documents at query time into whatever you need for a particular query. That SJS could be in the form of a resource extension or search response transform. A resource extension has the advantage that it could do multiple queries across different document types and piece them together to form an answer. A search response transform, on the other hand will be given the results of only one query but could do additional queries as needed to bring in more data. Since you only have hundreds of thousands of records, you may not need to stress too much about raw speed.
If you plan to scale to millions of documents and want raw speed, you could keep everything you want to query about one patient in the patient record. That would allow you to find a patient by full-text search through all their records plus field-match on patient-specific data.
That assumes the only search results you ever want are patients. If you want something else, you'll need to let us know what other search results you might want.
When you say "attachment" I think of binary documents with scanned images, no metadata, and no full-text to search. Those would obviously be stored as separate binary documents. If they have metadata or full-text, you'll have to decide whether any of that should be in the big patient record for fast queries or in separate documents. All "attachment" documents that are separate JSON files could have a field that points to the patient by id.
I'd avoid triples at first. As David Ennis pointed out, you can combine triples and search, but it's a bit of a ninja move. One big JSON document per patient is much easier for most developers to understand.
If the result set is large, then having the entire result set in memory (server cache e.g. hazelcast) will not be feasible. With large result sets, you cannot afford to have them in memory. In such case, you have to fetch a chunk of data at a time (query based paging). The down side of using query based paging, is that there will be multiple calls to the database for multiple page requests.
Can anyone suggest how to implement a hybrid approach of it.
I haven't put any sample code here since I think the question is more about a logic instead of specific code. Still if you need sample code I can put it.
Thanks in advance.
The most effective solution is to use the primary key as a paging criterion.This enables us to rely of first class constructs like a between range query which is simple for the RDBMS to optimize, the primary key of the queried entity will most likely be indexed already.
Retrieving data using a range query on the primary key is a two-step process. First one have to retrieve the collection of primary-keys, followed by a step to generate the intervals to properly identify a proper subset of the data,followed by the actual queries against the data.
This approach is almost as fast as the brute-force version. The memory consumption is about one tenth. By selecting the appropriate page-size for this implementation, you may alter the ratio between execution time and memory consumption. This version is also stateless, it does not keep references to resources like the ScrollableResults version does, nor does it strain the database like the version using setFirstResult/setMaxResult.
Effective pagination using Hibernate
I'm doing college work where I have to search by keywords. My entity is called Position and I'm using MySQL. The fields that I need to search are:
- date
- positionCode
- title
- location
- status
- company
- tecnoArea
I need to search the same word in all of these fields. To this end, I used criteria API to create a dynamic query. It is the same word for several fields and it should get the maximum possible results. Do you have any advice about how to optimize the search on the database. Should I do several queries?
EDIT
I will use an OR constraint.
If you will need to find the key word at any position within the data you will need to use LIKE with wildcards, eg. title LIKE '%manager%'. Since date and positionCode (presumably a numeric type) are not likely to contain the key word, to achieve a very small performance gain, I would omit searching these columns for the key word. Your query is going to need to do a serial read, which means that all rows in the table will need to be brought into main memory to evaluate and retrieve the result set of your query. Given a serial read is going to happen anyway, I do not think there is too much you can do to optimize the query when searching multiple columns. I am not familiar with the "criteria api to create dynamic queries", but using dynamic queries in other systems is non-optimal - they must be parsed and evaluated every time the are run and most query optimize-rs cannot make use of the statistics for cost-based optimization to improve performance like they can with explicitly defined SQL.
Not sure what your database is.
If it is Oracle, you can use Oracle text.
The below link might be useful :
http://swiss-army-development.blogspot.com/2012/02/keyword-search-via-oracle-text.html
I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.
What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):
1- When inserting an object in MongoDB, I then add it to Solr
SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();
2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like
StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();
3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:
get the id of the document
get the object from MongoDB having the same id to be able to retrieve the properties from there
4- When deleting, well I haven't done that part yet and not really sure how to do it in Java
So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after
Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).
In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).
Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).
You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that
there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
useColdSearcher is set to
false in order to have good search performance, or
true if you want changes performed to the index to be taken faster into account at the price of a slower search.
Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.
For more information about Solr performance factors, see
http://wiki.apache.org/solr/SolrPerformanceFactors
To delete documents, you can either delete by document ID (as defined in schema.xml) or by query :
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html
You can also wait for more documents and indexing them only each X minutes. (Of course this highly depend of your application & requirements)
If your documents are small and you don't need all data (which are stored in MongoDB) you can put only the field you need in the Solr Document by storing them but not indexing
<field name="nameoyourfield" type="stringOrAnyTypeYouuse"indexed="false"stored="true"/>
I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.
Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.
How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)
I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.
You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.
This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8