I have the following graph structure:
Vertexes -- Campaign, Project, Lead
{"Name":["CompanyV"],"sid":["SidFromSQL"]}
{"name":["Campaign3V"],"status":["paused"]}
{"name":["Campaign1"],"startDate":["Jan 1, 2019 5:30:00 AM"]}
{"name":["Campaign2V"],"status":["active"]}
{"name":["Lead11V"]}
{"name":["Lead2V"]}
{"name":["Project1V"],"Name":[""],"sid":["SidFromSQL"]}
{"name":["Project2V"],"Name":[""],"sid":["SidFromSQL"]}
{"name":["Lead3V"]}
{"name":["Campaign1V"],"status":["active"]}
Edges:
{"inVertex":{"id":"58b6e79f-6809-6fc4-9f0a-c8a26337a729","label":"Campaign"},"outVertex":{"id":"1cb6e79d-1ca7-4d3c-e71d-c05c13abac15","label":"Project"},"id":"a6b6e79f-6809-87bf-a535-eec9101e683c","label":"hasCampaign"}
{"inVertex":{"id":"c4b6e7ae-64d3-b8b9-ce7b-c319e7ed70ca","label":"Lead"},"outVertex":{"id":"1cb6e79d-1ca7-4d3c-e71d-c05c13abac15","label":"Project"},"id":"6cb6e7ae-64d4-6fb0-9314-411eb72d9a28","label":"hasLead"}
{"inVertex":{"id":"a2b6e79d-1ca7-db4e-f19f-2ef1df514ade","label":"Project"},"outVertex":{"id":"64b6e79c-d58b-37ad-6c3e-63a783e6df97","label":"Company"},"id":"4eb6e7a8-451f-4365-1d79-d5d118b5ff56","label":"hasProject"}
{"inVertex":{"id":"c4b6e7ae-64d3-b8b9-ce7b-c319e7ed70ca","label":"Lead"},"outVertex":{"id":"58b6e79f-6809-6fc4-9f0a-c8a26337a729","label":"Campaign"},"id":"96b6e7ae-64d4-b918-353d-fccc13cbd9bb","label":"hasLead"}
{"inVertex":{"id":"94b6e79f-69b9-ccfe-d9e6-a41c4be59979","label":"Campaign"},"outVertex":{"id":"a2b6e79d-1ca7-db4e-f19f-2ef1df514ade","label":"Project"},"id":"34b6e79f-69b9-bd15-9331-0551c464f222","label":"hasCampaign"}
{"inVertex":{"id":"36b6e7b2-3d78-3229-9ebd-05c2c5f5927b","label":"Lead"},"outVertex":{"id":"1cb6e79d-1ca7-4d3c-e71d-c05c13abac15","label":"Project"},"id":"c2b6e7b2-3d78-16d0-ee95-46b66108236e","label":"hasLead"}
{"inVertex":{"id":"36b6e7b2-3d78-3229-9ebd-05c2c5f5927b","label":"Lead"},"outVertex":{"id":"58b6e79f-6809-6fc4-9f0a-c8a26337a729","label":"Campaign"},"id":"04b6e7b2-3d79-3d95-855e-a206d38b8603","label":"hasLead"}
{"inVertex":{"id":"1cb6e79d-1ca7-4d3c-e71d-c05c13abac15","label":"Project"},"outVertex":{"id":"64b6e79c-d58b-37ad-6c3e-63a783e6df97","label":"Company"},"id":"ccb6e7a8-449b-d92c-1330-4f6288ab0852","label":"hasProject"}
{"inVertex":{"id":"7eb6e7dc-94f9-ca83-df4c-87284897151f","label":"Lead"},"outVertex":{"id":"1cb6e79d-1ca7-4d3c-e71d-c05c13abac15","label":"Project"},"id":"d2b6e7dc-94fb-8219-30a2-d304ccaed75d","label":"hasLead"}
{"inVertex":{"id":"d0b6e79f-692a-8c03-112e-9388e54b1f9d","label":"Campaign"},"outVertex":{"id":"1cb6e79d-1ca7-4d3c-e71d-c05c13abac15","label":"Project"},"id":"3cb6e79f-692b-4cad-48ac-1e4991e75b60","label":"hasCampaign"}
I am running the below query to fetch the Leads associated with a Project and a specific campaign.
GraphTraversal t =g.V("1cb6e79d-1ca7-4d3c-e71d-c05c13abac15").out("hasLead")
.where(in("hasLead").has("Campaign","name","Campaign1V"));
This is returning the information about the Leads in the output, I wanted to know if there is a way in which i can get the specific Campaign information as well as the ID information in the output response (using a single traversal statement) so that this can be utilised by the UI component to render in an HTML.
You just need to transform your result into the form that you want. In this case you can use something like project()
g.V("1cb6e79d-1ca7-4d3c-e71d-c05c13abac15").out("hasLead").
where(__.in("hasLead").has("Campaign","name","Campaign1V")).
project('lead','campaign').
by().
by(__.in("hasLead").has("Campaign","name","Campaign1V").fold())
You might want to include something in that first by() modulator to further transform the vertex to the properties you want and you might want to do the same for the second as well. Furthermore, the fold() is only necessary if you have more than one campaign per lead.
So, the above works nicely and is easy to follow, but it does traverse the same "hasLead" edges twice. You can avoid that, but it adds a bit of misdirection to readability that you have to decide if you can live with:
g.V("1cb6e79d-1ca7-4d3c-e71d-c05c13abac15").out("hasLead").
project('lead','campaign').
by().
by(__.in("hasLead").has("Campaign","name","Campaign1V").fold()).
filter(select('campaign').unfold())
Now you project all of the "leads" but filter away any that have an empty list for the "campaign".
Related
I have a database and am trying to find a safe way to read this data into an XML file. Reading the verticies is not the problem. But with the edges the app runs since a while into a HeapSpace-Exception. g.V().has(Timestamp)
.order()
.by(Timestamp, Order.asc)
.range(startIndex, endIndex)
.outE()
Data to be Taken eg. Label, Id, Properties
I use the timestamp of the verticies as a reference for sorting to avoid capturing duplicates.
But because some verticies make up the largest value of all outEdges, the program runs out of the heap. My question is can I use the alphanumeric ID(UUID) of the edges to sort them safely or is there another or better way to reach the goal.
Something like this:
g.E().order().by(T.id, Order.asc).range(startIndex, endIndex)
Ordering the edges by T.id would be a fine property to order by however the problem is not related to the property chosen, it is instead related to the sheer number of edges being exported. Neptune, as with other databases, has to retrieve all the edges in order to then order them. Retrieving all these edges is why you are running out of heap. To solve this problem you can either increase the instance size to get additional memory for the query or export the data differently. If you take a look at this utility https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-export you can see current recommended best practice. This utility will export the entire graph as a CSV file. In your use case you may be able to either use this utility and then transform the CSV into an XML document or modify the code to export as an XML document.
In the project currently under development, we are integrating the Realm Database into the customer's app to improve responsiveness while working on a huge data set of ~20.000 records. For on-screen presentation, we are incorporating Realm's Android Recyclerview. Majority of the use cases are read operations, followed up by the possibility of advanced search and/or filtering of the records.
Where the shoe pinches are that on some of our views, from all the data of given type only a subset of records is supposed to be displayed, selected by the back-end. Using the information passed by the API, we perform the initial filtering and set up the view.
Now, using the aforementioned technologies, is there a readable and maintainable way to store either this pre-filtered subset or the query fetching it for further reference, so that the initial state of the view can always be restored once the searchview and/or filters are cleared? Or should storing the API response re-applying the conditions given through it be the only way to do it? Applying any new conditions to the query seems to alter it for good, the same goes for applying new queries to the results. Shouldn't there be a way to create ourselves a fresh result set based on an old one but without disturbing the latter?
Edit: Our app being 'bilingual', both Java- and Kotlin-based solutions are welcomed, should they differ.
As we came to realise after a while earlier this week, and just as #EpicPandaForce have mentioned in the comments, while the RealmQuery object cannot be "snapshoot" by assigning it to a spare variable before extending it, the same is not true for RealmResults objects. And so:
RealmResults<Obj> widerResults = realmInstance.where(Obj.class).in("id", idArray).findAll;
RealmResults<Obj> narrowerResults = widerResults.where().equalTo("flag", true).findAll;
Will provide two independent result sets. The wider one can be used as per the use case I highlighted - to treat it as a starting point for further subqueries. The changes to objects found in the sets themselves will still be reflected in both sets.
Providing an answer for all the lost souls out there, should they get stuck like we did.
I'm very new to Gremlin. I have been going through the documentation but continue to struggle to find an answer to my problem. I'm assuming the answer is easy, but have unfortunately become a little confused with all the different API options e.g. subgraphs, side effects and would like a little help/clarity from the expert group if possible.
Basically (as an example) I have a graph that looks like the below where I first need to select 'A' and then traverse down the children of 'A' only, to find if there is Vertex that matches 'A3' or 'A4'.
Selecting the first Vertex of course is easy, I simply do something like:
.V().has("name", "A")
However, I'm not sure how I can now isolate my second vertex search to the children of 'A' only. As I mentioned before I have stumbled upon subgraphs but have not being able to fully grasp how I can leverage this capability or if I should for my purpose.
I'm using TinkerPop3 and Java 8.
Any help will be greatly appreciated!
When you start your traversal with: g.V().has('name','A') you get the "A" vertex. Any additional steps that you add after that are restricted to that one vertex. Therefore g.V().has('name','A').out() can only ever give you the "A1" vertex and related children.
To traverse through all the children of "A", you need repeat() step:
g.V().has('name','A').
repeat(out()).
until(has('name',within('A3','A4'))
So, basically find "A", then traverse through children until you run into "A3" or "A4".
In the future, please consider supplying a Gremlin script that can be pasted into the console to construct your example graph - here's an example. An example graph in that form is quite helpful.
I am very new to lotus notes. Recently my team mates were facing a problem regarding the Duplicates in Lotus notes as shown below in the CASE A and CASE B.
So we bought a app named scanEZ (Link About scanEX). Using this tool we can remove the first occurrence or the second occurrence. As in the case A and Case B the second items are considered as redundant because they do not have child. So we can remove all the second item as given below and thus removing the duplicates.
But in the Case 3 the order gets changed, the child item comes first and the Parent items comes second so i am unable to use the scanEX app.
Is there any other better way or software or script to accomplish my task. As I am new to this field I have not idea. Kindly help me.
Thanks in advance.
Probably the easiest way to approach this would be to force the view to always display documents with children first. That way the tool you have purchased will behave consistently for you. You would do this by adding a hidden sorted column to the right of the column that that you have circled. The formula in this column would be #DocChildren, and the sort options for the column would be set to 'Descending'. (Note that if you are uncomfortable making changes in this view, you can make a copy of it, make your changes in the copy, and run ScanEZ against the copy as well. You can also do all of this in a local replica of the database, and only replicate it back to the server when you are satisified that you have the right results.)
The other way would be to write your own code in LotusScript or Java, using the Notes classes. There are many different ways that you could write that code,
I agree with Richard's answer. If you want more details on how to go thru the document collection you could isolate the documents into a view that shows only the duplicates. Then write an agent to look at the UNID of the document, date modified and other such data elements to insure that you are getting the last updated document. I would add a field to the document as in FLAG='keep'. Then delete documents that don't have your flag in the document with a second agent. If you take this approach you can often use the same agents in other databases.
Since you are new to Notes keep in mind that Notes is a document database. There are several different conflicts like save conflicts or replication conflicts. Also you need to look at database settings on how duplicates can be handled. I would read up on these topics just so you can explain it to your co-workers/project manager.
Eventually in your heavily travelled databases you might be able to automate this process after you work down the source of the duplicates.
These are clearly not duplicates.
The definition of duplicate is that they are identical and so it does not matter which one is kept and which one is removed. To you, the fact that one has children makes it more important, which means that they are not pure duplicates.
What you have not stated is what you want to do if multiple documents with similar dates/subjects have children (a case D if you will).
To me this appears as three separate problems.
The first problem is to sort out the cases where more than one
document in a set has children.
Then sort out the cases where only one document in a set has children.
Then sort out the cases where none of the documents in a set has children.
The approach in each case will be different. The article from Ytira only really covers the last of these cases.
I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example...
Does anybody know does such a web crawler exist and can If answer is yes where I can find it.
Tnx...
What you're asking is two components:
Web crawler
Lucene-based automated indexer
First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could use Lucene to do something you've requested without really understanding what's going on underneath.
Web crawler
So you have a web site/directory you want to "crawl" through to collect specific resources. Assuming that it's any common web server which lists directory contents, making a web crawler is easy: Just point it to the root of the directory and define rules for collecting the actual files, such as "ends with .txt". Very simple stuff, really.
The actual implementation could be something like so: Use HttpClient to get the actual web pages/directory listings, parse them in the way you find most efficient such as using XPath to select all the links from the fetched document or just parsing it with regex using Java's Pattern and Matcher classes readily available. If you decide to go the XPath route, consider using JDOM for DOM handling and Jaxen for the actual XPath.
Once you get the actual resources you want such as bunch of text files, you need to identify the type of data to be able to know what to index and what you can safely ignore. For simplicity's sake I'm assuming these are plaintext files with no fields or anything and won't go deeper into that but if you have multiple fields to store, I suggest you make your crawler to produce 1..n of specialized beans with accessors and mutators (bonus points: Make the bean immutable, don't allow accessors to mutate the internal state of the bean, create a copy constructor for the bean) to be used in the other component.
In terms of API calls, you should have something like HttpCrawler#getDocuments(String url) which returns a List<YourBean> to use in conjuction with the actual indexer.
Lucene-based automated indexer
Beyond the obvious stuff with Lucene such as setting up a directory and understanding its threading model (only one write operation is allowed at any time, multiple reads can exist even when the index is being updated), you of course want to feed your beans to the index. The five minute tutorial I already linked to basically does exactly that, look into the example addDoc(..) method and just replace the String with YourBean.
Note that Lucene IndexWriter does have some cleanup methods which are handy to execute in a controlled manner, for example calling IndexWriter#commit() only after a bunch of documents have been added to index is good for performance and then calling IndexWriter#optimize() to make sure the index isn't getting hugely bloated over time is a good idea too. Always remember to close the index too to avoid unnecessary LockObtainFailedExceptions to be thrown, as with all IO in Java such operation should of course be done in the finally block.
Caveats
You need to remember to expire your Lucene index' contents every now and then too, otherwise you'll never remove anything and it'll get bloated and eventually just dies because of its own internal complexity.
Because of the threading model you most likely need to create a separate read/write abstraction layer for the index itself to ensure that only one instance can write to the index at any given time.
Since the source data acquisition is done over HTTP, you need to consider the validation of data and possible error situations such as server not available to avoid any kind of malformed indexing and client hangups.
You need to know what you want to search from the index to be able to decide what you are going to put into it. Note that indexing by date must be done so that you split the date to say year, month, day, hour, minute, second instead of millisecond value because when doing range queries from Lucene index, the [0 to 5] actually gets transformed into +0 +1 +2 +3 +4 +5 which means the range query dies out very quickly because there's a maximum number of query sub parts.
With this information I do believe you could make your own special Lucene indexer in less than a day, three if you want to test it rigorously.
Take a look at solr search server and nutch (crawler), both are related to the lucene project.