BaseX: Inserting nodes performance problems - java

I am experiencing some performance problems when inserting XML nodes to existing nodes in a BaseX database.
Usecase
I have one big XML file (about 2GB) from which I created a BaseX database. The XML looks like this (simplified). It has about 350.000 <record>s:
<collection>
<record>
<id>ABC007</id>
<title>The title of the record</title>
<author>Joe Lastname</author>
... [other information]
</record>
<record>
<id>ABC555</id>
<relation_id>ABC007</relation_id>
<title>Another title</title>
<author>Sue Lastname</author>
... [other information]
</record>
... [many other <record>s]
</collection>
The <record>s are related to each other. The <relation_id> in one record points to an <id> in another record (see example above).
What I am doing in BaseX is inserting information from one related record to the other one and vice versa. So, the result looks like this:
<collection>
<record>
<id>ABC007</id>
<title>The title of the record</title>
<author>Joe Lastname</author>
... [other information]
<related_record> <!-- Insert this information -->
<title>Another title</title>
<author>Sue Lastname</author>
</related_record>
</record>
<record>
<id>ABC555</id>
<relation_id>ABC007</relation_id>
<title>Another title</title>
<author>Sue Lastname</author>
... [other information]
<related_record> <!-- Insert this information -->
<title>The title of the record</title>
<author>Joe Lastname</author>
</related_record>
</record>
... [many other <record>s that should be enriched with other records data]
</collection>
I am doing that with the following Java code:
// Setting some options and variables
Context context = new Context();
new Set(MainOptions.AUTOFLUSH, false).execute(context);
new Set(MainOptions.AUTOOPTIMIZE, false).execute(context);
new Set(MainOptions.UPDINDEX, true).execute(context);
// Opening the database
new Open('database_name').execute(context);
// Get all records with <relation_id> tags. These are the "child" records and they contain the "parent" record ID.
String queryParentIdsInChild = "for $childRecord in doc('xmlfile.xml')//record[relation_id]
return db:node-id($childRecord)"
// Iterate over the child records and get the parent record ID
QueryProcessor parentIdsInChildProc = new QueryProcessor(queryParentIdsInChild, context);
Iter iter = parentIdsInChildProc.iter();
parentIdsInChildProc.close();
for(Item childRecord; (childRecord = iter.next()) != null;) {
// Create a pointer to the child record in BaseX for convenience
String childNodeId = childRecord.toString();
String childNode = "db:open-id('database_name', " + childNodeId + ")";
// Get some details from the child record. They should be added to the parent record.
String queryChildDetails = "let $title := data("+childNode+"/title)"
+ " let $author := data("+childNode+"/author)"
+ " return "
+ "<related_record>"
+ " <title>{$title}</title>"
+ " <author>{$author}</author>"
+ "</related_record>";
String childDetails = new XQuery(queryChildDetails).execute(context);
// Create a pointer to the parent record in BaseX for convenience
parentNode = (... similar procedure like getting the child node, therefore skiping that code here)
// PERFORMANCE ISSUE HERE!!!
// Insert the child record details to the parent node
String parentUpdate = "insert node " + childDetails + " into " + parentNode;
new XQuery(parentUpdate).execute(context);
}
... flushing and optimizing code here
Problem
The problem is that I experience massive performance problems when inserting the new nodes to a <record>. In a smaller test database with about 10.000 <record>s, the inserts are executed quite fast - in about 7 seconds. When I run the same code in my production database with about 350.000 <record>s, a single insert operation takes several seconds, some even minutes! And there would be thousands of these inserts, so it definitely takes too long.
Questions
I'm very new to BaseX and I am for sure not the most experienced Java programmer. Maybe I'm just overlooking something or making some stupid mistake. So I'm asking if someone has a hint for me. What could be the problem? Is it the Java code? Or is the BaseX database with 350.000 <record>s just too big for insert operations? If yes: Is there a workaround? Or is BaseX (or XML databases in general) not the right tool for this usecase?
Further Information
I am using BaseX 9.0.2 in stand-alone mode on Ubuntu 18.04. I have done an "Optimize All" before running the above mentioned code.

I think I didn't run the optimize correctly. After I optimized again the insert commands ran very fast. Now, about 10000 inserts are executing within under a second. Maybe it also helped that I deactivated UPDINDEX and AUTOOPTIMIZE.

Related

Merge/combine BaseX databases with upserts in memory constrained environment

I have two databases in BaseX, source_db and target_db, and would like to merge them by matching on the id attribute of each element and upserting the element with a replace or an insert depending on whether the element was found in the target_db. source_db has about 100,000 elements, and target_db has about 1,000,000 elements.
<!-- source_db contents -->
<root>
<element id="1">
<element id="2">
</root>
<!-- target_db contents -->
<root>
<element id="1">
</root>
My query to merge the two databases looks like this:
for $e in (db:open("source_db")/root/element)
return (
if (exists(db:open("target_db")/root/element[#id = data($e/#id)]))
then replace node db:open("target_db")/root/element[#id = data($e/#id)] with $e
else insert node $e into db:open("target_db")/root
)
When running the query, however, I keep getting memory constraint errors. Using a POST request to BaseX's REST interface I get Out of Main Memory and using the BaseX java client I get java.io.IOException: GC overhead limit exceeded.
Ideally I would like to just process one element from source_db at a time to avoid memory issues, but it seems like my query isn't doing this. I've tried using the db:copynode false pragma but it did not make a difference. Is there any way to accomplish this?

How to read attributes out of multiple nested documents in MongoDB Java?

I need some help with a project I am planing to do. At this stage I am trying to learn using NoSQL Databases in Java.
I've got a few nested documents looking like this:
MongoDB nesting structure
Like you can see on the image, my inner attributes are "model" and "construction".
Now I need to iterate through all the documents in my collection, whose keynames are unknown, because they are generated in runtime, when a user enters some information.
At the end I need to list them in a TreeView, keeping the structure they have already in the database.
What I've tried is getting keySets from documents, but I cannot pass the second layer of the structure. I am able to print the whole Object in Json format, but I cannot access the specific attributes like "model" or "construction".
MongoCollection collection= mongoDatabase.getCollection("test");
MongoCursor<Document> cursor = collection.find().iterator();
for(String keys: document.keySet()) {
Document vehicles = (Document) document.getString(keys);
//System.out.println(keys);
//System.out.println(document.get(keys));
}
/Document cars = (Document) vehicle.get("cars");
Document types = (Document) cars.get("coupes");
Document brands = (Document) types.get("Ford");
Document model = (Document) brands.get("Mustang GT");
Here I tried to get some properties, by hardcoding the keynames of the documents, but I can't seem to get any value either. It keeps telling me that it could not read from vehicle, because it is null.
The most tutorials and posts in forums, somehow does not work for me. I don't know if they have any other version of MongoDB Driver. Mine is: mongodb driver 3.12.7. if this helps you in any way.
I am trying to get this working for days now and it is driving me crazy.
I hope there is anyone out there who is able to help me with this problem.
Here is a way you can try using the Document class's methods. You use the Document#getEmbedded method to navigate the embedded (or sub-document) document's path.
try (MongoCursor<Document> cursor = collection.find().iterator()) {
while (cursor.hasNext()) {
// Get a document
Document doc = (Document) cursor.next();
// Get the sub-document with the known key path "vehicles.cars.coupes"
Document coupes = doc.getEmbedded(
Arrays.asList("vehicles", "cars", "coupes"),
Document.class);
// For each of the sub-documents within the "coupes" get the
// dynamic keys and their values.
for (Map.Entry<String, Object> coupe : coupes.entrySet()) {
System.out.println(coupe.getKey()); // e.g., Mercedes
// The dynamic sub-document for the dynamic key (e.g., Mercedes):
// {"S-Class": {"model": "S-Class", "construction": "2011"}}
Document coupeSubDoc = (Document) coupe.getValue();
// Get the coupeSubDoc's keys and values
coupeSubDoc.keySet().forEach(k -> {
System.out.println("\t" + k); // e.g., S-Class
System.out.println("\t\t" + "model" + " : " +
coupeSubDoc.getEmbedded(Arrays.asList(k, "model"), String.class));
System.out.println("\t\t" + "construction" + " : " +
coupeSubDoc.getEmbedded(Arrays.asList(k, "construction"), String.class));
});
}
}
}
The above code prints to the console as:
Mercedes
S-Class
model : S-Class
construction : 2011
Ford
Mustang
model : Mustang GT
construction : 2015
I think it's not the complete answer to his question.
Here he says:
Now I need to iterate through all the documents in my collection, whose keynames are unknown, because they are generated in runtime, when a user enters some information.
Your answer #prasad_ just refers to his case with vehicles, cars and so on. He needs a way to handle unknown key/value pairs i guess. For example, in this case he only knows the keys:vehicle,cars,coupe,Mercedes/Ford and their subkeys. If another user inserts some new key/value paairs in the collection he will have problems because he can't navigate trough the new document without to have a look into the database.
I'm also interested in the solution because I never nested my key/value pairs and cant see the advantage of it. Am I wrong or does it make the programming more difficult?

Get historic prices by ISIN from yahoo finance

I have the following problem:
I have around 1000 unique ISIN numbers of stock exchange listed companies.
I need the historic prices of these companies starting with the earliest listing until today on a daily basis.
However, as far as my research goes, yahoo can only provide prices for stock ticker symbols, which I do not have.
Is there a way to get for example for ISIN: AT0000609664, which is the company Porr the historic prices from yahoo automatically via their api?
I appreciate your replies!
The Answer:
To get the Yahoo ticker symbol from an ISIN, take a look at the yahoo.finance.isin table, here is an example query:
http://query.yahooapis.com:80/v1/public/yql?q=select * from yahoo.finance.isin where symbol in ("DE000A1EWWW0")&env=store://datatables.org/alltableswithkeys
This returns the ticker ADS.DE inside an XML:
<query yahoo:count="1" yahoo:created="2015-09-21T12:18:01Z" yahoo:lang="en-US">
<results>
<stock symbol="DE000A1EWWW0">
<Isin>ADS.DE</Isin>
</stock>
</results>
</query>
<!-- total: 223 -->
<!-- pprd1-node600-lh3.manhattan.bf1.yahoo.com -->
I am afraid your example ISIN won't work, but that's an error on Yahoos side (see Yahoo Symbol Lookup, type your ISINs in there to check if the ticker exists on Yahoo).
The Implementation:
Sorry, I am not proficient in Java or R anymore, but this C# code should be almost similar enough to copy/paste:
public String GetYahooSymbol(string isin)
{
string query = GetQuery(isin);
XDocument result = GetHttpResult(query);
XElement stock = result.Root.Element("results").Element("stock");
return stock.Element("Isin").Value.ToString();
}
where GetQuery(string isin) returns the URI for the query to yahoo (see my example URI) and GetHttpResult(string URI) fetches the XML from the web. Then you have to extract the contents of the Isin node and you're done.
I assume you have already implemented the actual data fetch using ticker symbols.
Also see this question for the inverse problem (symbol -> isin). But for the record:
Query to fetch historical data for a symbol
http://query.yahooapis.com:80/v1/public/yql?q=select * from yahoo.finance.historicaldata where symbol in ("ADS.DE") and startDate = "2015-06-14" and endDate = "2015-09-22"&env=store://datatables.org/alltableswithkeys
where you may pass arbitrary dates and an arbitrary list of ticker symbols. It's up to you to build the query in your code and to pull the results from the XML you get back. The response will be along the lines of
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="71" yahoo:created="2015-09-22T20:00:39Z" yahoo:lang="en-US">
<results>
<quote Symbol="ADS.DE">
<Date>2015-09-21</Date>
<Open>69.94</Open>
<High>71.21</High>
<Low>69.65</Low>
<Close>70.79</Close>
<Volume>973600</Volume>
<Adj_Close>70.79</Adj_Close>
</quote>
<quote Symbol="ADS.DE">
<Date>2015-09-18</Date>
<Open>70.00</Open>
<High>71.43</High>
<Low>69.62</Low>
<Close>70.17</Close>
<Volume>3300200</Volume>
<Adj_Close>70.17</Adj_Close>
</quote>
......
</results>
</query>
<!-- total: 621 -->
<!-- pprd1-node591-lh3.manhattan.bf1.yahoo.com -->
This should get you far enough to write your own code. Note that there are possibilities to get data as .csv format with &e=.csv at the end of the query, but I don't know much about that or if it will work for the queries above, so see here for reference.
I found a Web-Service which provides historic data based on date range. Please have a look
http://splice.xignite.com/services/Xignite/XigniteHistorical/GetHistoricalQuotesRange.aspx

Parsing a xml, exponential increase in time

I have a parser which parses and collects the requires fields and constructs a object out of it.
Suppose if xml is like below
<xml>
<p1>
...
...
</p1>
<p2>
...
</p2>
...
...
</xml>
My java code parses it and code seems like below.
for each product //p1,p2 etc..
print start time
parse that node, which returns a object
print end time
add the object to list.
The sample code is below
products = (NodeList) xPath.evaluate("/xml/product",pxml,XPathConstants.NODESET);
for (int i = 0; i < products.getLength(); i++)
{
System.out.println("parsing product ::"+i+":" + (System.currentTimeMillis()-time));
BookDataInfo _parsedPoduct = ParseProduct(products.item(i));
System.out.println("parsing product finished ::"+i+":" + (System.currentTimeMillis()-time));
if (_parsedPoduct.getParsingSucceeded())
{
pparsedProducts.add(_parsedPoduct);
}
}
I have printed the times before parsing the node and after that, the time is exponentially increasing with no.of products like for 1st product takes 100ms where as some 300th product takes 2000ms.
In each case same part of code is executed for parsing.
Could any one have idea why it happens?
I can't post the code what parseproduct is doing but found out where the time is consumed most.
private NodeList getNodelist(Node xml, String Name)
{
long time = System.currentTimeMillis();
System.out.println("Nodelist start::" + (System.currentTimeMillis() - time));
NodeList nodes = (NodeList)xPath.evaluate(Name,xml,XPathConstants.NODESET);
System.out.println("Nodelist end::" + (System.currentTimeMillis() - time));
return nodes;
}
similarly for getting node value at a stmt
Node node = (Node)xPath.evaluate(Name,xml,XPathConstants.NODE);
here xPath is a static object of type XPath.
when multiple times the above function is called for a product, the later calls are taking much time, like in start it took 2/3 ms but later(say product 300) it took 55-60ms for each call.
May I am missing some thing here?
Thanks!
check out the difference between DOM and SAX parsing, DOM lets you query the XML file but have to upload entire document to memory for that, if you just want to create objects you better use SAX parser
The problem is solved.
The main issue is the one mentioned in below link.
XPath.evaluate performance slows down (absurdly) over multiple calls
Followed the steps mentioned in that, it drastically reduced the time consumed.

Adding entities to solr using solrj and schema.xml

I would like to add entities to documents like you can do with the data-config.
At the moment I'm indexing every page of my documents as a single document.
Now :
<solrDoc>
<id>1</id>
<docname>test.pdf</docmname>
<pagenumber>1</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
<solrDoc>
<id>2</id>
<docname>test.pdf</docmname>
<pagenumber>2</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
As you can see the data related to the document is stored x pages times. I would like to get documents like this:
<doc>
<id>1</id>
<docname>test.pdf</docmname>
<pageEntries> //multivaluefield
<pageEntry><pagenumber>1</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
<pageEntry><pagenumber>2</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
</pageEntries>
</doc>
I don't know how to make something like pageEntry. I saw that solr can import entities from databases but I'm wondering how I can do the same? (or something similar)
I'm using solr 3.6.1. The page extraction is done by myself using pdfbox.
Java code:
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField("id", 1);
solrDoc.setField("filename", "test");
for (int p : pages) {
solrDoc.addField("page", p);
}
for (String pc : pagecont) {
solrDoc.addField("pagecont", pc);
}
If the extraction is performed by you, you can club all the pages and feed it as a single Solr document with the pagenumber & pagecontent being multivalued fields.
You can use the same id for all the pages (with the id not being a primary field in the schema definition) and use Grouping (Field Collapsing) to group the results for the documents.

Categories