Merge/combine BaseX databases with upserts in memory constrained environment - java

I have two databases in BaseX, source_db and target_db, and would like to merge them by matching on the id attribute of each element and upserting the element with a replace or an insert depending on whether the element was found in the target_db. source_db has about 100,000 elements, and target_db has about 1,000,000 elements.
<!-- source_db contents -->
<root>
<element id="1">
<element id="2">
</root>
<!-- target_db contents -->
<root>
<element id="1">
</root>
My query to merge the two databases looks like this:
for $e in (db:open("source_db")/root/element)
return (
if (exists(db:open("target_db")/root/element[#id = data($e/#id)]))
then replace node db:open("target_db")/root/element[#id = data($e/#id)] with $e
else insert node $e into db:open("target_db")/root
)
When running the query, however, I keep getting memory constraint errors. Using a POST request to BaseX's REST interface I get Out of Main Memory and using the BaseX java client I get java.io.IOException: GC overhead limit exceeded.
Ideally I would like to just process one element from source_db at a time to avoid memory issues, but it seems like my query isn't doing this. I've tried using the db:copynode false pragma but it did not make a difference. Is there any way to accomplish this?

Related

BaseX: Inserting nodes performance problems

I am experiencing some performance problems when inserting XML nodes to existing nodes in a BaseX database.
Usecase
I have one big XML file (about 2GB) from which I created a BaseX database. The XML looks like this (simplified). It has about 350.000 <record>s:
<collection>
<record>
<id>ABC007</id>
<title>The title of the record</title>
<author>Joe Lastname</author>
... [other information]
</record>
<record>
<id>ABC555</id>
<relation_id>ABC007</relation_id>
<title>Another title</title>
<author>Sue Lastname</author>
... [other information]
</record>
... [many other <record>s]
</collection>
The <record>s are related to each other. The <relation_id> in one record points to an <id> in another record (see example above).
What I am doing in BaseX is inserting information from one related record to the other one and vice versa. So, the result looks like this:
<collection>
<record>
<id>ABC007</id>
<title>The title of the record</title>
<author>Joe Lastname</author>
... [other information]
<related_record> <!-- Insert this information -->
<title>Another title</title>
<author>Sue Lastname</author>
</related_record>
</record>
<record>
<id>ABC555</id>
<relation_id>ABC007</relation_id>
<title>Another title</title>
<author>Sue Lastname</author>
... [other information]
<related_record> <!-- Insert this information -->
<title>The title of the record</title>
<author>Joe Lastname</author>
</related_record>
</record>
... [many other <record>s that should be enriched with other records data]
</collection>
I am doing that with the following Java code:
// Setting some options and variables
Context context = new Context();
new Set(MainOptions.AUTOFLUSH, false).execute(context);
new Set(MainOptions.AUTOOPTIMIZE, false).execute(context);
new Set(MainOptions.UPDINDEX, true).execute(context);
// Opening the database
new Open('database_name').execute(context);
// Get all records with <relation_id> tags. These are the "child" records and they contain the "parent" record ID.
String queryParentIdsInChild = "for $childRecord in doc('xmlfile.xml')//record[relation_id]
return db:node-id($childRecord)"
// Iterate over the child records and get the parent record ID
QueryProcessor parentIdsInChildProc = new QueryProcessor(queryParentIdsInChild, context);
Iter iter = parentIdsInChildProc.iter();
parentIdsInChildProc.close();
for(Item childRecord; (childRecord = iter.next()) != null;) {
// Create a pointer to the child record in BaseX for convenience
String childNodeId = childRecord.toString();
String childNode = "db:open-id('database_name', " + childNodeId + ")";
// Get some details from the child record. They should be added to the parent record.
String queryChildDetails = "let $title := data("+childNode+"/title)"
+ " let $author := data("+childNode+"/author)"
+ " return "
+ "<related_record>"
+ " <title>{$title}</title>"
+ " <author>{$author}</author>"
+ "</related_record>";
String childDetails = new XQuery(queryChildDetails).execute(context);
// Create a pointer to the parent record in BaseX for convenience
parentNode = (... similar procedure like getting the child node, therefore skiping that code here)
// PERFORMANCE ISSUE HERE!!!
// Insert the child record details to the parent node
String parentUpdate = "insert node " + childDetails + " into " + parentNode;
new XQuery(parentUpdate).execute(context);
}
... flushing and optimizing code here
Problem
The problem is that I experience massive performance problems when inserting the new nodes to a <record>. In a smaller test database with about 10.000 <record>s, the inserts are executed quite fast - in about 7 seconds. When I run the same code in my production database with about 350.000 <record>s, a single insert operation takes several seconds, some even minutes! And there would be thousands of these inserts, so it definitely takes too long.
Questions
I'm very new to BaseX and I am for sure not the most experienced Java programmer. Maybe I'm just overlooking something or making some stupid mistake. So I'm asking if someone has a hint for me. What could be the problem? Is it the Java code? Or is the BaseX database with 350.000 <record>s just too big for insert operations? If yes: Is there a workaround? Or is BaseX (or XML databases in general) not the right tool for this usecase?
Further Information
I am using BaseX 9.0.2 in stand-alone mode on Ubuntu 18.04. I have done an "Optimize All" before running the above mentioned code.
I think I didn't run the optimize correctly. After I optimized again the insert commands ran very fast. Now, about 10000 inserts are executing within under a second. Maybe it also helped that I deactivated UPDINDEX and AUTOOPTIMIZE.

+ sign being dropped from xml when validation occurs

up to a previous question I asked here WebResponse posting a null string
while the answer works for the question a new problem happened. When parsing the below xml
<?xml version="1.0" encoding="UTF-8"?>
<hml xmlns="http://schemas.nmdp.org/spec/hml/1.0.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://schemas.nmdp.org/spec/hml/1.0.1 http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd"
version="1.0.1" >
<!--
MIRING Element 1.1 requires the inclusion of an hmlid.
hmlid can be reported in the form of an ISO Object Identifier (OID)
"root" represents a unique publically registered organization
"extension" is a unique document id managed by the reporting organization.
-->
<hmlid root="2.34.48.32" extension="HML.3245662"/>
<!--
MIRING Element 1.2 requires the inclusion of a reporting-center.
reporting-center identifies the organization sending the HML message.
"reporting-center-id" is a unique identifier of the sender.
"reporting-center-context" reports the context/naming authority of the identifier.
-->
<reporting-center reporting-center-id="567"/>
<sample id="4555-6677-8">
<typing gene-family="HLA" date="2015-01-13">
<!--
MIRING Element 3 requires the inclusion of Genotyping information.
The Genotype should include all pertinent Loci, as well as a Genotype in a standard format.
GLStrings can be included either as plain text, or as a reference to a publicly
available service, such as GL Service (gl.nmdp.org)
-->
<allele-assignment date="2015-07-28" allele-db="IMGT/HLA" allele-version="3.17.0">
<haploid locus="HLA-A" method="DNA" type="02:20:01"/>
<glstring>
HLA-A*02:20:01
</glstring>
</allele-assignment>
<typing-method>
<!--
MIRING Element 6 requires platform documentation. This could be a peer-reviewed publication,
or an identifier of a procedure on a publicly available resource, such as NCBI GTR
-->
<sbt-ngs locus="HLA-A"
test-id="HLA-A.Test.1234"
test-id-source="AcmeGenLabs">
<raw-reads uri="rawreads/read1.fastq.gz"
availability="public"
format="fastq"
paired="1"
pooled="1"
adapter-trimmed="1"
quality-trimmed="0"/>
</sbt-ngs>
</typing-method>
<consensus-sequence date="2015-01-13">
<!--
MIRING Element 2 requires the inclusion of Reference Context.
The location and identifiers of the reference sequence should be specified.
start and end attributes are 0-based, and refer to positions on the reference sequence.
-->
<reference-database availability="public" curated="true">
<reference-sequence
name="HLA-A reference"
id="Ref111"
start="945000"
end="946000"
accession="GL000123.4"
uri="http://AcmeGenReference/RefDB/GL000123.4"/>
</reference-database>
<!--
MIRING Element 4 requires the inclusion of a consensus sequence.
The start and end positions are 0-based, and refer to positions on the reference sequence (reference-sequence-id)
Multiple consensus-sequence-block elements can be included sequentially.
-->
<consensus-sequence-block reference-sequence-id="Ref111"
start="945532"
end="945832"
strand="+"
phase-set="1"
expected-copy-number="1"
continuity="true"
description="HLA-A Consensus Sequence 4.5.67">
<!--
A sequence can be reported as plain text, or as a pointer to an external reference,
or as variants from a reference sequence.
-->
<sequence>
CCCAGTTCTCACTCCCATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCGGTCGCTGTTCTAAAGCCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGGATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTCGGGGGCCCTGGCCCTGACCCAGACCTGGGCGGGTGAGTGCGGGGTCGGGAGGGAAACCGCCTCTGCGGGGAGAAGCAAGGGGCCCTCCTGGCGGGGGCGCAGGACCGGGGGAGCCGCGCCGGGACGAGGGTCGGGCAGGT
</sequence>
<!--
MIRING Element 5 requires the inclusion of any relevant sequence polymorphisms.
These represent variants from the reference sequence.
start and end attributes are 0-based, and refer to positions on the reference sequence.
You can see this variant at positions 10 - 15 on the sequence. (945542 - 945532 = 10)
-->
<variant id="0"
reference-bases="GTCATG"
alternate-bases="ACTCCC"
start="945542"
end="945548"
filter="pass"
quality-score="95">
<!--
The functional effects of variants can be reported using variant-effect.
They should use Sequence Ontology (SO) variant effect terms.
-->
<variant-effect term="missense_variant"/>
</variant>
</consensus-sequence-block>
</consensus-sequence>
</typing>
</sample>
<!--
Multiple samples can be included in a single message.
Each sample should have it's own reference-database(s) even if they are identical to other samples' references.
-->
<sample id="4555-6677-9">
<typing gene-family="HLA" date="2015-01-13">
<allele-assignment date="2015-07-28" allele-db="IMGT/HLA" allele-version="3.17.0">
<haploid locus="HLA-A" method="DNA" type="02:20:01"/>
<glstring>
HLA-A*02:01:01:01
</glstring>
</allele-assignment>
<typing-method>
<sbt-ngs locus="HLA-A"
test-id="HLA-A.Test.1234"
test-id-source="AcmeGenLabs">
<raw-reads uri="rawreads/read2.fastq.gz"
availability="public"
format="fastq"
paired="1"
pooled="1"
adapter-trimmed="1"
quality-trimmed="0"/>
</sbt-ngs>
</typing-method>
<consensus-sequence date="2015-01-13">
<reference-database availability="public" curated="true">
<reference-sequence
name="HLA-A reference"
id="Ref112"
start="945000"
end="946000"
accession="GL000123.4"
uri="http://AcmeGenReference/RefDB/GL000123.4"/>
</reference-database>
<consensus-sequence-block
reference-sequence-id="Ref112"
start="945532"
end="945832"
strand="+"
phase-set="1"
expected-copy-number="1"
continuity="true"
description="HLA-A Consensus Sequence 4.5.89">
<sequence>
CCCAGTTCTCGTCATGATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCGGTCGCTGTTCTAAAGCCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGGATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTCGGGGGCCCTGGCCCTGACCCAGACCTGGGCGGGTGAGTGCGGGGTCGGGAGGGAAACCGCCTCTGCGGGGAGAAGCAAGGGGCCCTCCTGGCGGGGGCGCAGGACCGGGGGAGCCGCGCCGGGACGAGGGTCGGGCAGGT
</sequence>
</consensus-sequence-block>
</consensus-sequence>
</typing>
</sample>
</hml>
Which is the sample given for the validator so I know it works. However when I pass it through my restful POST code:
#POST
#Path("/Validate")
#Produces("application/xml")
public String validate(#FormParam("xml") String xml)
{
System.out.println(xml);
try {
Client client = Client.create();
WebResource webResource = client.resource("http://miring.b12x.org/validator/ValidateMiring/");
// POST method
ClientResponse response = webResource.accept("application/xml").post(ClientResponse.class,"xml="+xml);
// check response status code
if (response.getStatus() != 200) {
throw new RuntimeException("Failed : HTTP error code : " + response.getStatus());
}
// display response
String output = response.getEntity(String.class);
System.out.println("Output from Server .... ");
System.out.println(output + "\n");
return output;
} catch (Exception e) {
e.printStackTrace();
}
return "Oops";
}
Everything passes through perfectly fine except for Strand="+" which for some reason drops the + and gets the error message of The value '' of attribute 'strand' on element 'consensus-sequence-block' is not valid with respect to its...'
I tried it with all of strands enumerations +,-,-1,1 and all of them work except for +.
Using the WEB UI (miring.b12x.org) it works perfectly.
Is there something with parsing with SAX that could cause a + to be dropped or any reason a certain enumeration would be dropped?
Thank you
EDIT: Here is the output received:
Output from Server ....
<?xml version="1.0" encoding="UTF-8"?>
<miring-report xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
timestamp="07/19/2016 15:07:31"
xsi:noNamespaceSchemaLocation="http://schemas.nmdp.org/spec/miringreport/1.0/miringreport.xsd">
<hml-compliant>reject</hml-compliant>
<miring-compliant>reject</miring-compliant>
<hmlid extension="HML.3245662" root="2.34.48.32"/>
<samples compliant-sample-count="4"
noncompliant-sample-count="0"
sample-count="2">
<sample hml-compliant="true" id="4555-6677-8" miring-compliant="true"/>
<sample hml-compliant="true" id="4555-6677-9" miring-compliant="true"/>
</samples>
<fatal-validation-errors>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-attribute.3:, The, value, ', ', of, attribute, 'strand', on, element, 'consensus-sequence-block', is, not, valid, with, respect, to, its, type,, 'null'.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-attribute.3:, The, value, ', ', of, attribute, 'strand', on, element, 'consensus-sequence-block', is, not, valid, with, respect, to, its, type,, 'null'.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-enumeration-valid:, Value, ', ', is, not, facet-valid, with, respect, to, enumeration, '[-1,, 1,, +,, -]'., It, must, be, a, value, from, the, enumeration.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-enumeration-valid:, Value, ', ', is, not, facet-valid, with, respect, to, enumeration, '[-1,, 1,, +,, -]'., It, must, be, a, value, from, the, enumeration.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
</fatal-validation-errors>
<validation-warnings>
<miring-result miring-rule-id="1.2.b" severity="warning">
<description>The node reporting-center is missing a reporting-center-context attribute.</description>
<solution>Please add a reporting-center-context attribute to the reporting-center node. You can use reporting-center-context to specify the naming authority of the reporting center identifier. Reporting-center-context is not explicitly required.</solution>
<xpath>/hml[1]/reporting-center[1]</xpath>
</miring-result>
</validation-warnings>
</miring-report>
You don’t set the type of your WebResource, and I don’t know what the default Content-Type of the request is, but I suspect it is application/x-www-form-urlencoded, which means + is being treated as a space. If that is the case, changing "xml="+xml to "xml=" + URLEncoder.encode(xml, "UTF-8") may address the problem.
The application/x-www-form-urlencoded format is the default format for HTML form submissions, as described in the HTML 4.01 specification. The the documentation for the URLEncoder class also describes this format.
In that format, a + character represents a space, so the strand attribute contains a single space. Except, the Attribute-Value Normalization section of the XML 1.0 specification states:
If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters …
So, that single space is then normalized into the empty string (when all leading and trailing space is removed). The empty string, strand='', does not conform to the XML schema you are referencing, http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd .
URLEncoder.encode escapes all “reserved” characters, including +, as percent-escapes, and then escapes spaces as +. The server expects this format (almost certainly because a Content-Type: application/x-www-form-urlencoded header is present in the HTTP request), and decodes the + and percent-escapes back to the original XML.

Get historic prices by ISIN from yahoo finance

I have the following problem:
I have around 1000 unique ISIN numbers of stock exchange listed companies.
I need the historic prices of these companies starting with the earliest listing until today on a daily basis.
However, as far as my research goes, yahoo can only provide prices for stock ticker symbols, which I do not have.
Is there a way to get for example for ISIN: AT0000609664, which is the company Porr the historic prices from yahoo automatically via their api?
I appreciate your replies!
The Answer:
To get the Yahoo ticker symbol from an ISIN, take a look at the yahoo.finance.isin table, here is an example query:
http://query.yahooapis.com:80/v1/public/yql?q=select * from yahoo.finance.isin where symbol in ("DE000A1EWWW0")&env=store://datatables.org/alltableswithkeys
This returns the ticker ADS.DE inside an XML:
<query yahoo:count="1" yahoo:created="2015-09-21T12:18:01Z" yahoo:lang="en-US">
<results>
<stock symbol="DE000A1EWWW0">
<Isin>ADS.DE</Isin>
</stock>
</results>
</query>
<!-- total: 223 -->
<!-- pprd1-node600-lh3.manhattan.bf1.yahoo.com -->
I am afraid your example ISIN won't work, but that's an error on Yahoos side (see Yahoo Symbol Lookup, type your ISINs in there to check if the ticker exists on Yahoo).
The Implementation:
Sorry, I am not proficient in Java or R anymore, but this C# code should be almost similar enough to copy/paste:
public String GetYahooSymbol(string isin)
{
string query = GetQuery(isin);
XDocument result = GetHttpResult(query);
XElement stock = result.Root.Element("results").Element("stock");
return stock.Element("Isin").Value.ToString();
}
where GetQuery(string isin) returns the URI for the query to yahoo (see my example URI) and GetHttpResult(string URI) fetches the XML from the web. Then you have to extract the contents of the Isin node and you're done.
I assume you have already implemented the actual data fetch using ticker symbols.
Also see this question for the inverse problem (symbol -> isin). But for the record:
Query to fetch historical data for a symbol
http://query.yahooapis.com:80/v1/public/yql?q=select * from yahoo.finance.historicaldata where symbol in ("ADS.DE") and startDate = "2015-06-14" and endDate = "2015-09-22"&env=store://datatables.org/alltableswithkeys
where you may pass arbitrary dates and an arbitrary list of ticker symbols. It's up to you to build the query in your code and to pull the results from the XML you get back. The response will be along the lines of
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="71" yahoo:created="2015-09-22T20:00:39Z" yahoo:lang="en-US">
<results>
<quote Symbol="ADS.DE">
<Date>2015-09-21</Date>
<Open>69.94</Open>
<High>71.21</High>
<Low>69.65</Low>
<Close>70.79</Close>
<Volume>973600</Volume>
<Adj_Close>70.79</Adj_Close>
</quote>
<quote Symbol="ADS.DE">
<Date>2015-09-18</Date>
<Open>70.00</Open>
<High>71.43</High>
<Low>69.62</Low>
<Close>70.17</Close>
<Volume>3300200</Volume>
<Adj_Close>70.17</Adj_Close>
</quote>
......
</results>
</query>
<!-- total: 621 -->
<!-- pprd1-node591-lh3.manhattan.bf1.yahoo.com -->
This should get you far enough to write your own code. Note that there are possibilities to get data as .csv format with &e=.csv at the end of the query, but I don't know much about that or if it will work for the queries above, so see here for reference.
I found a Web-Service which provides historic data based on date range. Please have a look
http://splice.xignite.com/services/Xignite/XigniteHistorical/GetHistoricalQuotesRange.aspx

Why do my AppEngine queries throw DatastoreNeedIndexExceptions although the required indices are 'serving'?

I am running a multi-tenant java high-replication web application on Google AppEngine. The application successfully uses multi-property indices (configured within the datastore-indexes.xml file). Well, at least up until now...
Since today there is at least one namespace that throws DatastoreNeedIndexExceptions when executing a query. The curious thing is that the same query works in other namespaces.
Here is the index configuration from the datastore-indexes.xml and the index status from the admin panel:
<?xml version="1.0" encoding="utf-8"?>
<datastore-indexes autoGenerate="false">
<datastore-index kind="Index_Asc_Asc_Asc_Asc" ancestor="false" source="manual">
<property name="components" direction="asc"/>
<property name="component_0" direction="asc"/>
<property name="component_1" direction="asc"/>
<property name="component_2" direction="asc"/>
<property name="component_3" direction="asc"/>
</datastore-index>
</datastore-indexes>
The corresponding query looks like this:
SELECT __key__ FROM Index_Asc_Asc_Asc_Asc WHERE components = '12340987hja' AND component_0 = 'asdfeawsefad' AND component_1 = '4FnlnSYiJuo25mNU' AND component_3 = 'cvxyvsdfsa' AND component_2 >= 0
When I execute this query within my application or the admin panel datastore view App Engine throws a DatastoreNeedIndexException with the following recommendation. Again, the same query works in other namespaces:
The suggested index for this query is:
<datastore-index kind="Index_Asc_Asc_Asc_Asc" ancestor="false">
<property name="component_0" direction="asc" />
<property name="component_1" direction="asc" />
<property name="component_3" direction="asc" />
<property name="components" direction="asc" />
<property name="component_2" direction="asc" />
</datastore-index>
Investigations:
I have tried to set autoGenerate="true", but I do get the same error and no new indexes have been added.
I have tried to execute the query in newly created namespaces: No problems.
The error does not occur in the development server.
Is there something I am missing? Has anyone else experienced the same problem? Why is the same query working in other namespaces but not in that one?
Thanksalot!
Tim is right. To help clarify the point you need to understand how datastore works.
Basically all datastore reads need to be sequential in the index you are looking at. In other words they need to be in adjacent rows. This is how datastore gains speed and how it can be sharded across multiple machines. (there are some exceptions for equality matching but just accept that smart people figured that one out for us for now).
So looking at a set of data with a num column, and alpha column and an id column like the following:
id Num Alpha
---------------------
1 1 A
2 1 Z
3 4 A
... ... ... <-- lots of data
100004 2 Z
100005 1 C
So when datastore comes through a query like yours it will look at the precomputed index and find the starting point of matches. It will then read until the rows no longer match the query. It never does a join like you are used to in SQL. The closes thing is a zipper merge which only applied to equality operators. ROWS MUST BE ADJACENT IN THE INDEX!
So index num asc, alpha asc looks like:
id Num Alpha
---------------------
... ... ... <- negative numbers
1 1 A
100005 1 C
2 1 Z
100004 2 Z
3 4 A
... ... ... <-- lots of data (assume all other num values were above 5)
and index alpha asc, num asc looks like:
id Num Alpha
---------------------
1 A 1
3 A 4
... ... ... <-- lots of data
100005 C 1
... ... ... <-- lots of data
2 Z 1
100004 Z 2
... ... ... <-- lots of data
This allows datastore to quickly zip through your data to get an answer very fast. It can then use the id to look up the rest of that row's data.
If for example you tried to look at all of the num=1 and wanted all alpha's ordered sequentially it would have to read all of the num=1 rows into memory (which could be 100s of millions of rows) then sort them based on A. Here it's all precomputed and much faster. This allows for far more throughput on reads. It's probably overkill for your application but the idea is that your app can scale to huge sizes this way.
Hope that made sense.

Adding entities to solr using solrj and schema.xml

I would like to add entities to documents like you can do with the data-config.
At the moment I'm indexing every page of my documents as a single document.
Now :
<solrDoc>
<id>1</id>
<docname>test.pdf</docmname>
<pagenumber>1</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
<solrDoc>
<id>2</id>
<docname>test.pdf</docmname>
<pagenumber>2</pagenumber>
<pagecontent>blablabla</pagecontent>
</solrDoc>
As you can see the data related to the document is stored x pages times. I would like to get documents like this:
<doc>
<id>1</id>
<docname>test.pdf</docmname>
<pageEntries> //multivaluefield
<pageEntry><pagenumber>1</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
<pageEntry><pagenumber>2</pagenumber><pagecontent>blablabla</pagecontent></pageEntry>
</pageEntries>
</doc>
I don't know how to make something like pageEntry. I saw that solr can import entities from databases but I'm wondering how I can do the same? (or something similar)
I'm using solr 3.6.1. The page extraction is done by myself using pdfbox.
Java code:
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField("id", 1);
solrDoc.setField("filename", "test");
for (int p : pages) {
solrDoc.addField("page", p);
}
for (String pc : pagecont) {
solrDoc.addField("pagecont", pc);
}
If the extraction is performed by you, you can club all the pages and feed it as a single Solr document with the pagenumber & pagecontent being multivalued fields.
You can use the same id for all the pages (with the id not being a primary field in the schema definition) and use Grouping (Field Collapsing) to group the results for the documents.

Categories