If I create documents as such:
{
Document document = new Document();
document.add(new TextField("id", "10384-10735", Field.Store.YES));
submitDocument(document);
}
{
Document document = new Document();
document.add(new TextField("id", "10735", Field.Store.YES));
submitDocument(document);
}
for (int i = 20000; i < 80000; i += 123) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
submitDocument(otherDoc1);
Document otherDoc2 = new Document();
otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
submitDocument(otherDoc2);
}
meaning:
one with an id of 10384-10735
one with an id of 10735 (which is the last part of the previous document ID)
and 975 other documents with pretty much any ID
and then write them using:
final IndexWriterConfig luceneWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
luceneWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
final IndexWriter luceneDocumentWriter = new IndexWriter(luceneDirectory, luceneWriterConfig);
for (Map.Entry<String, Document> indexDocument : indexDocuments.entrySet()) {
final Term term = new Term(Index.UNIQUE_LUCENE_DOCUMENT_ID, indexDocument.getKey());
indexDocument.getValue().add(new TextField(Index.UNIQUE_LUCENE_DOCUMENT_ID, indexDocument.getKey(), Field.Store.YES));
luceneDocumentWriter.updateDocument(term, indexDocument.getValue());
}
luceneDocumentWriter.close();
Now that the index is written, I want to perform a query, searching for the document with the ID 10384-10735.
I will be doing this in two ways, using the TermQuery and a QueryParser with the StandardAnalyzer:
System.out.println("term query: " + index.findDocuments(new TermQuery(new Term("id", "10384-10735"))));
final QueryParser parser = new QueryParser(Index.UNIQUE_LUCENE_DOCUMENT_ID, new StandardAnalyzer());
System.out.println("query parser: " + index.findDocuments(parser.parse("id:\"10384 10735\"")));
In both cases, I would expect the document to appear. This is the result if I run the queries however:
term query: []
query parser: []
which seems odd. I experimented around a bit further and found out that if I either reduce the amount of documents OR remove the entry 10735, the query parser query now successfully finds the document:
term query: []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]
meaning this works:
{
Document document = new Document();
document.add(new TextField("id", "10384-10735", Field.Store.YES));
submitDocument(document);
}
for (int i = 20000; i < 80000; i += 123) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
submitDocument(otherDoc1);
Document otherDoc2 = new Document();
otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
submitDocument(otherDoc2);
}
and this works (490 documents)
{
Document document = new Document();
document.add(new TextField("id", "10384-10735", Field.Store.YES));
submitDocument(document);
}
{
Document document = new Document();
document.add(new TextField("id", "10735", Field.Store.YES));
submitDocument(document);
}
for (int i = 20000; i < 50000; i += 123) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
submitDocument(otherDoc1);
Document otherDoc2 = new Document();
otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
submitDocument(otherDoc2);
}
Does somebody know what causes this? I really need the index to consistently find the documents. I'm fine with using the QueryParser and not the TermQuery.
I use 9.3.0 lucene-core and lucene-queryparser.
Thank you for your help in advance.
Edit 1: This is the code in findDocuments():
final TopDocs topDocs = getIndexSearcher().search(query, Integer.MAX_VALUE);
final List<Document> documents = new ArrayList<>((int) topDocs.totalHits.value);
for (int i = 0; i < topDocs.totalHits.value; i++) {
documents.add(getIndexSearcher().doc(topDocs.scoreDocs[i].doc));
}
return documents;
Edit 2: here is a working example: https://pastebin.com/Ft0r8pN5
for some reason, the issue with the too many documents does not happen in this one, which I will look into. I still left it in for the example. This is my output:
[similar id: true, many documents: true]
Indexing [3092] documents
term query: []
query parser: []
[similar id: true, many documents: false]
Indexing [654] documents
term query: []
query parser: []
[similar id: false, many documents: true]
Indexing [3091] documents
term query: []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]
[similar id: false, many documents: false]
Indexing [653] documents
term query: []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]
As you can see, if the document with the ID 10735 is added to the documents, the document cannot be found anymore.
Summary
The problem is caused by a combination of (a) the order in which your documents are processed; and (b) the fact that updateDocument first deletes and then inserts data in the index.
When you use writer.updateDocument(term, document), Lucene performs an atomic delete-then-add:
Updates a document by first deleting the document(s) containing term and then adding the new document.
In your case, the order in which documents are processed is based on how they are retrieved from your Java Map - and that is based on how the entries are hashed by the map.
As you note in your answer, you already have a way to avoid this by using your Java object hashes as the updateDocument terms. (As long as you don't get any hash collisions.)
This answer attempts to explain the "why" behind the results you are seeing.
Basic Demonstration
This is a highly simplified version of your code.
Consider the following two Lucene documents:
final Document documentA = new Document();
documentA.add(new TextField(FIELD_NAME, "10735", Field.Store.YES));
final Term termA = new Term(FIELD_NAME, "10735");
writer.updateDocument(termA, documentA);
final Document documentB = new Document();
documentB.add(new TextField(FIELD_NAME, "10384-10735", Field.Store.YES));
final Term termB = new Term(FIELD_NAME, "10384-10735");
writer.updateDocument(termB, documentB);
documentA then documentB:
Lucene has nothing to delete when documentA is added. After the doc is added, the index contains the following:
field id
term 10735
doc 0
freq 1
pos 0
So, we have only one token 10735.
For documentB, there are no documents in the index containing the term 10384-10735 - and therefore nothing is deleted prior to documentB being added to the index.
We end up with the following final indexed data:
field id
term 10384
doc 1
freq 1
pos 0
term 10735
doc 0
freq 1
pos 0
doc 1
freq 1
pos 1
When we search for 10384, we get one hit, as expected.
documentB then documentA:
If we swap the order in which the 2 documents are processed, we see the following after documentB is indexed:
field id
term 10384
doc 0
freq 1
pos 0
term 10735
doc 0
freq 1
pos 1
When documentA is indexed, Lucene finds that doc 0 (above) in the index does contain the term 10735 used by documentA. Therefore all of the doc 0 entries are deleted from the index, before documentA is added.
We end up with the following indexed data (basically, a new doc 0, after the original doc 0 was deleted):
field id
term 10735
doc 0
freq 1
pos 0
Now when we search for 10384, we get zero hits - not what we expected.
More Complicated Demonstration
Things are made more complicated in your scenario in the question by your use of a Java Map to collect the documents to be indexed. This causes the order in which your Lucene documents are indexed to be different from the order in which they are created, due to hashing performed by the map.
Here is another simplified version of your code, but this time it uses a map:
public class MyIndexBuilder {
private static final String INDEX_PATH = "index";
private static final String FIELD_NAME = "id";
private static final Map<String, Document> indexDocuments = new HashMap<>();
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
//iwc.setCodec(new SimpleTextCodec());
try ( IndexWriter writer = new IndexWriter(dir, iwc)) {
String suffix = "10429";
Document document1 = new Document();
document1.add(new TextField("id", "10001-" + suffix, Field.Store.YES));
indexDocuments.put("10001-" + suffix, document1);
Document document2 = new Document();
document2.add(new TextField("id", suffix, Field.Store.YES));
indexDocuments.put(suffix, document2);
//int max = 10193; // OK
int max = 10192; // not OK
for (int i = 10003; i <= max; i += 1) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField(FIELD_NAME, String.valueOf(i), Field.Store.YES));
indexDocuments.put(String.valueOf(i), otherDoc1);
}
System.out.println("Total docs: " + indexDocuments.size());
for (Map.Entry<String, Document> indexDocument : indexDocuments.entrySet()) {
if (indexDocument.getKey().contains(suffix)) {
// show the order in which the document1 and document2 are indexed:
System.out.println(indexDocument.getKey());
}
final Term term = new Term(FIELD_NAME, indexDocument.getKey());
writer.updateDocument(term, indexDocument.getValue());
}
}
}
}
In addition to the two documents we are interested in, I add 191 additional (completely unrelated) documents to the index.
When I process the map, I see the following output:
Total docs: 193
10429
10001-10429
So, document2 is indexed before document1 - and our search for 10001 finds one hit.
But if I process fewer of these "extra" documents (190 instead of 191):
int max = 10192; // not OK
...then I get this output:
Total docs: 192
10001-10429
10429
You can see that the order in which document1 and document2 are processed has been flipped - and now that same search for 10001 finds zero hits.
A seemingly unrelated change (procesing one fewer document) has caused the retrieval order from the map to change, causing the indexed data to be different.
(I was incorrect in one of my comments in the question, when I noted that the indexed data was apparently identical. It is not the same. I missed that when I was first looking at the indexed data.)
Recommendation
Consider adding a new field to your Lucene documents, for storing each document's unique identifier.
You could call it doc_id and it would be created as a StringField, not as a TextField.
This would ensure that the contents of this field are never processed by the Standard Analyzer and are stored in the index as a single (presumably unique) token. A StringField is indexed but not tokenized.
You can then use this field when building your term to use in the updateDocument() method. And you can use the existing id field for searches.
At a first glance, a possible solution for this would be:
The updateDocument() method with a term passed as first parameter is currently used to build the index. When either passing null as term or using the addDocument() method, the query successfully returned the correct values. The solution must have something to do with the Term.
luceneDocumentWriter.addDocument(indexDocument.getFields());
// or
luceneDocumentWriter.updateDocument(null, indexDocument);
Playing around a bit further: the key of the term the document in question is stored under cannot be used as field key inside the document again, otherwise the document becomes unsearchable:
final Term term = new Term("uldid", indexDocument.get("id"));
// would work, different key from term...
indexDocument.add(new TextField("uldid2", indexDocument.get("id"), Field.Store.YES));
// would not work...
indexDocument.add(new TextField("uldid", indexDocument.get("id"), Field.Store.YES));
// ...when adding to index using term
luceneDocumentWriter.updateDocument(term, indexDocument);
Another way to circumvent this would be to use a different value from the identical field in the document (uldid in this case), that is also different from the ID that is being searched in the index:
final Term term = new Term("uldid", indexDocument.get("id").hashCode() + "");
// or
indexDocument.add(new TextField("uldid", indexDocument.get("id").hashCode() + "", Field.Store.YES));
Which seems rather odd. I don't really have a final solution or reason this is the way it is, but I will be using the second option from now on, using the hash of whatever key I want to store the document under as Term.
I have a following xml string.
<aa>
<bb>
<cc>
<cmd>
<efg sid="C1D7B70D7AF705731B0" mid="C1D7D7AF705731B0" stid="-1" dopt="3">
<pqr>
<dru fo="1" fps="1" nku="WBECDD6CC37656E6C9" tt="1"/>
<dpo drpr="67" dpi="16"/>
<dres >
<dre dreid="BB:8D679D3511D3E4981000E787EC6DE8A4:1:1:0:2:1" fa="1" dpt= "1" o="0"/>
</dres>
</pqr>
</efg>
</cmd>
</cc>
</bb>
</aa>
I need to add "login" attribute inside <efg> tag. So new XML would be
<aa>
<bb>
<cc>
<cmd>
<efg sid="C1D7B70D7AF705731B0" login="sdf34234dfs" mid="C1D7D7AF705731B0" stid="-1" dopt="3">
<pqr>
<dru fo="1" fps="1" nku="WBECDD6CC37656E6C9" tt="1"/>
<dpo drpr="67" dpi="16"/>
<dres >
<dre dreid="BB:8D679D3511D3E4981000E787EC6DE8A4:1:1:0:2:1" fa="1" dpt= "1" o="0"/>
</dres>
</pqr>
</efg>
</cmd>
</cc>
</bb>
</aa>
Condition is:
I can only use inbuilt Java API (java 8) or SAX parser or xmlbuilder
Add condition is based on Parent tag i.e need to check <cmd> then in child need to add <login> because it is not sure always that <efg> tag would always be there with the same name, it could be with any name.
I have tried with DOM parser with following code.
String xml = "xmlString";
//Use method to convert XML string content to XML Document object
Document doc = convertStringToXML( xml );
doc.getDocumentElement().normalize();
Node m = doc.getElementsByTagName("cmd").item(0).getFirstChild();
Attr login = doc.createAttribute("login");
login.setValue("123567");
m.appendChild(login);
However, I am getting following error in my last line of code.
Exception in thread "main" org.w3c.dom.DOMException: HIERARCHY_REQUEST_ERR: An attempt was made to insert a node where it is not permitted.
Please anyone suggest me, how to add new attribute login into based on my condition no 2.
NodeList nodeList = doc.getElementsByTagName("cmd");
//Check <cmd> tag is present and then check <cmd> tag has child nodes
if (nodeList != null && nodeList.item(0).hasChildNodes()) {
//Get first child node of <cmd> xml tag
String nodeName = doc.getElementsByTagName("cmd").item(0).getFirstChild().getNodeName();
NodeList childNodeList = doc.getElementsByTagName(nodeName);
Element el = (Element) childNodeList.item(0);
//set pgd_login attribute with respective value
el.setAttribute("login", "xyz");
//Convert back into xml string from Document
xml = XMLHelpers.TransformDOMDocumentToString(doc);
}
Essentially, i'm creating an XML document from a file (a database), and then i'm comparing another parsed XML file (with updated information) to the original database, then writing the new information into the database.
I'm using java's org.w3c.dom.
After lots of struggling, i decided to just create a new Document object and will write from there from the oldDocument and newDocument ones i'm comparing the elements in.
The XML doc is in the following format:
<Log>
<File name="something.c">
<Warning file="something.c" line="101" column="23"/>
<Warning file="something.c" line="505" column="71" />
</File>
</Log>
as an example.
How would i go about adding in a new "warning" Element to the "File" without getting the pesky "org.w3c.dom.DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it." exception?
Cutting it down, I have something similar to:
public static Document update(Element databaseRoot, Element newRoot){
Document doc = db.newDocument(); // DocumentBuilder defined previously
Element baseRoot = doc.createElement("Log");
//for each file i have:
Element newFileRoot = doc.createElement("File");
//some for loop that parses through each 'file' and looks at the warnings
//when i come to a new warning to add to the Document:
NodeList newWarnings = newFileToCompare.getChildNodes(); //newFileToCompare comes from the newRoot element
for (int m = 0; m < newWarnings.getLength(); m++){
if(newWarnings.item(m).getNodeType() == Node.ELEMENT_NODE){
Element newWarning = (Element)newWarnings.item(m);
Element newWarningRoot = (Element)newWarning.cloneNode(false);
newFileRoot.appendChild(doc.importNode(newWarningRoot,true)); // this is what crashes
}
}
// for new files i have this which works:
newFileRoot = (Element)newFiles.item(i).cloneNode(true);
baseRoot.appendChild(doc.importNode(newFileRoot,true));
doc.appendChild(baseRoot);
return doc;
}
Any ideas? I'm beating my head against the wall. First time doing this.
Going through with a debugger I verified that the document owners were correct. Using node.getOwnerDocument(), I realized that the newFileRoot was connected to the wrong document earlier when I created it, so I changed
Element newFileRoot = (Element)pastFileToFind.cloneNode(false);
to
Element newFileRoot = (Element)doc.importNode(pastFileToFind.cloneNode(false),true);
since later on when i was trying to add the newWarningRoot to newFileRoot, they had different Documents (newWarningRoot was correct but newFileRoot was connected to the wrong document)
charactersI am trying to include the correct characters in an XML document text node:
Element request = doc.createElement("requestnode");
request.appendChild(doc.createTextNode(xml));
rootElement.appendChild(request);
The xml string is a segment of a large xml file which I have read in:
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("rootnode");
doc.appendChild(rootElement);
<firstname>John</firstname>
<dateOfBirth>28091999</dateOfBirth>
<surname>Doe</surname>
The problem is that passing this into createTextNode is replacing some of the charters:
<firstname>John</firstname>
<dateOfBirth>28091999</dateOfBirth>
<surname>Doe</surname>
Is there any way I can keep the correct characters (< , >) in the textnode. I have read about using importnode but this is not correctly XML, only a segment of a file.
Any help would be greatly appreciated.
EDIT: I need the xml string (which is not fully formatted xml, only a segment of an external xml file) to be in the "request node" as I am building XML to be imported into SOAP UI
You can't pass the element tag and text to the createTextNode() method. You only need to pass the text. You need then to append this text node to an element.
If the source is another XML document, you must extract the text node from an element and insert it in to the other. You can grab a Node (element and text) and try to inserted as a text node in the other. That is why you are seeing all the escape characters.
On the other hand, you can insert this Node into the other XML (if the structure is allowed) and it should be just fine.
In your context, I assume "request" is some sort of Node. The child element of a Node could be another element, text, etc. You have to be very specific.
You can do something like:
Element name = doc.createElement("name");
Element dob = doc.createElement("dateOfBirth");
Element surname = doc.createElement("surname");
name.appendChild( doc.createTextNode("John") );
dob.appendChild( doc.createTextNode("28091999") );
surname.appendChild( doc.createTextNode("Doe") );
Then you can add these element to a parent node:
node.appendChild(name);
node.appendChild(dob);
node.appendChild(surname);
UPDATE: As an alternative, you can open a stream to a document and insert your XML string as a byte stream. Something like this (untested code, but close):
String xmlString = "<firstname>John</firstname><dateOfBirth>28091999</dateOfBirth><surname>Doe</surname>";
DocumentBuilderFactory fac = javax.xml.parsers.DocumentBuilderFactory.newInstance();
DocumentBuilder builder = fac.newDocumentBuilder();
Document newDoc = builder.parse(new ByteArrayInputStream(xmlString.getBytes()));
Element newElem = doc.createElement("whatever");
doc.appendChild(newElem);
Node node = doc.importNode(newDoc.getDocumentElement(), true);
newElem.appendChild(node);
Something like that should do the trick.
I am really unsure how I can get the information I need to place into a database, the code below just prints the whole file.
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
System.out.println(doc.toString());
My HTML is here from line 61 and I am needing to get the items under the column headings but also grab the MMSI number which is not under a column heading but in the href tag. I haven't used JSoup other than to get the HTML from the web page. I can only really see tutorials to use php and I'd rather not use it.
To get those information, the best way is to use Jsoup's selector API. Using selectors, your code will look something like this (pseudeocode!):
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements matches = doc.select("<your selector here>");
for( Element element : matches )
{
// do something with found elements
}
There's a good documentation available here: Use selector-syntax to find elements. If you get stuck nevertheless, please describe your problem.
Here are some hints for that selector, you can use:
// Select the table with class 'shipinfo'
Elements tables = doc.select("table.shipinfo");
// Iterate over all tables found (since it's only one, you can use first() instead
for( Element element : tables )
{
// Select all 'td' tags of that table
Elements tdTags = element.select("td");
// Iterate over all 'td' tags found
for( Element td : tdTags )
{
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}