Accessing custom Lucene attribute from DirectoryReader

Accessing custom Lucene attribute from DirectoryReader - java

I added a custom attribute to my Lucene pipeline like described here (in the "Adding a custom Attribute" section).
Now, after I built my index (by adding all the documents via IndexWriter) I want to be able to assess this attribute when reading the index directory. How do I do this?
What I'm doing now is the following:
DirectoryReader reader = DirectoryReader.open(index);
TermsEnum iterator = null;
for (int i = 0; i < r.maxDoc(); i++) {
Terms terms = r.getTermVector(i, "content");
iterator = terms.iterator(iterator);
AttributeSource attributes = iterator.attributes();
SentenceAttribute sentence = attributes.addAttribute(SentenceAttribute.class);
while (true) {
BytesRef term = iterator.next();
if (term == null) {
break;
}
System.out.println(term.utf8ToString());
System.out.println(sentence.getStringSentenceId());
}
}
It doesn't seem to work: I get the same sentenceId all the time.
I use Lucene 4.9.1.

Finally, I solved it. To do it, I used PayloadAttribute to store the data I needed.
To store payloads in the index, first, set storeTermVectorPayloads property of the Field as well as some other stuff:
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorOffsets(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorPayloads(true);
Then for each token during the analyzation phase, set the payload attribute:
private final PayloadAttribute payloadAtt = addAttribute(PayloadAttribute.class);
// in incrementToken()
payloadAtt.setPayload(new BytesRef(String.valueOf(myAttr)));
Then build an index, and, finally, after that it's possible to get the payload this way:
DocsAndPositionsEnum payloads = null;
TermsEnum iterator = null;
Terms termVector = reader.getTermVector(docId, "field");
iterator = termVector.iterator(iterator);
while ((ref = iterator.next()) != null) {
payloads = iterator.docsAndPositions(null, payloads, DocsAndPositionsEnum.FLAG_PAYLOADS);
while (payloads.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int freq = payloads.freq();
for (int i = 0; i < freq; i++) {
payloads.nextPosition();
BytesRef payload = payloads.getPayload();
// do something with the payload
}
}
}

Related

Get number of entries in properties file apache commons

I'm creating a list of IP address' to ping in which a user can add to the list which is then saved to a properties file in the form of site.name1 = ... site.name2 = ...
Currently I have a for loop with a fixed amount, is there a way to get the number of entries in a properties file so I can set this in the for loop rather than wait for a exception?
PropertiesConfiguration config = configs.properties(new File("IPs.properties"));
//initially check for how many values there are - set to max increments for loop
for (int i = 0; i < 3; i++) { //todo fix
siteName = config.getString("site.name" + i);
siteAddress = config.getString("site.address" + i);
SiteList.add(i, siteName);
IPList.add(i, siteAddress);
}
I've looked through the documentation and other questions but they seem to be unrelated.

It looks to me based on the documentation you should be able to use PropertiesConfiguration#getLayout#getKeys to get a Set of all keys as a String.
I had to modify the code a bit to use apache-commons-configuration-1.10
PropertiesConfiguration config = new PropertiesConfiguration("ips.properties");
PropertiesConfigurationLayout layout = config.getLayout();
String siteName = null;
String siteAddress = null;
for (String key : layout.getKeys()) {
String value = config.getString(key);
if (value == null) {
throw new IllegalStateException(String.format("No value found for key: %s", key));
}
if (key.equals("site.name")) {
siteName = value;
} else if (key.equals("site.address")) {
siteAddress = value;
} else {
throw new IllegalStateException(String.format("Unsupported key: %s", key));
}
}
System.out.println(String.format("name=%s, address=%s", siteName, siteAddress));

Exporting SqlRowSet to a file

I want to export the results of an SQL query, fired through JDBC, to a file; and then import that result, at some point later.
I'm currently doing it by querying the database through a NamedParameterJdbcTemplate of Spring which returns a SqlRowSet that I can iterate. In each iteration, I extract desired fields and dump the result into a CSV file, using PrintWriter.
final SqlRowSet rs = namedJdbcTemplate.queryForRowSet(sql,params);
while (rs.next()) {
The problem is that when I read back the file, they are all Strings and I need to cast them to their proper types, e.g Integer, String, Date etc.
while (line != null) {
String[] csvLine = line.split(";");
Object[] params = new Object[14];
params[0] = csvLine[0];
params[1] = csvLine[1];
params[2] = Integer.parseInt(csvLine[2]);
params[3] = csvLine[3];
params[4] = csvLine[4];
params[5] = Integer.parseInt(csvLine[5]);
params[6] = Integer.parseInt(csvLine[6]);
params[7] = Long.parseLong(csvLine[7]);
params[8] = formatter.parse(csvLine[8]);
params[9] = Integer.parseInt(csvLine[9]);
params[10] = Double.parseDouble(csvLine[10]);
params[11] = Double.parseDouble(csvLine[11]);
params[12] = Double.parseDouble(csvLine[12]);
params[13] = Double.parseDouble(csvLine[13]);
batchParams.add(params);
line = reader.readLine();
}
Is there a better way to export this SqlRowSet to a file in order to facilitate the import process later on; some way to store the schema for an easier insertion into the DB?

If parsing is your concern, one way of handling this is,
Create a parser Factory interface, say ParserFactory
Create a parse interface, say MyParser
Have MyParser implemented using Factory Method Pattern, i.e. IntegerParser implements MyParser etc.
Have your factory class names as a header in your CSV
This way your calling code would look like,
List<String> headerRow = reader.readLine().split(";"); // Get the 1st row here
Map<String, MyParser> parsers = new HashMap<>();
for(int i = 0; i < headerRow.length(); i++) {
if(!parser.containsKey(headerRow[i]))
parsers.put(headerRow[i], ParserFactory.get(headerRow[i]));
}
line = reader.readLine();
while (line != null) { // From 2nd row onwards
String[] row = line.split(";");
Object[] params = new Object[row.length()];
for(int i = 0; i<row.length(); i++) {
params[i] = parser.get(headerRow[i]).parse(row[i]);
}
batchParams.add(params);
line = reader.readLine();
}
You might like to extract out the creation of parser map, into it's own method. Or let your ParserFactory take headerRow as parameter, and return respective parsers as a result. Something like,
List<String> headerRow = reader.readLine().split(";"); // Get the 1st row here
Map<String, MyParser> parsers = ParserFactory.getParsers(headerRow);

OR some AND rules in OWL API?

I don’t seem to be able to figure out how to OR (ObjectUnionOf?) a set of AND (ObjectIntersectionOf) rules. What my code produces when the OWL file is opened in protégé is rules (has_difi_min some double[<= "184.84"^^double]) and (has_mean_ndvi some double[<= "0.3428"^^double]), etc. with lines separating the "rulesets" as shown below in the screenshot.
My OWLAPI code:
/* write rules */
// OWLObjectIntersectionOf intersection = null;
OWLClassExpression firstRuleSet = null;
OWLClass owlCls = null;
OWLObjectUnionOf union = null;
Iterator it = rules.map.entrySet().iterator();
Set<OWLClassExpression> unionSet = new HashSet<OWLClassExpression>();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry) it.next();
String currCls = (String) pair.getKey();
owlCls = factory.getOWLClass(IRI.create("#" + currCls));
ArrayList<owlRuleSet> currRuleset = (ArrayList<owlRuleSet>) pair.getValue();
for (int i = 0; i < currRuleset.size(); i++) {
firstRuleSet = factory.getOWLObjectIntersectionOf(currRuleset.get(i).getRuleList(currCls))
union = factory.getOWLObjectUnionOf(firstRuleSet);
manager.addAxiom(ontology, factory.getOWLEquivalentClassesAxiom(owlCls, union));
}
}
manager.saveOntology(ontology);
This is what is looks like:
I want the lines to be ORs.
edit: Thanks Ignazio!
My OWLAPI code now looks like so:
/* write rules */
OWLClass owlCls = null;
OWLObjectUnionOf totalUnion = null;
Iterator it = rules.map.entrySet().iterator();
Set<OWLClassExpression> unionSet = new HashSet<OWLClassExpression>();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry) it.next();
String currCls = (String) pair.getKey();
owlCls = factory.getOWLClass(IRI.create("#" + currCls));
ArrayList<owlRuleSet> currRuleset = (ArrayList<owlRuleSet>) pair.getValue();
for (int i = 0; i < currRuleset.size(); i++) {
firstRuleSet = factory.getOWLObjectIntersectionOf(currRuleset.get(i).getRuleList(currCls))
unionSet.add(firstRuleSet);
}
totalUnion = factory.getOWLObjectUnionOf(unionSet);
unionSet.clear()
manager.addAxiom(ontology, factory.getOWLEquivalentClassesAxiom(owlCls, totalunion));
}
manager.saveOntology(ontology);

You are creating unionSet but not using it. Instead of adding an axiom to the ontology, add firstRuleSet to unionSet, then create an equivalent class axiom outside the main loop, just before saving the ontology.

BIRT: How to remove a dataset parameter programmatically

I want to modify an existing *.rptdesign file and save it under a new name.
The existing file contains a Data Set with a template SQL select statement and several DS parameters.
I'd like to use an actual SQL select statement which uses only part of the DS parameters.
However, the following code results in the exception:
Exception in thread "main" `java.lang.RuntimeException`: *The structure is floating, and its handle is invalid!*
at org.eclipse.birt.report.model.api.StructureHandle.getStringProperty(StructureHandle.java:207)
at org.eclipse.birt.report.model.api.DataSetParameterHandle.getName(DataSetParameterHandle.java:143)
at org.eclipse.birt.report.model.api.DataSetHandle$DataSetParametersPropertyHandle.removeParamBindingsFor(DataSetHandle.java:851)
at org.eclipse.birt.report.model.api.DataSetHandle$DataSetParametersPropertyHandle.removeItems(DataSetHandle.java:694)
--
OdaDataSetHandle dsMaster = (OdaDataSetHandle) report.findDataSet("Master");
HashSet<String> bindVarsUsed = new HashSet<String>();
...
// find out which DS parameters are actually used
HashSet<String> bindVarsUsed = new HashSet<String>();
...
ArrayList<OdaDataSetParameterHandle> toRemove = new ArrayList<OdaDataSetParameterHandle>();
for (Iterator iter = dsMaster.parametersIterator(); iter.hasNext(); ) {
OdaDataSetParameterHandle dsPara = (OdaDataSetParameterHandle)iter.next();
String name = dsPara.getName();
if (name.startsWith("param_")) {
String bindVarName = name.substring(6);
if (!bindVarsUsed.contains(bindVarName)) {
toRemove.add(dsPara);
}
}
}
PropertyHandle paramsHandle = dsMaster.getPropertyHandle( OdaDataSetHandle.PARAMETERS_PROP );
paramsHandle.removeItems(toRemove);
What is wrong here?
Has anyone used the DE API to remove parameters from an existing Data Set?

I had similar issue. Resolved it by calling 'removeItem' multiple times and also had to re-evaluate parametersIterator everytime.
protected void updateDataSetParameters(OdaDataSetHandle dataSetHandle) throws SemanticException {
int countMatches = StringUtils.countMatches(dataSetHandle.getQueryText(), "?");
int paramIndex = 0;
do {
paramIndex = 0;
PropertyHandle odaDataSetParameterProp = dataSetHandle.getPropertyHandle(OdaDataSetHandle.PARAMETERS_PROP);
Iterator parametersIterator = dataSetHandle.parametersIterator();
while(parametersIterator.hasNext()) {
Object next = parametersIterator.next();
paramIndex++;
if(paramIndex > countMatches) {
odaDataSetParameterProp.removeItem(next);
break;
}
}
if(paramIndex < countMatches) {
paramIndex++;
OdaDataSetParameter dataSetParameter = createDataSetParameter(paramIndex);
odaDataSetParameterProp.addItem(dataSetParameter);
}
} while(countMatches != paramIndex);
}
private OdaDataSetParameter createDataSetParameter(int paramIndex) {
OdaDataSetParameter dataSetParameter = StructureFactory.createOdaDataSetParameter();
dataSetParameter.setName("param_" + paramIndex);
dataSetParameter.setDataType(DesignChoiceConstants.PARAM_TYPE_INTEGER);
dataSetParameter.setNativeDataType(1);
dataSetParameter.setPosition(paramIndex);
dataSetParameter.setIsInput(true);
dataSetParameter.setIsOutput(false);
dataSetParameter.setExpressionProperty("defaultValue", new Expression("<evaluation script>", ExpressionType.JAVASCRIPT));
return dataSetParameter;
}

Generate Term-Document matrix using Lucene 4.4

I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4.
I know how to get TermVector for each document as following:
//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size()); //just testing
I thought I can just union all the termVector together as columns in a matrix to get the matrix. However, termVector for different documents have different size. And we don't know how to pad 0 into the termVector. So, certainly, this method does not work.
Hence, I wonder if someone can show me how to create Term-Document vector with Lucene 4.4 please? (If possible, please show sample code).
If Lucene does not support this function, what is the other way you recommend to do it?
Many thanks,

I found the solution to my problem here. Very detail example given by Mr. Sujit, although the code is written in older version of Lucene so many things will have to be changed. I'll update details when I finish my code.
Here is my solution that works on Lucene 4.4
public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
reader = DirectoryReader.open(FSDirectory.open(index));
searcher = new IndexSearcher(reader);
this.corpus = corpus;
termIdMap = computeTermIdMap(reader);
}
/**
* Map term to a fix integer so that we can build document matrix later.
* It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
Map<String,Integer> termIdMap = new HashMap<String,Integer>();
int id = 0;
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("contents");
TermsEnum itr = terms.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
if (termIdMap.containsKey(termText))
continue;
//System.out.println(termText);
termIdMap.put(termText, id++);
}
return termIdMap;
}
/**
* build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
//iterate through directory to work with each doc
int col = 0;
int numDocs = countDocs(corpus); //get the number of documents here
int numTerms = termIdMap.size(); //total number of terms
RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);
for (File f : corpus.listFiles()) {
if (!f.isHidden() && f.canRead()) {
//I build term document matrix for a subset of corpus so
//I need to lookup document by path name.
//If you build for the whole corpus, just iterate through all documents
String path = f.getPath();
BooleanQuery pathQuery = new BooleanQuery();
pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
TopDocs hits = searcher.search(pathQuery, 1);
//get term vector
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
//compute term weight
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
int row = termIdMap.get(termText);
long termFreq = itr.totalTermFreq();
long docCount = itr.docFreq();
double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
tdMatrix.setEntry(row, col, weight);
}
col++;
}
}
return tdMatrix;
}
}

One can refer this code also. In the latest Lucene version It will be quite easy.
Example 15
public void testSparseFreqDoubleArrayConversion() throws Exception {
Terms fieldTerms = MultiFields.getTerms(index, "text");
if (fieldTerms != null && fieldTerms.size() != -1) {
IndexSearcher indexSearcher = new IndexSearcher(index);
for (ScoreDoc scoreDoc : indexSearcher.search(new MatchAllDocsQuery(), Integer.MAX_VALUE).scoreDocs) {
Terms docTerms = index.getTermVector(scoreDoc.doc, "text");
Double[] vector = DocToDoubleVectorUtils.toSparseLocalFreqDoubleArray(docTerms, fieldTerms);
assertNotNull(vector);
assertTrue(vector.length > 0);
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Accessing custom Lucene attribute from DirectoryReader - java

Related

Get number of entries in properties file apache commons

Exporting SqlRowSet to a file

OR some AND rules in OWL API?

BIRT: How to remove a dataset parameter programmatically

Generate Term-Document matrix using Lucene 4.4

Categories

Resources