Overview
I want to implement a Lucene Indexer/Searcher that uses the new Payload feature that allows to add meta information to text. In my specific case, I add weights (that can be understood as % probabilities, between 0 and 100) to conceptual tags in order to use them to overwrite the standard Lucene TF-IDF weighting. I am puzzled by the behaviour of this and I believe there is something wrong with the Similarity class, that I overwrote, but I cannot figure it out.
Example
When I run a search query (e.g. "concept:red") I discover that each payload is always the first number that was passed through MyPayloadSimilarity (in the code example, this is 1.0) and not 1.0, 50.0 and 100.0. As a result, all documents get the same payload and the same score. However, the data should feature picture #1, with a payload of 100.0, followed by picture #2, followed by picture #3 and very diverse scores. I can't get my heard around.
Here are the results of the run:
Query: concept:red
===> docid: 0 payload: 1.0
===> docid: 1 payload: 1.0
===> docid: 2 payload: 1.0
Number of results:3
-> docid: 3.jpg score: 0.2518424
-> docid: 2.jpg score: 0.2518424
-> docid: 1.jpg score: 0.2518424
What is wrong? Did i misunderstand something about Payloads?
Code
Enclosed I share my code as a self-contained example to make it as easy as possible for you to run it, should you consider this option.
public class PayloadShowcase {
public static void main(String s[]) {
PayloadShowcase p = new PayloadShowcase();
p.run();
}
public void run () {
// Step 1: indexing
MyPayloadIndexer indexer = new MyPayloadIndexer();
indexer.index();
// Step 2: searching
MyPayloadSearcher searcher = new MyPayloadSearcher();
searcher.search("red");
}
public class MyPayloadAnalyzer extends Analyzer {
private PayloadEncoder encoder;
MyPayloadAnalyzer(PayloadEncoder encoder) {
this.encoder = encoder;
}
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(reader);
TokenStream filter = new LowerCaseFilter(source);
filter = new DelimitedPayloadTokenFilter(filter, '|', encoder);
return new TokenStreamComponents(source, filter);
}
}
public class MyPayloadIndexer {
public MyPayloadIndexer() {}
public void index() {
try {
Directory dir = FSDirectory.open(new File("D:/data/indices/sandbox"));
Analyzer analyzer = new MyPayloadAnalyzer(new FloatEncoder());
IndexWriterConfig iwconfig = new IndexWriterConfig(Version.LUCENE_4_10_1, analyzer);
iwconfig.setSimilarity(new MyPayloadSimilarity());
iwconfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
// load mappings and classifiers
HashMap<String, String> mappings = this.loadDataMappings();
HashMap<String, HashMap> cMaps = this.loadData();
IndexWriter writer = new IndexWriter(dir, iwconfig);
indexDocuments(writer, mappings, cMaps);
writer.close();
} catch (IOException e) {
System.out.println("Exception while indexing: " + e.getMessage());
}
}
private void indexDocuments(IndexWriter writer, HashMap<String, String> fileMappings, HashMap<String, HashMap> concepts) throws IOException {
Set fileSet = fileMappings.keySet();
Iterator<String> iterator = fileSet.iterator();
while (iterator.hasNext()){
// unique file information
String fileID = iterator.next();
String filePath = fileMappings.get(fileID);
// create a new, empty document
Document doc = new Document();
// path of the indexed file
Field pathField = new StringField("path", filePath, Field.Store.YES);
doc.add(pathField);
// lookup all concept probabilities for this fileID
Iterator<String> conceptIterator = concepts.keySet().iterator();
while (conceptIterator.hasNext()){
String conceptName = conceptIterator.next();
HashMap conceptMap = concepts.get(conceptName);
doc.add(new TextField("concept", ("" + conceptName + "|").trim() + (conceptMap.get(fileID) + "").trim(), Field.Store.YES));
}
writer.addDocument(doc);
}
}
public HashMap<String, String> loadDataMappings(){
HashMap<String, String> h = new HashMap<>();
h.put("1", "1.jpg");
h.put("2", "2.jpg");
h.put("3", "3.jpg");
return h;
}
public HashMap<String, HashMap> loadData(){
HashMap<String, HashMap> h = new HashMap<>();
HashMap<String, String> green = new HashMap<>();
green.put("1", "50.0");
green.put("2", "1.0");
green.put("3", "100.0");
HashMap<String, String> red = new HashMap<>();
red.put("1", "100.0");
red.put("2", "50.0");
red.put("3", "1.0");
HashMap<String, String> blue = new HashMap<>();
blue.put("1", "1.0");
blue.put("2", "50.0");
blue.put("3", "100.0");
h.put("green", green);
h.put("red", red);
h.put("blue", blue);
return h;
}
}
class MyPayloadSimilarity extends DefaultSimilarity {
#Override
public float scorePayload(int docID, int start, int end, BytesRef payload) {
float pload = 1.0f;
if (payload != null) {
pload = PayloadHelper.decodeFloat(payload.bytes);
}
System.out.println("===> docid: " + docID + " payload: " + pload);
return pload;
}
}
public class MyPayloadSearcher {
public MyPayloadSearcher() {}
public void search(String queryString) {
try {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("D:/data/indices/sandbox")));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(new PayloadSimilarity());
PayloadTermQuery query = new PayloadTermQuery(new Term("concept", queryString),
new AveragePayloadFunction());
System.out.println("Query: " + query.toString());
TopDocs topDocs = searcher.search(query, 999);
ScoreDoc[] hits = topDocs.scoreDocs;
System.out.println("Number of results:" + hits.length);
// output
for (int i = 0; i < hits.length; i++) {
Document doc = searcher.doc(hits[i].doc);
System.out.println("-> docid: " + doc.get("path") + " score: " + hits[i].score);
}
reader.close();
} catch (Exception e) {
System.out.println("Exception while searching: " + e.getMessage());
}
}
}
}
At MyPayloadSimilarity, PayloadHelper.decodeFloat call is incorrect. In this case, it's also necessary to pass the payload.offset param, like this:
pload = PayloadHelper.decodeFloat(payload.bytes, payload.offset);
I hope it helps.
Related
I have 107 documents in my index base, i created a method to return all these documents with pagination, in my case the first page contains 20 documents and i logically get 6 pages, the 5 first pages contain 20 documents each and the 6th page contains only 7. The problem is that the methods reeturn always 1 page not 6
#Override
#Transactional(readOnly = true)
public Page<Convention> findAll(Pageable pageable) throws UnknownHostException {
String[] parts = pageable.getSort().toString().split(":");
SortOrder sortOrder;
if ("DESC".equalsIgnoreCase(parts[1].trim())) {
sortOrder = SortOrder.DESC;
} else {
sortOrder = SortOrder.ASC;
}
SearchResponse searchResponse = elasticsearchConfiguration.getTransportClient()
.prepareSearch("convention")
.setTypes("convention")
.setQuery(QueryBuilders.matchAllQuery())
.addSort(SortBuilders.fieldSort(parts[0])
.order(sortOrder))
.setSize(pageable.getPageSize())
.setFrom(pageable.getPageNumber() * pageable.getPageSize())
.setSearchType(SearchType.QUERY_THEN_FETCH)
.get();
return searchResults(searchResponse);
}
private Page<Convention> searchResults(SearchResponse searchResponse) {
List<Convention> conventions = new ArrayList<>();
for (SearchHit hit : searchResponse.getHits()) {
if (searchResponse.getHits().getHits().length <= 0) {
return null;
}
String sourceAsString = hit.getSourceAsString();
if (sourceAsString != null) {
ObjectMapper mapper = new ObjectMapper();
Convention convention = null;
try {
convention = mapper.readValue(sourceAsString, Convention.class);
} catch (IOException e) {
LOGGER.error("Error", e);
}
conventions.add(convention);
}
}
return new PageImpl<>(conventions);
}
http://localhost:8081/api/conventions?page=0&size=20&sort=shortname,DESC
When i execute this api, i have TotalElements=20, Number=0, TotalPages=1, and Size=0
#GetMapping("/conventions")
public ResponseEntity<List<Convention>> getAllConventions(final Pageable pageable) throws UnknownHostException {
final Page<Convention> page = conventionService.findAll(pageable);
System.out.println("-------------- 1:" + page.getTotalElements()); // 20
System.out.println("-------------- 2:" + page.getNumber()); // 0
System.out.println("-------------- 3:" + page.getTotalPages()); // 1
System.out.println("-------------- 4:" + page.getSize()); // 0
HttpHeaders headers = new HttpHeaders();
headers.add("X-Total-Count", Long.toString(page.getTotalElements()));
return new ResponseEntity<>(page.getContent(), headers, HttpStatus.OK);
}
This issue is addressed and fixed in current stable version of spring-data-elasticsearch 3.0.7
See https://jira.spring.io/browse/DATAES-402
i think it comes from this line: return new PageImpl<>(conventions);
Maybe you should transfer the total size of the responshits, because you override the query.
I'm trying out this code from Microsoft, however I wanted to combine the 2 features they made. One is analyzing image and one is detecting celebrities. However, I'm having a hard time on how will I return 2 values from one function.
Here is the process method...
private String process() throws VisionServiceException, IOException {
Gson gson = new Gson();
String model = "celebrities";
ByteArrayOutputStream output = new ByteArrayOutputStream();
bitmapPicture.compress(Bitmap.CompressFormat.JPEG, 100, output);
ByteArrayInputStream inputStream = new ByteArrayInputStream(output.toByteArray());
AnalysisResult v = this.client.describe(inputStream, 1);
AnalysisInDomainResult m = this.client.analyzeImageInDomain(inputStream,model);
String result = gson.toJson(v);
String result2 = gson.toJson(m);
Log.d("result", result);
return result, result2;
}
And combine the 2 results with this method...
#Override
protected void onPostExecute(String data) {
super.onPostExecute(data);
mEditText.setText("");
if (e != null) {
mEditText.setText("Error: " + e.getMessage());
this.e = null;
} else {
Gson gson = new Gson();
AnalysisResult result = gson.fromJson(data, AnalysisResult.class);
//pang detect ng peymus...
AnalysisInDomainResult result2 = gson.fromJson(data, AnalysisInDomainResult.class);
//decode the returned result
JsonArray detectedCelebs = result2.result.get("celebrities").getAsJsonArray();
if(result2.result != null){
mEditText.append("Celebrities detected: "+detectedCelebs.size()+"\n");
for(JsonElement celebElement: detectedCelebs) {
JsonObject celeb = celebElement.getAsJsonObject();
mEditText.append("Name: "+celeb.get("name").getAsString() +", score" +
celeb.get("confidence").getAsString() +"\n");
}
}else {
for (Caption caption: result.description.captions) {
mEditText.append("Your seeing " + caption.text + ", confidence: " + caption.confidence + "\n");
}
mEditText.append("\n");
}
/* for (String tag: result.description.tags) {
mEditText.append("Tag: " + tag + "\n");
}
mEditText.append("\n");
mEditText.append("\n--- Raw Data ---\n\n");
mEditText.append(data);*/
mEditText.setSelection(0);
}
}
Thanks in advance!
you can use. the parameter are two objects so can you can put everything
final Pair<String, String> pair = Pair.create("1", "2");
String a = pair.first;
String b = pair.second;
Simply use Bundle
Bundle bundle = new Bundle();
bundle.putString("key_one", "your_first_value");
bundle.putString("key_two", "your_second_value");
return bundle;
You can add multiple values with different types in Bundle.In this case your method's return type should be Bundle.
or AbstractMap#SimpleEntry (or as of Java-9 Map#entry)
You can always return two values via Arrays.asList(one, two) also
Hello Elasticsearch Friends.
I have a Problem with my settings and mappings in Elasticsearch Java API. I configured my Index and set the mapping and settings. My indexname is "orange11", type "profile". I want that when I'm typing in my search inputField after just 2 or 3 letters Elasticsearch gives me some results. I've read something about analyzer, mapping and all this stuff and so I tried out.
This is my IndexService code:
#Service
public class IndexService {
private Node node;
private Client client;
#Autowired
public IndexService(Node node) throws Exception {
this.node = node;
client = this.node.client();
ImmutableSettings.Builder indexSettings = ImmutableSettings.settingsBuilder();
indexSettings.put("orange11.analysis.filter.autocomplete_filter.type", "edge_ngram");
indexSettings.put("orange11.analysis.filter.autocomplete_filter.min.gram", 1);
indexSettings.put("orange11.analysis.filter.autocomplete_filter.max_gram", 20);
indexSettings.put("orange11.analysis.analyzer.autocomplete.type", "custom");
indexSettings.put("orange11.analysis.analyzer.tokenizer", "standard");
indexSettings.put("orange11.analysis.analyzer.filter", new String[]{"lowercase", "autocomplete_filter"});
IndicesExistsResponse res = client.admin().indices().prepareExists("orange11").execute().actionGet();
if (res.isExists()) {
DeleteIndexRequestBuilder delIdx = client.admin().indices().prepareDelete("orange11");
delIdx.execute().actionGet();
}
CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate("orange11").setSettings(indexSettings);
// MAPPING GOES HERE
XContentBuilder mappingBuilder = jsonBuilder().startObject().startObject("profile").startObject("properties")
.startObject("name").field("type", "string").field("analyzer", "autocomplete").endObject()
.endObject()
.endObject();
System.out.println(mappingBuilder.string());
createIndexRequestBuilder.addMapping("profile ", mappingBuilder);
createIndexRequestBuilder.execute().actionGet();
List<Accounts> accountsList = transformJsonFileToJavaObject();
//Get Data from jsonMap() function into a ListMap.
//List<Map<String, Object>> dataFromJson = jsonToMap();
createIndex(accountsList);
}
public List<Accounts> transformJsonFileToJavaObject() throws IOException {
ObjectMapper mapper = new ObjectMapper();
List<Accounts> list = mapper.readValue(new File("/Users/lucaarchidiacono/IdeaProjects/moap2/MP3_MoapSampleBuild/data/index/testAccount.json"), TypeFactory.defaultInstance().constructCollectionType(List.class, Accounts.class));
return list;
}
public void createIndex(List<Accounts> accountsList) {
for (int i = 0; i < accountsList.size(); ++i) {
Map<String, Object> accountMap = new HashMap<String, Object>();
accountMap.put("id", accountsList.get(i).getId());
accountMap.put("isActive", accountsList.get(i).isActive());
accountMap.put("balance", accountsList.get(i).getBalance());
accountMap.put("age", accountsList.get(i).getAge());
accountMap.put("eyeColor", accountsList.get(i).getEyeColor());
accountMap.put("name", accountsList.get(i).getName());
accountMap.put("gender", accountsList.get(i).getGender());
accountMap.put("company", accountsList.get(i).getCompany());
accountMap.put("email", accountsList.get(i).getEmail());
accountMap.put("phone", accountsList.get(i).getPhone());
accountMap.put("address", accountsList.get(i).getAddress());
accountMap.put("about", accountsList.get(i).getAbout());
accountMap.put("greeting", accountsList.get(i).getGreeting());
accountMap.put("favoriteFruit", accountsList.get(i).getFavoriteFruit());
accountMap.put("url", accountsList.get(i).getUrl());
//Request an Index for indexObject. Set the index specification such as indexName, indexType and ID.
IndexRequestBuilder indexRequest = client.prepareIndex("orange11", "profile", Integer.toString(i)).setSource(accountMap);
//Execute the indexRequest and get the result in indexResponse.
IndexResponse indexResponse = indexRequest.execute().actionGet();
if (indexResponse != null && indexResponse.isCreated()) {
//Print out result of indexResponse
System.out.println("Index has been created !");
System.out.println("------------------------------");
System.out.println("Index name: " + indexResponse.getIndex());
System.out.println("Type name: " + indexResponse.getType());
System.out.println("ID: " + indexResponse.getId());
System.out.println("Version: " + indexResponse.getVersion());
System.out.println("------------------------------");
} else {
System.err.println("Index creation failed.");
}
}
}
}
Every time I wanna run this code I become this exception:
Caused by: org.elasticsearch.index.mapper.MapperParsingException: Root type mapping not empty after parsing! Remaining fields: [profile : {properties={name={analyzer=autocomplete, type=string}}}]
at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:278)
at org.elasticsearch.index.mapper.DocumentMapperParser.parseCompressed(DocumentMapperParser.java:192)
at org.elasticsearch.index.mapper.MapperService.parse(MapperService.java:449)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:307)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:391)
I don't know how to continue, because I don't see any '.' missing in my indexSettings. Sorry for my bad English.
Change min.gram to min_gram.
(See indexSettings.put("orange11.analysis.filter.autocomplete_filter.min.gram", 1);)
I'm using Apache OpenNLP and i'd like to extract the Keyphrases of a given text. I'm already gathering entities - but i would like to have Keyphrases.
The problem i have is that i can't use TF-IDF cause i don't have models for that and i only have a single text (not multiple documents)
Here is some code (prototyped - not so clean)
public List<KeywordsModel> extractKeywords(String text, NLPProvider pipeline) {
SentenceDetectorME sentenceDetector = new SentenceDetectorME(pipeline.getSentencedetecto("en"));
TokenizerME tokenizer = new TokenizerME(pipeline.getTokenizer("en"));
POSTaggerME posTagger = new POSTaggerME(pipeline.getPosmodel("en"));
ChunkerME chunker = new ChunkerME(pipeline.getChunker("en"));
ArrayList<String> stopwords = pipeline.getStopwords("en");
Span[] sentSpans = sentenceDetector.sentPosDetect(text);
Map<String, Float> results = new LinkedHashMap<>();
SortedMap<String, Float> sortedData = new TreeMap(new MapSort.FloatValueComparer(results));
float sentenceCounter = sentSpans.length;
float prominenceVal = 0;
int sentences = sentSpans.length;
for (Span sentSpan : sentSpans) {
prominenceVal = sentenceCounter / sentences;
sentenceCounter--;
String sentence = sentSpan.getCoveredText(text).toString();
int start = sentSpan.getStart();
Span[] tokSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokSpans.length];
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
}
String[] tags = posTagger.tag(tokens);
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
int npstart = start + tokSpans[chunk.getStart()].getStart();
int npend = start + tokSpans[chunk.getEnd() - 1].getEnd();
String potentialKey = text.substring(npstart, npend);
if (!results.containsKey(potentialKey)) {
boolean hasStopWord = false;
String[] pKeys = potentialKey.split("\\s+");
if (pKeys.length < 3) {
for (String pKey : pKeys) {
for (String stopword : stopwords) {
if (pKey.toLowerCase().matches(stopword)) {
hasStopWord = true;
break;
}
}
if (hasStopWord == true) {
break;
}
}
}else{
hasStopWord=true;
}
if (hasStopWord == false) {
int count = StringUtils.countMatches(text, potentialKey);
results.put(potentialKey, (float) (Math.log(count) / 100) + (float)(prominenceVal/5));
}
}
}
}
}
sortedData.putAll(results);
System.out.println(sortedData);
return null;
}
What it basically does is giving me the Nouns back and sorting them by prominence value (where is it in the text?) and counts.
But honestly - this doesn't work soo good.
I also tried it with lucene analyzer but the results were also not so good.
So - how can i achieve what i want to do? I already know of KEA/Maui-indexer etc (but i'm afraid i can't use them because of GPL :( )
Also interesting? Which other algorithms can i use instead of TF-IDF?
Example:
This text: http://techcrunch.com/2015/09/04/etsys-pulling-the-plug-on-grand-st-at-the-end-of-this-month/
Good output in my opinion: Etsy, Grand St., solar chargers, maker marketplace, tech hardware
Finally, i found something:
https://github.com/srijiths/jtopia
It is using the POS from opennlp/stanfordnlp. It has an ALS2 license. Haven't measured precision and recall yet but it delivers great results in my opinion.
Here is my code:
Configuration.setTaggerType("openNLP");
Configuration.setSingleStrength(6);
Configuration.setNoLimitStrength(5);
// if tagger type is "openNLP" then give the openNLP POS tagger path
//Configuration.setModelFileLocation("model/openNLP/en-pos-maxent.bin");
// if tagger type is "default" then give the default POS lexicon file
//Configuration.setModelFileLocation("model/default/english-lexicon.txt");
// if tagger type is "stanford "
Configuration.setModelFileLocation("Dont need that here");
Configuration.setPipeline(pipeline);
TermsExtractor termExtractor = new TermsExtractor();
TermDocument topiaDoc = new TermDocument();
topiaDoc = termExtractor.extractTerms(text);
//logger.info("Extracted terms : " + topiaDoc.getExtractedTerms());
Map<String, ArrayList<Integer>> finalFilteredTerms = topiaDoc.getFinalFilteredTerms();
List<KeywordsModel> keywords = new ArrayList<>();
for (Map.Entry<String, ArrayList<Integer>> e : finalFilteredTerms.entrySet()) {
KeywordsModel keyword = new KeywordsModel();
keyword.setLabel(e.getKey());
keywords.add(keyword);
}
I modified the Configurationfile a bit so that the POSModel is loaded from the pipeline instance.
I have this XMLParser:
public class XMLParser {
private URL url;
public XMLParser(String url){
try {
this.url = new URL(url);
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
public LinkedList<HashMap<String, String>> parse() {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
LinkedList<HashMap<String, String>> entries = new LinkedList<HashMap<String,String>>();
HashMap<String, String> entry;
try {
DocumentBuilder builder = factory.newDocumentBuilder();
Document dom = builder.parse(this.url.openConnection().getInputStream());
Element root = dom.getDocumentElement();
NodeList items = root.getElementsByTagName("item");
for (int i=0; i<items.getLength(); i++) {
entry = new HashMap<String, String>();
Node item = items.item(i);
NodeList properties = item.getChildNodes();
for (int j=0; j<properties.getLength(); j++) {
Node property = properties.item(j);
String name = property.getNodeName();
if (name.equalsIgnoreCase("title")) {
entry.put(CheckNewPost.DATA_TITLE, property.getFirstChild().getNodeValue());
} else if (name.equalsIgnoreCase("link")) {
entry.put(CheckNewPost.DATA_LINK, property.getFirstChild().getNodeValue());
}
}
entries.add(entry);
}
} catch (Exception e) {
throw new RuntimeException(e);
}
return entries;
}
}
The problem is that I don't know how to obtain last value added on LinkedList and show it on a Toast. I tried this:
XMLParser par = new XMLParser(feedUrl);
HashMap<String, String> entry = par.parse().getLast();
String[] text = new String[] {entry.get(DATA_TITLE), entry.get(DATA_LINK)};
Toast t = Toast.makeText(getApplicationContext(), text[0], Toast.LENGTH_LONG);
t.show();
Evidently I don't have sufficient experience and I know almost nothing about Map, HashMap or so… The XMLParser is not mine.
For iterating through all the values in your HashMap, you can use an EntrySet:
for (Map.Entry<String, String> entry : hashMap.entrySet()) {
Log.i(TAG, "Key: " + entry.getKey() + ", Value: " + entry.getValue());
}
And for only obtaining the last value, you can try this:
hashMap.get(hashMap.size() - 1);
So the entry is HashMap that mean you need to get the Named/Value pairs out of the Map. So for map, there is Named also known as Key or (hashed index), and for each "Key" there exist exactly one "value" so that.
Map<String, String> entries = new HashMap<String, String>();
entries.put("Key1", "Value for Key 1");
entries.put("Key 2", "Key 2 value!");
entries.put("Key1", "Not the original value for sure!"); // replaced
and later you can get it as:
String val1 = entries.get("Key1"); // "Not the original value for sure!"
String val2 = entries.get("Key 2"); //"Key 2 value!"
and to loop through the Map you would do:
for (String key : entries.keySet()) {
String value = entries.get(key);
}
Hope that help. You can also google "Java Map example" for more.