Elasticsearch inner hits reponse - java

This is my query function :
public List<feed> search(String id) throws IOException {
Query nestedQuery = NestedQuery.of(nq ->nq.path("comment").innerHits(InnerHits.of(ih -> ih)).query(MatchQuery
.of(mq -> mq.field("comment.c_text").query(id))._toQuery()))._toQuery();
Query termQueryTitle = TermQuery.of(tq -> tq.field("title").value(id))._toQuery();
Query termQueryBody = TermQuery.of(tq -> tq.field("body").value(id))._toQuery();
Query boolQuery = BoolQuery.of(bq -> bq.should(nestedQuery, termQueryBody, termQueryTitle))._toQuery();
SearchRequest searchRequest = SearchRequest.of(s -> s.index(indexName).query(boolQuery));
var response = elasticsearchClient.search(searchRequest, feed.class);
for (var hit : response.hits().hits()){
System.out.println("this is inner hit response: " + (hit.innerHits().get("comment").hits().hits())); }
List<Hit<feed>> hits = response.hits().hits();
List<feed> feeds = new ArrayList<>();
feed f=null;
for(Hit object : hits){
f = (feed) object.source();
feeds.add(f); }
return feeds;
}
i have add this code
for (var hit : response.hits().hits()){
System.out.println("this is inner hit response: " + (hit.innerHits().get("comment").hits().hits())); }
if it founds 2 records it gives me the refrence of 2 records but dont show me the actual records like its outpout is as follow if it founds 2 records in inner hit :
this is inner hit response [co.elastic.clients.elasticsearch.core.search.Hit#75679b1a]
this is inner hit response [co.elastic.clients.elasticsearch.core.search.Hit#1916d9c6]
can anyone help me to poput the actual records

This properly works for me in console :
for (var hit : response.hits().hits()) {
var innerHits = hit.innerHits().get("comment").hits().hits();
for (var innerHit : innerHits) {
JsonData source = innerHit.source();
String jsonDataString = source.toString();
System.out.println("Matched comments"+jsonDataString);
}
}

I created a class Comment with property "c_text" and did a cast before adding inside a lists comments.
var comments = new ArrayList<Comment>();
for (var hit : response.hits().hits()) {
comments.addAll(hit.innerHits().get("comment").hits().hits().stream().map(
h -> h.source().to(Comment.class)
).collect(Collectors.toList()));
}
System.out.println(comments);

Related

Web scraping using multithreading

I wrote a code to lookup for some movie names on IMDB, but if for instance I am searching for "Harry Potter", I will find more than one movie. I would like to use multithreading, but I don't have much knowledge on this area.
I am using strategy design pattern to search among more websites, and for instance inside one of the methods I have this code
for (Element element : elements) {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if (movieName.matches(patternMatcher)) {
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
resultList.add(result);
}
}
which, for each element (which is the movie name), will create a new connection on IMDB to lookup for ratings and other stuff, on the super.imdbConnection(movieName) line.
The problem is, I would like to have all the connections at the same time, because on 5-6 movies found, the process will take much longer than expected.
I am not asking for code, I want some ideeas. I thought about creating an inner class which implements Runnable, and to use it, but I don't find any meaning on that.
How can I rewrite that loop to use multithreading?
I am using Jsoup for parsing, Element and Elements are from that library.
The most simple way is parallelStream()
List<Result> resultList = elements.parallelStream()
.map(e -> {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());
If you don't like parallelStream() and want to use Threads, you can to this:
List<Element> elements = new ArrayList<>();
//create a function which returns an implementation of `Callable`
//input: Element
//output: Callable<Result>
Function<Element, Callable<Result>> scrapFunction = (element) -> new Callable<Result>() {
#Override
public Result call() throws Exception{
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}
};
//create a fixed pool of threads
ExecutorService executor = Executors.newFixedThreadPool(elements.size());
//submit a Callable<Result> for every Element
//by using scrapFunction.apply(...)
List<Future<Result>> futures = elements.stream()
.map(e -> executor.submit(scrapFunction.apply(e)))
.collect(Collectors.toList());
//collect all results from Callable<Result>
List<Result> resultList = futures.stream()
.map(e -> {
try{
return e.get();
}catch(Exception ignored){
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());

Elastic Search : Highlighted field not always returned

Using Java API, I need to be able to retrieve the field/highlighted field associated with the query. So I'm adding the _all field (or else *) to the query and highlighted field to the response.
It works most of the time, but not always. Here is a snippet :
final BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
Arrays.asList(query.split(" "))
.stream()
.map(QueryParser::escape)
.map(x -> String.format("*%s*", x))
.forEach(x -> {
boolQueryBuilder.should(
QueryBuilders.queryStringQuery(x)
.field("_all")
.allowLeadingWildcard(true));
});
SearchResponse response = client
.prepareSearch()
.setSize(10)
.addHighlightedField("*")
.setHighlighterRequireFieldMatch(false)
.setQuery(boolQueryBuilder)
.setHighlighterFragmentSize(40)
.setHighlighterNumOfFragments(40)
.execute()
.actionGet();
Any idea on why the field field as well as the highlightedField is not always accessible in the response given that it is technically always queried?
Not sure but I think you might be looking for this :-
String aQueryWithPartialSerach = null;
final BoolQueryBuilder aBoolQueryBuilder = new BoolQueryBuilder();
// Enabling partial sarch
if (query.contains(" ")) {
List<String> aTokenList = Arrays.asList(query.split(" "));
aQueryWithPartialSerach = String.join(" ", aTokenList.stream().map(p -> "*" + p + "*").collect(Collectors.toList()));
} else {
aQueryWithPartialSerach = "*" + query + "*";
}
aBoolQueryBuilder.should(QueryBuilders.queryStringQuery(aQueryWithPartialSerach));

Elasticsearch Search Single Letter Java Api - 1.7

Hello Elasticsearch Friends.
I have a Problem with my settings and mappings in Elasticsearch Java API. I configured my Index and set the mapping and settings. My indexname is "orange11", type "profile". I want that when I'm typing in my search inputField after just 2 or 3 letters Elasticsearch gives me some results. I've read something about analyzer, mapping and all this stuff and so I tried out.
This is my IndexService code:
#Service
public class IndexService {
private Node node;
private Client client;
#Autowired
public IndexService(Node node) throws Exception {
this.node = node;
client = this.node.client();
ImmutableSettings.Builder indexSettings = ImmutableSettings.settingsBuilder();
indexSettings.put("orange11.analysis.filter.autocomplete_filter.type", "edge_ngram");
indexSettings.put("orange11.analysis.filter.autocomplete_filter.min.gram", 1);
indexSettings.put("orange11.analysis.filter.autocomplete_filter.max_gram", 20);
indexSettings.put("orange11.analysis.analyzer.autocomplete.type", "custom");
indexSettings.put("orange11.analysis.analyzer.tokenizer", "standard");
indexSettings.put("orange11.analysis.analyzer.filter", new String[]{"lowercase", "autocomplete_filter"});
IndicesExistsResponse res = client.admin().indices().prepareExists("orange11").execute().actionGet();
if (res.isExists()) {
DeleteIndexRequestBuilder delIdx = client.admin().indices().prepareDelete("orange11");
delIdx.execute().actionGet();
}
CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate("orange11").setSettings(indexSettings);
// MAPPING GOES HERE
XContentBuilder mappingBuilder = jsonBuilder().startObject().startObject("profile").startObject("properties")
.startObject("name").field("type", "string").field("analyzer", "autocomplete").endObject()
.endObject()
.endObject();
System.out.println(mappingBuilder.string());
createIndexRequestBuilder.addMapping("profile ", mappingBuilder);
createIndexRequestBuilder.execute().actionGet();
List<Accounts> accountsList = transformJsonFileToJavaObject();
//Get Data from jsonMap() function into a ListMap.
//List<Map<String, Object>> dataFromJson = jsonToMap();
createIndex(accountsList);
}
public List<Accounts> transformJsonFileToJavaObject() throws IOException {
ObjectMapper mapper = new ObjectMapper();
List<Accounts> list = mapper.readValue(new File("/Users/lucaarchidiacono/IdeaProjects/moap2/MP3_MoapSampleBuild/data/index/testAccount.json"), TypeFactory.defaultInstance().constructCollectionType(List.class, Accounts.class));
return list;
}
public void createIndex(List<Accounts> accountsList) {
for (int i = 0; i < accountsList.size(); ++i) {
Map<String, Object> accountMap = new HashMap<String, Object>();
accountMap.put("id", accountsList.get(i).getId());
accountMap.put("isActive", accountsList.get(i).isActive());
accountMap.put("balance", accountsList.get(i).getBalance());
accountMap.put("age", accountsList.get(i).getAge());
accountMap.put("eyeColor", accountsList.get(i).getEyeColor());
accountMap.put("name", accountsList.get(i).getName());
accountMap.put("gender", accountsList.get(i).getGender());
accountMap.put("company", accountsList.get(i).getCompany());
accountMap.put("email", accountsList.get(i).getEmail());
accountMap.put("phone", accountsList.get(i).getPhone());
accountMap.put("address", accountsList.get(i).getAddress());
accountMap.put("about", accountsList.get(i).getAbout());
accountMap.put("greeting", accountsList.get(i).getGreeting());
accountMap.put("favoriteFruit", accountsList.get(i).getFavoriteFruit());
accountMap.put("url", accountsList.get(i).getUrl());
//Request an Index for indexObject. Set the index specification such as indexName, indexType and ID.
IndexRequestBuilder indexRequest = client.prepareIndex("orange11", "profile", Integer.toString(i)).setSource(accountMap);
//Execute the indexRequest and get the result in indexResponse.
IndexResponse indexResponse = indexRequest.execute().actionGet();
if (indexResponse != null && indexResponse.isCreated()) {
//Print out result of indexResponse
System.out.println("Index has been created !");
System.out.println("------------------------------");
System.out.println("Index name: " + indexResponse.getIndex());
System.out.println("Type name: " + indexResponse.getType());
System.out.println("ID: " + indexResponse.getId());
System.out.println("Version: " + indexResponse.getVersion());
System.out.println("------------------------------");
} else {
System.err.println("Index creation failed.");
}
}
}
}
Every time I wanna run this code I become this exception:
Caused by: org.elasticsearch.index.mapper.MapperParsingException: Root type mapping not empty after parsing! Remaining fields: [profile : {properties={name={analyzer=autocomplete, type=string}}}]
at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:278)
at org.elasticsearch.index.mapper.DocumentMapperParser.parseCompressed(DocumentMapperParser.java:192)
at org.elasticsearch.index.mapper.MapperService.parse(MapperService.java:449)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:307)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:391)
I don't know how to continue, because I don't see any '.' missing in my indexSettings. Sorry for my bad English.
Change min.gram to min_gram.
(See indexSettings.put("orange11.analysis.filter.autocomplete_filter.min.gram", 1);)

Java ExecutorService Runnable doesn't update value

I'm using Java to download HTML contents of websites whose URLs are stored in a database. I'd like to put their HTML into database, too.
I'm using Jsoup for this purpose:
public String downloadHTML(String byLink) {
String htmlInPage = "";
try {
Document doc = Jsoup.connect(byLink).get();
htmlInPage = doc.html();
} catch (org.jsoup.UnsupportedMimeTypeException e) {
// process this and some other exceptions
}
return htmlInPage;
}
I'd like to download websites concurrently and use this function:
public void downloadURL(int websiteId, String url,
String categoryName, ExecutorService executorService) {
executorService.submit((Runnable) () -> {
String htmlInPage = downloadHTML(url);
System.out.println("Category: " + categoryName + " " + websiteId + " " + url);
String insertQuery =
"INSERT INTO html_data (website_id, html_contents) VALUES (?,?)";
dbUtils.query(insertQuery, websiteId, htmlInPage);
});
}
dbUtils is my class based on Apache Commons DbUtils. Details are here: http://pastebin.com/iAKXchbQ
And I'm using everything mentioned above in a such way: (List<Object[]> details are explained on pastebin, too)
public static void main(String[] args) {
DbUtils dbUtils = new DbUtils("host", "db", "driver", "user", "pass");
List<String> categoriesList =
Arrays.asList("weapons", "planes", "cooking", "manga");
String sql = "SELECT lw.id, lw.website_url, category_name " +
"FROM list_of_websites AS lw JOIN list_of_categories AS lc " +
"ON lw.category_id = lc.id " +
"where category_name = ? ";
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (String category : categoriesList) {
List<Object[]> sitesInCategory = dbUtils.select(sql, category );
for (Object[] entry : sitesInCategory) {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
}
}
executorService.shutdown();
}
I'm not sure if this solution is correct but it works. Now I want to modify code to save HTML not from all websites in my database, but only their fixed ammount in each category.
For example, download and save HTML of 50 websites from the "weapons" category, 50 from "planes", etc. I don't think it's necessary to use sql for this purpose: if we select 50 sites per category, it doesn't mean we save them all, because of possibly incorrect syntax and connection problems.
I've tryed to create separate class implementing Runnable with fields: counter and maxWebsitesPerCategory, but these variables aren't updated. Another idea was to create field Map<String,Integer> sitesInCategory instead of counter, put each category as a key there and increment its value until it reaches maxWebsitesPerCategory, but it didn't work, too. Please, help me!
P.S: I'll also be grateful for any recommendations connected with my realization of concurrent downloading (I haven't worked with concurrency in Java before and this is my first attempt)
How about this?
for (String category : categoriesList) {
dbUtils.select(sql, category).stream()
.limit(50)
.forEach(entry -> {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
});
}
sitesInCategory has been replaced with a stream of at most 50 elements, then your code is run on each entry.
EDIT
In regard to comments. I've gone ahead and restructured a bit, you can modify/implement the content of the methods I've suggested.
public void werk(Queue<Object[]> q, ExecutorService executorService) {
executorService.submit(() -> {
try {
Object[] o = q.remove();
try {
String html = downloadHTML(o); // this takes one of your object arrays and returns the text of an html page
insertIntoDB(html); // this is the code in the latter half of your downloadURL method
}catch (/*narrow exception type indicating download failure*/Exception e) {
werk(q, executorService);
}
}catch (NoSuchElementException e) {}
});
}
^^^ This method does most of the work.
for (String category : categoriesList) {
Queue<Object[]> q = new ConcurrentLinkedQueue<>(dbUtils.select(sql, category));
IntStream.range(0, 50).forEach(i -> werk(q, executorService));
}
^^^ this is the for loop in your main
Now each category tries to download 50 pages, upon failure of downloading a page it moves on and tries to download another page. In this way, you will either download 50 pages or have attempted to download all pages in the category.

CloudSearch deleteByQuery

The official Solr Java API has a deleteByQuery operation where we can delete documents that satisfy a query. The AWS CloudSearch SDK doesn't seem to have matching functionality. Am I just not seeing the deleteByQuery equivalent, or is this something we'll need to roll our own?
Something like this:
SearchRequest searchRequest = new SearchRequest();
searchRequest.setQuery(queryString);
searchRequest.setReturn("id,version");
SearchResult searchResult = awsCloudSearch.search(searchRequest);
JSONArray docs = new JSONArray();
for (Hit hit : searchResult.getHits().getHit()) {
JSONObject doc = new JSONObject();
doc.put("id", hit.getId());
// is version necessary?
doc.put("version", hit.getFields().get("version").get(0));
doc.put("type", "delete");
docs.put(doc);
}
UploadDocumentsRequest uploadDocumentsRequest = new UploadDocumentsRequest();
StringInputStream documents = new StringInputStream(docs.toString());
uploadDocumentsRequest.setDocuments(documents);
UploadDocumentsResult uploadResult = awsCloudSearch.uploadDocuments(uploadDocumentsRequest);
Is this reasonable? Is there an easier way?
You're correct that CloudSearch doesn't have an equivalent to deleteByQuery. Your approach looks like the next best thing.
And no, version is not necessary -- it was removed with the CloudSearch 01-01-2013 API (aka v2).
CloudSearch doesn't provide delete as query, it supports delete in a slightly different way i.e. build json object having only document id (to be deleted) and operation should be specified as delete. These json objects can be batched together but batch size has to be less than 5 MB.
Following class supports this functionality, you just pass its delete method the array of ids to be deleted:
class AWS_CS
{
protected $client;
function connect($domain)
{
try{
$csClient = CloudSearchClient::factory(array(
'key' => 'YOUR_KEY',
'secret' => 'YOUR_SECRET',
'region' => 'us-east-1'
));
$this->client = $csClient->getDomainClient(
$domain,
array(
'credentials' => $csClient->getCredentials(),
'scheme' => 'HTTPS'
)
);
}
catch(Exception $ex){
echo "Exception: ";
echo $ex->getMessage();
}
//$this->client->addSubscriber(LogPlugin::getDebugPlugin());
}
function search($queryStr, $domain){
$this->connect($domain);
$result = $this->client->search(array(
'query' => $queryStr,
'queryParser' => 'lucene',
'size' => 100,
'return' => '_score,_all_fields'
))->toArray();
return json_encode($result['hits']);
//$hitCount = $result->getPath('hits/found');
//echo "Number of Hits: {$hitCount}\n";
}
function deleteDocs($idArray, $operation = 'delete'){
$batch = array();
foreach($idArray as $id){
//dumpArray($song);
$batch[] = array(
'type' => $operation,
'id' => $id);
}
$batch = array_filter($batch);
$jsonObj = json_encode($batch, JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_QUOT | JSON_HEX_AMP);
print_r($this->client->uploadDocuments(array(
'documents' => $jsonObj,
'contentType' =>'application/json'
)));
return $result['status'] == 'success' ? mb_strlen($jsonObj) : 0;
}
}
Modified for C# - Deleting uploaded document in cloud search
public void DeleteUploadedDocuments(string location)
{
SearchRequest searchRequest = new SearchRequest { };
searchRequest = new SearchRequest { Query = "resourcename:'filepath'", QueryParser = QueryParser.Lucene, Size = 10000 };
searchClient = new AmazonCloudSearchDomainClient( ConfigurationManager.AppSettings["awsAccessKeyId"] , ConfigurationManager.AppSettings["awsSecretAccessKey"] , new AmazonCloudSearchDomainConfig { ServiceURL = ConfigurationManager.AppSettings["CloudSearchEndPoint"] });
SearchResponse searchResponse = searchClient.Search(searchRequest);
JArray docs = new JArray();
foreach (Hit hit in searchResponse.Hits.Hit)
{
JObject doc = new JObject();
doc.Add("id", hit.Id);
doc.Add("type", "delete");
docs.Add(doc);
}
UpdateIndexDocument<JArray>(docs, ConfigurationManager.AppSettings["CloudSearchEndPoint"]);
}
public void UpdateIndexDocument<T>(T document, string DocumentUrl)
{
AmazonCloudSearchDomainConfig config = new AmazonCloudSearchDomainConfig { ServiceURL = DocumentUrl };
AmazonCloudSearchDomainClient searchClient = new AmazonCloudSearchDomainClient( ConfigurationManager.AppSettings["awsAccessKeyId"] , ConfigurationManager.AppSettings["awsSecretAccessKey"] , config);
using (Stream stream = GenerateStreamFromString(JsonConvert.SerializeObject(document)))
{
UploadDocumentsRequest upload = new UploadDocumentsRequest()
{
ContentType = "application/json",
Documents = stream
};
searchClient.UploadDocuments(upload);
};
}

Categories