Getting the number of documents in a Solr Cloud collection - java

I'm working on a custom Solr search component which takes into account the number of documents in the collection. Currently the number of documents is hard coded in my Solr configuration file, and that's bad because the number of documents is dynamic. Is it possible to get the number of documents (in the whole collection, not in a single core) from the response builder? So far I have found a way to get the cloud descriptor (rb.req.getCore().getCoreDescriptor().getCloudDescriptor()), but in contrast to my expectations I did not see a getNumDocs() method in there.

I used following code to get the NumberOfDocuments in my SOLR Cloud Collection.
HttpSolrServer httpSolrServer = new HttpSolrServer("http://localhost:8983/solr/collectionname/");
QueryResponse response = httpSolrServer.query(new SolrQuery(), METHOD.POST);
SolrDocumentList solrDocumentList = queryResponse.getResults();
solrDocumentList.getNumFound();
solrDocumentList.getStart();
Hope this Helps you!!!

Related

MarkLogic Search return document collections

Is there a way to return the collections of a document if you are using the search api?
I could not find a option in the Query Options Reference for that use case.
Right now i would have to build my own wrapper around the search api and find the collections of search results by myself:
xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
let $docs := search:search("query")
return for $doc in $docs
return xdmp:node-collections(doc($doc/search:result/#uri))
Edit: This should be also availiable with the marklogic java client api.
In case you are using the MarkLogic REST api, you can use the category parameter on /v1/search to pull up metadata instead of content. If you would like to blend it into the search results, you best use a REST transform on /v1/search using the transform parameter. See also:
https://docs.marklogic.com/REST/GET/v1/search
HTH!
To get only document metadata such as collections and not the document content, write and install a server-side transform that takes calls xdmp:node-collections() on the document and constructs a replacement document. See:
http://docs.marklogic.com/guide/java/transforms
Then call the QueryDefinition.setResponseTransform() method to specify the server-side transform:
http://docs.marklogic.com/javadoc/client/com/marklogic/client/query/QueryDefinition.html#setResponseTransform-com.marklogic.client.document.ServerTransform-
before passing the query definition to the DocumentManager.search() method:
http://docs.marklogic.com/javadoc/client/com/marklogic/client/document/DocumentManager.html#search-com.marklogic.client.query.QueryDefinition-long-
As a footnote, the DocumentManager.search() method can retrieve both the document metadata and content in a single request without a server-side transform by calling DocumentManager.setMetadataCategories() before searching. See:
http://docs.marklogic.com/javadoc/client/com/marklogic/client/document/DocumentManager.html#setMetadataCategories-java.util.Set-
Hoping that helps,

How to read data from Elasticsearch to Spark?

I`m trying to read data from ElasticSearch to Apache Spark by python.
Below are the code copied from official documents.
$ ./bin/pyspark --driver-class-path=/path/to/elasticsearch-hadoop.jar
conf = {"es.resource" : "index/type"}
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()
The above can read the data from the corresponding index but it is reading the whole index.
Can you tell me how to use query to limit the read scope?
Also, I did not find much doc regarding this. For example, it seems the conf dict control the read scope but the ES doc just said it is a Hadoop config and nothing more. I go to Hadoop config did not find corresponding key and value regarding ES. Do you know some better articles about this?
You can add a es.query setting to your configuration like this:
conf.set("es.query", "?q=me*")
Here's a more detailed documentation on how to use it.

Recursively scan documents for indexing in a folder in SolrJ

I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/
This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.
However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0
Thank you.
Regards,
Edwin
If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:
SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
server.request(req);
To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.
If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.

Order By doesnt work on Google Spreadsheet API

I am trying to sort a Google Spreadsheet with the Java API but unfortunately it doesn't seem to work. The code I am using is really simple as shown in the API reference.
URL listFeedUrl = new URI(worksheet.getListFeedUrl().toString() + "?orderby=columnname").toURL();
However, this does not work. The feed returned is not sorted at all. Am I missing something? FYI the column I am trying to sort contains email addresses.
EDIT: I just realized that the problem only happens with the old version of Google Spreadsheet.
maybe this happens. The query is performed on the spreadsheet xml and xml tags are in lower case, for example the title of my column in my spreadseet is "Nombre" and the xml <gsx:nombre>is not working so instead of using [?orderby=Nombre], use [?orderby=nombre] with a lowercase "n"
The correct query for this is.
URL listFeedUrl = new URI(worksheet.getListFeedUrl().toString() + "?orderby=nombre").toURL();

Document feed limit

Is there a limit to the number of entries returned in a DocumentListFeed? I'm getting 100 results, and some of the collections in my account are missing.
How can I make sure I get all of the collections in my account?
DocsService service = new DocsService(APP_NAME);
service.setHeader("Authorization", "Bearer " + accessToken);
String feedUrl = new URL("https://docs.google.com/feeds/default/private/full/-/folder?v=3&showfolders=true&showroot=true");
DocumentLisFeed feed = service.getFeed(feedUrl, DocumentListFeed.class);
List<DocumentListEntry> entries = feed.getEntries();
The size of entries is 100.
A single request to the Documents List feed by default returns 100 element, but you can configure that value by setting the ?max-results query parameter.
Regardless, in order to retrieve all documents and files you should always take into account sending multiple requests, one per page, as explained in the documentation:
https://developers.google.com/google-apps/documents-list/#getting_all_pages_of_documents_and_files
Please also note that it is now recommended to switch to the newer Google Drive API, which interacts with the same resources and has complete documentation and sample code in multiple languages, including Java:
https://developers.google.com/drive/
You can call
feed.getNextLink().getHref()
to get a URL that you an form another feed with. This can be done until the link is null, at which point all the entries have been fetched.

Categories