How to read data from Elasticsearch to Spark? - java

I`m trying to read data from ElasticSearch to Apache Spark by python.
Below are the code copied from official documents.
$ ./bin/pyspark --driver-class-path=/path/to/elasticsearch-hadoop.jar
conf = {"es.resource" : "index/type"}
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()
The above can read the data from the corresponding index but it is reading the whole index.
Can you tell me how to use query to limit the read scope?
Also, I did not find much doc regarding this. For example, it seems the conf dict control the read scope but the ES doc just said it is a Hadoop config and nothing more. I go to Hadoop config did not find corresponding key and value regarding ES. Do you know some better articles about this?

You can add a es.query setting to your configuration like this:
conf.set("es.query", "?q=me*")
Here's a more detailed documentation on how to use it.

Related

How to access file name within a DoFn in an unbounded pipeline

I'm looking for a way to access the name of the file being processed during the data transformation within a DoFn.
My pipeline is as shown below:
Pipeline p = Pipeline.create(options);
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))
.apply(XmlIO.<MyString>readFiles()
.withRootElement("root")
.withRecordElement("record")
.withRecordClass(MyString.class))//<-- This only returns the contents of the file
.apply(ParDo.of(new ProcessRecord()))//<-- I need to access file name here
.apply(ParDo.of(new FormatRecord()))
.apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(new CustomWrite(options));
Each file that is processed is an XML document. While processing the content, I need access to the name of the file being processed too to include in the transformed record.
Is there a way to achieve this?
This post has a similar question, but since i'm trying to use XmlIO I havent found a way to access the file metadata.
Below is the approach I found online, but not sure if there is a way to use it in the pipeline described above.
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))//File Metadata
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))//Readable Files
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(),new TypeDescriptor<ReadableFile>() {} ))
.via((ReadableFile file) -> {
return KV.of(file.getMetadata().resourceId().getFilename(),file);
})
);
Any suggestions are highly appreciated.
Thank you for your time reviewing this.
EDIT:
I took Alexey's advice and implemented a custom XmlIO. It would be nice if we could just extend the class we need and override the appropriate method. However, in this specific case, there was a reference to one method which was protected within the sdk because of which I couldn't easily override what i needed and instead ended up copying a whole bunch of files. While this works for now, I hope in future there is a more straighforward way to access the file metadata in these IO implementations.
I don't think it's possible to do "out-of-box" with a current implementation of of XmlIO since it returns a PCollection<T> where T is a type of your xml record and, if I'm not mistaken, there is no way to add a file name there. Though, you still can try to "reimplement" a ReadFiles and XmlSource in a way that it will return parsed payload and input file metadata.

Recursively scan documents for indexing in a folder in SolrJ

I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/
This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.
However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0
Thank you.
Regards,
Edwin
If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:
SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
server.request(req);
To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.
If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.

How can I get json documents from couchbase in java?

I have a couchbase database that is shared between multiple applications, storing documents as json. I cannot seem to get data into my java app, since it appears to be dependent on native java binary serialization.
This code:
CouchbaseClient client = new CouchbaseClient(hosts,"bucket","");
System.out.println((String)client.get("someKey"));
results in
net.spy.memcached.transcoders.SerializingTranscoder: Failed to decompress data
java.util.zip.ZipException: Not in GZIP format
since it is trying to deserialize by default. I notice that I can provide my own transcoder, but I really only want the raw string data so I can json parse it myself using gson or whatever. None of the available transcoders seem to give me this.
The couchbase docs have an example for setting json, but none for reading it. How are people reading json into java?
First off, this problem will go away soon in that the Couchbase "2.0 SDKs" implement common flags between each other so this kind of problem doesn't come up. Michael's blogs are a good read if you want to see what's happening here. The reason for the problem in the first place is that in the 1.x series, Couchbase was trying to stay compatible with existing application code and memcached. In the memcached world, the clients were all written by different people at different times.
Based on the exception, I believe you're probably trying to read an item stored by .NET. I have a sample transcoder you can use for this from a few weeks ago.
Make sure you are using latest CB java client:
<dependencies>
<dependency>
<groupId>com.couchbase.client</groupId>
<artifactId>couchbase-client</artifactId>
<version>1.4.4</version>
</dependency>
</dependencies>
see: Couchbase Java Client Library 1.4
I have my service that uses CB client running just fine. Here is how I create client:
CouchbaseConnectionFactoryBuilder cfb = new CouchbaseConnectionFactoryBuilder();
cfb.setOpTimeout(10000);
cfb.setOpQueueMaxBlockTime(5000);
CouchbaseClient client = new CouchbaseClient(cfb.buildCouchbaseConnection(baseURIs, bucketName, ""));
And here is an example how I get a raw string and convert it to POJOs:
MyPOJO get(CouchbaseClient client, String key)
{
com.google.gson.Gson gson = new com.google.gson.Gson();
String jsonValue = (String) client.get(key);
return gson.fromJson(jsonValue, MyPOJO.class);
}
Also, update your question with the sample JSON doc that causing this issue. Perhaps it has something to do with the format of the document itself.

Getting the number of documents in a Solr Cloud collection

I'm working on a custom Solr search component which takes into account the number of documents in the collection. Currently the number of documents is hard coded in my Solr configuration file, and that's bad because the number of documents is dynamic. Is it possible to get the number of documents (in the whole collection, not in a single core) from the response builder? So far I have found a way to get the cloud descriptor (rb.req.getCore().getCoreDescriptor().getCloudDescriptor()), but in contrast to my expectations I did not see a getNumDocs() method in there.
I used following code to get the NumberOfDocuments in my SOLR Cloud Collection.
HttpSolrServer httpSolrServer = new HttpSolrServer("http://localhost:8983/solr/collectionname/");
QueryResponse response = httpSolrServer.query(new SolrQuery(), METHOD.POST);
SolrDocumentList solrDocumentList = queryResponse.getResults();
solrDocumentList.getNumFound();
solrDocumentList.getStart();
Hope this Helps you!!!

Nutch Seed URLs

Is it possible to get URLs into Nutch directly from a database or a service etc. I'm not interested in the ways which data is taken from the database or service and written to seed.txt.
No. This cannot be done directly with the default nutch codebase. You need to modify Injector.java to achieve that.
EDIT:
Try using DBInputFormat : an InputFormat that reads input data from an SQL table. You need to modify the Inject code here (line 3 in snippet below):
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);

Categories