How to save fetched html content to database in apache nutch?

How to save fetched html content to database in apache nutch? - java

I'm using 1.8 version of apache nutch. I want to save crawled HTML content to postgre database to do this, I modify FetcherThread.java class as below.
case ProtocolStatus.SUCCESS: // got a page
pstatus = output(fit.url, fit.datum, content, status,
CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
updateStatus(content.getContent().length);
/*Added My code Here*/
But I want to use plug-in system instead of directly modifying FetcherThread class. To use plug-in system which extension points I need to use?

You could write a custom plugin and implement an extension of org.apache.nutch.indexer.IndexWriter to send the documents to
Postgres as part of the indexing step. You'll need to index the raw content which requires NUTCH-2032 - this is in Nutch 1.11 so you will need to upgrade your version of Nutch.
Alternatively you could write a custom MapReduce job which would take a segments as input, read the content and send it to your DB in the reduce step.

Related

Access Pervasive/Btrieve DB (DDF + DAT files) from Java

I have a folder with *.DDF and *.DAT files that are a pervasive/btrieve database. I am able to open and see the content of the database with DDF Periscope (ddf-periscope.com).
I can export data from each table individually using ddf periscope, and I would like to do the same thing using Java. Access the data in the DB and export them to a CSV file, POJOs or any way I can manipulate the data.
Is this possible?

You can use either JDBC or the JCL interfaces to access the data. You do still need the Pervasive engine but you can use Java. Here is a simple sample for the JDBC driver.
I don't have a JCL sample but there should be one in the Pervasive / Actian Java Class Library SDK.

Swagger UI blank PDF download

I'm having a problem with Swagger UI when trying to download a PDF file. Everything works fine outside Swagger UI (using curl or Postman there is no problem), but when I try to download via Swagger UI I get a blank PDF.
I'm using Springfox 2.50 (microservice JHipster application), and the response from my Java Spring method is a HttpEntity<byte[]>.
Edit:
I found similar problem: Swagger UI Download PDF but it does not have any answers.

You likely have one of the following issues with your setup:
1) The #produces on your server (and therefore in the swagger definition) may not be correct. Please make sure you have produces: application/pdf in your operation.
2) Your operation that returns the pdf may have no schema associated with it. For swagger-ui to render a proper download, you need to have a schema. The correct schema would be:
schema:
type: string
format: byte
3) Your server must be returning the correct Content-Type. Please make sure it's application/pdf in the headers
You might want to try the petstore sample against your server as that is the latest build of swagger-ui, the one bundled with SpringFox may be a bit behind.

Pushing AEM content into Solr 6

I am trying to push the AEM page content to solr remote server. Is there a way we can do it from AEM directly or we have to write a service for it. If I need a service what api should I use. I was able to create solr schema using solrindex node under oak:index.
Thanks
Abhishek

I had similar requirement. When I synced AEM with remote SOLR a separate document was created for each AEM node. So I ended up creating my custom service to bulk load all content pages to solr. I used AEM's query api to extract page content to get id, title, description and path. For description field I did tree traversal to extract property values and created space delimited description text field. I used solrj to then add documents to solr.

Adding Reference links to what Opkar has shared:-
Link:- http://www.aemsolrsearch.com/#/
Git:- https://github.com/headwirecom/aem-solr-search
Video/Demo :- http://www.aemsolrsearch.com/#/demo
AEM 6.2 Documentation :- https://docs.adobe.com/docs/en/aem/6-2/deploy/platform/queries-and-indexing.html#Configuring AEM with an embedded SOLR server
Adobe AEM Community post:- http://help-forums.adobe.com/content/adobeforums/en/experience-manager-forum/adobe-experience-manager.topic.html/forum__ir8q-is_there_a_detailed.html
I hope this would be helpful.
Thanks and Regards
Kautuk Sahni

Save GXT Tree to XML file in Java

How to save GXT Tree content (com.extjs.gxt.ui.client.widget.tree.Tree) to a XML file in Java?

Quid Pro Quo. A less detailed answer would be as follows
You would be anyway using a List of Lists via Store or DataProviders. Send them to server by Serializing them over RPC or JSON via RequestBuilder. Use a XML library to convert them to XML file on the run and in the on success call of rpc/request builder give a url to download the xml file if required.
Note - This can be done any number of ways depending on the details of your requirement and your expertise!!!!

How to index data in single solr app using tomcat server

I am working on single solr app. I downloded solr exampple code for net, which is working fine while running on jetty server.It is having data which are to be indexed in C:\apache-solr-1.4.0\example\exampledocs and the indexes are stored in C:\apache-solr-1.4.0\example\solr\data, using jetty server indexes are created using command java -jar post.jar *.xml. Now i want to know how can i achieve this using Tomcat. do i need to change the configuration to change the path for indexe storage and for xml files storage. how data will b indexed so that i would able to search it.

If I understand your question correctly, you'll want to use the -Durl flag when running post.jar, e.g.:
java -jar -Durl=http://localhost:8080/solr/update post.jar solr.xml monitor.xml

In solrconfig.xml you can mention the path that has to hold the index
<dataDir>${solr.data.dir:}</dataDir>

I think you just have to read more from SOLR documentation, and click through what you have in the package.
There is an tomcat deployment doc in solr wiki:
http://wiki.apache.org/solr/SolrTomcat
And the war file is in the dist folder you've downloaded.
How to search it? There is no simple answer. I suggest you read more on the solr wiki. Find out what is a handler, what is the difference between dismax handler and standard handler, how schema.xml defines the database.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to save fetched html content to database in apache nutch? - java

Related

Access Pervasive/Btrieve DB (DDF + DAT files) from Java

Swagger UI blank PDF download

Pushing AEM content into Solr 6

Save GXT Tree to XML file in Java

How to index data in single solr app using tomcat server

Categories

Resources