how to save web-harvest data to database - java

I am scrapping the data using web-harvest tool and i am getting the required data i.e. name and price of the product.
here is my config file.
<include path="functions.xml"/>
<!-- collects all tables for individual products -->
<var-def name="products">
<call name="download-multipage-list">
<call-param name="pageUrl">http://www.amazon.de/s/ref=nb_sb_noss?__mk_de_DE=AMAZON&url=search-alias%3Daps&field-keywords=AT300-103%20TEGRA%203%201GB</call-param>
<call-param name="nextXPath">//a[#class="pagnNext"]/#href</call-param>
<call-param name="itemXPath">//div[#class="fstRow prod"]</call-param>
<call-param name="maxloops">10</call-param>
</call>
</var-def>
<!-- iterates over all collected products and extract desired data -->
<file action="write" path="reports/catalog.xml" charset="UTF-8">
<![CDATA[ <catalog> ]]>
<loop item="item" index="i">
<list><var name="products"/></list>
<body>
<xquery>
<xq-param name="item" type="node()"><var name="item"/></xq-param>
<xq-expression><![CDATA[
declare variable $item as node() external;
let $name := data($item//*[#class='lrg bold'])
let $price := data($item//*[#class='bld lrg red'])
return
<product>
<name>{normalize-space($name)}</name>
<price>{normalize-space($price)}</price>
</product>
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </catalog> ]]>
</file>
now i am trying to move this name and price information to the mysql database table which contains two columns name and price. i got the information that we have to use database tag. but not getting information how to use that.
could you please assist me how can configure that in my config file.
Thanks in advance.
Sahiti

Please go thorugh web-harvest.sourceforge.net/manual.php#database and try to implement as given.

Related

How to pass customized parameters to SOlR DIH query

I have scenario where i need to pass customized parameters to solr data import query.
Ex- select * from customer where last_updated_date >=last_updated_indexed_date
The last_updated_indexed_date is coming from another table which has details about core.
How can I pass that last_indexed_updated_date in DIH query.
The data-config can be configured something like below :
<dataConfig>
<dataSource name="ds-db" driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:#127.0.0.1:1521:test" user="dev" password="dev" />
<dataSource name="ds-file" type="BinFileDataSource" />
<document name="documents">
<entity name="book" dataSource="ds-db"
query="select distinct
book.id as id,
book.title,
book.author,
book.publisher,
from Books book
where book.book_added_date >= to_date($ {dataimporter.request.lastIndexDate}, 'DD/MM/YYYY HH24:MI:SS')))"
transformer="DateFormatTransformer">
<field column=”id” name=”id” />
<field column=”title” name=”title” />
<field column=”author” name=”author” />
<field column=”publisher” name=”publisher” />
<entity name=”content” query=”select description from content
where content_id='${book.Id}' ”>
<field column=”description” name=”description” />
</entity>
</entity>
</document>
</dataConfig>
The way here '${book.Id}' is retrieved and passed to another query. You will also need to work upon something similar for the last_indexed_updated_date in your data-config.xml. if you don't have the same in your tables. You can try the same passing to the data import url like lastIndexDate(Please refer the below data import url.)
The data import url will be be like
http://localhost:8080/solr/admin/select/?qt=/dataimport&command=full-import&clean=false&commit=true&lastIndexDate='08/05/2011 20:16:11'

test if column type is numeric scriptella

hello guys am using scriptella to copy dara from an oracle database into a postgresql database. i've been able to do but got one problem .i would like to copy a column that is numeric , but i may have a code from the initial table which is not really numeric i would like to test whther it is numeric , please any help .
here what i did
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<description>
test script Pour table article
</description>
<connection id="in" driver="oracle"
url="jdbc:oracle:thin:#localhost:1521:XE" user="test" password="test" />
<connection id="out" driver="postgresql"
url="jdbc:postgresql://localhost:5432/testMonoprix2" user="postgres"
password="maher" />
<query connection-id="in">
SELECT CODE from test.TMP_FOURNISSEUR;
<script connection-id="out" if =" code is numeric" >
INSERT INTO public.suppliers
(code) values
(?CODE);
</script>
</query>
</etl>
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<description>
test script Pour table article
</description>
<connection id="in" driver="oracle"
url="jdbc:oracle:thin:#localhost:1521:XE" user="IPTECH" password="IPTECH" />
<connection id="out" driver="postgresql"
url="jdbc:postgresql://localhost:5432/gemodb" user="postgres"
password="maher" />
<connection id="janino" driver="janino" />
<connection id="log" driver="text" />
<query connection-id="in">
SELECT CODEARTICLE,STRUCTURE, DES,TYPEMARK,TYP,IMPLOC,MARQUE,GAMME,TAR
FROM IPTECH.TMP_ARTICLE ;
<query connection-id="janino">
import java.io.*;
import java.lang.*;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.*;
Boolean result= false ;
Object obj =get("CODEARTICLE");
if (StringUtils.isNumeric(obj.toString())) {
<!-- System.out.println("ok "); -->
result=true ;}
else{
result=false ;}
set("result", result);
next();
<script connection-id="out" if="result">
INSERT INTO public.articles
(id,
is_enabled,type_marketing,type_tarif,description,gamme,import_local,marque,reference,struct,family_id)
values
(cast(?CODEARTICLE as bigint)
,'TRUE',?TYPEMARK,?TAR,?DES,?GAMME,?IMPLOC,?MARQUE,?CODEARTICLE,?STRUCTURE,cast(?{STRUCTURE.substring(0,
2)} as bigint));
</script>
</query>
</query>
</etl>

Stored procedure to retrieve the query result as XML

Is it possible to retrieve the stored procedure result in XML format? I am using Java to call the stored procedure and Microsoft SQL Server management studio to test my stored procedures. Could someone provide a sample code?
Found something like this
SELECT
CustomerID AS '#CustomerID',
CustName AS '#Name',
(SELECT ProductName AS '#productname'
FROM dbo.Products p
WHERE p.CustomerID = c.CustomerID
FOR XML PATH('Product'), TYPE) AS 'Products',
(SELECT HobbyName AS '#hobbyname'
FROM dbo.Hobbies h
WHERE h.CUstomerID = c.CustomerID
FOR XML PATH('Hobby'), TYPE) AS 'Hobbies'
FROM
dbo.Customers c
FOR XML PATH('Customer'), ROOT('Customers')
Gives following output
<Customers>
<Customer CustomerID="1" Name="Fred">
<Products>
<Product productname="Table" />
<Product productname="Wardrobe" />
<Product productname="Chair" />
</Products>
<Hobbies>
<Hobby hobbyname="Golf" />
<Hobby hobbyname="Swimming" />
</Hobbies>
</Customer>
<Customer CustomerID="2" Name="Sue">
<Products>
<Product productname="CD Player" />
<Product productname="Picture frame" />
</Products>
<Hobbies>
<Hobby hobbyname="Dancing" />
<Hobby hobbyname="Gardening" />
<Hobby hobbyname="Reading" />
</Hobbies>
</Customer>
</Customers>
Is this correct?

How to enable advanced search by d:date properties in Alfresco?

I have a custom content model I created for Alfresco that has type with a d:date property. I am able to build the repository and share projects with seemingly no errors. However, I am unable to search by the properties using the data type d:date or d:int. I resolved the d:int problem by changing the data type to d:text and adding a regex constraint, but I'm not sure if that would be prudent for the d:date property.
Is there some additional configuration that I need to supply or create in order to search by properties that are not d:text?
Here is a snippet showing the type declaration:
<types>
<!-- Enterprise-wide generic document type -->
<type name="gl:x">
<title>Document</title>
<parent>cm:content</parent>
<properties>
<property name="gl:period">
<type>d:text</type>
</property>
<property name="gl:year">
<type>d:text</type>
<constraints>
<constraint ref="gl:documentYears" />
</constraints>
</property>
<property name="gl:docType">
<type>d:text</type>
<constraints>
<constraint ref="gl:documentTypeList" />
</constraints>
</property>
<property name="gl:date">
<type>d:date</type>
</property>
</properties>
</type>
</types>
The share search forms and properties forms seem to be rendering correctly, so I don't think that there is any problem within those.
The advanced search page accepts two types of parameters.
One is simply the "keywords" field. This performs a full text search, i.e. it looks for the provided keywords in ANY text property. There is no need to configure the full text search for custom types (e.g. your gl:x) - it automatically picks up any text property in any model in the system.
The other is the group of single parameters: name, title, description, mime-type, modified-date, modifier. These properties can be of any type. A d:date property would be perfectly acceptable here, as the modified-date parameter testifies.
But here custom properties are not picked-up automatically. They need to be configured explicitly.
Notice that in the upper part of the advanced search page is a drop-down called "Look for" with two options: content and folders. The best approach would be to add an option for your content type gl:x and to configure a search form for it.
You can find the definition of the two standard search forms in tomcat/webapps/share/WEB-INF/classes/alfresco/share-form-config.xml. The file is rather long so here are the two sections to look for:
<config evaluator="model-type" condition="cm:content">
<forms>
<!-- Default Create Content form -->
<form>
</form>
<!-- Document Library Create Google Doc form -->
<form id="doclib-create-googledoc">
</form>
<!-- Search form -->
<form id="search">
</form>
</forms>
</config>
<!-- cm:folder type (creating nodes) -->
<config evaluator="model-type" condition="cm:folder">
<forms>
<!-- Document Library Common form -->
<form id="doclib-common">
</form>
<!-- Search form -->
<form id="search">
</form>
</forms>
</config>
I've skipped the details, but what is important is that "cm:content" and "cm:folder" each defines a <form id="search"> with the desired search properties/parameters.
As an experiment you could modify share-form-config.xml directly and add your own definition:
<config evaluator="model-type" condition="gl:x">
<forms>
<!-- Search form -->
<form id="search">
<field-visibility>
<show id="gl:date" />
</field-visibility>
<appearance>
<field id="gl:date">
<control template="/org/alfresco/components/form/controls/daterange.ftl" />
</field>
</appearance>
</form>
</forms>
</config>
Also you have to add the new search form to the AdvancedSearch configuration found in tomcat/webapps/share/WEB-INF/classes/alfresco/share-config.xml:
<config evaluator="string-compare" condition="AdvancedSearch">
<advanced-search>
<forms>
<form labelId="search.form.label.cm_content" descriptionId="search.form.desc.cm_content">cm:content</form>
<form labelId="search.form.label.cm_folder" descriptionId="search.form.desc.cm_folder">cm:folder</form>
<form labelId="search.form.label.gl_x" descriptionId="search.form.desc.gl_x">gl:x</form>
</forms>
</advanced-search>
</config>
Remember to restart alfresco after every change.
When you're satisfied with the results, it would be better to move your custom definitions to a separate share-config-custom.xml in your project (share-config.xml and share-form-config.xml should never be modified directly).
For more details: https://wiki.alfresco.com/wiki/Share_Advanced_Search

Trying to Extract URL's from a Website Using Web Harvest

I'm trying to extract the URL's of a website that doesn't have a sitemap. I'm using the Web Harvest tool
I have no idea about Java or coding. Could someone please help me out with using this tool.
I want it to run on a specific website (e.g. example.com) and extract every single URL from that website.
Example.com is not a very good example, as it has only one link! :)
Here's my code with some annotations:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<!-- 1: provide inputs -->
<script><![CDATA[
url="http://stackoverflow.com/questions/17635763/trying-to-extract-urls-from-a-website-using-web-harvest";
output_path = "C:/webharvest/";
file_name = "urllist.txt";
output_file = output_path + file_name;
]]></script>
<!-- 5 : save the resulting list in a variable -->
<var-def name="urls">
<!-- 4 : select only links (outputs a list variable) -->
<xpath expression='//a/#href'>
<!-- 3 : convert it to XML, for querying -->
<html-to-xml>
<!-- 2 : load the page -->
<http url="${url}"/>
</html-to-xml>
</xpath>
</var-def>
<!-- 7: write to output file -->
<file action="write" path="${output_file}">
<!-- 6 : convert the list variable into a string with each link on a new line -->
<text delimiter="${sys.cr}${sys.lf}">
<var name="urls" />
</text>
</file>
</config>
You should go through Web harvest user manual at http://web-harvest.sourceforge.net/manual.php which has multiple number of examples.

Categories