I'm trying to extract the URL's of a website that doesn't have a sitemap. I'm using the Web Harvest tool
I have no idea about Java or coding. Could someone please help me out with using this tool.
I want it to run on a specific website (e.g. example.com) and extract every single URL from that website.
Example.com is not a very good example, as it has only one link! :)
Here's my code with some annotations:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<!-- 1: provide inputs -->
<script><![CDATA[
url="http://stackoverflow.com/questions/17635763/trying-to-extract-urls-from-a-website-using-web-harvest";
output_path = "C:/webharvest/";
file_name = "urllist.txt";
output_file = output_path + file_name;
]]></script>
<!-- 5 : save the resulting list in a variable -->
<var-def name="urls">
<!-- 4 : select only links (outputs a list variable) -->
<xpath expression='//a/#href'>
<!-- 3 : convert it to XML, for querying -->
<html-to-xml>
<!-- 2 : load the page -->
<http url="${url}"/>
</html-to-xml>
</xpath>
</var-def>
<!-- 7: write to output file -->
<file action="write" path="${output_file}">
<!-- 6 : convert the list variable into a string with each link on a new line -->
<text delimiter="${sys.cr}${sys.lf}">
<var name="urls" />
</text>
</file>
</config>
You should go through Web harvest user manual at http://web-harvest.sourceforge.net/manual.php which has multiple number of examples.
Related
I am trying to run appengine for Java project by following steps mentioned in https://cloud.google.com/java/getting-started/using-forms?authuser=3
To run the app on local machine I gave the command:
mvn -Plocal clean jetty:run-exploded -DprojectID=[YOUR-PROJECT-ID]
But I am getting following exception:
java.lang.IllegalStateException: Invalid storage type. Check if bookshelf.storageType property is set.
at com.example.getstarted.basicactions.ListBookServlet.init(ListBookServlet.java:62)
at javax.servlet.GenericServlet.init(GenericServlet.java:244)
at org.eclipse.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:637)
at org.eclipse.jetty.servlet.ServletHolder.initialize(ServletHolder.java:421)
at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:744)
I tried the same thing GCP Shell but I got the same exception.
What could be going wrong here?
snippet of web.xml
<!-- [START config] -->
<context-param>
<param-name>bookshelf.storageType</param-name>
<param-value>${bookshelf.storageType}</param-value>
</context-param>
snippet of pom.xml
<properties>
<!-- [START config] -->
<projectID>myProjectID</projectID> <!-- set w/ -DprojectID=myProjectID on command line -->
<bookshelf.storageType>datastore</bookshelf.storageType> <!-- datastore or cloudsql -->
<sql.dbName>bookshelf</sql.dbName> <!-- A reasonable default -->
<!-- Instance Connection Name - project:region:dbName -->
<!-- -Dsql.instanceName=localhost to use a local MySQL server -->
<sql.instanceName>${projectID}:us-central1:${sql.dbName}</sql.instanceName>
<sql.userName>root</sql.userName> <!-- A reasonable default -->
<sql.password>myRootPassword1234</sql.password> <!-- -Dsql.password=myRootPassword1234 -->
<!-- [END config] -->
Please clarify.
Thanks.
This issue was due to an error in Github repo (https://github.com/GoogleCloudPlatform/getting-started-java).
It is solved now. If you run into this, update to the last version of the repo.
I have the XML file which starts like this:
<?xml version="1.0" encoding="UTF-8"?>
<interface name="AccountAPING" owner="BDP" version="1.0.0" date="now()" namespace="com.betfair.account.api"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<description>Account API-NG</description>
...
afterward there are various blocks, such as:
<operation name="getDeveloperAppKeys" since="1.0.0">
<description>
Get all application keys owned by the given developer/vendor
</description>
<parameters>
<request/>
<simpleResponse type="list(DeveloperApp)">
<description>
A list of application keys owned by the given developer/vendor
</description>
</simpleResponse>
<exceptions>
<exception type="AccountAPINGException">
<description>Generic exception that is thrown if this operation fails for any reason.</description>
</exception>
</exceptions>
</parameters>
</operation>
........
<simpleType name="Status" type="string">
<validValues>
<value name="SUCCESS">
<description>Sucess status</description>
</value>
</validValues>
</simpleType>
........
<dataType name="TimeRange">
<description>TimeRange</description>
<parameter name="from" type="dateTime" mandatory="false">
<description>from, format: ISO 8601)</description>
</parameter>
<parameter name="to" type="dateTime" mandatory="false">
<description>to, format: ISO 8601</description>
</parameter>
</dataType>
How can I generate Java code from this using maven? I tried using "maven-jaxb2-plugin", but it can't parse this structure.
Please note
This is an XML file not not an xsd
I'm using Netbeans
First of all, you need the schema (xsd) that describes your xml sample. Without that schema you can not use Jaxb. You don't have a schema for the sample you shown xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" is not the schema for your xml.
You can use free on-line tools to generate schemas from xml, but you can't rely on this tools without review the automated schema.
To generate Java code from a schema file use XJC, see here. Open a command prompt to the folder where you put your xsd file, and then generate java code you'll just need to type:
$ xjc nameOfSchemaFile.xsd
xjc is included with Java SDK.
I have a custom content model I created for Alfresco that has type with a d:date property. I am able to build the repository and share projects with seemingly no errors. However, I am unable to search by the properties using the data type d:date or d:int. I resolved the d:int problem by changing the data type to d:text and adding a regex constraint, but I'm not sure if that would be prudent for the d:date property.
Is there some additional configuration that I need to supply or create in order to search by properties that are not d:text?
Here is a snippet showing the type declaration:
<types>
<!-- Enterprise-wide generic document type -->
<type name="gl:x">
<title>Document</title>
<parent>cm:content</parent>
<properties>
<property name="gl:period">
<type>d:text</type>
</property>
<property name="gl:year">
<type>d:text</type>
<constraints>
<constraint ref="gl:documentYears" />
</constraints>
</property>
<property name="gl:docType">
<type>d:text</type>
<constraints>
<constraint ref="gl:documentTypeList" />
</constraints>
</property>
<property name="gl:date">
<type>d:date</type>
</property>
</properties>
</type>
</types>
The share search forms and properties forms seem to be rendering correctly, so I don't think that there is any problem within those.
The advanced search page accepts two types of parameters.
One is simply the "keywords" field. This performs a full text search, i.e. it looks for the provided keywords in ANY text property. There is no need to configure the full text search for custom types (e.g. your gl:x) - it automatically picks up any text property in any model in the system.
The other is the group of single parameters: name, title, description, mime-type, modified-date, modifier. These properties can be of any type. A d:date property would be perfectly acceptable here, as the modified-date parameter testifies.
But here custom properties are not picked-up automatically. They need to be configured explicitly.
Notice that in the upper part of the advanced search page is a drop-down called "Look for" with two options: content and folders. The best approach would be to add an option for your content type gl:x and to configure a search form for it.
You can find the definition of the two standard search forms in tomcat/webapps/share/WEB-INF/classes/alfresco/share-form-config.xml. The file is rather long so here are the two sections to look for:
<config evaluator="model-type" condition="cm:content">
<forms>
<!-- Default Create Content form -->
<form>
</form>
<!-- Document Library Create Google Doc form -->
<form id="doclib-create-googledoc">
</form>
<!-- Search form -->
<form id="search">
</form>
</forms>
</config>
<!-- cm:folder type (creating nodes) -->
<config evaluator="model-type" condition="cm:folder">
<forms>
<!-- Document Library Common form -->
<form id="doclib-common">
</form>
<!-- Search form -->
<form id="search">
</form>
</forms>
</config>
I've skipped the details, but what is important is that "cm:content" and "cm:folder" each defines a <form id="search"> with the desired search properties/parameters.
As an experiment you could modify share-form-config.xml directly and add your own definition:
<config evaluator="model-type" condition="gl:x">
<forms>
<!-- Search form -->
<form id="search">
<field-visibility>
<show id="gl:date" />
</field-visibility>
<appearance>
<field id="gl:date">
<control template="/org/alfresco/components/form/controls/daterange.ftl" />
</field>
</appearance>
</form>
</forms>
</config>
Also you have to add the new search form to the AdvancedSearch configuration found in tomcat/webapps/share/WEB-INF/classes/alfresco/share-config.xml:
<config evaluator="string-compare" condition="AdvancedSearch">
<advanced-search>
<forms>
<form labelId="search.form.label.cm_content" descriptionId="search.form.desc.cm_content">cm:content</form>
<form labelId="search.form.label.cm_folder" descriptionId="search.form.desc.cm_folder">cm:folder</form>
<form labelId="search.form.label.gl_x" descriptionId="search.form.desc.gl_x">gl:x</form>
</forms>
</advanced-search>
</config>
Remember to restart alfresco after every change.
When you're satisfied with the results, it would be better to move your custom definitions to a separate share-config-custom.xml in your project (share-config.xml and share-form-config.xml should never be modified directly).
For more details: https://wiki.alfresco.com/wiki/Share_Advanced_Search
I used com.vaadin.tapio.googlemaps.GoogleMap component to connect with Google map from vaadin.
I tried the below code.(Vaadin 7.0.2)
public class StoresMainView extends VerticalLayout implements View {
#Override
public void enter(ViewChangeEvent event) {
setSizeFull();
GoogleMap googleMap = new GoogleMap(new LatLon(-27.47101, 153.02429), 10.0, "");
googleMap.setSizeFull();
googleMap.setImmediate(true);
googleMap.setMinZoom(4.0);
addComponent(googleMap);
}
But it gives the below error when running.I added the dependency in my pom.
Widgetset does not contain implementation for com.vaadin.tapio.googlemaps.GoogleMap. Check its component connector's #Connect mapping, widgetsets GWT module description file and re-compile your widgetset. In case you have downloaded a vaadin add-on package, you might want to refer to add-on instructions.
In my web.xml I have define the Widget set as below
<init-param>
<param-name>widgetset</param-name>
<param-value>com.client.DashboardWidgetSet</param-value>
</init-param>
And my DashboardWidgetSet as below
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE module PUBLIC "-//Google Inc.//DTD Google Web Toolkit 1.7.0//EN" "http://google-web-toolkit.googlecode.com/svn/tags/1.7.0/distro-source/core/src/gwt-module.dtd">
<module>
<inherits name="com.vaadin.DefaultWidgetSet" />
<inherits name="org.vaadin.cssinject.Cssinject_addonWidgetset" />
<!-- -->
<set-configuration-property name="devModeRedirectEnabled"
value="true" />
<!-- Uncomment the following to compile the widgetset for one browser only.
This can reduce the GWT compilation time significantly when debugging. The
line should be commented out before deployment to production environments.
Multiple browsers can be specified for GWT 1.7 as a comma separated list.
The supported user agents at the moment of writing were: ie6,ie8,gecko,gecko1_8,safari,opera
The value gecko1_8 is used for Firefox 3 and later and safari is used for
webkit based browsers including Google Chrome. -->
<!-- <set-property name="user.agent" value="safari"/> -->
<!-- WidgetSetOptimizer -->
<inherits name="org.vaadin.easyuploads.Widgetset" />
<inherits name="com.vaadin.tapio.googlemaps.WidgetSet" />
</module>
Any help is really appreciated.
You need to make sure that the widgetset init-param in your web.xml points to the right widgetset. The default one does not contain any information about the Google Map component's widgets.
I am scrapping the data using web-harvest tool and i am getting the required data i.e. name and price of the product.
here is my config file.
<include path="functions.xml"/>
<!-- collects all tables for individual products -->
<var-def name="products">
<call name="download-multipage-list">
<call-param name="pageUrl">http://www.amazon.de/s/ref=nb_sb_noss?__mk_de_DE=AMAZON&url=search-alias%3Daps&field-keywords=AT300-103%20TEGRA%203%201GB</call-param>
<call-param name="nextXPath">//a[#class="pagnNext"]/#href</call-param>
<call-param name="itemXPath">//div[#class="fstRow prod"]</call-param>
<call-param name="maxloops">10</call-param>
</call>
</var-def>
<!-- iterates over all collected products and extract desired data -->
<file action="write" path="reports/catalog.xml" charset="UTF-8">
<![CDATA[ <catalog> ]]>
<loop item="item" index="i">
<list><var name="products"/></list>
<body>
<xquery>
<xq-param name="item" type="node()"><var name="item"/></xq-param>
<xq-expression><![CDATA[
declare variable $item as node() external;
let $name := data($item//*[#class='lrg bold'])
let $price := data($item//*[#class='bld lrg red'])
return
<product>
<name>{normalize-space($name)}</name>
<price>{normalize-space($price)}</price>
</product>
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </catalog> ]]>
</file>
now i am trying to move this name and price information to the mysql database table which contains two columns name and price. i got the information that we have to use database tag. but not getting information how to use that.
could you please assist me how can configure that in my config file.
Thanks in advance.
Sahiti
Please go thorugh web-harvest.sourceforge.net/manual.php#database and try to implement as given.