Is it possible to crawl ajax-based web sites using Heritrix-3.2.0?
If you intend to make a "copy" of an ajax website, clearly no.
If you want to grab some data by analysing the content of the website, you can customize the crawler with an Extractor that would determine which URLs to follow. On most website you can easily guess the urls that are interesting for your case without having to interpret the javascript. Then the ajax callbacks would be crawled and given to the Processor chain. By default this would store the ajax callback answers in the archive files.
Making your own Extractor looks like that:
import org.archive.modules.extractor.ContentExtractor;
import org.archive.modules.extractor.LinkContext;
import org.archive.modules.extractor.Hop;
import org.archive.io.ReplayCharSequence;
import org.archive.modules.CrawlURI;
public class MyExtractor extends ContentExtractor {
#Override
protected boolean shouldExtract(CrawlURI uri) {
return true;
}
#Override
protected boolean innerExtract(CrawlURI curi) {
try {
ReplayCharSequence cs = curi.getRecorder().getContentReplayCharSequence();
// ... analyse the page content cs as a CharSequence ...
// decide you want to crawl some page with url [uri] :
addOutlink( curi, uri, LinkContext.NAVLINK_MISC, Hop.NAVLINK );
}
Compile, put the jar file in the heritrix/lib directory and insert a bean refering to MyExtractor in the fetchProcessors chain : basically, duplicate the extractorHtml line in the crawl job cxml file.
Related
I am writing a custom java annotator for our UIMA pipeline in Watson Explorer Content Analytics.
There are two places (I know of ) where I can try to get the URL or Filename of the document that is currently being processed.
Initialize
public class CustomAnnotator extends JCasAnnotator_ImplBase {
#Override
public void initialize(UimaContext aContext)
throws ResourceInitializationException {
super.initialize(aContext);
.... HERE MAYBE ? ....
Or
Process
#Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
try {
.... HERE ....
I have tried several options:
via context in initialize method(Running the pipeline on the server , I could get the PearID for example),
via the Sofa in the process method (e.g. jcas.getSofa().getSofaURI())
I also found SourceDocumentInformation , but this is an example and although the method getUri() seems promising, I depend on IBM to implement the setUri(String) method...
But so far I have not been successful, I hope I have overlooked something...
I asked the same question on IBM dwanwsers.
In short, you can access multiple views when the pipeline runs in the Watson Explorer Content Analytics server. For metadata we need to inspect the _InitialView and not the rlw-view, which is the one that holds all annotations created by the custom pipeline you create in Content Analytics Studio
More details can be found here, also look at the reponses !
https://www.ibm.com/developerworks/community/blogs/ibmandgoogle/entry/Exporting_annotations_from_Watson_Explorer_Content_Analytics?lang=en
I'm new in the server-side , and I'm trying to learn how to use play in rest api with java and restangular. I made a project for java in the intellij. I want the GET request to return an html page and not html.scala page.
how do I change this function that it will return the app/views/index.html instead the app/views/index.html.scala
also if someone have a good website to learn from, it will help a lot
the function in java :
package controllers;
import play.*;
import play.mvc.*;
import views.html.*;
public class Application extends Controller {
public static Result index() {
return ok(index.render("Your new application is ready."));
}
}
the routes page :
# Routes
# This file defines all application routes (Higher priority routes first)
# ~~~~
# Home page
GET / controllers.Application.index
# Map static resources from the /public folder to the /assets URL path
GET /assets/*file controllers.Assets.at(path="/public", file)
Play use twirl as a template engine. I am not sure if you can easely change it to some other template engine but I think you can. Any way, as I see you are looking for a way just output simple HTML file. You can do this with default Assets controller:
this string in your route
GET /assets/*file controllers.Assets.at(path="/public", file)
would handle any files in the public. So if you will add to the public directory the simple /html/hello.html file then Play will render it by the url /assets/html/hello.html. You can change URL as you like in the rote file
I found a similar question with good answer with a lot of examples.
you can find it here.
The easiest way is to make a GET request for the / and to send the destination also, like this:
# Home page
GET / controllers.Assets.at(path="/public",file="index.html")
I want to give dynamic input to Java FX application from a JSP page. I am not able to find any suitable way.
Dynamic in the sense that I want to give input to JavaFX application based on user input in a JSP page. I am embedding the same Java FX application in the same JSP page.
Any help is welcome regarding the same.
I want to give input to Java FX application when it is running through JSP page.
See the JavaFX deployment topic: Accessing a JavaFX Application from a Web Page.
The JavaScript => JavaFX interface in JavaFX is the same as that used for a traditional Java applet - it makes use of a technology known as LiveConnect. Further documentation on using LiveConnect is in the LiveConnect documentation topic: Calling from JavaScript to Java.
The JavaFX documentation provides the following sample code:
Java Code
package testapp;
public class MapApp extends Application {
public static int ZOOM_STREET = 10;
public static class City {
public City(String name) {...}
...
}
public int currentZipCode;
public void navigateTo(City location, int zoomLevel) {...}
....
}
JavaScript Code
function navigateTo(cityName) {
//Assumes that the Ant task uses "myMapApp" as id for this application
var mapApp = document.getElementById("myMapApp");
if (mapApp != null) {
//City is nested class. Therefore classname uses $ char
var city = new mapApp.Packages.testapp.MapApp$City(cityName);
mapApp.navigateTo(city, mapApp.Packages.testapp.MapApp.ZOOM_STREET);
return mapApp.currentZipCode;
}
return "unknown";
}
window.alert("Area zip: " + navigateTo("San Francisco"));
Note the important comment in the JavaScript code "Assumes that the Ant task uses "myMapApp" as id for this application". The id referred to is the placeholderid parameter of the fx:deploy task.
Now, because you are using a JSP, presumably the html page containing the application is dynamically generated by the JSP processor. So, what you may want to do is make use of the fx:template task to generate modified jsp source which invokes the dtjava deployment script to embed your target JavaFX application.
I'm not sure, but try: HostServices.getWebContext
Im trying to call a javascript function out of my Vaadin Portlet.
lets say I have an HTML file witch is located in my project ;
homepage.html
<html>
...
<script type="text/javascript">
...
function foo(String msg)
{
alert(msg);
}
...
</script>
...
</html>
the page in Embedded in my Portlet via the Vaadin Embedded Browser
how do I call the function foo(String msg) out of my java application
do i need to import/read the homepage.html file and just call it or is it something else I have to do ?
firstly you need to get the script body;
then you can user javax.script.ScriptEngineManager to solve your problem javax.script.*
pseudo code
import javax.script.*;
ScriptEngine engine =
new ScriptEngineManager().getEngineByName("javascript");
String script = getScript(path_to_html);
engine.eval(script);
The simplest way to include an external javascript file into a Vaadin application is to override the Application#writeAjaxPageHtmlVaadinScripts method.
To call a javascript function from the Vaadin server-side code, you call Window#executeJavascript
#Override
protected void writeAjaxPageHtmlVaadinScripts(Window window,
String themeName, Application application, BufferedWriter page,
String appUrl, String themeUri, String appId,
HttpServletRequest request) throws ServletException, IOException {
page.write("<script type=\"text/javascript\">\n");
page.write("//<![CDATA[\n");
page.write("document.write(\"<script language='javascript' src='" + appUrl + "/VAADIN/scripts/example.js'><\\/script>\");\n");
page.write("//]]>\n</script>\n");
super.writeAjaxPageHtmlVaadinScripts(window, themeName, application,
page, appUrl, themeUri, appId, request);
}
NB : I have never used Vaadin as a Portlet, but a quick look suggests that this should work OK.
However, this approach is rather rudimentary, and only suitable for a quick hack/proof-of-concept: if you want to so anything more sophisticated, then developing your own Vaadin widget is correct approach. It gives you the power of GWT and JSNI, and gives you a much finer grain of control : See The Book Of Vaadin for more details.
Refer to following links, these provides API for doing what you want to do,
http://www.ibm.com/developerworks/java/library/j-5things9/index.html
http://metoojava.wordpress.com/2010/06/20/execute-javascript-from-java/
My cell phone provider offers a limited number of free text messages on their website. I frequently use the service although I hate constantly having a tab open in my browser.
Does anyone know/point me in the right direction of how I could create a jar file/command line utility so I can fill out the appropriate forms on the site. I've always wanted to code up a project like this in Java, just in case anyone asks why I'm not using something else.
Kind Regards,
Lar
Try with Webdriver from Google or Selenium.
Sounds like you need a framework designed for doing functional testing. These act as browsers and can navigate web sites for testing and automation. You don't need the testing functionality, but it would still serve your needs.
Try HtmlUnit, or LiFT, which is a higher-level abstraction built on HtmlUnit.
Use Watij with the Eclipse IDE. When your done, compile as an .exe or run with a batch file.
Here is some sample code I wrote for filling in fields for a Google search, which can be adjusted for the web form you want to control :
package goog;
import junit.framework.TestCase;
import watij.runtime.ie.IE;
import static watij.finders.SymbolFactory.*;
public class GTestCases extends TestCase {
private static watij.runtime.ie.IE activeIE_m;
public static IE attachToIE(String url) throws Exception {
if (activeIE_m==null)
{
activeIE_m = new IE();
activeIE_m.start(url);
} else {
activeIE_m.goTo(url);
}
activeIE_m.bringToFront();
return (activeIE_m);
}
public static String getActiveUrl () throws Exception {
String currUrl = activeIE_m.url().toString();
return currUrl;
}
public void testGoogleLogin() throws Exception {
IE ie = attachToIE("http://google.com");
if ( ie.containsText("/Sign in/") ) {
ie.div(id,"guser").link(0).click();
if ( ie.containsText("Sign in with your") ||
ie.containsText("Sign in to iGoogle with your")) {
ie.textField(name,"Email").set("test#gmail.com");
ie.textField(name,"Passwd").set("test");
if ( ie.checkbox(name,"PersistentCookie").checked() ){
ie.checkbox(name,"PersistentCookie").click();
}
ie.button(name,"signIn").click();
}
}
System.out.println("Login finished.");
}
public void testGoogleSearch() throws Exception {
//IE ie = attachToIE( getActiveUrl() );
IE ie = attachToIE( "http://www.google.com/advanced_search?hl=en" );
ie.div(id,"opt-handle").click();
ie.textField(name,"as_q").set("Watij");
ie.selectList(name,"lr").select("English");
ie.button(value,"Advanced Search").click();
System.out.println("Search finished.");
}
public void testGoogleResult() throws Exception {
IE ie = attachToIE( getActiveUrl() );
ie.link(href,"http://groups.google.com/group/watij").click();
System.out.println("Followed link.");
}
}
It depends on how they are sending the form information.
If they are using a simple GET request, all you need to do is fill in the appropriate url parameters.
Otherwise you will need to post the form information to the target page.
You could use Watij, which provides a Java/COM interface onto Internet Explorer. Then write a small amount of Java code to navigate the form, insert values and submit.
Alternatively, if it's simple, then check out HttpClient, which is a simple Java HTTP client API.
Whatever you do, watch out that you don't contravene your terms of service (easy during testing - perhaps you should work against a mock interface initially?)
WebTest is yet another webapp testing framework that may be easier to use than the alternatives cited by others.
Check out the Apache Commons Net Package. There you can send a POSt request to a page. This is quite low level but may do what you want (if not you might check out the functional testing suites but it is probably not as easy to dig into).
As jjnguy says, you'll need to dissect the form to find out all the parameters.
With them you can form your own request using Apache's HTTP Client and fire it off.