I have went onto the GitHub repository for the docx4j files and downloaded VariableReplace. When i copied this file into netbeans, I got an error on line 86 (Cannont find symbol).
Here is the code:
/*
* Copyright 2007-2008, Plutext Pty Ltd.
*
* This file is part of docx4j.
docx4j is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
import java.util.HashMap;
import org.docx4j.XmlUtils;
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.io.SaveToZipFile;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.Document;
/**
* There are at least 3 approaches for replacing variables in
* a docx.
*
* 1. as shows in this example
* 2. using Merge Fields (see org.docx4j.model.fields.merge.MailMerger)
* 3. binding content controls to an XML Part (via XPath)
*
* Approach 3 is the recommended one when using docx4j. See the
* ContentControl* examples, Getting Started, and the subforum.
*
* Approach 1, as shown in this example, works in simple cases
* only. It won't work if your KEY is split across separate
* runs in your docx (which often happens), or if you want
* to insert images, or multiple rows in a table.
*
* You're encouraged to investigate binding content controls
* to an XML part. There is org.docx4j.model.datastorage.migration.FromVariableReplacement
* to automatically convert your templates to this better
* approach.
*
* OK, enough preaching. If you want to use VariableReplace,
* your variables should be appear like so: ${key1}, ${key2}
*
* And if you are having problems with your runs being split,
* VariablePrepare can clean them up.
*
*/
public class VariableReplace {
public static void main(String[] args) throws Exception {
// Exclude context init from timing
org.docx4j.wml.ObjectFactory foo = Context.getWmlObjectFactory();
// Input docx has variables in it: ${colour}, ${icecream}
String inputfilepath = System.getProperty("user.dir") + "/sample-docs/word/unmarshallFromTemplateExample176.docx";
boolean save = false;
String outputfilepath = System.getProperty("user.dir")
+ "/OUT_VariableReplace.docx";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(new java.io.File(inputfilepath));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
HashMap<String, String> mappings = new HashMap<String, String>();
mappings.put("colour", "green");
mappings.put("icecream", "chocolate");
long start = System.currentTimeMillis();
// Approach 1 (from 3.0.0; faster if you haven't yet caused unmarshalling to occur):
documentPart.variableReplace(mappings);
/* // Approach 2 (original)
// unmarshallFromTemplate requires string input
String xml = XmlUtils.marshaltoString(documentPart.getJaxbElement(), true);
// Do it...
Object obj = XmlUtils.unmarshallFromTemplate(xml, mappings);
// Inject result into docx
documentPart.setJaxbElement((Document) obj);
*/
long end = System.currentTimeMillis();
long total = end - start;
System.out.println("Time: " + total);
// Save it
if (save) {
SaveToZipFile saver = new SaveToZipFile(wordMLPackage);
saver.save(outputfilepath);
} else {
System.out.println(XmlUtils.marshaltoString(documentPart.getJaxbElement(), true,
true));
}
}
}
Does anyone know why I am getting this error?
Here are a few pictures:
Thanks Hrach
EDIT:
I do not get any immediate errors, but when I run the code below, i get these errors:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger
at org.docx4j.openpackaging.Base.<clinit>(Base.java:42)
at VariableReplace.main(VariableReplace.java:61)
Caused by: java.lang.ClassNotFoundException: org.apache.log4j.Logger
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 2 more
Java Result: 1
I just figured out the problem :). I was using the old samples.
Here is the new code:
/*
* Copyright 2007-2008, Plutext Pty Ltd.
*
* This file is part of docx4j.
docx4j is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package org.docx4j.samples;
import java.util.HashMap;
import org.docx4j.XmlUtils;
import org.docx4j.openpackaging.io.SaveToZipFile;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.Document;
/**
* There are at least 3 approaches for replacing variables in
* a docx.
*
* 1. as shows in this example
* 2. using Merge Fields (see org.docx4j.model.fields.merge.MailMerger)
* 3. binding content controls to an XML Part (via XPath)
*
* Approach 3 is the recommended one when using docx4j. See the
* ContentControl* examples, Getting Started, and the subforum.
*
* Approach 1, as shown in this example, works in simple cases
* only. It won't work if your KEY is split across separate
* runs in your docx (which often happens), or if you want
* to insert images, or multiple rows in a table.
*
* You're encouraged to investigate binding content controls
* to an XML part.
*
*/
public class VariableReplace {
public static void main(String[] args) throws Exception {
String inputfilepath = System.getProperty("user.dir") + "/sample-docs/word/unmarshallFromTemplateExample.docx";
boolean save = false;
String outputfilepath = System.getProperty("user.dir")
+ "/OUT_VariableReplace.docx";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(new java.io.File(inputfilepath));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
// unmarshallFromTemplate requires string input
String xml = XmlUtils.marshaltoString(documentPart.getJaxbElement(), true);
HashMap<String, String> mappings = new HashMap<String, String>();
mappings.put("colour", "green");
mappings.put("icecream", "chocolate");
// Do it...
Object obj = XmlUtils.unmarshallFromTemplate(xml, mappings);
// Inject result into docx
documentPart.setJaxbElement((Document) obj);
// Save it
if (save) {
SaveToZipFile saver = new SaveToZipFile(wordMLPackage);
saver.save(outputfilepath);
} else {
System.out.println(XmlUtils.marshaltoString(documentPart.getJaxbElement(), true,
true));
}
}
}
Related
I'm having a difficult time understanding the concepts of .withFileNamePolicy of TextIO.write(). The requirements for supplying a FileNamePolicy seem incredibly complex for doing something as simple as specifying a GCS bucket to write streamed filed.
At a high level, I have JSON messages being streamed to a PubSub topic, and I'd like to write those raw messages to files in GCS for permanent storage (I'll also be doing other processing on the messages). I initially started with this Pipeline, thinking it would be pretty simple:
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
.apply("Write to GCS", TextIO.write().to(gcs_bucket);
p.run();
}
I got the error about needing WindowedWrites, which I applied, and then needing a FileNamePolicy. This is where things get hairy.
I went to the Beam docs and checked out FilenamePolicy. It looks like I would need to extend this class which then also require extending other abstract classes to make this work. Unfortunately the documentation on Apache is a bit scant and I can't find any examples for Dataflow 2.0 doing this, except for The Wordcount Example, which even then uses implements these details in a helper class.
So I could probably make this work just by copying much of the WordCount example, but I'm trying to better understand the details of this. A few questions I have:
1) Is there any roadmap item to abstract a lot of this complexity? It seems like I should be able to do supply a GCS bucket like I would in a nonWindowedWrite, and then just supply a few basic options like the timing and file naming rule. I know writing streaming windowed data to files is more complex than just opening a file pointer (or object storage equivalent).
2) It looks like to make this work, I need to create a WindowedContext object which requires supplying a BoundedWindow abstract class, and PaneInfo Object Class, and then some shard info. The information available for these is pretty bare and I'm having a hard time knowing what is actually needed for all of these, especially given my simple use case. Are there any good examples available that implement these? In addition, it also looks like I need the set the # of shards as part of TextIO.write, but then also supply # shards as part of the fileNamePolicy?
Thanks for anything in helping me understand the details behind this, hoping to learn a few things!
Edit 7/20/17
So I finally got this pipeline to run with extending the FilenamePolicy. My challenge was needing to define the window of the streaming data from PubSub. Here is a pretty close representation of the code:
public class ReadData {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
p.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(topic))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply("Write to GCS", TextIO.write().to("gcs_bucket")
.withWindowedWrites()
.withFilenamePolicy(new TestPolicy())
.withNumShards(10));
p.run();
}
}
class TestPolicy extends FileBasedSink.FilenamePolicy {
#Override
public ResourceId windowedFilename(
ResourceId outputDirectory, WindowedContext context, String extension) {
IntervalWindow window = (IntervalWindow) context.getWindow();
String filename = String.format(
"%s-%s-%s-%s-of-%s.json",
"test",
window.start().toString(),
window.end().toString(),
context.getShardNumber(),
context.getShardNumber()
);
return outputDirectory.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}
#Override
public ResourceId unwindowedFilename(
ResourceId outputDirectory, Context context, String extension) {
throw new UnsupportedOperationException("Unsupported.");
}
}
In Beam 2.0, the below is an example of writing the raw messages from PubSub out into windowed files on GCS. The pipeline is fairly configurable, allowing you to specify the window duration via a parameter and a sub directory policy if you want logical subsections of your data for ease of reprocessing / archiving. Note that this has an additional dependency on Apache Commons Lang 3.
PubSubToGcs
/**
* This pipeline ingests incoming data from a Cloud Pub/Sub topic and
* outputs the raw data into windowed files at the specified output
* directory.
*/
public class PubsubToGcs {
/**
* Options supported by the pipeline.
*
* <p>Inherits standard configuration options.</p>
*/
public static interface Options extends DataflowPipelineOptions, StreamingOptions {
#Description("The Cloud Pub/Sub topic to read from.")
#Required
ValueProvider<String> getTopic();
void setTopic(ValueProvider<String> value);
#Description("The directory to output files to. Must end with a slash.")
#Required
ValueProvider<String> getOutputDirectory();
void setOutputDirectory(ValueProvider<String> value);
#Description("The filename prefix of the files to write to.")
#Default.String("output")
#Required
ValueProvider<String> getOutputFilenamePrefix();
void setOutputFilenamePrefix(ValueProvider<String> value);
#Description("The shard template of the output file. Specified as repeating sequences "
+ "of the letters 'S' or 'N' (example: SSS-NNN). These are replaced with the "
+ "shard number, or number of shards respectively")
#Default.String("")
ValueProvider<String> getShardTemplate();
void setShardTemplate(ValueProvider<String> value);
#Description("The suffix of the files to write.")
#Default.String("")
ValueProvider<String> getOutputFilenameSuffix();
void setOutputFilenameSuffix(ValueProvider<String> value);
#Description("The sub-directory policy which files will use when output per window.")
#Default.Enum("NONE")
SubDirectoryPolicy getSubDirectoryPolicy();
void setSubDirectoryPolicy(SubDirectoryPolicy value);
#Description("The window duration in which data will be written. Defaults to 5m. "
+ "Allowed formats are: "
+ "Ns (for seconds, example: 5s), "
+ "Nm (for minutes, example: 12m), "
+ "Nh (for hours, example: 2h).")
#Default.String("5m")
String getWindowDuration();
void setWindowDuration(String value);
#Description("The maximum number of output shards produced when writing.")
#Default.Integer(10)
Integer getNumShards();
void setNumShards(Integer value);
}
/**
* Main entry point for executing the pipeline.
* #param args The command-line arguments to the pipeline.
*/
public static void main(String[] args) {
Options options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(Options.class);
run(options);
}
/**
* Runs the pipeline with the supplied options.
*
* #param options The execution parameters to the pipeline.
* #return The result of the pipeline execution.
*/
public static PipelineResult run(Options options) {
// Create the pipeline
Pipeline pipeline = Pipeline.create(options);
/**
* Steps:
* 1) Read string messages from PubSub
* 2) Window the messages into minute intervals specified by the executor.
* 3) Output the windowed files to GCS
*/
pipeline
.apply("Read PubSub Events",
PubsubIO
.readStrings()
.fromTopic(options.getTopic()))
.apply(options.getWindowDuration() + " Window",
Window
.into(FixedWindows.of(parseDuration(options.getWindowDuration()))))
.apply("Write File(s)",
TextIO
.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(options.getOutputDirectory())
.withFilenamePolicy(
new WindowedFilenamePolicy(
options.getOutputFilenamePrefix(),
options.getShardTemplate(),
options.getOutputFilenameSuffix())
.withSubDirectoryPolicy(options.getSubDirectoryPolicy())));
// Execute the pipeline and return the result.
PipelineResult result = pipeline.run();
return result;
}
/**
* Parses a duration from a period formatted string. Values
* are accepted in the following formats:
* <p>
* Ns - Seconds. Example: 5s<br>
* Nm - Minutes. Example: 13m<br>
* Nh - Hours. Example: 2h
*
* <pre>
* parseDuration(null) = NullPointerException()
* parseDuration("") = Duration.standardSeconds(0)
* parseDuration("2s") = Duration.standardSeconds(2)
* parseDuration("5m") = Duration.standardMinutes(5)
* parseDuration("3h") = Duration.standardHours(3)
* </pre>
*
* #param value The period value to parse.
* #return The {#link Duration} parsed from the supplied period string.
*/
private static Duration parseDuration(String value) {
Preconditions.checkNotNull(value, "The specified duration must be a non-null value!");
PeriodParser parser = new PeriodFormatterBuilder()
.appendSeconds().appendSuffix("s")
.appendMinutes().appendSuffix("m")
.appendHours().appendSuffix("h")
.toParser();
MutablePeriod period = new MutablePeriod();
parser.parseInto(period, value, 0, Locale.getDefault());
Duration duration = period.toDurationFrom(new DateTime(0));
return duration;
}
}
WindowedFilenamePolicy
/**
* The {#link WindowedFilenamePolicy} class will output files
* to the specified location with a format of output-yyyyMMdd'T'HHmmssZ-001-of-100.txt.
*/
#SuppressWarnings("serial")
public class WindowedFilenamePolicy extends FilenamePolicy {
/**
* Possible sub-directory creation modes.
*/
public static enum SubDirectoryPolicy {
NONE("."),
PER_HOUR("yyyy-MM-dd/HH"),
PER_DAY("yyyy-MM-dd");
private final String subDirectoryPattern;
private SubDirectoryPolicy(String subDirectoryPattern) {
this.subDirectoryPattern = subDirectoryPattern;
}
public String getSubDirectoryPattern() {
return subDirectoryPattern;
}
public String format(Instant instant) {
DateTimeFormatter formatter = DateTimeFormat.forPattern(subDirectoryPattern);
return formatter.print(instant);
}
}
/**
* The formatter used to format the window timestamp for outputting to the filename.
*/
private static final DateTimeFormatter formatter = ISODateTimeFormat
.basicDateTimeNoMillis()
.withZone(DateTimeZone.getDefault());
/**
* The filename prefix.
*/
private final ValueProvider<String> prefix;
/**
* The filenmae suffix.
*/
private final ValueProvider<String> suffix;
/**
* The shard template used during file formatting.
*/
private final ValueProvider<String> shardTemplate;
/**
* The policy which dictates when or if sub-directories are created
* for the windowed file output.
*/
private ValueProvider<SubDirectoryPolicy> subDirectoryPolicy = StaticValueProvider.of(SubDirectoryPolicy.NONE);
/**
* Constructs a new {#link WindowedFilenamePolicy} with the
* supplied prefix used for output files.
*
* #param prefix The prefix to append to all files output by the policy.
* #param shardTemplate The template used to create uniquely named sharded files.
* #param suffix The suffix to append to all files output by the policy.
*/
public WindowedFilenamePolicy(String prefix, String shardTemplate, String suffix) {
this(StaticValueProvider.of(prefix),
StaticValueProvider.of(shardTemplate),
StaticValueProvider.of(suffix));
}
/**
* Constructs a new {#link WindowedFilenamePolicy} with the
* supplied prefix used for output files.
*
* #param prefix The prefix to append to all files output by the policy.
* #param shardTemplate The template used to create uniquely named sharded files.
* #param suffix The suffix to append to all files output by the policy.
*/
public WindowedFilenamePolicy(
ValueProvider<String> prefix,
ValueProvider<String> shardTemplate,
ValueProvider<String> suffix) {
this.prefix = prefix;
this.shardTemplate = shardTemplate;
this.suffix = suffix;
}
/**
* The subdirectory policy will create sub-directories on the
* filesystem based on the window which has fired.
*
* #param policy The subdirectory policy to apply.
* #return The filename policy instance.
*/
public WindowedFilenamePolicy withSubDirectoryPolicy(SubDirectoryPolicy policy) {
return withSubDirectoryPolicy(StaticValueProvider.of(policy));
}
/**
* The subdirectory policy will create sub-directories on the
* filesystem based on the window which has fired.
*
* #param policy The subdirectory policy to apply.
* #return The filename policy instance.
*/
public WindowedFilenamePolicy withSubDirectoryPolicy(ValueProvider<SubDirectoryPolicy> policy) {
this.subDirectoryPolicy = policy;
return this;
}
/**
* The windowed filename method will construct filenames per window in the
* format of output-yyyyMMdd'T'HHmmss-001-of-100.txt.
*/
#Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext c, String extension) {
Instant windowInstant = c.getWindow().maxTimestamp();
String datetimeStr = formatter.print(windowInstant.toDateTime());
// Remove the prefix when it is null so we don't append the literal 'null'
// to the start of the filename
String filenamePrefix = prefix.get() == null ? datetimeStr : prefix.get() + "-" + datetimeStr;
String filename = DefaultFilenamePolicy.constructName(
filenamePrefix,
shardTemplate.get(),
StringUtils.defaultIfBlank(suffix.get(), extension), // Ignore the extension in favor of the suffix.
c.getShardNumber(),
c.getNumShards());
String subDirectory = subDirectoryPolicy.get().format(windowInstant);
return outputDirectory
.resolve(subDirectory, StandardResolveOptions.RESOLVE_DIRECTORY)
.resolve(filename, StandardResolveOptions.RESOLVE_FILE);
}
/**
* Unwindowed writes are unsupported by this filename policy so an {#link UnsupportedOperationException}
* will be thrown if invoked.
*/
#Override
public ResourceId unwindowedFilename(ResourceId outputDirectory, Context c, String extension) {
throw new UnsupportedOperationException("There is no windowed filename policy for unwindowed file"
+ " output. Please use the WindowedFilenamePolicy with windowed writes or switch filename policies.");
}
}
In Beam currently the DefaultFilenamePolicy supports windowed writes, so there's no need to write a custom FilenamePolicy. You can control the output filename by putting W and P placeholders (for the window and pane respectively) in the filename template. This exists in the head beam repository, and will also be in the upcoming Beam 2.1 release (which is being released as we speak).
I am using pentaho report designer to generate reports.My datasource is mondrian olap cube.Now , I want to integrate my reports with java. I downloaded pentaho reporting sdk and modified the existing sample java program and provided the path of my .prpt file.But I am getting following error:
org.pentaho.reporting.libraries.resourceloader.ResourceCreationException: Unable to parse the document: ResourceKey{schema=org.pentaho.reporting.libraries.docbundle.bundleloader.ZipResourceBundleLoader, identifier=content.xml, factoryParameters={org.pentaho.reporting.libraries.resourceloader.FactoryParameterKey{name=repository}=org.pentaho.reporting.libraries.repository.zipreader.ZipReadRepository#ef028b, org.pentaho.reporting.libraries.resourceloader.FactoryParameterKey{name=repository-loader}=org.pentaho.reporting.libraries.docbundle.bundleloader.ZipResourceBundleLoader#19007a5}, parent=ResourceKey{schema=org.pentaho.reporting.libraries.resourceloader.loader.URLResourceLoader, identifier=file:/C:/Users/devang/workspace/samples/eclipse-bin/org/pentaho/reporting/engine/classic/samples/anor_admin.prpt, factoryParameters={}, parent=null}}
at org.pentaho.reporting.libraries.xmlns.parser.AbstractXmlResourceFactory.create(AbstractXmlResourceFactory.java:249)
at org.pentaho.reporting.libraries.resourceloader.DefaultResourceManagerBackend.create(DefaultResourceManagerBackend.java:272)
at org.pentaho.reporting.libraries.resourceloader.ResourceManager.create(ResourceManager.java:411)
at org.pentaho.reporting.libraries.resourceloader.ResourceManager.create(ResourceManager.java:370)
at org.pentaho.reporting.libraries.resourceloader.ResourceManager.createDirectly(ResourceManager.java:207)
at org.pentaho.reporting.engine.classic.samples.Sample1.getReportDefinition(Sample1.java:70)
at org.pentaho.reporting.engine.classic.samples.AbstractReportGenerator.generateReport(AbstractReportGenerator.java:160)
at org.pentaho.reporting.engine.classic.samples.AbstractReportGenerator.generateReport(AbstractReportGenerator.java:128)
at org.pentaho.reporting.engine.classic.samples.Sample1.main(Sample1.java:132)
ParentException:
org.pentaho.reporting.libraries.xmlns.parser.ParseException: Failure while loading data: datadefinition.xml [Location: Line=5 Column=11]
at org.pentaho.reporting.libraries.xmlns.parser.AbstractXmlReadHandler.performExternalParsing(AbstractXmlReadHandler.java:337)
at org.pentaho.reporting.engine.classic.core.modules.parser.bundle.content.ContentRootElementHandler.parseDataDefinition(ContentRootElementHandler.java:290)
at org.pentaho.reporting.engine.classic.core.modules.parser.bundle.content.ContentRootElementHandler.parseLocalFiles(ContentRootElementHandler.java:242)
at org.pentaho.reporting.engine.classic.core.modules.parser.bundle.content.ContentRootElementHandler.doneParsing(ContentRootElementHandler.java:236)
at org.pentaho.reporting.libraries.xmlns.parser.AbstractXmlReadHandler.endElement(AbstractXmlReadHandler.java:163)
at org.pentaho.reporting.libraries.xmlns.parser.RootXmlReadHandler.endElement(RootXmlReadHandler.java:586)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.pentaho.reporting.libraries.xmlns.parser.AbstractXmlResourceFactory.create(AbstractXmlResourceFactory.java:236)
at org.pentaho.reporting.libraries.resourceloader.DefaultResourceManagerBackend.create(DefaultResourceManagerBackend.java:272)
at org.pentaho.reporting.libraries.resourceloader.ResourceManager.create(ResourceManager.java:411)
at org.pentaho.reporting.libraries.resourceloader.ResourceManager.create(ResourceManager.java:370)
at org.pentaho.reporting.libraries.resourceloader.ResourceManager.createDirectly(ResourceManager.java:207)
at org.pentaho.reporting.engine.classic.samples.Sample1.getReportDefinition(Sample1.java:70)
at org.pentaho.reporting.engine.classic.samples.AbstractReportGenerator.generateReport(AbstractReportGenerator.java:160)
at org.pentaho.reporting.engine.classic.samples.AbstractReportGenerator.generateReport(AbstractReportGenerator.java:128)
at org.pentaho.reporting.engine.classic.samples.Sample1.main(Sample1.java:132)
My program is:
/*
* This program is free software; you can redistribute it and/or modify it under the
* terms of the GNU Lesser General Public License, version 2.1 as published by the Free Software
* Foundation.
*
* You should have received a copy of the GNU Lesser General Public License along with this
* program; if not, you can obtain a copy at http://www.gnu.org/licenses/old-licenses/lgpl-2.1.html
* or from the Free Software Foundation, Inc.,
* 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*
* This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
* without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
* See the GNU Lesser General Public License for more details.
*
* Copyright (c) 2009 Pentaho Corporation.. All rights reserved.
*/
package org.pentaho.reporting.engine.classic.samples;
import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.Map;
import java.util.HashMap;
import org.pentaho.reporting.engine.classic.core.DataFactory;
import org.pentaho.reporting.engine.classic.core.MasterReport;
import org.pentaho.reporting.engine.classic.core.ReportProcessingException;
import org.pentaho.reporting.libraries.resourceloader.Resource;
import org.pentaho.reporting.libraries.resourceloader.ResourceException;
import org.pentaho.reporting.libraries.resourceloader.ResourceManager;
/**
* Generates a report in the following scenario:
* <ol>
* <li>The report definition file is a .prpt file which will be loaded and parsed
* <li>The data factory is a simple JDBC data factory using HSQLDB
* <li>There are no runtime report parameters used
* </ol>
*/
public class Sample1 extends AbstractReportGenerator
{
/**
* Default constructor for this sample report generator
*/
public Sample1()
{
}
/**
* Returns the report definition which will be used to generate the report. In this case, the report will be
* loaded and parsed from a file contained in this package.
*
* #return the loaded and parsed report definition to be used in report generation.
*/
public MasterReport getReportDefinition()
{
try
{
// Using the classloader, get the URL to the reportDefinition file
final ClassLoader classloader = this.getClass().getClassLoader();
final URL reportDefinitionURL = classloader.getResource("org/pentaho/reporting/engine/classic/samples/anor_admin.prpt");
// Parse the report file
final ResourceManager resourceManager = new ResourceManager();
resourceManager.registerDefaults();
final Resource directly = resourceManager.createDirectly(reportDefinitionURL, MasterReport.class);
return (MasterReport) directly.getResource();
}
catch (ResourceException e)
{
e.printStackTrace();
}
return null;
}
/**
* Returns the data factory which will be used to generate the data used during report generation. In this example,
* we will return null since the data factory has been defined in the report definition.
*
* #return the data factory used with the report generator
*/
public DataFactory getDataFactory()
{
return null;
}
/**
* Returns the set of runtime report parameters. This sample report uses the following three parameters:
* <ul>
* <li><b>Report Title</b> - The title text on the top of the report</li>
* <li><b>Customer Names</b> - an array of customer names to show in the report</li>
* <li><b>Col Headers BG Color</b> - the background color for the column headers</li>
* </ul>
*
* #return <code>null</code> indicating the report generator does not use any report parameters
*/
public Map<String, Object> getReportParameters()
{
final Map<String, Object> parameters = new HashMap<String, Object>();
parameters.put("stday", 28);
parameters.put("styear", 2012);
parameters.put("stmonth", 10);
parameters.put("eday", 10);
parameters.put("eyear", 2012);
parameters.put("emonth", 10);
return parameters;
}
/**
* Simple command line application that will generate a PDF version of the report. In this report,
* the report definition has already been created with the Pentaho Report Designer application and
* it located in the same package as this class. The data query is located in that report definition
* as well, and there are a few report-modifying parameters that will be passed to the engine at runtime.
* <p/>
* The output of this report will be a PDF file located in the current directory and will be named
* <code>SimpleReportGeneratorExample.pdf</code>.
*
* #param args none
* #throws IOException indicates an error writing to the filesystem
* #throws ReportProcessingException indicates an error generating the report
*/
public static void main(String[] args) throws IOException, ReportProcessingException
{
// Create an output filename
final File outputFilename = new File(Sample1.class.getSimpleName() + ".pdf");
// Generate the report
new Sample1().generateReport(AbstractReportGenerator.OutputType.PDF, outputFilename);
// Output the location of the file
System.err.println("Generated the report [" + outputFilename.getAbsolutePath() + "]");
}
}
So my general question is "Is it possible to have an Accumulo BatchScanner only pull back the first result per Range I give it?"
Now some details about my use case as there may be a better way to approach this anyway. I have data that represent messages from different systems. There can be different types of messages. My users want to be able to ask the system questions, such as "give me the most recent message of a certain type as of a certain time for all these systems".
My table layout looks like this
rowid: system_name, family: message_type, qualifier: masked_timestamp, value: message_text
The idea is that the user gives me a list of systems they care about, the type of message, and a certain timestamp. I used masked timestamp so that the table sorts most recent first. That way when I scan for a timestamp, the first result is the most recent prior to that time. I am using a BatchScanner because I have multiple systems I am searching for per query. Can I make the BatchScanner only fetch the first result for each Range? I can't specify a specific key because the most recent may not match the datetime given by the user.
Currently, I am using the BatchScanner and ignoring all but the first result per Key. It works right now, but it seems like a waste to pull back all the data for a specific system/type over the network when I only care about the first result per system/type.
EDIT
My attempt using the FirstEntryInRowIterator
#Test
public void testFirstEntryIterator() throws Exception
{
Connector connector = new MockInstance("inst").getConnector("user", new PasswordToken("password"));
connector.tableOperations().create("testing");
BatchWriter writer = writer(connector, "testing");
writer.addMutation(mutation("row", "fam", "qual1", "val1"));
writer.addMutation(mutation("row", "fam", "qual2", "val2"));
writer.addMutation(mutation("row", "fam", "qual3", "val3"));
writer.close();
Scanner scanner = connector.createScanner("testing", new Authorizations());
scanner.addScanIterator(new IteratorSetting(50, FirstEntryInRowIterator.class));
Key begin = new Key("row", "fam", "qual2");
scanner.setRange(new Range(begin, begin.followingKey(PartialKey.ROW_COLFAM_COLQUAL)));
int numResults = 0;
for (Map.Entry<Key, Value> entry : scanner)
{
Assert.assertEquals("qual2", entry.getKey().getColumnQualifier().toString());
numResults++;
}
Assert.assertEquals(1, numResults);
}
My goal is that the returned entry will be the ("row", "fam", "qual2", "val2") but I get 0 results. It almost seems like the Iterator is being applied before the Range maybe? I haven't dug into this yet.
This sounds like a good use case for using one of Accumulo's SortedKeyValueIterators, specifically the FirstEntryInRowIterator (contained in the accumulo-core artifact).
Create an IteratorSetting with the FirstEntryInRowIterator and add it to your BatchScanner. This will return the first Key/Value in that system_name, and then stop avoiding the overhead of your client ignoring all other results.
A quick modification of the FirstEntryInRowIterator might get you what you want:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.accumulo.core.iterators;
import java.io.IOException;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import org.apache.accumulo.core.client.IteratorSetting;
import org.apache.accumulo.core.data.ByteSequence;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.PartialKey;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.hadoop.io.Text;
public class FirstEntryInRangeIterator extends SkippingIterator implements OptionDescriber {
// options
static final String NUM_SCANS_STRING_NAME = "scansBeforeSeek";
// iterator predecessor seek options to pass through
private Range latestRange;
private Collection<ByteSequence> latestColumnFamilies;
private boolean latestInclusive;
// private fields
private Text lastRowFound;
private int numscans;
/**
* convenience method to set the option to optimize the frequency of scans vs. seeks
*/
public static void setNumScansBeforeSeek(IteratorSetting cfg, int num) {
cfg.addOption(NUM_SCANS_STRING_NAME, Integer.toString(num));
}
// this must be public for OptionsDescriber
public FirstEntryInRangeIterator() {
super();
}
public FirstEntryInRangeIterator(FirstEntryInRangeIterator other, IteratorEnvironment env) {
super();
setSource(other.getSource().deepCopy(env));
}
#Override
public SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env) {
return new FirstEntryInRangeIterator(this, env);
}
#Override
public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException {
super.init(source, options, env);
String o = options.get(NUM_SCANS_STRING_NAME);
numscans = o == null ? 10 : Integer.parseInt(o);
}
// this is only ever called immediately after getting "next" entry
#Override
protected void consume() throws IOException {
if (finished == true || lastRowFound == null)
return;
int count = 0;
while (getSource().hasTop() && lastRowFound.equals(getSource().getTopKey().getRow())) {
// try to efficiently jump to the next matching key
if (count < numscans) {
++count;
getSource().next(); // scan
} else {
// too many scans, just seek
count = 0;
// determine where to seek to, but don't go beyond the user-specified range
Key nextKey = getSource().getTopKey().followingKey(PartialKey.ROW);
if (!latestRange.afterEndKey(nextKey))
getSource().seek(new Range(nextKey, true, latestRange.getEndKey(), latestRange.isEndKeyInclusive()), latestColumnFamilies, latestInclusive);
else {
finished = true;
break;
}
}
}
lastRowFound = getSource().hasTop() ? getSource().getTopKey().getRow(lastRowFound) : null;
}
private boolean finished = true;
#Override
public boolean hasTop() {
return !finished && getSource().hasTop();
}
#Override
public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException {
// save parameters for future internal seeks
latestRange = range;
latestColumnFamilies = columnFamilies;
latestInclusive = inclusive;
lastRowFound = null;
super.seek(range, columnFamilies, inclusive);
finished = false;
if (getSource().hasTop()) {
lastRowFound = getSource().getTopKey().getRow();
if (range.beforeStartKey(getSource().getTopKey()))
consume();
}
}
#Override
public IteratorOptions describeOptions() {
String name = "firstEntry";
String desc = "Only allows iteration over the first entry per range";
HashMap<String,String> namedOptions = new HashMap<String,String>();
namedOptions.put(NUM_SCANS_STRING_NAME, "Number of scans to try before seeking [10]");
return new IteratorOptions(name, desc, namedOptions, null);
}
#Override
public boolean validateOptions(Map<String,String> options) {
try {
String o = options.get(NUM_SCANS_STRING_NAME);
if (o != null)
Integer.parseInt(o);
} catch (Exception e) {
throw new IllegalArgumentException("bad integer " + NUM_SCANS_STRING_NAME + ":" + options.get(NUM_SCANS_STRING_NAME), e);
}
return true;
}
}
Can someone please guide/help me in setting up the LDAP connection with Glassfish v3.1.2 using JNDI . I googled on this topic only to find people setting up and using ldap in Glassfish to authenticate the user. Whereas, I need to fetch user data which is to be displayed on my JSF forms and for auto complete during new entires creation on those forms.
I am bit confused. Is Ldap connection in Glassfish only used for authenticating and setting the realm?
Ok I found something while googling for the ways to query. But my extremely limited knowledge still hindering my progress.
So here is the code I found on http://www.myjeeva.com/2012/05/querying-active-directory-using-java/
Active Directory
/**
* The MIT License
*
* Copyright (c) 2010-2012 www.myjeeva.com
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*
*/
package com.LdapSearchDaoBean;
import java.util.Properties;
import java.util.logging.Logger;
import javax.naming.Context;
import javax.naming.NamingEnumeration;
import javax.naming.NamingException;
import javax.naming.directory.DirContext;
import javax.naming.directory.InitialDirContext;
import javax.naming.directory.SearchControls;
import javax.naming.directory.SearchResult;
/**
* Query Active Directory using Java
*
* #filename ActiveDirectory.java
* #author Jeevanandam Madanagopal
* #copyright © 2010-2012 www.myjeeva.com
*/
public class ActiveDirectory {
// Logger
private static final Logger LOG = Logger.getLogger(ActiveDirectory.class.getName());
//required private variables
private Properties properties;
private DirContext dirContext;
private SearchControls searchCtls;
private String[] returnAttributes = { "sAMAccountName", "givenName", "cn", "mail" };
private String domainBase;
private String baseFilter = "(&((&(objectCategory=Person)(objectClass=User)))";
/**
* constructor with parameter for initializing a LDAP context
*
* #param username a {#link java.lang.String} object - username to establish a LDAP connection
* #param password a {#link java.lang.String} object - password to establish a LDAP connection
* #param domainController a {#link java.lang.String} object - domain controller name for LDAP connection
*/
public ActiveDirectory(String username, String password, String domainController) {
properties = new Properties();
properties.put(Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.ldap.LdapCtxFactory");
properties.put(Context.PROVIDER_URL, "LDAP://" + domainController);
properties.put(Context.SECURITY_PRINCIPAL, username + "#" + domainController);
properties.put(Context.SECURITY_CREDENTIALS, password);
//initializing active directory LDAP connection
try {
dirContext = new InitialDirContext(properties);
} catch (NamingException e) {
LOG.severe(e.getMessage());
}
//default domain base for search
domainBase = getDomainBase(domainController);
//initializing search controls
searchCtls = new SearchControls();
searchCtls.setSearchScope(SearchControls.SUBTREE_SCOPE);
searchCtls.setReturningAttributes(returnAttributes);
}
/**
* search the Active directory by username/email id for given search base
*
* #param searchValue a {#link java.lang.String} object - search value used for AD search for eg. username or email
* #param searchBy a {#link java.lang.String} object - scope of search by username or by email id
* #param searchBase a {#link java.lang.String} object - search base value for scope tree for eg. DC=myjeeva,DC=com
* #return search result a {#link javax.naming.NamingEnumeration} object - active directory search result
* #throws NamingException
*/
public NamingEnumeration<SearchResult> searchUser(String searchValue, String searchBy, String searchBase) throws NamingException {
String filter = getFilter(searchValue, searchBy);
String base = (null == searchBase) ? domainBase : getDomainBase(searchBase); // for eg.: "DC=myjeeva,DC=com";
return this.dirContext.search(base, filter, this.searchCtls);
}
/**
* closes the LDAP connection with Domain controller
*/
public void closeLdapConnection(){
try {
if(dirContext != null)
dirContext.close();
}
catch (NamingException e) {
LOG.severe(e.getMessage());
}
}
/**
* active directory filter string value
*
* #param searchValue a {#link java.lang.String} object - search value of username/email id for active directory
* #param searchBy a {#link java.lang.String} object - scope of search by username or email id
* #return a {#link java.lang.String} object - filter string
*/
private String getFilter(String searchValue, String searchBy) {
String filter = this.baseFilter;
if(searchBy.equals("email")) {
filter += "(mail=" + searchValue + "))";
} else if(searchBy.equals("username")) {
filter += "(samaccountname=" + searchValue + "))";
}
return filter;
}
/**
* creating a domain base value from domain controller name
*
* #param base a {#link java.lang.String} object - name of the domain controller
* #return a {#link java.lang.String} object - base name for eg. DC=myjeeva,DC=com
*/
private static String getDomainBase(String base) {
char[] namePair = base.toUpperCase().toCharArray();
String dn = "DC=";
for (int i = 0; i < namePair.length; i++) {
if (namePair[i] == '.') {
dn += ",DC=" + namePair[++i];
} else {
dn += namePair[i];
}
}
return dn;
}
}
Sample Usage Code
/**
* The MIT License
*
* Copyright (c) 2010-2012 www.myjeeva.com
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*
*/
package com.LdapSearchDaoBean;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import javax.naming.NamingEnumeration;
import javax.naming.NamingException;
import javax.naming.directory.Attributes;
import javax.naming.directory.SearchResult;
/**
* Sample program how to use ActiveDirectory class in Java
*
* #filename SampleUsageActiveDirectory.java
* #author Jeevanandam Madanagopal
* #copyright © 2010-2012 www.myjeeva.com
*/
public class SampleUsageActiveDirectory {
/**
* #param args
* #throws NamingException
*/
public static void main(String[] args) throws NamingException, IOException {
System.out.println("\n\nQuerying Active Directory Using Java");
System.out.println("------------------------------------");
String domain = "";
String username = "";
String password = "";
String choice = "";
String searchTerm = "";
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Provide username & password for connecting AD");
System.out.println("Enter Domain:");
domain = br.readLine();
System.out.println("Enter username:");
username = br.readLine();
System.out.println("Enter password:");
password = br.readLine();
System.out.println("Search by username or email:");
choice = br.readLine();
System.out.println("Enter search term:");
searchTerm = br.readLine();
//Creating instance of ActiveDirectory
ActiveDirectory activeDirectory = new ActiveDirectory(username, password, domain);
//Searching
NamingEnumeration<SearchResult> result = activeDirectory.searchUser(searchTerm, choice, null);
if(result.hasMore()) {
SearchResult rs= (SearchResult)result.next();
Attributes attrs = rs.getAttributes();
String temp = attrs.get("samaccountname").toString();
System.out.println("Username : " + temp.substring(temp.indexOf(":")+1));
temp = attrs.get("givenname").toString();
System.out.println("Name : " + temp.substring(temp.indexOf(":")+1));
temp = attrs.get("mail").toString();
System.out.println("Email ID : " + temp.substring(temp.indexOf(":")+1));
temp = attrs.get("cn").toString();
System.out.println("Display Name : " + temp.substring(temp.indexOf(":")+1) + "\n\n");
} else {
System.out.println("No search result found!");
}
//Closing LDAP Connection
activeDirectory.closeLdapConnection();
}
}
I tried to use the above code with following input in console:
Querying Active Directory Using Java
------------------------------------
Provide username & password for connecting AD
Enter Domain:
DC=de,DC=*****,DC=com
Enter username:
************** ( i've hidden username)
Enter password:
************* (i've hidden password)
Search by username or email:
username
Enter search term:
user1
And I get following errors
Apr 12, 2013 10:35:17 AM com.LdapSearchDaoBean.ActiveDirectory <init>
SEVERE: DC=de,DC=*****,DC=com:389
Exception in thread "main" java.lang.NullPointerException
at com.LdapSearchDaoBean.ActiveDirectory.searchUser(ActiveDirectory.java:101)
at com.LdapSearchDaoBean.SampleUsageActiveDirectory.main(SampleUsageActiveDirectory.java:75)
It will be really great if someone can help me out may be with a little explanation on HowTo and how can I actually use this in AutoComplete in JSF2.0 forms. I'm literally lost over this topic. Thanks in advance.
I got the same problem, which i can not resolve, but I maybe can help you with your problem.
When the Application asks for the Domain, it wants the IP/Adress like: "10.10.200.1:389" or "my.activedirectoryurl:389" from your active directory.
Besides this, the code does not work properly, because there is a null given in line 75 in SampleUsageActiveDirectory and this will always cause the NullPointer-Exception:
NamingEnumeration<SearchResult> result = activeDirectory.searchUser(searchTerm, choice, null);
The error you have is, that you have entered the AD values. For hostname just use the real AD server name like ad.myserver.com or the ip address. Then it should work.
I am trying to run the following code for BasicCrawlController in java but I get some error:
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package edu.uci.ics.crawler4j.examples.basic;
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
/**
* #author Yasser Ganjisaffar <lastname at gmail dot com>
*/
public class MyWebCrawler {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Needed parameters: ");
System.out.println("\t rootFolder (it will contain intermediate crawl data)");
System.out.println("\t numberOfCralwers (number of concurrent threads)");
return;
}
/*
* crawlStorageFolder is a folder where intermediate crawl data is
* stored.
*/
String crawlStorageFolder = args[0];
/*
* numberOfCrawlers shows the number of concurrent threads that should
* be initiated for crawling.
*/
int numberOfCrawlers = Integer.parseInt(args[1]);
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Be polite: Make sure that we don't send more than 1 request per
* second (1000 milliseconds between requests).
*/
config.setPolitenessDelay(1000);
/*
* You can set the maximum crawl depth here. The default value is -1 for
* unlimited depth
*/
config.setMaxDepthOfCrawling(2);
/*
* You can set the maximum number of pages to crawl. The default value
* is -1 for unlimited number of pages
*/
config.setMaxPagesToFetch(1000);
/*
* Do you need to set a proxy? If so, you can use:
* config.setProxyHost("proxyserver.example.com");
* config.setProxyPort(8080);
*
* If your proxy also needs authentication:
* config.setProxyUsername(username); config.getProxyPassword(password);
*/
/*
* This config parameter can be used to set your crawl to be resumable
* (meaning that you can resume the crawl from a previously
* interrupted/crashed crawl). Note: if you enable resuming feature and
* want to start a fresh crawl, you need to delete the contents of
* rootFolder manually.
*/
config.setResumableCrawling(false);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/~welling/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(BasicCrawler.class, numberOfCrawlers);
}
}
the error is:
log4j:WARN No appenders could be found for logger (org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager).
log4j:WARN Please initialize the log4j system properly.
Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - Erroneous tree type: <any>
at mywebcrawler.MyWebCrawler.main(MyWebCrawler.java:107)
what is the problem with the code? it is completely copied from thw web site of crawler4j!
You are missing the log4j properties file.
What is BasicCrawler, is that your own class? How did you define it, is it a generic class? Haven't you forgot specifying the generic type?
BasicCrawler is a class that I copied from crawler4j documentation:
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
public class BasicCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
#Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.aut.ac.ir/");
}
/**
* This function is called when a page is fetched and ready to be processed
* by your program.
*/
#Override
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
System.out.println("=============");
}
}