JSVGCanvas.getSVGDocument() is returning null? - java

I seem to have a problem working with batikSVG for manupilating SVG using Java. I can display the SVG just fine on the JSVG Canvas but when I try to the canvas's SVGDocument using getSVGDocument it seems to return null. Why is that, and how can I get the actual document?
jSVGCanvas1.setURI(new File("circle.svg").toURI().toString());
jSVGCanvas1.setDocumentState(JSVGCanvas.ALWAYS_DYNAMIC);
SVGDocument doc = jSVGCanvas1.getSVGDocument();
if(doc==null)System.out.println("null");
The last line tests where doc is null and it always prints null. Please help!

You'll need to wait for the document to load and that happens asynchronously. Something like this...
jSVGCanvas1.addSVGDocumentLoaderListener(new SVGDocumentLoaderAdapter() {
public void documentLoadingCompleted(SVGDocumentLoaderEvent e) {
SVGDocument doc = jSVGCanvas1.getSVGDocument();
if(doc==null)System.out.println("null");
}
});

Related

Parse data from webpage to android app using Jsoup

My android app has a part were i need to parse data from wikipedia.com and use that in application. when i go to https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data I can see the covid19 cases. I want to retrieve the number from the table
I am using Jsoup. I am able to get html data by using this https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data .If you can guide me how can i extract the india cases and deaths from html file. as the html doc is huge and there no attr for tr. There's not much information about this on internet. What i have tried so far...
private void getWebsite() {
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder builder = new StringBuilder();
String web_link = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(web_link).get();
String title = doc.title();
Elements links = doc.select("tr");
builder.append(title).append("\n");
for(Element link : links){
builder.append(link);
}
} catch (IOException e) {
builder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
textView.setText(builder.toString());
}
});
}
}).start();
}
The problem is related to the format of the data (XML). When you navigate down the XML elements, you find what's displayed in the document, when viewed via your browser, is:
<someTag>...</someTag>
But what's actually present is the xml encoded version of the string:
<someTag>...</someTag>
JSoup won't work well with this and you'll need further processing to convert the output to more XML to get it working I'd imagine. You can test this yourself by viewing the result of:
doc.getElementsByTag("text")
And you'd need to replace all < and > tokens with <, > respectively.
Here's what I tried, plus some minor edits after failing to be able to pull tbody/thead/th.. I then started trying to pull from the top level tag, starting with api, moving deeper into the DOM.
final StringBuilder builder = new StringBuilder();
String url = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(url).get();
String title = doc.getElementsByTag("parse").attr("title");
Also worth mentioning there are some really good examples in the documents here: https://jsoup.org/cookbook/extracting-data/dom-navigation
And finally, for what it's worth, I'd change the URL used to: https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data to make life easier for use with JSoup so you can just pull the relevant bits of data from HTML rather than XML.
In my view, if you have the choice, HtmlUnit would be a better tool for this since you can simply specify an XPath for the HTML element you want to extract without having to use multiple method calls to get what you want... the more concise format means there's less room for errors to hide.

Using Groovy to overwrite a FlowFile in NiFi

I'm trying to do something fairly simple and read an i9 PDF form from an incoming FlowFile, parse the first and last name out of it into a JSON, then output the JSON to the outgoing FlowFile.
I found no official documentation on how to do this, but someone has written up several cookbooks on doing things in several scripting languages in NiFi here. It seems pretty straightforward and I'm pretty sure I'm doing what is written there, but I'm not even sure the PDF is being read at all. It simply passes the PDF unmodified out to REL_SUCCESS every time.
Link to sample PDF
import java.nio.charset.StandardCharsets
import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
import java.nio.charset.StandardCharsets
def flowFile = session.get()
flowFile = session.write(flowFile, { inputStream, outputStream ->
try {
//Load Flowfile contents
PDDocument document = PDDocument.load(inputStream)
PDFTextStripperByArea stripper = new PDFTextStripperByArea()
//Get the first page
List<PDPage> allPages = document.getDocumentCatalog().getAllPages()
PDPage page = allPages.get(0)
//Define the areas to search and add them as search regions
stripper = new PDFTextStripperByArea()
Rectangle lname = new Rectangle(25, 226, 240, 15)
stripper.addRegion("lname", lname)
Rectangle fname = new Rectangle(276, 226, 240, 15)
stripper.addRegion("fname", fname)
//Load the results into a JSON
def boxMap = [:]
stripper.setSortByPosition(true)
stripper.extractRegions(page)
regions = stripper.getRegions()
for (String region : regions) {
String box = stripper.getTextForRegion(region)
boxMap.put(region, box)
}
Gson gson = new Gson()
//Remove random noise from the output
json = gson.toJson(boxMap, LinkedHashMap.class)
json = json.replace('\\n', '')
json = json.replace('\\r', '')
json = json.replace(',"', ',\n"')
//Overwrite flowfile contents with JSON
outputStream.write(json.getBytes(StandardCharsets.UTF_8))
} catch (Exception e){
System.out.println(e.getMessage())
session.transfer(flowFile, REL_FAILURE)
}
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
EDIT:
Was able to confirm that the flowFile object is being read properly by subbing a txt file in. So the problem seems to be that the inputStream is never being handed off to the PDDocument or something is happening when it does. I edited the code to try reading it into a File object first but that resulted in an error:
FlowFileHandlingException: null is not known in this session
EDIT Edit:
Solved by moving my try/catch. I don't seem to understand how that works, my code above has been edited and works properly.
session.get can return null, so definitely add a line after that if(!flowFile) return. Also put the try/catch outside the session.write, that way you can put the session.transfer(flowFile, REL_SUCCESS) after the session.write (inside the try) and the catch can transfer to failure.
Also I can't tell from the code how the PDFTextStripperByArea works to get the info from the incoming document. It looks like all the document stuff is inside the try, so wouldn't be available to the PDFTextStripper (and isn't passed in).
None of these things explain why you're getting the original flow file on the success relationship, but maybe there's something I'm not seeing that would be magically fixed by the changes above :)
Also, if you use log.info() or log.error() rather than System.out.println, you will see the output in the NiFi logs (and for error it will post a bulletin to the processor and you can see the message if you hover over the top right corner (red square if bulletin is present) of the processor.

Jsoup is not Correctly Working

Hello guys i have an problem by Jsoup it is not working and i have no idea to figured it out here is the Code
private void getWebsite() {
Document doc = null;
try {
doc = Jsoup.connect("http://www.jean-clermont-schule.de/seite/90384/vertretungsplan.html").get();
Elements newsHeadlines = doc.select("content");
} catch (IOException e) {
e.printStackTrace();
}
}
And here is an Picture
Your code you provided is valid and compiles so the error is likely outside of what you've shown us. Looking at your picture, I guess you've imported the wrong Document class. Check your imports.
I am unable to add comment above.
I think that the code line Elements newsHeadlines = doc.select("content"); is wrong because content isn't tag for this link.
You must provide tag name with attribute and value being optional while using .select("");
You may try Elements newsHeadlines = doc.select("div[id=content]");

ghost4j class cast exception during joining two PostScripts

I am trying to join two PostScript files to one with ghost4j 0.5.0 as follows:
final PSDocument[] psDocuments = new PSDocument[2];
psDocuments[0] = new PSDocument();
psDocuments[0].load("1.ps");
psDocuments[1] = new PSDocument();
psDocuments[1].load("2.ps");
psDocuments[0].append(psDocuments[1]);
psDocuments[0].write("3.ps");
During this simplified process I got the following exception message for the above "append" line:
org.ghost4j.document.DocumentException: java.lang.ClassCastException:
org.apache.xmlgraphics.ps.dsc.events.UnparsedDSCComment cannot be cast to
org.apache.xmlgraphics.ps.dsc.events.DSCCommentPage
Until now I have not made to find out whats the problem here - maybe some kind of a problem within one of the PostScript files?
So help would be appreciated.
EDIT:
I tested with ghostScript commandline tool:
gswin32.exe -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pswrite -sOutputFile="test.ps" --filename "1.ps" "2.ps"
which results in a document where 1.ps and 2.ps are merged into one(!) page (i.e. overlay).
When removing the --filename the resulting document will be a PostScript with two pages as expected.
The exception occurs because one of the 2 documents does not follow the Adobe Document Structuring Convention (DSC), which is mandatory if you want to use the Document append method.
Use the SafeAppenderModifier instead. There is an example here: http://www.ghost4j.org/highlevelapisamples.html (Append a PDF document to a PostScript document)
I think something is wrong in the document or in the XMLGraphics library as it seems it cannot parse a part of it.
Here you can see the code in ghost4j that I think it is failing (link):
DSCParser parser = new DSCParser(bais);
Object tP = parser.nextDSCComment(DSCConstants.PAGES);
while (tP instanceof DSCAtend)
tP = parser.nextDSCComment(DSCConstants.PAGES);
DSCCommentPages pages = (DSCCommentPages) tP;
And here you can see why XMLGraphics may bre sesponsable (link):
private DSCComment parseDSCComment(String name, String value) {
DSCComment parsed = DSCCommentFactory.createDSCCommentFor(name);
if (parsed != null) {
try {
parsed.parseValue(value);
return parsed;
} catch (Exception e) {
//ignore and fall back to unparsed DSC comment
}
}
UnparsedDSCComment unparsed = new UnparsedDSCComment(name);
unparsed.parseValue(value);
return unparsed;
}
It seems parsed.parseValue(value) has thrown an exception, it was hidden in the catch and it returned an unparsed version ghost4j didn't expect.

Getting info from a webpage in java

sorry if it's kind of a big question but I'm just looking for someone to tell me in what direction to learn more since I have no clue, I have very basic knowledge of HTML and Java.
Someone in my family has to copy every product from a supplier into his own webshop.
The problem is he needs to put in all the articles one by one by hand,I'm looking for a way to replace him by a program.
I already got a bit going on for the price calculation , all I need now is the info of the product.
http://pastebin.com/WVCy55Dj
From line 1009 to around 1030.
I need 3 seperate strings of the three span's with the class "CatalogusListDetailTest"
From line 987 to around 1000.
I need a way to get all these images, it's on the website at www.flamingo.be/Images/Products/Large/"productID"(our first string).jpg
sometimes there's a _A , _B as you can see in this example so I'm looking for a way to make it check if there is and get these images aswell.
If I could get this far then I'd be very thankful ! I'll figure the rest out myself, sorry for the long post, wanted to give as much info as possible.
You can look at HTML parser lib Jsoup, doc reference: http://jsoup.org/cookbook/
EDIT: Code to get the product code:
Elements classElements = document.getElementsByClass("CatalogusListDetailTextTitel");
for (Element classElement : classElements) {
if (classElement.text().contains("Productcode :")) {
System.out.println(classElement.parent().ownText());
}
}
Instead of document you may have to use an element to get the consistent result, above code will print all the product codes.
You can use JTidy for what you need.
Code Example:
public void downloadSinglePage(String pageLink, String targetDir) throws XPathExpressionException, IOException {
URL url = new URL(pageLink);
BufferedInputStream page = new BufferedInputStream(url.openStream());
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document response = tidy.parseDOM(page, null);
XPathFactory factory = XPathFactory.newInstance();
XPath xPath=factory.newXPath();
NodeList nodes = (NodeList)xPath.evaluate(IMAGE_PATTERN, response, XPathConstants.NODESET);
String imageURL = (String) nodes.item(0).getNodeValue();
saveImageNIO(imageURL, targetDir);
}
where
IMAGE_PATTERN = "///a/img/#src";
but the pattern depends on how the image is innested in the page HTML code.
Method for saving Image using NIO:
public void saveImageNIO(String imageURL, String targetDir, String imageName) throws IOException {
URL url = new URL(imageURL);
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream fos = new FileOutputStream(targetDir + "/" + imageName + ".jpg");
fos.getChannel().transferFrom(rbc, 0, 1 << 24);
}

Categories