I'm trying to do something fairly simple and read an i9 PDF form from an incoming FlowFile, parse the first and last name out of it into a JSON, then output the JSON to the outgoing FlowFile.
I found no official documentation on how to do this, but someone has written up several cookbooks on doing things in several scripting languages in NiFi here. It seems pretty straightforward and I'm pretty sure I'm doing what is written there, but I'm not even sure the PDF is being read at all. It simply passes the PDF unmodified out to REL_SUCCESS every time.
Link to sample PDF
import java.nio.charset.StandardCharsets
import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
import java.nio.charset.StandardCharsets
def flowFile = session.get()
flowFile = session.write(flowFile, { inputStream, outputStream ->
try {
//Load Flowfile contents
PDDocument document = PDDocument.load(inputStream)
PDFTextStripperByArea stripper = new PDFTextStripperByArea()
//Get the first page
List<PDPage> allPages = document.getDocumentCatalog().getAllPages()
PDPage page = allPages.get(0)
//Define the areas to search and add them as search regions
stripper = new PDFTextStripperByArea()
Rectangle lname = new Rectangle(25, 226, 240, 15)
stripper.addRegion("lname", lname)
Rectangle fname = new Rectangle(276, 226, 240, 15)
stripper.addRegion("fname", fname)
//Load the results into a JSON
def boxMap = [:]
stripper.setSortByPosition(true)
stripper.extractRegions(page)
regions = stripper.getRegions()
for (String region : regions) {
String box = stripper.getTextForRegion(region)
boxMap.put(region, box)
}
Gson gson = new Gson()
//Remove random noise from the output
json = gson.toJson(boxMap, LinkedHashMap.class)
json = json.replace('\\n', '')
json = json.replace('\\r', '')
json = json.replace(',"', ',\n"')
//Overwrite flowfile contents with JSON
outputStream.write(json.getBytes(StandardCharsets.UTF_8))
} catch (Exception e){
System.out.println(e.getMessage())
session.transfer(flowFile, REL_FAILURE)
}
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
EDIT:
Was able to confirm that the flowFile object is being read properly by subbing a txt file in. So the problem seems to be that the inputStream is never being handed off to the PDDocument or something is happening when it does. I edited the code to try reading it into a File object first but that resulted in an error:
FlowFileHandlingException: null is not known in this session
EDIT Edit:
Solved by moving my try/catch. I don't seem to understand how that works, my code above has been edited and works properly.
session.get can return null, so definitely add a line after that if(!flowFile) return. Also put the try/catch outside the session.write, that way you can put the session.transfer(flowFile, REL_SUCCESS) after the session.write (inside the try) and the catch can transfer to failure.
Also I can't tell from the code how the PDFTextStripperByArea works to get the info from the incoming document. It looks like all the document stuff is inside the try, so wouldn't be available to the PDFTextStripper (and isn't passed in).
None of these things explain why you're getting the original flow file on the success relationship, but maybe there's something I'm not seeing that would be magically fixed by the changes above :)
Also, if you use log.info() or log.error() rather than System.out.println, you will see the output in the NiFi logs (and for error it will post a bulletin to the processor and you can see the message if you hover over the top right corner (red square if bulletin is present) of the processor.
Related
I'm newble in Java and I need to extract data from NSF files to DXL xml files. I tried it with python, but OLE wrapper doesn't have a few major functionality. I have two different formates of DXL files: with rawitemdata and with richtext.
When I switched to Java Lotus Notes API I receive mistakes in rawitemdata fields. I use doc.convertToMIME(2); procedure to provide mail body in the HTML format for getting sameness for each type of documents. As result all fields stored as rawitemdata and it's great. It's what I want. But some rawitemdata fields have crashed encoding.
It's look like
<rawitemdata type="19">
AgAPAAAAAgAmAj4AhQEAAAAAAAANCi0tMF9fPUNDQkIwRTFFREZBRjk4Mjg4ZjllOGE5M2RmOTM4NjkwOTE4Y0NDQkIwRTFFREZBRjk4
MjgNCkNvbnRlbnQtVHJhbnNmZXItRW5jb2Rpbmc6IGJpbmFyeQ0KQ29udGVudC10eXBlOiBhcHBsaWNhdGlvbi9vY3RldC1zdHJlYW07
IA0KCW5hbWU9Ij0/S09JOC1SP0I/OFBMdjlPL3I3K3d1Wkc5amVBPT0/PSINCkNvbnRlbnQtRGlzcG9zaXRpb246IGF0dGFjaG1lbnQ7
IGZpbGVuYW1lPSI9P0tPSTgtUj84UEx2OU8vcjcrd3VaRzlqZUE9PT89Ig0KQ29udGVudC1JRDogPDNfXz1DQ0JCMEUxRURGQUY5ODI4
OGY5ZThhOTNkZjkzODY5MDkxQGxvY2FsPg0KDQoFzwXQBc4F0gXOBcoFzgXLLmRvY3g=
</rawitemdata>
After decode from base64 I got this
\x02\x00\x0f\x00\x00\x00\x02\x00&\x02>\x00\x85\x01\x00\x00\x00\x00\x00\x00\r\n-
-0__=CCBB0E1EDFAF98288f9e8a93df938690918cCCBB0E1EDFAF9828\r\nContent-Transfer-Encoding:
binary\r\nContent-type: application/octet-stream; \r\n\tname="=?KOI8-R?B?8PLv9O/r7+wuZG9jeA==?
="\r\nContent-Disposition: attachment; filename="=?KOI8-R?8PLv9O/r7+wuZG9jeA==?="\r\nContent-ID:
<3__=CCBB0E1EDFAF98288f9e8a93df93869091#local>\r\n\r\n\x05\xcf\x05\xd0\x05\xce\x05\xd2\x05\xce\x05\xca\x
05\xce\x05\xcb.docx
It's easy to explain:
first 20 bytes it's a header: \x02\x00\x0f\x00\x00\x00\x02\x00&\x02>\x00\x85\x01\x00\x00\x00\x00\x00\x00
to unpack it I use next code struct.unpack("<hhhh hhLL", data[:20])
it returns tuple like this: (2, 15, 0, 2, 550, 62, 389, 0) where the 5th element it is a length of body. But I changed the body and lazed to calc current header for new body
the body with rfc822 headers
After base64 and then koi8-r header Content-Disposition decoded we could see field name with content ПРОТОКОЛ.docx. And it's correct content.
If I try decode the last part of the body which contains byte encoding \x05\xcf\x05\xd0\x05\xce\x05\xd2\x05\xce\x05\xca\x
05\xce\x05\xcb.docx I got exception. This is part can't being decoded correctly.
It's look like cp1251 encoding under unicode. Because the content ПРОТОКОЛ.docx in cp1251 codepage has '\xcf\xd0\xce\xd2\xce\xca\xce\xcb.docx' bytes.
My java code is quite simple:
import lotus.domino.Database;
import lotus.domino.Session;
import lotus.domino.NotesFactory;
import java.io.File;
import java.io.FileWriter;
import java.io.BufferedWriter;
import lotus.domino.*;
public class ExportDXL {
public static void main(String[] args) throws Exception {
String server = "";
String dbPath = "C:\\share\\test.nsf";
NotesThread.sinitThread();
Session session = NotesFactory.createSession((String)null, (String)null, "test");
Database db = session.getDatabase(server, dbPath);
System.out.println("Db Title: " + db.getTitle());
DxlExporter exporter = session.createDxlExporter();
exporter.setConvertNotesBitmapsToGIF(true);
View view = db.getView("$ALL");
Document doc = view.getFirstDocument();
while(doc != null){
doc.convertToMIME(2); // or set 1 for get plain text
String xmldoc = doc.generateXML();
FileWriter fw = null;
String id = doc.getUniversalID();
fw = new FileWriter("C:\\lotus_test\\"+ id +".xml");
fw.write(xmldoc);
fw.flush();
fw.close();
doc = view.getNextDocument(doc);
}
}
}
How can I provide correct encoding in this case? Or how to set code page for Lotus Notes API ?
I have the following issue, as everybody it seems, I want to replace some items with others in Word doc.
Issue with the issue is, the doc contains headers and footers which are part of the POIFSFileSystem (I know this because reading the FS / writing the doc back -without any changes- loses these informations, whereas reading the FS / writing it back as a new file doesn't).
Currently I do this :
POIFSFileSystem pfs = new POIFSFileSystem(fis);
HWPFDocument document = new HWPFDocument(pfs);
Range r1 = document.getRange();
…
document.write();
ByteArrayOutputStream bos = new ByteArrayOutputStream(50000);
pfs.writeFilesystem(bos);
pfs.close();
However this fails, with this error:
Opened read-only or via an InputStream, a Writeable File is required
If I don't rewrite the document, it works fine, but my changes are lost.
The other way around if I only save the document, not the filesystem, I lose the header/footer.
Now the problem is, how can I update the document while "saving as" the entire filesystem, or is there a way to force the document to contain everything from the file system?
The HWPF stuff is always in scratchpad because the DOC binary file format is the most horrible of all the Horrible formats. So it will really not be ready and also will be buggy in many cases.
But in your special case, your observations are not reproducible. Using apache poi 4.0.1 the HWPFDocument contains the header story, which also contains the footer stories, after creating from *.doc file. So the following works for me:
Source:
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;
public class ReadAndWriteDOCWithHeaderFooter {
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("TemplateDOCWithHeaderFooter.doc"));
Range bodyRange = document.getRange();
System.out.println(bodyRange);
for (int p = 0; p < bodyRange.numParagraphs(); p++) {
System.out.println(bodyRange.getParagraph(p).text());
if (bodyRange.getParagraph(p).text().contains("<<NAME>>"))
bodyRange.getParagraph(p).replaceText("<<NAME>>", "Axel Richter");
if (bodyRange.getParagraph(p).text().contains("<<DATE>>"))
bodyRange.getParagraph(p).replaceText("<<DATE>>", "12/21/1964");
if (bodyRange.getParagraph(p).text().contains("<<AMOUNT>>"))
bodyRange.getParagraph(p).replaceText("<<AMOUNT>>", "1,234.56");
System.out.println(bodyRange.getParagraph(p).text());
}
System.out.println("==============================================================================");
Range overallRange = document.getOverallRange();
System.out.println(overallRange);
for (int p = 0; p < overallRange.numParagraphs(); p++) {
System.out.println(overallRange.getParagraph(p).text()); // contains all inclusive header and footer
}
FileOutputStream out = new FileOutputStream("ResultDOCWithHeaderFooter.doc");
document.write(out);
out.close();
document.close();
}
}
Result:
So please do checking it again and tell us exactly what is not working for you. Because we need reproducing that, please do providing a minimal, complete, and verifiable example as I have done with my code.
I have defined an AVRO schema, and generated some classes with avro-tools for the schemes. Now, I want to serialize the data to disk. I found some answers about scala for this, but not for Java. The class Article is generated with avro-tools, and is made from a schema defined by me.
Here's a simplified version of the code of how I try to do it:
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<Article> processingFiles = filesRDD.map(fileNameContent -> {
// The name of the file
String fileName = fileNameContent._1();
// The content of the file
String fileContent = fileNameContent._2();
// An object from my avro schema
Article a = new Article(fileContent);
Processing processing = new Processing();
// .... some processing of the content here ... //
processing.serializeArticleToDisk(avroFileName);
return a;
});
where serializeArticleToDisk(avroFileName) is defined as follows:
public void serializeArticleToDisk(String filename) throws IOException{
// Serialize article to disk
DatumWriter<Article> articleDatumWriter = new SpecificDatumWriter<Article>(Article.class);
DataFileWriter<Article> dataFileWriter = new DataFileWriter<Article>(articleDatumWriter);
dataFileWriter.create(this.article.getSchema(), new File(filename));
dataFileWriter.append(this.article);
dataFileWriter.close();
}
where Article is my avro schema.
Now, the mapper throws me the error:
java.io.FileNotFoundException: hdfs:/...path.../avroFileName.avro (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.avro.file.SyncableFileOutputStream.<init>(SyncableFileOutputStream.java:60)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:129)
at sentences.ProcessXML.serializeArticleToDisk(ProcessXML.java:207)
. . . rest of the stacktrace ...
although the file path is correct.
I use a collect() method afterwards, so everything else within the map function works fine (except for the serialization part).
I am quite new with Spark, so I am not sure if this might be something trivial actually. I suspect that I need to use some writing functions, not to do the writing in the mapper (not sure if this is true, though). Any ideas how to tackle this?
EDIT:
The last line of the error stack-trace I showed, is actually on this part:
dataFileWriter.create(this.article.getSchema(), new File(filename));
This is the part that throws the actual error. I am assuming the dataFileWriter needs to be replaced with something else. Any ideas?
This solution is not using data-frames and is not throwing any errors:
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.io.NullWritable;
import org.apache.avro.mapred.AvroKey;
import org.apache.spark.api.java.JavaPairRDD;
import scala.Tuple2;
. . . . .
// Serializing to AVRO
JavaPairRDD<AvroKey<Article>, NullWritable> javaPairRDD = processingFiles.mapToPair(r -> {
return new Tuple2<AvroKey<Article>, NullWritable>(new AvroKey<Article>(r), NullWritable.get());
});
Job job = AvroUtils.getJobOutputKeyAvroSchema(Article.getClassSchema());
javaPairRDD.saveAsNewAPIHadoopFile(outputDataPath, AvroKey.class, NullWritable.class, AvroKeyOutputFormat.class,
job.getConfiguration());
where AvroUtils.getJobOutputKeyAvroSchema is:
public static Job getJobOutputKeyAvroSchema(Schema avroSchema) {
Job job;
try {
job = new Job();
} catch (IOException e) {
throw new RuntimeException(e);
}
AvroJob.setOutputKeySchema(job, avroSchema);
return job;
}
Similar things for Spark + Avro can be found here -> https://github.com/CeON/spark-utils.
It seems that you use Spark in a wrong way.
Map is a transformation function. Just calling map doesn't invoke calulation of RDD. You have to call action like forEach() or collect().
Also note, that lambda supplied to map will be serialized at driver and transferred to some Node in a cluster.
ADDED
Try to use Spark SQL and Spark-Avro to save Spark DataFrame in Avro format:
// Load a text file and convert each line to a JavaBean.
JavaRDD<Person> people = sc.textFile("/examples/people.txt")
.map(Person::parse);
// Apply a schema to an RDD
DataFrame peopleDF = sqlContext.createDataFrame(people, Person.class);
peopleDF.write()
.format("com.databricks.spark.avro")
.save("/output");
How can I get the latest tweet from html content through either regex or without any external libraries. I am happy to use external libraries I would just prefer not to. I just wanted to know how it would be possible. I have written the html download part in Java and if anyone wants I will post it here.
So I'll do a pit of pseudo code so that I'm not only targeting Java developers This is how my program looks so far.
1.)Load site("www.twitter.com/user123")
2.)Get initial string and write it to variable->buffer
3.)Loop start
4.) Append string->buffer
5.) If there is no more ->break
6.)print buffer
Obviously the variable buffer will now have raw html content. How can I sort this out to get the tweet. I have found a way but this is too inconsistent. The way I managed it was to find the string which held the tweets and get the content surrounded by the code. However there were too many changes in this section. What I mean is some content inside of it changes, like the font size. I could write multiple if statements but is there a neater solution?
Let me just start off by saying that jsoup is an amazing lightweight HTML parsing library. You can use things like CSS selectors and whatnot. If you ever decide to use a library jsoup will make your life a lot easier.
You can just query for the element with the class of TweetTextSize, then get the text content. This will give you all text, hashtags, and links. (The downside being pictures are also given in links)
Otherwise, you'll need to manually traverse the DOM. For example, use regex to find the beginning of the first TweetTextSize, and then just keep all text which is not between a < and a >.
Unfortunately, this second solution is volatile and may not work in the future, and you'll end up with a big glob of code which is overly complex and hard to debug.
Simple answer if you want a regex and not a sophisticated third party library.
<p[^>]+js-tweet-text[^>]*>(.*)</p>
Try the above on the "view-source" of https://twitter.com/a
Thanks.
EDIT:
Source Code:
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TweetSucker {
public static void main(String[] args) throws Exception {
URLConnection urlConnection = new URL("https://twitter.com/a").openConnection();
InputStream inputStream = urlConnection.getInputStream();
String encoding = urlConnection.getContentEncoding();
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[8192];
int len = 0;
while ((len = inputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
String htmlContent = null;
if (encoding != null) {
htmlContent = new String(byteArrayOutputStream.toByteArray(), encoding);
} else {
htmlContent = new String(byteArrayOutputStream.toByteArray());
}
Pattern TWEET_PATTERN = Pattern.compile("(<p[^>]+js-tweet-text[^>]*>(.*)</p>)", Pattern.CASE_INSENSITIVE);
Matcher matcher = TWEET_PATTERN.matcher(htmlContent);
while (matcher.find()) {
System.out.println("Tweet Found: " + matcher.group(2));
}
}
}
I know that you don't want any libraries but if you want something really quick this is working code in C#:
using (IE browser = new IE())
{
browser.GoTo("https://twitter.com/user");
List tweets = browser.List(Find.ById("stream-items-id"));
if (tweets != null)
{
foreach (var tweet in tweets.ListItems)
{
var tweetText = tweet.Paras.FirstOrDefault();
if (tweetText != null)
{
MessageBox.Show(tweetText.Text);
}
}
}
}
This program uses a library called WatiN (if you use Visual Studio go to Tools Menu, select "NuGet Package Manager" then select "Manage Nuget Packages for Solution" and then select "Browse" and then type "Watin" on the search box, after you find the library hit "Install", after it is installed you just add a reference in your code and then a using statement:
using WatiN.Core;
You can just copy and paste the code I wrote above in a button handler and it'll work, you need to change the twitter.com/XXXXXX user name to list all their tweets. Modify code accordingly to meet your needs.
I am trying to join two PostScript files to one with ghost4j 0.5.0 as follows:
final PSDocument[] psDocuments = new PSDocument[2];
psDocuments[0] = new PSDocument();
psDocuments[0].load("1.ps");
psDocuments[1] = new PSDocument();
psDocuments[1].load("2.ps");
psDocuments[0].append(psDocuments[1]);
psDocuments[0].write("3.ps");
During this simplified process I got the following exception message for the above "append" line:
org.ghost4j.document.DocumentException: java.lang.ClassCastException:
org.apache.xmlgraphics.ps.dsc.events.UnparsedDSCComment cannot be cast to
org.apache.xmlgraphics.ps.dsc.events.DSCCommentPage
Until now I have not made to find out whats the problem here - maybe some kind of a problem within one of the PostScript files?
So help would be appreciated.
EDIT:
I tested with ghostScript commandline tool:
gswin32.exe -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pswrite -sOutputFile="test.ps" --filename "1.ps" "2.ps"
which results in a document where 1.ps and 2.ps are merged into one(!) page (i.e. overlay).
When removing the --filename the resulting document will be a PostScript with two pages as expected.
The exception occurs because one of the 2 documents does not follow the Adobe Document Structuring Convention (DSC), which is mandatory if you want to use the Document append method.
Use the SafeAppenderModifier instead. There is an example here: http://www.ghost4j.org/highlevelapisamples.html (Append a PDF document to a PostScript document)
I think something is wrong in the document or in the XMLGraphics library as it seems it cannot parse a part of it.
Here you can see the code in ghost4j that I think it is failing (link):
DSCParser parser = new DSCParser(bais);
Object tP = parser.nextDSCComment(DSCConstants.PAGES);
while (tP instanceof DSCAtend)
tP = parser.nextDSCComment(DSCConstants.PAGES);
DSCCommentPages pages = (DSCCommentPages) tP;
And here you can see why XMLGraphics may bre sesponsable (link):
private DSCComment parseDSCComment(String name, String value) {
DSCComment parsed = DSCCommentFactory.createDSCCommentFor(name);
if (parsed != null) {
try {
parsed.parseValue(value);
return parsed;
} catch (Exception e) {
//ignore and fall back to unparsed DSC comment
}
}
UnparsedDSCComment unparsed = new UnparsedDSCComment(name);
unparsed.parseValue(value);
return unparsed;
}
It seems parsed.parseValue(value) has thrown an exception, it was hidden in the catch and it returned an unparsed version ghost4j didn't expect.