Iterate over repeated specific HL7 segment in Pentaho Data Integration KETTLE

Iterate over repeated specific HL7 segment in Pentaho Data Integration KETTLE - java

I want to get a specific segment hl7, by name of segment for example,
I am using Pipeparser class but i still don't know how to get each segment by structure name (MSH,PID,OBX,...).
Sometimes I have a repeated segments like DG1 or PV1 or OBX(see attached lines) how can I get data fileds from each row segment in Pentaho kettle (do i need java code in kettle, if there is such solution, help please).
OBX|1|TX|PTH_SITE1^Site A|1|left||||||F|||||||
OBX|2|TX|PTH_SPEC1^Specimen A||C-FNA^Fine Needle Aspiration||||||F|||||||
or
DG1|1|I10C|G30.0|Alzheimer's disease with early onset|20160406|W|||||||||
DG1|2|I10C|E87.70|Fluid overload, unspecified|20160406|W|||||||||

You should use HL7 Parser for proper/accurate parsing of any HL7 formatted message.
With the help of HAPI you can either parse your message as Stream
// Open an InputStream to read from the file
File file = new File("hl7_messages.txt");
InputStream is = new FileInputStream(file);
// It's generally a good idea to buffer file IO
is = new BufferedInputStream(is);
// The following class is a HAPI utility that will iterate over
// the messages which appear over an InputStream
Hl7InputStreamMessageIterator iter = new Hl7InputStreamMessageIterator(is);
while (iter.hasNext()) {
Message next = iter.next();
// Do something with the message
}
Or you can read it as String
File file = new File("hl7_messages.txt");
is = new FileInputStream(file);
is = new BufferedInputStream(is);
Hl7InputStreamMessageStringIterator iter2 = new Hl7InputStreamMessageStringIterator(is);
while (iter2.hasNext()) {
String next = iter2.next();
// Do something with the message
}
Hope this helps you in right direction.

Related

How to read files with an offset from Hadoop using Java

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading

I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}

How to chain multiple different InputStreams into one InputStream

I'm wondering if there is any ideomatic way to chain multiple InputStreams into one continual InputStream in Java (or Scala).
What I need it for is to parse flat files that I load over the network from an FTP-Server. What I want to do is to take file[1..N], open up streams and then combine them into one stream. So when file1 comes to an end, I want to start reading from file2 and so on, until I reach the end of fileN.
I need to read these files in a specific order, data comes from a legacy system that produces files in barches so data in one depends on data in another file, but I would like to handle them as one continual stream to simplify my domain logic interface.
I searched around and found PipedInputStream, but I'm not positive that is what I need. An example would be helpful.

It's right there in JDK! Quoting JavaDoc of SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
You want to concatenate arbitrary number of InputStreams while SequenceInputStream accepts only two. But since SequenceInputStream is also an InputStream you can apply it recursively (nest them):
new SequenceInputStream(
new SequenceInputStream(
new SequenceInputStream(file1, file2),
file3
),
file4
);
...you get the idea.
See also
How do you merge two input streams in Java? (dup?)

This is done using SequencedInputStream, which is straightforward in Java, as Tomasz Nurkiewicz's answer shows. I had to do this repeatedly in a project recently, so I added some Scala-y goodness via the "pimp my library" pattern.
object StreamUtils {
implicit def toRichInputStream(str: InputStream) = new RichInputStream(str)
class RichInputStream(str: InputStream) {
// a bunch of other handy Stream functionality, deleted
def ++(str2: InputStream): InputStream = new SequenceInputStream(str, str2)
}
}
With that, I can do stream sequencing as follows
val mergedStream = stream1++stream2++stream3
or even
val streamList = //some arbitrary-length list of streams, non-empty
val mergedStream = streamList.reduceLeft(_++_)

Another solution: first create a list of input stream and then create the sequence of input streams:
List<InputStream> iss = Files.list(Paths.get("/your/path"))
.filter(Files::isRegularFile)
.map(f -> {
try {
return new FileInputStream(f.toString());
} catch (Exception e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
new SequenceInputStream(Collections.enumeration(iss)))

Here is a more elegant solution using Vector, this is for Android specifically but use vector for any Java
AssetManager am = getAssets();
Vector v = new Vector(Constant.PAGES);
for (int i = 0; i < Constant.PAGES; i++) {
String fileName = "file" + i + ".txt";
InputStream is = am.open(fileName);
v.add(is);
}
Enumeration e = v.elements();
SequenceInputStream sis = new SequenceInputStream(e);
InputStreamReader isr = new InputStreamReader(sis);
Scanner scanner = new Scanner(isr); // or use bufferedReader

Here's a simple Scala version that concatenates an Iterator[InputStream]:
import java.io.{InputStream, SequenceInputStream}
import scala.collection.JavaConverters._
def concatInputStreams(streams: Iterator[InputStream]): InputStream =
new SequenceInputStream(streams.asJavaEnumeration)

Zip file as InputStream then separating each file inside it then converting it to image. In Java

I am getting a zip file as an InputStream. I am then separating each file inside it. Then I am passing the same byte array to a pdfbox which internally uses Apace pdf box 1.6.0 to convert it to image.
However when I pass the byte array to the PDFDocumentReader I get the following exception-
SEVERE: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream#44c2beb9
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream#44c2beb9
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:530)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:862)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:829)
at org.dopdf.document.read.pdf.PDFDocumentReader.init(PDFDocumentReader.java:98)
To fetch each file from the zip I use the following code -
ZipInputStream zis = new ZipInputStream(aZipFile); // aZipFile is byte array
ZipEntry entry;
ArrayList<String> nameOfIgnoredFiles = new ArrayList<String>();
byte data[] = null;
while ((entry = zis.getNextEntry()) != null) {
if (entry.getName().endsWith(".pdf")) {
int dataSize = (int)entry.getSize();
data = new byte[dataSize];
zis.read(data);
// i use data and pass it to the pdf box.
} else {
nameOfIgnoredFiles.add(entry.getName());
}
The data byte array that I fetch above is then passed to like below -
PDFDocumentReader document = new PDFDocumentReader(data); // here i get the error
What am I doing wrong? Can you suggest a solution? I guess the fetching of the data byte array is an issue. How to do it the best way?

You are assuming that zis.read(data) fills the buffer. Check the API documentation. It isn't guaranteed to do that. You are also assuming that the size fits into an int, and that the item itself fits into memory. None of these assumptions is valid.
Surely you can pass the entry's InputStream to a pdfbox API?

how to append data in docx file using docx4j

Please tell me how to append data in docx file using java and docx4j.
What I'm doing is, I am using a template in docx format in which some field are dilled by java at run time,
My problem is for every group of data it creates a new file and i just want to append the new file into 1 file. and this is not done using java streams
String outputfilepath = "e:\\Practice/DOC/output/generatedLatterOUTPUT.docx";
String outputfilepath1 = "e:\\Practice/DOC/output/generatedLatterOUTPUT1.docx";
WordprocessingMLPackage wordMLPackage;
public void templetsubtitution(String name, String age, String gender, Document document)
throws Exception {
// input file name
String inputfilepath = "e:\\Practice/DOC/profile.docx";
// out put file name
// id of Xml file
String itemId1 = "{A5D3A327-5613-4B97-98A9-FF42A2BA0F74}".toLowerCase();
String itemId2 = "{A5D3A327-5613-4B97-98A9-FF42A2BA0F74}".toLowerCase();
String itemId3 = "{A5D3A327-5613-4B97-98A9-FF42A2BA0F74}".toLowerCase();
// Load the Package
if (inputfilepath.endsWith(".xml")) {
JAXBContext jc = Context.jcXmlPackage;
Unmarshaller u = jc.createUnmarshaller();
u.setEventHandler(new org.docx4j.jaxb.JaxbValidationEventHandler());
org.docx4j.xmlPackage.Package wmlPackageEl = (org.docx4j.xmlPackage.Package) ((JAXBElement) u
.unmarshal(new javax.xml.transform.stream.StreamSource(
new FileInputStream(inputfilepath)))).getValue();
org.docx4j.convert.in.FlatOpcXmlImporter xmlPackage = new org.docx4j.convert.in.FlatOpcXmlImporter(
wmlPackageEl);
wordMLPackage = (WordprocessingMLPackage) xmlPackage.get();
} else {
wordMLPackage = WordprocessingMLPackage
.load(new File(inputfilepath));
}
CustomXmlDataStoragePart customXmlDataStoragePart = wordMLPackage
.getCustomXmlDataStorageParts().get(itemId1);
// Get the contents
CustomXmlDataStorage customXmlDataStorage = customXmlDataStoragePart
.getData();
// Change its contents
((CustomXmlDataStorageImpl) customXmlDataStorage).setNodeValueAtXPath(
"/ns0:orderForm[1]/ns0:record[1]/ns0:name[1]", name,
"xmlns:ns0='EasyForm'");
customXmlDataStoragePart = wordMLPackage.getCustomXmlDataStorageParts()
.get(itemId2);
// Get the contents
customXmlDataStorage = customXmlDataStoragePart.getData();
// Change its contents
((CustomXmlDataStorageImpl) customXmlDataStorage).setNodeValueAtXPath(
"/ns0:orderForm[1]/ns0:record[1]/ns0:age[1]", age,
"xmlns:ns0='EasyForm'");
customXmlDataStoragePart = wordMLPackage.getCustomXmlDataStorageParts()
.get(itemId3);
// Get the contents
customXmlDataStorage = customXmlDataStoragePart.getData();
// Change its contents
((CustomXmlDataStorageImpl) customXmlDataStorage).setNodeValueAtXPath(
"/ns0:orderForm[1]/ns0:record[1]/ns0:gender[1]", gender,
"xmlns:ns0='EasyForm'");
// Apply the bindings
BindingHandler.applyBindings(wordMLPackage.getMainDocumentPart());
File f = new File(outputfilepath);
wordMLPackage.save(f);
FileInputStream fis = new FileInputStream(f);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
try {
for (int readNum; (readNum = fis.read(buf)) != -1;) {
bos.write(buf, 0, readNum);
}
// System.out.println( buf.length);
} catch (IOException ex) {
}
byte[] bytes = bos.toByteArray();
FileOutputStream file = new FileOutputStream(outputfilepath1, true);
DataOutputStream out = new DataOutputStream(file);
out.write(bytes);
out.flush();
out.close();
System.out.println("..done");
}
public static void main(String[] args) {
utility u = new utility();
u.templetsubtitution("aditya",24,mohan);
}
thanks in advance

If I understand you correctly, you're essentially talking about merging documents. There are two very simple approaches that you can use, and their effectiveness really depends on the structure and onward use of your data:
PhilippeAuriach describes one approach in his answer, which entails
appending all components within a MaindocumentPart instance to
another. In terms of the final docx file, this means the content
that appears in document.xml -- it won't take into account headers
and footers ( for example), but that may be fine for you.
You can insert multiple documents into a single docx file by inserting them
as AltChunk elements (see the docx4j documentation). This will
bring everything from one Word file into another, headers and all.
The downside of this is that your final document won't be a proper
flowing Word file until you open it and save it in MS Word itself
(the imported components remain as standalone files within the docx
bundle). This will cause you issues if you want to generated
'merged' files and then do something with them like render PDFs --
the merged content will simply be ignored.
The more complete (and complex) approach is to perform a "deep merge". This updates and maintains all references held within a document. Imported content becomes part of the main "flow" of the document (i.e. it is not stored as separate references), so the end result is a properly-merged file which can be rendered to PDF or whatever.
The downside to this is you need a good knowledge of docx structure and the API, and you will be writing a fair amount of code (I would recommend buying a license to Plutext's MergeDocx instead).

I had to deal with similar things, and here is what I did (probably not the most efficient, but working) :
create a finalDoc loading the template, and emptying it (so you have the styles in this doc)
for each data row, create a new doc loading the template, then replace your fields with your values
use the function below to append the doc filled with the datas to the finalDoc :
public static void append(WordprocessingMLPackage docDest, WordprocessingMLPackage docSource) {
List<Object> objects = docSource.getMainDocumentPart().getContent();
for(Object o : objects){
docDest.getMainDocumentPart().getContent().add(o);
}
}
Hope this helps.

Inserting text into an existing file via Java

I would like to create a simple program (in Java) which edits text files - particularly one which performs inserting arbitrary pieces of text at random positions in a text file. This feature is part of a larger program I am currently writing.
Reading the description about java.util.RandomAccessFile, it appears that any write operations performed in the middle of a file would actually overwrite the exiting content. This is a side-effect which I would like to avoid (if possible).
Is there a simple way to achieve this?
Thanks in advance.

Okay, this question is pretty old, but FileChannels exist since Java 1.4 and I don't know why they aren't mentioned anywhere when dealing with the problem of replacing or inserting content in files. FileChannels are fast, use them.
Here's an example (ignoring exceptions and some other stuff):
public void insert(String filename, long offset, byte[] content) {
RandomAccessFile r = new RandomAccessFile(new File(filename), "rw");
RandomAccessFile rtemp = new RandomAccessFile(new File(filename + "~"), "rw");
long fileSize = r.length();
FileChannel sourceChannel = r.getChannel();
FileChannel targetChannel = rtemp.getChannel();
sourceChannel.transferTo(offset, (fileSize - offset), targetChannel);
sourceChannel.truncate(offset);
r.seek(offset);
r.write(content);
long newOffset = r.getFilePointer();
targetChannel.position(0L);
sourceChannel.transferFrom(targetChannel, newOffset, (fileSize - offset));
sourceChannel.close();
targetChannel.close();
}

Well, no, I don't believe there is a way to avoid overwriting existing content with a single, standard Java IO API call.
If the files are not too large, just read the entire file into an ArrayList (an entry per line) and either rewrite entries or insert new entries for new lines.
Then overwrite the existing file with new content, or move the existing file to a backup and write a new file.
Depending on how sophisticated the edits need to be, your data structure may need to change.
Another method would be to read characters from the existing file while writing to the edited file and edit the stream as it is read.

If Java has a way to memory map files, then what you can do is extend the file to its new length, map the file, memmove all the bytes down to the end to make a hole and write the new data into the hole.
This works in C. Never tried it in Java.
Another way I just thought of to do the same but with random file access.
Seek to the end - 1 MB
Read 1 MB
Write that to original position + gap size.
Repeat for each previous 1 MB working toward the beginning of the file.
Stop when you reach the desired gap position.
Use a larger buffer size for faster performance.

You can use following code:
BufferedReader reader = null;
BufferedWriter writer = null;
ArrayList list = new ArrayList();
try {
reader = new BufferedReader(new FileReader(fileName));
String tmp;
while ((tmp = reader.readLine()) != null)
list.add(tmp);
OUtil.closeReader(reader);
list.add(0, "Start Text");
list.add("End Text");
writer = new BufferedWriter(new FileWriter(fileName));
for (int i = 0; i < list.size(); i++)
writer.write(list.get(i) + "\r\n");
} catch (Exception e) {
e.printStackTrace();
} finally {
OUtil.closeReader(reader);
OUtil.closeWriter(writer);
}

I don't know if there's a handy way to do it straight otherwise than
read the beginning of the file and write it to target
write your new text to target
read the rest of the file and write it to target.
About the target : You can construct the new contents of the file in memory and then overwrite the old content of the file if the files handled aren't so big. Or you can write the result to a temporary file.
The thing would probably be easiest to do with streams, RandomAccessFile doesn't seem to be meant for inserting in the middle (afaik). Check the tutorial if you need.

I believe the only way to insert text into an existing text file is to read the original file and write the content in a temporary file with the new text inserted. Then erase the original file and rename the temporary file to the original name.
This example is focused on inserted a single line into an existing file, but still maybe of use to you.

If it is a text file,,,,Read the existing file in StringBuffer and append the new content in the same StringBuffer now u can write the SrtingBuffer on file. so now the file contains both the existing and new text.

As #xor_eq answer's edit queue is full, here in a new answer a more documented and slightly improved version of his:
public static void insert(String filename, long offset, byte[] content) throws IOException {
File temp = Files.createTempFile("insertTempFile", ".temp").toFile(); // Create a temporary file to save content to
try (RandomAccessFile r = new RandomAccessFile(new File(filename), "rw"); // Open file for read & write
RandomAccessFile rtemp = new RandomAccessFile(temp, "rw"); // Open temporary file for read & write
FileChannel sourceChannel = r.getChannel(); // Channel of file
FileChannel targetChannel = rtemp.getChannel()) { // Channel of temporary file
long fileSize = r.length();
sourceChannel.transferTo(offset, (fileSize - offset), targetChannel); // Copy content after insert index to
// temporary file
sourceChannel.truncate(offset); // Remove content past insert index from file
r.seek(offset); // Goto back of file (now insert index)
r.write(content); // Write new content
long newOffset = r.getFilePointer(); // The current offset
targetChannel.position(0L); // Goto start of temporary file
sourceChannel.transferFrom(targetChannel, newOffset, (fileSize - offset)); // Copy all content of temporary
// to end of file
}
Files.delete(temp.toPath()); // Delete the temporary file as not needed anymore
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.