How to read a Delimited Text file in Java?

How to read a Delimited Text file in Java? - java

We have following SEQ file from SFTP:
TSID ,D4 ; TEST ID # (PRIMARY)
TSNAME,A15 ; TEST NAME COMMON (ALTERNATE)
TSRNAM ,A15 ; PORT NAME
TSRELO ,A5 ; TEST REPEAT LOW VALUE
TSREHI ,A5 ; TEST REPEAT HIGH VALUE
TSSSRQ ,D2 ; SAMPLE SIZE REQ.
TSCTYP ,D2 ; CONTAINER TYPE
TSSUOM,A6 ; SAMPLE UNIT OF MEAS
TSINDX ,D4 ; WIDE REPORTING INDEX (ALTERNATE)
TSWKLF ,D2 ; WORKLIST FORMAT
TSMCCD,A8 ; MEDICARE CODE + MODIFIER 1 (ALTERNATE)
TSTADY ,D3 ; RESULT TURN-AROUND TIME IN DAYS
TSENOR ,A1 ; TEST HAS EXPANDED NORMALS Y/N
TSSRPT ,A1 ; ELIGIBLE FOR STATE NOTIFICATION REPORT Y/N
TSPLAB ,D2 ; SENDOUT LAB
The content of file are simple text like:
0001MONTH COMPOSITE 12319909110940 MONTH COMPOSITE
0002MONTHLY CAPD 12319909120944 MONTHLY CAPD
0003CAPD MONTHLY LS 123199100110021004100510081010101210151016101811620944105917931794 CAPD MONTHLY LS
0004CCPD MONTHLY LS 12319910011002100410051007100810101012101510161018116209400942105917931794 CCPD MONTHLY LS
0005HD MONTHLY LS 1231991001100210041005100710081010101210151016101809400942105917931794 HD MONTHLY LS
Is there any Java Internal package (or Third party Java library) available to read file Delimited file (.SEQ) in such a way to assign each value to POJO directly using some sort of converters?
For ex:
public class ra(){
#SomethigLength (0,4)
private String tsId;
#SomethigLength (4,15)
private String tsName;
}
(Note we are using Apache Camel here but i think camel may be complicated compare to any simple library?)

You can use camel-bindy with Fixed-Length records(https://camel.apache.org/components/latest/bindy-dataformat.html#_4_fixedlengthrecord)
So your class will be like:
#FixedLengthRecord(length = 15, paddingChar = ' ')
public class Fastbox {
#DataField(pos = 1, length = 4, align = "L")
private String tsId;
#DataField(pos = 2, length = 11, align = "L")
private String tsName;
}
and with unmarshal() you can convert the file to Java object.
More details are in the link above.
Hope it will help!

After so much introspection i will use
http://fixedformat4j.ancientprogramming.com/usage/index.html

Related

Tf Idf over a set of documents to find words relevance

I have 2 books in txt format (6000+ lines). I would like to associate, using Python, to each word its relevance (using td idf algorithm) and order them in descending order.
I tried this code
#- * -coding: utf - 8 - * -
from __future__
import division, unicode_literals
import math
from textblob
import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1
for blob in bloblist
if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb(""
"FULL BOOK1 TEST"
"")
document2 = tb(""
"FULL BOOK2 TEST"
"")
bloblist = [document1, document2]
for i, blob in enumerate(bloblist):
with open("result.txt", 'w') as textfile:
print("Top words in document {}".format(i + 1))
scores = {
word: tfidf(word, blob, bloblist) for word in blob.words
}
sorted_words = sorted(scores.items(), key = lambda x: x[1], reverse = True)
for word, score in sorted_words:
textfile.write("Word: {}, TF-IDF: {}".format(word, round(score, 5)) + "\n")
that I found here https://stevenloria.com/tf-idf/ with some changes, but it takes a lot of time and after some minutes, it crashes saying TypeError: coercing to Unicode: need string or buffer, float found.
Why?
I also tried to call this Java program through python https://github.com/mccurdyc/tf-idf/. The program works, but the output is incorrect: there are a lot of words that should have a high relevance level that are instead categorized with 0 relevance.
Is there a way to fix that Python code?
Or, can you suggest me another tf-idf implementation that does correctly what I want?

Loading saved Tensorflow model in Java

I have developed a Tensorflow model with python in Linux based on the tutorial here: "http://cv-tricks.com/tensorflow-tutorial/training-convolutional-neural-network-for-image-classification/". I trained and saved the model using "tf.train.Saver". I am able to deploy the model in Linux environment and perform prediction successfully. Now I need to be able to load this saved model in JAVA on WINDOWS. Through extensive research online I have read that it does not work with "tf.train.Saver" and I have to change my code to use "Serving" to be able to load a saved TF model in java! Therefore, I followed the tutorial here:
"https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/mnist_saved_model.py
" and changed my code. However, I have an error with "tf.FixedLenFeature" where it is asking me to use "FixedLenSequenceFeature". Here is the complete error message:
"ValueError: First dimension of shape for feature x unknown. Consider using FixedLenSequenceFeature."
which is happening here:
feature_configs = {'x': tf.FixedLenFeature(shape=[None, img_size,img_size,num_channels], dtype=tf.float32),}
I am not sure this is the right path to go since I have batch of images of size [batchsize*128*128*3] and should not be using the sequence feature! It would be great if someone could clear this out for me and answer these questions:
1- Do I have to change my code from "tf.train.Saver" to "serving" to be able to load the saved model and deploy it in JAVA?
2- If the answer to the above question is yes, how can I feed the data correctly and solve the aforementioned ERROR?
3- Is there any example of how to DEPLOY the model that was saved using "serving"?
Here is my training code that throws the error:
import dataset
import tensorflow as tf
import time
from datetime import timedelta
import math
import random
import numpy as np
import os
#Adding Seed so that random initialization is consistent
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
batch_size = 32
#Prepare input data
classes = ['class1','class2','class3']
num_classes = len(classes)
# 20% of the data will automatically be used for validation
validation_size = 0.2
img_size = 128
num_channels = 3
train_path='/home/user1/Downloads/Expression/Augmented/Data/Train'
# We shall load all the training and validation images and labels into memory using openCV and use that during training
data = dataset.read_train_sets(train_path, img_size, classes, validation_size=validation_size)
print("Complete reading input data. Will Now print a snippet of it")
print("Number of files in Training-set:\t\t{}".format(len(data.train.labels)))
print("Number of files in Validation-set:\t{}".format(len(data.valid.labels)))
session = tf.Session()
serialized_tf_example = tf.placeholder(tf.string, name='tf_example')
feature_configs = {'x': tf.FixedLenFeature(shape=[None, img_size,img_size,num_channels], dtype=tf.float32),}
tf_example = tf.parse_example(serialized_tf_example, feature_configs)
x = tf.identity(tf_example['x'], name='x') # use tf.identity() to assign name
# x = tf.placeholder(tf.float32, shape=[None, img_size,img_size,num_channels], name='x')
## labels
y_true = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_true')
y_true_cls = tf.argmax(y_true, dimension=1)
##Network graph params
filter_size_conv1 = 3
num_filters_conv1 = 32
filter_size_conv2 = 3
num_filters_conv2 = 32
filter_size_conv3 = 3
num_filters_conv3 = 64
fc_layer_size = 128
def create_weights(shape):
return tf.Variable(tf.truncated_normal(shape, stddev=0.05))
def create_biases(size):
return tf.Variable(tf.constant(0.05, shape=[size]))
def create_convolutional_layer(input,
num_input_channels,
conv_filter_size,
num_filters):
## We shall define the weights that will be trained using create_weights function.
weights = create_weights(shape=[conv_filter_size, conv_filter_size, num_input_channels, num_filters])
## We create biases using the create_biases function. These are also trained.
biases = create_biases(num_filters)
## Creating the convolutional layer
layer = tf.nn.conv2d(input=input,
filter=weights,
strides=[1, 1, 1, 1],
padding='SAME')
layer += biases
## We shall be using max-pooling.
layer = tf.nn.max_pool(value=layer,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
padding='SAME')
## Output of pooling is fed to Relu which is the activation function for us.
layer = tf.nn.relu(layer)
return layer
def create_flatten_layer(layer):
#We know that the shape of the layer will be [batch_size img_size img_size num_channels]
# But let's get it from the previous layer.
layer_shape = layer.get_shape()
## Number of features will be img_height * img_width* num_channels. But we shall calculate it in place of hard-coding it.
num_features = layer_shape[1:4].num_elements()
## Now, we Flatten the layer so we shall have to reshape to num_features
layer = tf.reshape(layer, [-1, num_features])
return layer
def create_fc_layer(input,
num_inputs,
num_outputs,
use_relu=True):
#Let's define trainable weights and biases.
weights = create_weights(shape=[num_inputs, num_outputs])
biases = create_biases(num_outputs)
# Fully connected layer takes input x and produces wx+b.Since, these are matrices, we use matmul function in Tensorflow
layer = tf.matmul(input, weights) + biases
if use_relu:
layer = tf.nn.relu(layer)
return layer
layer_conv1 = create_convolutional_layer(input=x,
num_input_channels=num_channels,
conv_filter_size=filter_size_conv1,
num_filters=num_filters_conv1)
layer_conv2 = create_convolutional_layer(input=layer_conv1,
num_input_channels=num_filters_conv1,
conv_filter_size=filter_size_conv2,
num_filters=num_filters_conv2)
layer_conv3= create_convolutional_layer(input=layer_conv2,
num_input_channels=num_filters_conv2,
conv_filter_size=filter_size_conv3,
num_filters=num_filters_conv3)
layer_flat = create_flatten_layer(layer_conv3)
layer_fc1 = create_fc_layer(input=layer_flat,
num_inputs=layer_flat.get_shape()[1:4].num_elements(),
num_outputs=fc_layer_size,
use_relu=True)
layer_fc2 = create_fc_layer(input=layer_fc1,
num_inputs=fc_layer_size,
num_outputs=num_classes,
use_relu=False)
y_pred = tf.nn.softmax(layer_fc2,name='y_pred')
y_pred_cls = tf.argmax(y_pred, dimension=1)
values, indices = tf.nn.top_k(y_pred, 3)
table = tf.contrib.lookup.index_to_string_table_from_tensor(
tf.constant([str(i) for i in xrange(3)]))
prediction_classes = table.lookup(tf.to_int64(indices))
session.run(tf.global_variables_initializer())
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=layer_fc2,
labels=y_true)
cost = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost)
correct_prediction = tf.equal(y_pred_cls, y_true_cls)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
session.run(tf.global_variables_initializer())
def show_progress(epoch, feed_dict_train, feed_dict_validate, val_loss):
acc = session.run(accuracy, feed_dict=feed_dict_train)
val_acc = session.run(accuracy, feed_dict=feed_dict_validate)
msg = "Training Epoch {0} --- Training Accuracy: {1:>6.1%}, Validation Accuracy: {2:>6.1%}, Validation Loss: {3:.3f}"
print(msg.format(epoch + 1, acc, val_acc, val_loss))
total_iterations = 0
# saver = tf.train.Saver()
def train(num_iteration):
global total_iterations
for i in range(total_iterations,
total_iterations + num_iteration):
x_batch, y_true_batch, _, cls_batch = data.train.next_batch(batch_size)
x_valid_batch, y_valid_batch, _, valid_cls_batch = data.valid.next_batch(batch_size)
feed_dict_tr = {x: x_batch,
y_true: y_true_batch}
feed_dict_val = {x: x_valid_batch,
y_true: y_valid_batch}
session.run(optimizer, feed_dict=feed_dict_tr)
if i % int(data.train.num_examples/batch_size) == 0:
print(i)
val_loss = session.run(cost, feed_dict=feed_dict_val)
epoch = int(i / int(data.train.num_examples/batch_size))
show_progress(epoch, feed_dict_tr, feed_dict_val, val_loss)
print("Saving the model Now!")
# saver.save(session, save_path_full, global_step=i)
total_iterations += num_iteration
train(num_iteration=10000)#3000
# Export model
# WARNING(break-tutorial-inline-code): The following code snippet is
# in-lined in tutorials, please update tutorial documents accordingly
# whenever code changes.
export_path_base = './SavedModel/'
export_path = os.path.join(
tf.compat.as_bytes(export_path_base),
tf.compat.as_bytes(str(1)))
print 'Exporting trained model to', export_path
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
# Build the signature_def_map.
classification_inputs = tf.saved_model.utils.build_tensor_info(
serialized_tf_example)
classification_outputs_classes = tf.saved_model.utils.build_tensor_info(
prediction_classes)
classification_outputs_scores = tf.saved_model.utils.build_tensor_info(values)
classification_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={
tf.saved_model.signature_constants.CLASSIFY_INPUTS:
classification_inputs
},
outputs={
tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES:
classification_outputs_classes,
tf.saved_model.signature_constants.CLASSIFY_OUTPUT_SCORES:
classification_outputs_scores
},
method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME))
tensor_info_x = tf.saved_model.utils.build_tensor_info(x)
tensor_info_y = tf.saved_model.utils.build_tensor_info(y_pred)
prediction_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={'images': tensor_info_x},
outputs={'scores': tensor_info_y},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={
'predict_images':
prediction_signature,
tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
classification_signature,
},
legacy_init_op=legacy_init_op)
builder.save()
print 'Done exporting!'

Only print lines from bottom up if there is data

How would you go about solving the following logic:
I have pdf file with cells:
addressLine1
addressLine2
addressLine3
addressLine4
addressLine5
cityStateZip
All of them have getters.
Sometimes, all fields have data and sometimes they don't.
To make it pretty, I want them grouped together, ie:
1261 Graeber St (address4)
Bldg 2313 Rm 24 (address5)
Pensacola FL 32508 (cityStateZip)
You need to account for some of these addresses being blank, if addressLine1 is the only one existing.ie:
1261 Graeber St (address5)
Pensacola FL 32508 (cityStateZip)
Here, since address2, address3, address4 are blank, we moved address1 on pdf cell address5
My code right now print:
1261 Graeber St (address1)
(address2)
(address3)
(address4)
(address5)
Pensacola FL 32508 (cityStateZip)
And here is the code:
FdfInput.SetValue("addressLine1", getAddressLine1() );
FdfInput.SetValue("addressLine2", getAddressLine2() );
FdfInput.SetValue("addressLine3", getAddressLine3() );
FdfInput.SetValue("addressLine4", getAddressLine4() );
FdfInput.SetValue("addressLine5", getAddressLine5() );
FdfInput.SetValue("addressLine6", getCityStateZip() );
Picture on the left is how it looks like right now, I want it to be like picture on the right.
Is this a good candidate for LinkedList.insertLast() ?

This:
if(!getAddressLine1().isEmpty())
FdfInput.SetValue("addressLine1", getAddressLine1());
if(!getAddressLine2().isEmpty())
FdfInput.SetValue("addressLine2", getAddressLine2());
if(!getAddressLine3().isEmpty())
FdfInput.SetValue("addressLine3", getAddressLine3());
if(!getAddressLine4().isEmpty())
FdfInput.SetValue("addressLine4", getAddressLine4());
if(!getAddressLine5().isEmpty())
FdfInput.SetValue("addressLine5", getAddressLine5());
if(!getCityStateZip().isEmpty())
FdfInput.SetValue("cityStateZip", getCityStateZip());
In other words, if there is data to add to the line, do so, otherwise, skip it entirely. For example, let's say all of the fields are empty besides address3, address5, and cityStateZip.
// The output will not look like this:
addressLine3
addressLine5
cityStateZip
Instead, it will look like:
addressLine3
addressLine5
cityStateZip

I solved it by storing strings in array list and decrementing the counter on the name:
List<String> addrLines = new ArrayList<String>();
if(!getCityStateZip().isEmpty())
addrLines.add(getTomaCityStateZip());
if(!getAddressLine5().isEmpty())
addrLines.add(getAddressLine5());
if(!getAddressLine4().isEmpty())
addrLines.add(getAddressLine4());
if(!getAddressLine3().isEmpty())
addrLines.add(getAddressLine3());
if(!getAddressLine2().isEmpty())
addrLines.add(getAddressLine2());
if(!getAddressLine1().isEmpty())
addrLines.add(getAddressLine1());
for (int i = addrLines.size(); i > 0; --i)
{
int line = addrLines.size() - i;
String field = String.format("addressLine%d", 6 - line);
FdfInput.SetValue(field, addrLines.get(line));
}

Batch file renaming – inserting text from a list (in Python or Java)

I'm finishing a business card production flow (excel > xml > indesign > single page pdfs) and I would like to insert the employees' names in the filenames.
What I have now:
BusinessCard_01_Blue.pdf
BusinessCard_02_Blue.pdf
BusinessCard_03_Blue.pdf (they are gonna go up to the hundreds)
What I need (I can manipulate the name list with regex easily):
BusinessCard_01_CarlosJorgeSantos_Blue.pdf
BusinessCard_02_TaniaMartins_Blue.pdf
BusinessCard_03_MarciaLima_Blue.pdf
I'm a Java and Python toddler. I've read the related questions, tried this in Automator (Mac) and Name Mangler, but couldn't get it to work.
Thanks in advance,
Gus

Granted you have a map where to look at the right name you could do something like this in Java:
List<Files> originalFiles = ...
for( File f : originalFiles ) {
f.renameTo( new File( getNameFor( f ) ) );
}
And define the getNameFor to something like:
public String getNameFor( File f ) {
Map<String,String> namesMap = ...
return namesMap.get( f.getName() );
}
In the map you'll have the associations:
BusinessCard_01_Blue.pdf => BusinessCard_01_CarlosJorgeSantos_Blue.pdf
Does it make sense?

In Python (tested):
#!/usr/bin/python
import sys, os, shutil, re
try:
pdfpath = sys.argv[1]
except IndexError:
pdfpath = os.curdir
employees = {1:'Bob', 2:'Joe', 3:'Sara'} # emp_id:'name'
files = [f for f in os.listdir(pdfpath) if re.match("BusinessCard_[0-9]+_Blue.pdf", f)]
idnumbers = [int(re.search("[0-9]+", f).group(0)) for f in files]
filenamemap = zip(files, [employees[i] for i in idnumbers])
newfiles = [re.sub('Blue.pdf', e + '_Blue.pdf', f) for f, e in filenamemap]
for old, new in zip(files, newfiles):
shutil.move(os.path.join(pdfpath, old), os.path.join(pdfpath, new))
EDIT: This now alters only those files that have not yet been altered.
Let me know if you want something that will build the the employees dictionary automatically.

If you have a list of names in the same order the files are produced, in Python it goes like this untested fragment:
#!/usr/bin/python
import os
f = open('list.txt', 'r')
for n, name in enumerate(f):
original_name = 'BusinessCard_%02d_Blue.pdf' % (n + 1)
new_name = 'BusinessCard_%02d_%s_Blue.pdf' % (
n, ''.join(name.title().split()))
if os.path.isfile(original_name):
print "Renaming %s to %s" % (original_name, new_name),
os.rename(original_name, new_name)
print "OK!"
else:
print "File %s not found." % original_name

Python:
Assuming you have implemented the naming logic already:
for f in os.listdir(<directory>):
try:
os.rename(f, new_name(f.name))
except OSError:
# fail
You will, of course, need to write a function new_name which takes the string "BusinessCard_01_Blue.pdf" and returns the string "BusinessCard_01_CarlosJorgeSantos_Blue.pdf".

Parse a task list

A file contains the following:
HPWAMain.exe 3876 Console 1 8,112 K
hpqwmiex.exe 3900 Services 0 6,256 K
WmiPrvSE.exe 3924 Services 0 8,576 K
jusched.exe 3960 Console 1 5,128 K
DivXUpdate.exe 3044 Console 1 16,160 K
WiFiMsg.exe 3984 Console 1 6,404 K
HpqToaster.exe 2236 Console 1 7,188 K
wmpnscfg.exe 3784 Console 1 6,536 K
wmpnetwk.exe 3732 Services 0 11,196 K
skypePM.exe 2040 Console 1 25,960 K
I want to get the process ID of the skypePM.exe. How is this possible in Java?
Any help is appreciated.

Algorithm
Open the file.
In a loop, read a line of text.
If the line of text starts with skypePM.exe then extract the number.
Repeat looping until all lines have been read from the file.
Close the file.
Implementation
import java.io.*;
public class T {
public static void main( String args[] ) throws Exception {
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream( "tasklist.txt" ) ) );
String line;
while( (line = br.readLine()) != null ) {
if( line.startsWith( "skypePM.exe" ) ) {
line = line.substring( "skypePM.exe".length() );
int taskId = Integer.parseInt( (line.trim().split( " " ))[0] );
System.out.println( "Task Id: " + taskId );
}
}
br.close();
}
}
Alternate Implementation
If you have Cygwin and related tools installed, you could use:
cat tasklist.txt | grep skypePM.exe | awk '{ print $2; }'

To find the Process Id of the application SlypePM..
Open the file
now read lines one by one
find the line which contains SkypePM.exe in the beginning
In the line containing SkypePM.exe parse the line to read the numbers after the process name leaving the spaces.
You get process id of the process
It is all string operations.
Remember the format of the file should not change after you write the code.

If you really want to parse the output, you may need a different strategy. If your output file really is the result of a tasklist execution, then it should have some column headers at the top of it like:
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
I would use these, in particular the set of equal signs with spaces, to break any subsequent strings using a fixed-width column strategy. This way, you could have more flexibility in parsing the output if needed (i.e. maybe someone is looking for java.exe or wjava.exe). Do keep in mind the last column may not be padded with spaces all the way to the end.
I will say, in the strictest sense, the existing answers should work for just getting the PID.

Implementation in Java is not a good way. Shell or other script languages may help you a lot. Anyway, JAWK is a implementation of awk in Java, I think it may help you.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read a Delimited Text file in Java? - java

After so much introspection i will use http://fixedformat4j.ancientprogramming.com/usage/index.html

Related

Tf Idf over a set of documents to find words relevance

Loading saved Tensorflow model in Java

Only print lines from bottom up if there is data

Batch file renaming – inserting text from a list (in Python or Java)

Parse a task list

Categories

Resources