Univocity - writing out quote escape character using RowWriterProcessor - java

I am using univocity to parse files to beans, perform some magic on the objects, and afterwards write the beans back to a file. The code snippet that duplicates my problem:
public class Sample {
public static void main(String[] args) throws IOException {
BeanListProcessor<Person> rowProcessor = new BeanListProcessor<Person>(Person.class);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setProcessor(rowProcessor);
parserSettings.getFormat().setDelimiter(',');
parserSettings.getFormat().setQuote('"');
parserSettings.getFormat().setQuoteEscape('/');
CsvParser parser = new CsvParser(parserSettings);
parser.parse(new FileReader("src/main/resources/person.csv"));
List<Person> beans = rowProcessor.getBeans();
Writer outputWriter = new FileWriter("src/main/resources/personOut.csv", true);
CsvWriterSettings settings = new CsvWriterSettings();
settings.getFormat().setDelimiter(',');
settings.getFormat().setQuote('"');
settings.getFormat().setQuoteEscape('/');
settings.setRowWriterProcessor(new BeanWriterProcessor<Person>(Person.class));
CsvWriter writer = new CsvWriter(outputWriter, settings);
for (Person person : beans) {
writer.processRecord(person);
}
writer.close();
}
}
Bean class (getters and setters ommited):
public class Person {
#Parsed(index = 0)
private int id;
#Parsed(index = 1)
private String firstName;
#Parsed(index = 2)
private String lastName;
}
And the file content (input):
1,"Alex ,/,awesome/,",chan
2,"Peter ,boring",pitt
The file is not under my control as it is being provided by an external party. After the application performs operation on the objects (not included in the code snippet), I need to return the file to the external party with exactly the same settings.
Desired file output content:
1,"Alex ,/,awesome/,",chan
2,"Peter ,boring",pitt
However I am getting:
1,"Alex ,,awesome,",chan
2,"Peter ,boring",pitt
Is there a way to include the actual used quote escape character when writing the beans back to a file?
EDIT
I have tried using settings.setQuoteAllFields(true); on the writer and all combinations of
parserSettings.setKeepQuotes(true);
parserSettings.setKeepEscapeSequences(true);
on the CsvParserSettings, but none seem to give me the result I am looking for.

The solution for this is divided in two parts:
First, please make sure you use version 2.2.3, as it's been just released with an adjustment to properly capture the quote escape character if it's not been escaped, which is your case:
Here the / is not escaped:
"Alex ,/,awesome/,"
Versions prior to 2.2.3 would expect this:
"Alex ,//,awesome//,"
Second, when writing this back to the output, you won't want to write the escape escape, i.e., given the string "Alex ,/,awesome/,"
you want to get:
"Alex ,/,awesome/,"
Instead of what it will do by default:
"Alex ,//,awesome//,"
On the CsvWriterSettings, do this:
settings.getFormat().setCharToEscapeQuoteEscaping('\0');
And it will not try to escape the quote escape character.
Hope it helps

Related

Line breaks in field treated as end of line while parsing csv file

IN a csv file that I have a record that renders like this:
,"SKYY SPA MARTINI
2 oz. SKYY Vodka
Fresh cucumber
Fresh mint
Splash of simple syrup
Muddle cucumber & mint with syrup.
Add SKYY Vodka and shake with ice.
Strain into a chilled martini glass.
Garnish with a fresh mint sprig and cucumber slice.",
with each line ending with a LF carriage return.
I thought that this would be treated as a string and the carriage returns wouldn't be treated as new lines, but this isn't the case, and is breaking my script. Is there a way to have the reader only have line breaks parsed if they're not flanked by quotes? I'm currently using this as my code, couldn't find a setting for the tokenizer that would allow me to perform this action.
// instantiate description line mapper
DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
DefaultLineMapper<LCBOProduct> lineMapper = new DefaultLineMapper<>();
lineMapper.setLineTokenizer(lineTokenizer);
lineMapper.setFieldSetMapper(fieldSetMapper);
// set description line mapper
reader.setLineMapper(lineMapper);
return reader;
Inspired by this CSV regex post, I have written a quick-and-dirty method for doing this:
public static void main(String[] args) {
String line = "\"BEEP\",\"BOOP\",\"TWO SHOTS\rOF VODKA\"\r\"BOOP\",\"BEEP\",\"LEMON\rWEDGES\"";
String quote = "\"";
String splitter = "\r";
String delimiter = ",";
parse(line, delimiter, quote, splitter);
}
public static void parse(String data, String delimiter, String quote, String splitter) {
String regex = splitter+"(?=(?:[^"+quote+"]*\"[^"+quote+"]*\")*[^"+quote+"]*$)";
String[] lines = data.split(regex, -1);
List<String[]> records = new ArrayList<String[]>();
for(String line : lines) {
records.add(line.split(delimiter, -1));
}
for(String[] line : records) {
for(String record : line) {
System.out.println("RECORD: " + record); //do whatever
}
}
}
Of course, considering the large size of some CSV files, you will need to chug along with a StringBuilder and likely use myStringBuilder.toString().split(regex, -1); for the parse method.
This is likely not the Spring way of doing things. But as Jim Garrison commented, this is an edge case that I'm not sure if Spring has ways of solving.
A more complex regex may be required if the records start using other nasty characters (commas, quotes, etc.). I don't know what the source of these records could be, but some sanitizing may be in order before splitting the file.

Best way to populate a user defined object using the values of string array

I am reading two different csv files and populating data into two different objects. I am splitting each line of csv file based on regex(regex is different for two csv files) and populating the object using each data of that array which is obtained by splitting each line using regex as shown below:
public static <T> List<T> readCsv(String filePath, String type) {
List<T> list = new ArrayList<T>();
try {
File file = new File(filePath);
FileInputStream fileInputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(fileInputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader)
list = bufferedReader.lines().skip(1).map(line -> {
T obj = null;
String[] data = null;
if (type.equalsIgnoreCase("Student")) {
data = line.split(",");
ABC abc = new ABC();
abc.setName(data[0]);
abc.setRollNo(data[1]);
abc.setMobileNo(data[2]);
obj = (T)abc;
} else if (type.equalsIgnoreCase("Employee")) {
data = line.split("\\|");
XYZ xyz = new XYZ();s
xyz.setName(Integer.parseInt(data[0]));
xyz.setCity(data[1]);
xyz.setEmployer(data[2]);
xyz.setDesignation(data[3]);
obj = (T)xyz;
}
return obj;
}).collect(Collectors.toList());} catch(Exception e) {
}}
csv files are as below:
i. csv file to populate ABC object:
Name,rollNo,mobileNo
Test1,1000,8888888888
Test2,1001,9999999990
ii. csv file to populate XYZ object
Name|City|Employer|Designation
Test1|City1|Emp1|SSE
Test2|City2|Emp2|
The issue is there can be a missing data for any of the above columns in the csv file as shown in the second csv file. In that case, I will get ArrayIndexOutOfBounds exception.
Can anyone let me know what is the best way to populate the object using the data of the string array?
Thanks in advance.
In addition to the other mistakes you made and that were pointed out to you in the comments your actual problem is caused by line.split("\\|") calling line.split("\\|", 0) which discards the trailing empty String. You need to call it with line.split("\\|", -1) instead and it will work.
The problem appears to be that one or more of the last values on any given CSV line may be empty. In that case, you run into the fact that String.split(String) suppresses trailing empty strings.
Supposing that you can rely on all the fields in fact being present, even if empty, you can simply use the two-arg form of split():
data = line.split(",", -1);
You can find details in that method's API docs.
If you cannot be confident that the fields will be present at all, then you can force them to be by adding delimiters to the end of the input string:
data = (line + ",,").split(",", -1);
Since you only use the first values few values, any extra trailing values introduced by the extra delimiters would be ignored.

Xstream tag creation problems with toXML(Object, Writer)

I am trying to create a simple XML in string format but I cannot seem to get Eclipse to play along as it keeps giving me a "cannot convert from void to String" error warning for this code (I followed along this relatively recent post on Stackoverflow about how to add a tag for an xml version.) Here is my example code:
final Xstream xstream = new Xstream();
final Person myPerson = new Person();
Writer writer = new StringWriter();
try {
writer.write("<?xml version=\"1.0\"?>");
} catch (final IOException e) {
e.printStackTrace();
System.exit(0); //just for testing sakes
}
final String xmlResponse = xstream.toXML(myPerson, writer); // the line that has the error in Eclipse
For the obvious answer of "Maybe you aren't actually making an object for myPerson," the code I have for Person is:
public class Person{
public final String name;
Person(){
name = "EmptyName";
}
}
I have searched the documentation and have found nothing about this error that would help. However, there is in the source for Xstream the method void toxml(Xstream, Object, Writer) which could potentially be the issue. The other thing it might be is escaping the quotation marks in the string could also be the issue, but I do not have a way around that. Plus removing the quotations and escapes doesn't fix the issue.
Any help in solving this would be appreciated. Xstream version 1.4.10 in case that matters.
If you look at the documentation for toXML, it returns void, means there is no return type and hence you see that error in Eclipse.
You can simply write:
xstream.toXML( myPerson, writer );
Remove String xmlResponse =
and then you can assign the xml to xmlResponse using :
final String xmlResponse = ( (StringWriter)writer ).getBuffer().toString();

How to change some values in a .JSON file and then write it back while keeping the JSON formatting ? (Java)

The JSON example file consists of:
{
"1st_key": "value1",
"2nd_key": "value2",
"object_keys": {
"obj_1st": "value1",
"obj_2nd": "value2",
"obj_3rd": "value3",
}
}
I read the JSON file into a String with this StringBuilder method, in order to add the newlines into the string itself. So the String looks exactly like the JSON file above.
public String getJsonContent(String fileName) {
StringBuilder result = new StringBuilder("");
File file = new File(fileName);
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
result.append(line).append("\n");
}
scanner.close();
} catch (IOException e) {
e.printStackTrace();
}
return result.toString();
}
Then I translate the JSON file into an Object using MongoDB API (with DBObject, BasicDBObject and util.JSON) and I call out the Object section I need to change, which is 'object_keys':
File jsonFile = new File(C:\\example.json);
String jsonString = getJsonContent(jsonFile.getAbsolutePath());
DBObject jsonObject = (DBObject)JSON.parse(jsonString);
BasicDBObject objectKeys = (BasicDBObject) jsonObject.get("object_keys");
Now I can write new values into the Object using the PUT method like this:
objectKeys.put("obj_1st","NEW_VALUE1");
objectKeys.put("obj_2nd","NEW_VALUE2");
objectKeys.put("obj_3rd","NEW_VALUE3");
! This following part not needed, check out my answer below.
After I have changed the object, I need to write it back into the json file, so I need to translate the Object into a String. There are two methods to do this, either one works.
String newJSON = jsonObject.toString();
or
String newJSON = JSON.serialize(jsonObject);
Then I write the content back into the file using PrintWriter
PrintWriter writer = new PrintWriter(C:\\example.json)
writer.print(newJSON);
writer.close();
The problem I am facing now is that the String that is written is in a single line with no formatting whatosever. Somewhere along the way it lost all the newlines. So it basically looks like this:
{"1st_key": "value1","2nd_key": "value2","object_keys": { "obj_1st": "NEW_VALUE1","obj_2nd": "NEW_VALUE2","obj_3rd": "NEW_VALUE3", }}
I need to write the JSON file back in the same format as shown in the beginning, keeping all the tabulation, spaces etc.
Is this possible somehow ?
When you want something formatted the way you said it is addressed as writing to a file in a pretty/beautiful way. For example: Output beautiful json. A quick search on google found what i believe to solve your problem.
Solution
You're going to have to use a json parser of some sort. I personally prefer org.json and would recommend it if you are manipulating the json data, but you may also like json-io which is really good for json serialization with no external dependencies.
With json-io, it's as simple as
String formattedJson = JsonWriter.formatJson(jsonObject.toString())
With org.json, you simply pass an int to the toString method.
Thanks Saraiva, I found a surprisingly simple solution by Googling around with the words 'pretty printing JSON' and used the Google GSON library. I downloaded the .jar and added it to my project in Eclipse.
These are the new imports I needed:
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
Since I already had the JSON Object (jsonObject) readily available from my previous code, I only needed to add two new lines:
Gson gson = new GsonBuilder().setPrettyPrinting().create();
String newJSON = gson.toJson(jsonObject);
Now when I use writer.print(newJSON); it will write the JSON in the right format, beautifully formatted and indented.

Writing our own models in openNLP

If i use a query like this in command line
./opennlp TokenNameFinder en-ner-person.bin "input.txt" "output.txt"
I'll get person names printed in output.txt but I want to write own models such that i should print my own entities.
E.g.
what is the risk value on icm2500.
Delivery of prd_234 will be arrived late.
Watson is handling router_34.
If i pass these lines, it should parse and extract product_entities. icm2500, prd_234, router_34... etc these are all Products( we can save this information in a file and we can use it as look up kind of for models or openNLP).
Can anyone please tel me how to do this ?
You'll need to train your own model by annotating some sentences in the opennlp format. For the example sentences you posted the format would look like this:
what is the risk value on <START:product> icm2500 <END>.
Delivery of <START:product> prd_234 <END> will be arrived late.
Watson is handling <START:product> router_34 <END>.
Make sure each sentence ends in a newline and if there are newlines in the sentence to escape them somehow.
Once you make a file like this out of your data, then you can use the Java API to train the model like this
public static void main(String[] args){
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("your file in the above format"), charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model;
try {
model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(),
null, Collections.<String, Object>emptyMap());
}
finally {
sampleStream.close();
}
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
now you can use the model with the namefinder.
Because you may have a definitive, and possibly short, list of product names, you might consider a simple regex approach.
here's the opennlp docs that cover the NameFinder a bit:
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.tool

Categories