Escape inch symbol while splitting from comma - java

I have a CSV splitter with following regex for splitting a string with comma.
String[] splitData = splitCSV.split(",(?=(?:[^\"]*\"[^\"]*\"^\")*[^\"]*$)");
It works so far for String like 123, "foo", "bar", "no, split, here" but when it encounters an inch sign(") like following it cannot do the splitting.
"123, 1.0" xyz"
I need it to split into 123 and 1.0" xyz
Hope someone can provide a solution for this. Thank you.

A couple of points here:
You should be using an existing CSV processing library, not creating your own with a regex. There are many available for Java, see this question as a starting point. This is a solved problem; there's no reason to reinvent it.
The scenario you mention would be invalid* data. A quote should be escaped within a string, usually by using two quotes together. Having one unescaped quote makes the file invalid; and furthermore there is usually no reliable way to tell what the file "should" be once you have these sorts of errors. What to do about it:
If the file is within your control, correct it. Use a standard escape format for quotes within a string.
If the file is not within your control, you should handle errors separately rather than including this in your core processing. Either preprocess the file looking for errors, or use error handling available in a CSV library to do something with the lines that come back as having an incorrect format. If the errors are limited to a predictable issue that you know ahead of time, you might be able to correct them. But in most cases errors like this lead you to have to reject the lines.
*Technically there is no CSV standard, so anything goes. But this would be a data error in any reasonable format. And in the real world this almost always occurs because someone didn't think the file format through, not because they intentionally planned it this way.

What you have here is an unusual dialect of CSV.
Although there is no formalised standard for CSV, there are broadly two approaches to quotes:
Quotes are not special. That is: 7" single, 12" album is two items: 7" single and 12" album. In this dialect, items containing , are problematic.
Quotes are special. That is: "you, me","me you" is two items: you, me and me, you. In this dialect, you can put quotes around an entry in order to have a , within an item. However it makes items containing " problematic, as you have found.
The typical answer to the " problem in the second approach, is to escape quotes. So the item 7" single would appear in the CSV as "7\" single". This of course means that \ becomes a problem, but that's easily solved the same way. AC\DC 7" single appears in the CSV as "AC\\DC 7\" single".
If you can adopt one of these conventional approaches, then do so. Then you can either use an existing CSV library, or roll your own. Although a regex can consume these formats, my opinion is that it's not the clearest way to write code to consume CSV: I've found that a more explicit state machine (e.g. a switch (state) statement) is nice and clear.
If you can't change your input format, the puzzle you have to solve is, when you encounter a ", is it a metacharacter (part of a pair of quotes surrounding an item) or is it a real character that's part of the item?
As owner of the format, it's up to you to decide what the rule is. Perhaps a " should only be considered a metacharacter if it's next to a ,. But even that causes problems if you allow a mixture of quoted and unquoted items:
"A Town Called Malice", The Jam, 7", £6.99
So, you must come up with your own rules, that work in your domain, and write explicit code to handle that situation. One approach is to pre-process the input into canonical CSV so that it's again suitable for a conventional CSV parser.

Related

If data contains commas, how to store it in csv?

If the task is to create a csv file out of some data where commas may be present, is there a way to do it without later confusing which comma is a delimiter and which comma is part of a value?
Obviously, we can use a different delimiter, replace all occurrences, or replace the original comma with something else, but for the purpose of this question let's say that modifying the original data is not an option and a comma is the only delimiter allowed.
How would you approach something like this? Would it be easier to create the xls instead? Can you recommend any java libraries that handle this well?
A true CSV reader should be able to handle this; the values should be in quotes, e.g.:
one,two,"a, b, c",four
...per item #6 in Section 2 of the RFC.
While there's no single CSV standard, the usual convention is to surround entries containing commas in double quotes (i.e. ").
Prempting the next question: What to do if your data contains a double quote? In this case they are usually substituted for a pair of double quotes.
While I hate to cite wikipedia as a source, they do have a pretty good roundup of basic rules and examples for CSV formatting.
I would either use a different delimiter or use a library like Apache POI.
I think the best way is to use Apache POI: http://poi.apache.org/
You can easily create XLS documents without much hassle.
However, if you really need CSV and not XLS, you can surround the value with quotes. This should also solve the problem.
Usually, you work with , as separator and ' as quote. So your values would look like:
foo, 'bar, baz', iik, aje
the task is to create a csv file
Actually an impossible task, since there is no such thing as "a CSV" file. Different Microsoft produces have used different (subtly different, I grant) formats and named them all "CSV". As most spreadsheets can read delimiter separated value (DSV) files, you might be better writing one of those.

In Java (Pig) Regex, how could I do the following?

I have data coming in a txt file delimited by pipes. The unfortunate thing is 2 fields can have multiple values. To separate these multiples, the sender used pipes again, but put quotes around it. My regex worked for months until a certain rare situation...
Regex currently:
([^\|]*)\|"?([^"]*)"?\|([^\|]*)\|"?([^"]*)"?
And it worked for the following situation which happens most of the time:
abc|"part1|part2"|abc|"tool1|tool2"
But this case is where the ([^"]*) jumps ahead and takes all from the blank to the end of the quotes:
abc||abc|"tool1|tool2"
So I realize I must account for when there is a pipe next instead of a quote.
Just not sure how.............
P.S. For those PIG people that might be looking at this, I removed a backslash from each escape, to make it look more like Java, but in PIG you need 2, fyi.
In your expression you need to specify that the part between |s can be either quoted or not quoted. You can do it as follows:
(("[^"]*")|((?!")[^|]*))
Now you can repeat this part several times with |s in between, as you need.

Need some ideas on how to acomplish this in Java (parsing strings)

Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?
There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.
Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

Parsing of data structure in a plain text file

How would you parse in Java a structure, similar to this
\\Header (name)\\\
1JohnRide 2MarySwanson
1 password1
2 password2
\\\1 block of data name\\\
1.ABCD
2.FEGH
3.ZEY
\\\2-nd block of data name\\\
1. 123232aDDF dkfjd ksksd
2. dfdfsf dkfjd
....
etc
Suppose, it comes from a text buffer (plain file).
Each line of text is "\n" - limited. Space is used between the words.
The structure is more or less defined. Ambuguity may sometimes be, though, case
number of fields in each line of information may be different, sometimes there may not
be some block of data, and the number of lines in each block may vary as well.
The question is how to do it most effectively?
First solution that comes to my head is to use regular expressions.
But are there other solutions? Problem-oriented? Maybe some java library already written?
Check out UTAH: https://github.com/sonalake/utah-parser
It's a tool that's pretty good at parsing this kind of semi structured text
As no one recommended any library, my suggestion would be : use REGEX.
From what you have posted it looks like the data is delimited by whitespace. One idea is to use a Scanner or a StringTokenizer to get one token at a time. You can then check the first char of a token to see if it is a digit (in which case the part of the token after the digit(s) will be the data, if there is any).
This sounds like a homework problem so I'm going to try to answer it in such a way to help guide you (not give the final solution).
First, you need to consider each object of data you're reading. Is it a number then a text field? A number then 3 text fields? Variable numbers and text fields?
After that you need to determine what you're going to use to delimit each field and each object. For example, in many files you'll see something like a semi-colon between the fields and a new line for the end of the object. From what you said it sounds like yours is different.
If an object can go across multiple lines you'll need to bear that in mind (don't stop partway through an object).
Hopefully that helps. If you research this and you're still having problems post the code you've got so far and some sample data and I'll help you to solve your problems (I'll teach you to fish....not give you fish :-) ).
If the fields are fixed length, you could use a DataInputStream to read your file. Or, since your format is line-based, you could use a BufferedReader to read lines and write yourself a state machine which knows what kind of line to expect next, given what it's already seen. Once you have each line as a string, then you just need to split the data appropriately.
E.g., the password can be gotten from your password line like this:
final int pos = line.indexOf(' ');
String passwd = line.substring(pos+1, line.length());

Should I use java.text.MessageFormat for localised messages without placeholders?

We are localising the user-interface text for a web application that runs on Java 5, and have a dilemma about how we output messages that are defined in properties files - the kind used by java.util.Properties.
Some messages include a placeholder that will be filled using java.text.MessageFormat. For example:
search.summary = Your search for {0} found {1} items.
MessageFormat is annoying, because a single quote is a special character, despite being common in English text. You have to type two for a literal single quote:
warning.item = This item''s {0} is not valid.
However, three-quarters of the application's 1000 or so messages do not include a placeholder. This means that we can output them directly, avoiding MessageFormat, and leave the single quotes alone:
help.url = The web page's URL
Question: should we use MessageFormat for all messages, for consistent syntax, or avoid MessageFormat where we can, so most messages do not need escaping?
There are clearly pros and cons either way.
Note that the API documentation for MessageFormat acknowledges the problem and suggests a non-solution:
The rules for using quotes within
message format patterns unfortunately
have shown to be somewhat confusing.
In particular, it isn't always obvious
to localizers whether single quotes
need to be doubled or not. Make sure
to inform localizers about the rules,
and tell them (for example, by using
comments in resource bundle source
files) which strings will be processed
by MessageFormat.
Just write your own implementation of MessageFormat without this annoying feature. You may look at the code of SLF4J Logger.
They have their own version of message formatter which can be used as followed:
logger.debug("Temperature set to {}. Old temperature was {}.", t, oldT);
Empty placeholders could be used with default ordering and numbered for some localization cases where different languages do permutations of words or parts of sentences.
In the end we decided to side-step the single quote problem by always using ‘curly’ quotes:
warning.item = This item\u2019s {0} is not valid.
Use the ` character instead of ' for quoting. We use it all the time without problems.
Use MessageFormat only when you need it, otherwise they only bloat up the code and have no extra value.
In my opinion, consistency is important for this sort of thing. Properties files and MessageFormat already have lots of limitations. If you find these troubling you could "compile" your properties files to generate properly-formed ones. But I'd say go with using MessageFormat everywhere. This way, as you maintain the code you don't need to worry about which strings are formatted and which aren't. It becomes simpler to deal with, since you can hand off message processing to a library and not worry about the details at a high level.
Another alternative...When loading the properties file, just wrap the inputstream in a FilterInpuStream that doubles up every single quote.

Categories