As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I have a CSV file that has some quoting issues:
"Albanese Confectionery","157137","ALBANESE BULK ASST. MINI WILD FRUIT WORMS 2" 4/5LB",9,90,0,0,0,.53,"21",50137,"3441851137","5 lb",1,4,4,$6.7,$6.7,$26.8
SuperCSV is choking on these fruit worms (pun intended). I know that the 2" should probably be 2"", but it's not. LibreOffice actually parses this correctly (which surprises me). I was thinking of just writing my own little parser but other rows have commas inside the string:
"Albanese Confectionery","157230","ALBANESE BULK JET FIGHTERS,ASSORTED 4/5 B",9,90,0,0,0,.53,"21",50230,"3441851230","5 lb",1,4,4,$6.7,$6.7,$26.8
Does anyone know of a Java library that will handle crazy stuff like this? Or should I try all the available ones? Or am I better off hacking this out myself?
The right solution is to find the person who generated the data and beat them over the head with a keyboard until they fix the problem on their end.
Once you've exhausted that route, you could try some of the other CSV parsers on the market, I've used OpenCSV with success in the past.
Even if OpenCSV won't solve the problem out of the box, the code is fairly easy to read and available under an Apache license, so it might be possible to modify the algorithm to work with your wonky data, and probably easier than starting from scratch.
Surprising even myself here, but I think I would hack it myself. I mean, you only need to read the lines and generate the tokens by splitting on quotes/commas, whichever you want. That way you can adjust the logic the way it suites you. It's not very hard. The file seems to be broken as much so that going through some existing solutions seems like more work.
One point though - if LibreOffice already parses it correctly, couldn't you just save the file from there, thus generating a file that is more reasonable. However, if you think LibreOffice might be guessing, just write the tokenizer yourself.
+1 for the 'choking on fruit worms' pun - I nearly choked on my coffee reading that :)
If you really can't get that CSV fixed, then you could just supply your own Tokenizer (Super CSV is very flexible like that!).
You'd normally write your own readColumns() implementation, but it's quicker to extend the default Tokenizer and override the readLine() method to intercept the String (and fix the unescaped quotes) before it's tokenized.
I've made an assumption here that any quotes not next to a delimiter or at the start/end of the line should be escaped. It's far from perfect, but it works for your sample input. You can implement this however you like - it was too early in the morning for me to use a regex :)
This way you don't have to modify Super CSV at all (it just plugs in), so you get all of the other features like cell processors and bean mapping as well.
package org.supercsv;
import java.io.IOException;
import java.io.Reader;
import org.supercsv.io.Tokenizer;
import org.supercsv.prefs.CsvPreference;
public class FruitWormTokenizer extends Tokenizer {
public FruitWormTokenizer(Reader reader, CsvPreference preferences) {
super(reader, preferences);
}
#Override
protected String readLine() throws IOException {
final String line = super.readLine();
if (line == null) {
return null;
}
final char quote = (char) getPreferences().getQuoteChar();
final char delimiter = (char) getPreferences().getDelimiterChar();
// escape all quotes not next to a delimiter (or start/end of line)
final StringBuilder b = new StringBuilder(line);
for (int i = b.length() - 1; i >= 0; i--) {
if (quote == b.charAt(i)) {
final boolean validCharBefore = i - 1 < 0
|| b.charAt(i - 1) == delimiter;
final boolean validCharAfter = i + 1 == b.length()
|| b.charAt(i + 1) == delimiter;
if (!(validCharBefore || validCharAfter)) {
// escape that quote!
b.insert(i, quote);
}
}
}
return b.toString();
}
}
You can just supply this Tokenizer to the constructor of your CsvReader.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Alright so I have to write a LaTeXParser in java, I'm going to be taking in a file much like this one below and reading it for validity and errors. Now I am not looking for help really or code but more of a conceptual understanding, how to attack the problem. I am going to be using Stacks to store the blocks and make sure everything is sorted properly. So my question to you is, how to handle it?
For example, Should I begin by getting all the "\begin{_}" and putting them in a stack and then pop them with their corresponding "\end{}"? I was wondering using a String based case switch system that, when particular strings were found, would perform the actions necessary based on that string, on my stack.
Or maybe 2 Stacks that cancel each other out, all the \begins in one and the \ends in another and has their {__} match up, I start poping them out and what not.
So yeah, just wondering what the bright minds of SOF had to say about how I should be thinking about this problem and how to deal with it. Thanks for your input!
\documentclass{article}
\usepackage{amsmath, amssymb, amsthm}
\begin{document}
{\Large \begin{center} Homework Problems \end{center}}\begin{itemize}\item\end{itemize}
\begin{enumerate}
\item Prove: For all sets $A$ and $B$, $(A - B) \cup
(A \cap B) = A$.
\begin{proof}
\begin{align}
& (A - B) \cup (A \cap B) && \\
& = (A \cap B^c) \cup (A \cap B) && \text{by
Alternate Definition of Set Difference} \\
& = A \cap (B^c \cup B) && \text{by Distributive Law} \\
& = A \cap (B \cup B^c) && \text{by Commutative Law} \\
& = A \cap U && \text{by Union with the Complement Law} \\
& = A && \text{by Intersection with $U$ Law}
\end{align}
\end{proof}
\item If $n = 4k + 3$, does 8 divide $n^2 - 1$?
\begin{proof}
Let $n = 4k + 3$ for some integer $k$. Then
\begin{align}
n^2 - 1 & = (4k + 3)^2 - 1 \\
& = 16k^2 + 24k + 9 - 1 \\
& = 16k^2 + 24k + 8 \\
& = 8(2k^2 + 3k + 1) \text{,}
\end{align}
which is certainly divisible by 8.
\end{proof}
\end{enumerate}
\end{document}
EDIT: Lol I think everyone is overthinking this wayyyyyy too much, I am not looking for anything that recognizes and compiles code, or actually performs the actions of the LATEX language via this file. I simply want to be able to write up a text file, like the one above, have my program open it, read it, and say "hey! this would work because every block that begins also ends!" Or "hey theres an error on line 10!" Nothing more, nothing less. Just a simple validator/error checker that uses Stacks to contain the blocks and then pops them when the end is found and so on. Again I AM NOT LOOKING FOR CODE OR HANDOUTS! All I would like is some good ideas and methods for attacking this problem, maybe some pseudo code structuring at best!
For example...I was thinking of having this all contained in 1 class, in my main, and making a Stack that would hold all of the Strings in the file that were coded like such " \begin{_} " then when I found the corresponding " \end{} " just popping it out and check it off a list or something. If every beginning block is popped by the end of my run through the file, I have a valid .txt file.
Trying to roll your own parser is a big task. There are a number of Parser Generators that take some of the busy work out of the task. ANTLR is a popular one for java.
One of the first things you're going to need to do is find out what kind of language latex is? More complicated languages like C++ can't be parsed with the same kinds of parsers that you can use for a more regular language like forth.
The following Jules Bean post leads me to think that latex is harder to parse than most programming languages.
I'm pretty sure it's not an LALR language. It's context dependent and is capable of modifying it's own syntax. I think it is probably technical impossible to parse without actually executing the macros. I.e. you need a TeX state machine to parse it in full generality.
'well-behaved' LaTeX is probably LALR, though.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I need suggestion for the right approach to apply conditions in Java.
I have 100 conditions based on which I have to change value of a String variable that would be displayed to the user.
an example condition: a<5 && (b>0 && c>8) && d>9 || x!=4
More conditions are there but variables are the same more or less.
I am doing this right now:
if(condition1)
else if(condition2)
else if(condition3)
...
A switch case alternative would obviously be there nested within if-else's i.e.
if(condition1)
switch(x)
{
case y:
blah-blah
}
else if(condition2)
switch(x)
{
case y:
blah-blah
}
else if(condition3)
...
But I am looking for some more elegant solution like using an Interface for this with polymorphic support , What could be the thing that I could possibly do to avoid lines of code or what should be the right approach.
---Edit---
I actualy require this on an android device. But its more of a java construct here.
This is a small snapshot of conditions that I have with me. More will be added if a few pass/fail. That obviously would require more if-else's with/without nesting. In that case would the processing go slow.
I am as of now storing the messages in a separate class with various string variables those I have kept static so if a condition gets true
then I pick the static variable from the only class and display that
one. Would that be right about storing the resultant messages.
Depending on the number of conditional inputs, you might be able to use a look-up table, or even a HashMap, by encoding all inputs or even some relatively simple complex conditions in a single value:
int key = 0;
key |= a?(1):0;
key |= b?(1<<1):0;
key |= (c.size() > 1)?(1<<2):0;
...
String result = table[key]; // Or result = map.get(key);
This paradigm has the added advantage of constant time (O(1)) complexity, which may be important in some occasions. Depending on the complexity of the conditions, you might even have fewer branches in the code-path on average, as opposed to full-blown if-then-else spaghetti code, which might lead to performance improvements.
We might be able to help you more if you added more context to your question. Where are the condition inputs coming from? What are they like?
And the more important question: What is the actual problem that you are trying to solve?
There are a lot of possibilities to this. Without knowing much about your domain, I would create something like (you can think of better names :P)
public interface UserFriendlyMessageBuilder {
boolean meetCondition(FooObjectWithArguments args);
String transform(String rawMessage);
}
In this way, you can create a Set of UserFriendlyMessageBuilder and just iterate through them for the first that meets the condition to transform your raw message.
public class MessageProcessor {
private final Set<UserFriendlyMessageBuilder> messageBuilders;
public MessageProcessor(Set<UserFriendlyMessageBuilder> messageBuilders) {
this.messageBuilders = messageBuilders;
}
public String get(FooWithArguments args, String rawMsg) {
for (UserFriendlyMessageBuilder msgBuilder : messageBuilders) {
if (msgBuilder.meetCondition(args)) {
return msgBuilder.transform(rawMsg);
}
}
return rawMsg;
}
}
What it seems to me is "You have given very less importance to design the product in modules"
Which is the main factor of using OOP Language.
eg:If you have 100 conditions and you are able to make 4 modules then therotically for anything to choose you need 26 conditions.
This is an additional possibility that may be worth considering.
Take each comparison, and calculate its truth, then look the resulting boolean[] up in a truth table. There is a lot of existing work on simplifying truth tables that you could apply. I have a truth table simplification applet I wrote many years ago. You may find its source code useful.
The cost of this is doing all the comparisons, or at least the ones that are needed to evaluate the expression using the simplified truth table. The advantage is an organized system for managing a complicated combination of conditions.
Even if you do not use a truth table directly in the code, consider writing and simplifyin one as a way of organizing your code.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Apart from readability, are there any differences in performance or compile-time when a single-line loop / conditional statement is written with and without brakets?
For example, are there any differences between following:
if (a > 10)
a = 0;
and
if (a > 10)
{
a = 0;
}
?
Of course there is no difference in performance. But there is a difference in the possibility of introducing errors:
if (a>10)
a=0;
If somebody extends code and writes later,
if (a>10)
a=0;
printf ("a was reset\n");
This will always be printed because of the missing braces. Some people request that you always use braces to avoid this kind of errors.
Contrary to several answers, there is a finite but negligible performance difference at compile time. There is zero difference of any kind at runtime.
No, there is no difference, the compiler will strip out non-meaningful braces, line-breaks etc.
The compile time will be marginally different, but so marginally that you have already lost far more time reading this answer than you will get back in compile speed. As compute power increases, this cost goes down yet further, but the cost of reducing readability does not.
In short, do what is readable, it makes no useful difference in any other sense.
A machine code does not contain such braces. After compilation, there is no more {}. Use the most readable form.
Well, there is of course no difference between them as such at runtime.
But you should certainly use the 2nd way for the sake of maintainence of your code.
Why I'm saying this is, suppose in future, you need to add some more lines to your if-else block to expand them. Then if you have the first way incorporated in your old code, then you would have to add the braces before adding some new code. Which you won't need to do in 2nd case.
So, it is far easier to add code to the 2nd way in future, than to the 1st one.
Also, if you are using the first way, you are intended to do typing errors, such as semi-colon after your if, like this: -
if (a > 0);
System.out.println("Hello");
So, you can see that your Hello will always get printed. And these errors you can easily remove if you have curly braces attached to your if.
It depends on the rest of the coding guidelines. I don't see any
problem dropping the braces if the opening brace is always on a line
by itself. If the opening brace is at the end of the if line,
however, I find it too easy to overlook when adding to the contents. So
I'd go for either:
if ( a > 10 ) {
a = 0;
}
regardless of the number of lines, or:
if ( a > 10 )
{
// several statements...
}
with:
if ( a > 10 )
a = 0;
when there is just one statement. The important thing, however, is that
all of the code be consistent. If you're working on an existing code
base which uses several different styles, I'd alway use braces in new
code, since you can't count on the code style to ensure that if they
were there, they'd be in a highly visible location.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
In near future we might be enforced by a rule by which we can not have any hard coded numbers in our java source code. All the hard coded numbers must be declared as final variables.
Even though this sounds great in theory it is really hard/tedious to implement in code, especially legacy code. Should it really be considered "best practice" to declare numbers in following code snippets as final variables?
//creating excel
cellnum = 0;
//Declaring variables.
Object[] result = new Object[2];
//adding dash to ssn
return ssn.substring(1, 3)+"-"+ssn.substring(3, 5)+"-"+ssn.substring(5, 9);
Above are just some of the examples I could think of, but in these (and others) where would you as a developer say enough is enough?
I wanted to make this question a community wiki but couldn't see how...?
Definitely no. Literal constants have their places, especially low constants such as 0, 1, 2, ...
I don't think anyone would think
double[] pair = new double[PAIR_COUNT];
makes more sense than
double[] pair = new double[2];
I'd say use final variables if
...it increases readability,
...the value may change (and is used in multiple places), or
...it serves as documentation
A related side note: As always with coding standards / conventions: very few (if any) rules should be followed strictly.
Replacing numbers by constants makes sense if the number carries a meaning that is not inherently obvious by looking at its value alone.
For instance,
productType = 221; // BAD: the number needs to be looked up somewhere to understand its meaning
productType = PRODUCT_TYPE_CONSUMABLE; // GOOD: the constant is self-describing
On the other hand,
int initialCount = 0; // GOOD: in this context zero really means zero
int initialCount = ZERO; // BAD: the number value is clear, and there's no need to add a self-referencing constant name if there's no other meaning
Generally speaking, if a literal has a special meaning, it should be given a unique name rather than assuming things. I'm not sure why it is "practically" hard/tedious to do the same.
Object[] result = new Object[2]; => seems like a good candidate for using a Pair class
cellnum = 0; => cellnum = FIRST_COLUMN; esp since you might end up using an API which treats 1 as the starting index or maybe you want to process an excel in which columns start from 2.
return ssn.substring(1, 3)+"-"+ssn.substring(3, 5)+"-"+ssn.substring(5, 9) => If you have code like this littered throughout your codebase, you have bigger problems. If this code exists in a single location and is shielded by a sane API, I don't really see a problem here.
I've seen folks consider 0 and 1 accepted exceptions.
The idea is that you want to document why you have two Objects as above for example.
I agree with you about the dashes in SSN. The comment describes it better than 4 named constants.
In general, I like the idea of no magic numbers, but as with every rule, there are pragmatics involved. Legacy code, brings its own issues. It's a lot of work without a lot of productivity in terms of changed behavior to bring old code up to date this way. I would consider doing it in an evolutionary fashion: when you have to edit an old file, bring it up to date.
It really depends on the context doesn't it. If there are numbers in the code that does not indicate why they exist then naming them makes teh code more readable. If you see the number 3.14 in code is it PI? is there any way to tell or is that just a coincidence? Naming it PI will clear up the mystery.
In your example, why is cellnum = 2? why not 10? or 20? That should be named something, say INITIAL_CELL or MAX_CELL. Expecially if this same number, meaning the same thing appears again in the code.
Depends if it needs to be changed. Or for that matter, it can be changed.
If you only need 2 objects (say, for a pair like aioobe mentioned) then that isn't a magic number, it's the correct number. If it's for a variable tuple that, at this moment, is 2, then you probably should abstract it out into a constant.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to write a program for a school java project to parse some CSV I do not know. I do know the datatype of each column - although I do not know the delimiter.
The problem I do not even marginally know how to fix is to parse Date or even DateTime Columns. They can be in one of many formats.
I found many libraries but have no clue which is the best for my needs:
http://opencsv.sourceforge.net/
http://www.csvreader.com/java_csv.php
http://supercsv.sourceforge.net/
http://flatpack.sourceforge.net/
The problem is I am a total java beginner. I am afraid non of those libraries can do what I need or I can't convince them to do it.
I bet there are a lot of people here who have code sample that could get me started in no time for what I need:
automatically split in Columns (delimiter unknown, Columntypes are known)
cast to Columntype (should cope with $, %, etc.)
convert dates to Java Date or Calendar Objects
It would be nice to get as many code samples as possible by email.
Thanks a lot!
AS
You also have the Apache Commons CSV library, maybe it does what you need. See the guide. Updated to Release 1.1 in 2014-11.
Also, for the foolproof edition, I think you'll need to code it yourself...through SimpleDateFormat you can choose your formats, and specify various types, if the Date isn't like any of your pre-thought types, it isn't a Date.
There is a serious problem with using
String[] strArr=line.split(",");
in order to parse CSV files, and that is because there can be commas within the data values, and in that case you must quote them, and ignore commas between quotes.
There is a very very simple way to parse this:
/**
* returns a row of values as a list
* returns null if you are past the end of the input stream
*/
public static List<String> parseLine(Reader r) throws Exception {
int ch = r.read();
while (ch == '\r') {
//ignore linefeed chars wherever, particularly just before end of file
ch = r.read();
}
if (ch<0) {
return null;
}
Vector<String> store = new Vector<String>();
StringBuffer curVal = new StringBuffer();
boolean inquotes = false;
boolean started = false;
while (ch>=0) {
if (inquotes) {
started=true;
if (ch == '\"') {
inquotes = false;
}
else {
curVal.append((char)ch);
}
}
else {
if (ch == '\"') {
inquotes = true;
if (started) {
// if this is the second quote in a value, add a quote
// this is for the double quote in the middle of a value
curVal.append('\"');
}
}
else if (ch == ',') {
store.add(curVal.toString());
curVal = new StringBuffer();
started = false;
}
else if (ch == '\r') {
//ignore LF characters
}
else if (ch == '\n') {
//end of a line, break out
break;
}
else {
curVal.append((char)ch);
}
}
ch = r.read();
}
store.add(curVal.toString());
return store;
}
There are many advantages to this approach. Note that each character is touched EXACTLY once. There is no reading ahead, pushing back in the buffer, etc. No searching ahead to the end of the line, and then copying the line before parsing. This parser works purely from the stream, and creates each string value once. It works on header lines, and data lines, you just deal with the returned list appropriate to that. You give it a reader, so the underlying stream has been converted to characters using any encoding you choose. The stream can come from any source: a file, a HTTP post, an HTTP get, and you parse the stream directly. This is a static method, so there is no object to create and configure, and when this returns, there is no memory being held.
You can find a full discussion of this code, and why this approach is preferred in my blog post on the subject: The Only Class You Need for CSV Files.
My approach would not be to start by writing your own API. Life's too short, and there are more pressing problems to solve. In this situation, I typically:
Find a library that appears to do what I want. If one doesn't exist, then implement it.
If a library does exist, but I'm not sure it'll be suitable for my needs, write a thin adapter API around it, so I can control how it's called. The adapter API expresses the API I need, and it maps those calls to the underlying API.
If the library doesn't turn out to be suitable, I can swap another one in underneath the adapter API (whether it's another open source one or something I write myself) with a minimum of effort, without affecting the callers.
Start with something someone has already written. Odds are, it'll do what you want. You can always write your own later, if necessary. OpenCSV is as good a starting point as any.
i had to use a csv parser about 5 years ago. seems there are at least two csv standards: http://en.wikipedia.org/wiki/Comma-separated_values and what microsoft does in excel.
i found this libaray which eats both: http://ostermiller.org/utils/CSV.html, but afaik, it has no way of inferring what data type the columns were.
You might want to have a look at this specification for CSV. Bear in mind that there is no official recognized specification.
If you do not now the delimiter it will not be possible to do this so you have to find out somehow. If you can do a manual inspection of the file you should quickly be able to see what it is and hard code it in your program. If the delimiter can vary your only hope is to be able to deduce if from the formatting of the known data. When Excel imports CSV files it lets the user choose the delimiter and this is a solution you could use as well.
I agree with #Brian Clapper. I have used SuperCSV as a parser though I've had mixed results. I enjoy the versatility of it, but there are some situations within my own csv files for which I have not been able to reconcile "yet". I have faith in this product and would recommend it overall--I'm just missing something simple, no doubt, that I'm doing in my own implementation.
SuperCSV can parse the columns into various formats, do edits on the columns, etc. It's worth taking a look-see. It has examples as well, and easy to follow.
The one/only limitation I'm having is catching an 'empty' column and parsing it into an Integer or maybe a blank, etc. I'm getting null-pointer errors, but javadocs suggest each cellProcessor checks for nulls first. So, I'm blaming myself first, for now. :-)
Anyway, take a look at SuperCSV. http://supercsv.sourceforge.net/
At a minimum you are going to need to know the column delimiter.
Basically you will need to read the file line by line.
Then you will need to split each line by the delimiter, say a comma (CSV stands for comma-separated values), with
String[] strArr=line.split(",");
This will turn it into an array of strings which you can then manipulate, for example with
String name=strArr[0];
int yearOfBirth = Integer.valueOf(strArr[1]);
int monthOfBirth = Integer.valueOf(strArr[2]);
int dayOfBirth = Integer.valueOf(strArr[3]);
GregorianCalendar dob=new GregorianCalendar(yearOfBirth, monthOfBirth, dayOfBirth);
Student student=new Student(name, dob); //lets pretend you are creating instances of Student
You will need to do this for every line so wrap this code into a while loop. (If you don't know the delimiter just open the file in a text editor.)
I would recommend that you start by pulling your task apart into it's component parts.
Read string data from a CSV
Convert string data to appropriate format
Once you do that, it should be fairly trivial to use one of the libraries you link to (which most certainly will handle task #1). Then iterate through the returned values, and cast/convert each String value to the value you want.
If the question is how to convert strings to different objects, it's going to depend on what format you are starting with, and what format you want to wind up with.
DateFormat.parse(), for example, will parse dates from strings. See SimpleDateFormat for quickly constructing a DateFormat for a certain string representation.
Integer.parseInt() will prase integers from strings.
Currency, you'll have to decide how you want to capture it. If you want to just capture as a float, then Float.parseFloat() will do the trick (just use String.replace() to remove all $ and commas before you parse it). Or you can parse into a BigDecimal (so you don't have rounding problems). There may be a better class for currency handling (I don't do much of that, so am not familiar with that area of the JDK).
Writing your own parser is fun, but likely you should have a look at
Open CSV. It provides numerous ways of accessing the CSV and also allows to generate CSV. And it does handle escapes properly. As mentioned in another post, there is also a CSV-parsing lib in the Apache Commons, but that one isn't released yet.