How to prevent word splitting on build output? - java

I am attempting to use substrings in order to prevent word splitting on the system output.
I have seen responses that split it by each word, but I want to do it only when necessary.
Original code:
System.out.println("This is a very extra long quote that doesn't break properly.");
Build output:
This is a very extr
a long quote that d
oesn't break proper
ly.
Desired build output:
This is a very
extra long
quote that
doesn't
break properly.
Of course, I am not trying to have it that narrow-- I just want the word to break into a new line when the word splits.
Thank you anyone that helps! All responses are appreciated!

You can make use of wrap method from WordUtils (Apache commons library Link to documentation)
The method makes use of wraplength and white spaces to wrap words from a string.

Related

Java Regex to detect end of sentence BUT ignore (num)(period) e.g. 15

Trying to find a good regex for sentence end detection in java.
The main issue is if there is a number then period, it detects it as a sentence end (see demo link). But in my case, I'd prefer it to not recognize that as a sentence end, even though in some cases it might be. What I see in documents more commonly are section headers which look like :
12. the end of the world 13. world didnt end 14. nope it did
In my case it's splitting up a lot of simple header listings into sentences which I don't want.
addition issue with solution posted here:
The proposed solution is:
[!?.]+(?=$|\s)
See demo: http://regex101.com/r/lS5tT3/15
The issue is if there is a chapter heading such as 15. then it sees it incorrectly as a sentence end.
try this text in the demo and you will see the issue in the first sentence :
This is the f!!rst *15.* the best sentence! Is this the second one? The third 32.5 sentence is here... And the fourth one!!
If there are any regex whizzes who can help add logically that it is not a sentence end if period space but preceeded by a number that would be quite helpful
This regex works with some of the abbreviations and correctly recognizes the sentence end markers. Unfortunately, for java string.split I would need the inverse of this function...
([!?]+(?=$|\s))|((?<![\d])(?<!etc)(?<!Mr)(?<!mr)(?<!i.e)(?<!Dr)(?<!dr)(?<!Mrs)(?<!mrs)(?<![ A-Z])(?<!Ms)(?<!ms)(?<!Phd)(?<!u\.s)(?<!U\.S)(?<!\.)[.]{1}(?=$|\s))|

Java regex to match quoted numbers

I need to clean up a JSON including incorrectly quoted numbers via a short Java (not JS!) Regex snippet. Example for what I have:
[{"series":"a","x":"1","y":"111.71"},{"series":"a","x":"2","y":"120.25"}]
Example for what I would need to get:
[{"series":"a","x":1,"y":111.71},{"series":"a","x":2,"y":120.25}]
So I only need to match and eliminate quote characters if preceeded or followed by [0-9], but how to avoid replacing part of the number is beyond my lowly regex skills.
Any help greatly appreciated!
EDIT (2nd round):
Thanks for the fast feedback! I'm not too worried about false positives since I can control the contents of the descriptors, and I'll make sure they're text-only. Spaces can be avoided as well, only negative numbers might occur - good one! Separators are always commas (",") for the JSON, the arbitrary number of decimals in of the double values are always separated by dots ("."). I cannot fix the JSON source unfortunately, and I definitely want to clean this up in Java.
Trying out the suggestions now and reporting back. I'll also toy around with this: http://www.regular-expressions.info/lookaround.html#lookbehind
How about replaceAll("\"(-?\\d+([.]\\d+)?)\"","$1");
This works for your specific example, but would not work if other numbers have a different format (see my comment):
String s = "[{\"series\":\"a\",\"x\":\"1\",\"y\":\"111.71\"},{\"series\":\"a\",\"x\":\"2\",\"y\":\"120.25\"}]";
String clean = s.replaceAll("\"(\\d+\\.?\\d*)\"", "$1");
System.out.println(clean);
outputs:
[{"series":"a","x":1,"y":111.71},{"series":"a","x":2,"y":120.25}]

Suggested ways of reading a text file with inconsistent formatting

I'm trying to read a text file of numbers as a double array and after various methods (usually resulting in an input format exception) I have come to the conclusion that the text file I am trying to read is inconsistent with it's delimiting.
The majority of the text format is in the form "0.000,0.000" so I have been using a Scanner and the useDelimiter(",") to read in each value.
It turns out though (this is a big file of numbers) that some of the formatting is in the form "0.000 0.000" (at the end of a line I presume) which of course produces an input format exception.
This is an open question really, I'm a pretty basic Java programmer so I would just like to see if there are any suggestions/ways of performing this. Is Scanner the correct class to go on this?
Thank you for your time!
Read file as text line-by-line. Then split line into parts:
String[] parts = line.split("[ ,]");
Now iterate over the parts and call Double.parseDouble() for each part.
Scanner allows any Java Regex Pattern to function as a delimiter. You should be able to use any number of delimiters by doing the following:
scanner.setDelimiter("[,\\s]"); // Will match commas and whitespace
I'd like to comment this in instead of making it a separate answer, but my reputation is too low. Apologies, Alex.
You mentioned having two different delimited characters used in different instances, not a combination of the two as a single delimiter.
You can use the vertical bar as logical OR in a regular expression.
scanner.setDelimiter("[,|\\s]"); //Will match commas or whitespace as appropriate
line by line:
String[] parts = line.split("[,|\\s]");

Java Splitting a String

I have this string
G234101,Non-Essential,ATPases,Respiration chain complexes,"Auxotrophies, carbon and",PS00017,2,IONIC HOMEOSTASIS,mitochondria.
That I have been trying to split in java. The file is comma delimeted but some of the strings have commas within them and I don't want them to get split up. Currently in the above example
"Auxotrophies, carbon and"
is getting split into two strings.
Any suggestions on how to best split this up by comma's. Not all of the strings have the " " for example the following string:
G234103,Essential,Protein Kinases,?,Cell cycle defects,PS00479,2,CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION,cytoplasm.
http://opencsv.sourceforge.net/
But if you really do need to reinvent the wheel (homework), you need to use a more complicated regular expression than just "what,ever".split(","). It's not simple though. And you might be better off creating your own custom Lexer. http://en.wikipedia.org/wiki/Lexical_analysis
This isn't too hard in your case. As you process your text character by character you just need to keep track of opening and closing quotes to decide when to ignore commas and when to act on them.
Also see StreamTokenizer for a built-in configurable Lexer - you should be able to use this to meet your requirements.
I would think that this would be a multi step process. First, find all the comma's in quotes from your original string, replace it with something like {comma}. You can do this with some regex. Then on the new string, split the new string with the comma symbol(,). Then go through your list, and replace the {comma} with the comma symbol {,}.

java string tokenizer

What if I have a file that I am using a string tokenizer on to get values between commas. Its a csv file. Here is sample input:
test,first,second,,fourth,fifth
so how can i catch that empty comma? Right now its just pretending nothing is there. It doesn't even see that there is a place with nothing in it.
Using String#split() would be recommended over StringTokenizer.
String[] s = "test,first,second,,fourth,fifth".split(",");
System.out.println(Arrays.asList(s));
System.out.println(s.length);
// output:
// [test, first, second, , fourth, fifth]
// 6
Also, if you have much more involved CSV parsing in your code, if possible, try using an existing library like JavaCSV.
I am not sure if I am understanding your question correctly. I would use well-known packages like opencsv.
The split technique works great, so long as none of your elements have a comma inside it. You can use existing libraries. I've also had good results using regexp for CSV processing.

Categories