Retrieve Line Numbers from Diff Patch Match - java

I am working on a project that compares two large text file versions (around 5000+ lines of text). The newer version contains potentially new and removed content. It is intended to help detect early changes in text versions as a team receives information from that text.
To solve the problem, I use the diff-match-patch libary, which allows me to identify already removed and new content. In the first step I search for changes.
public void compareStrings(String oldText, String newText){
DiffMatchPatch dmp = new DiffMatchPatch();
LinkedList<Diff> diffs = dmp.diffMain(previousString, newString, false);
}
Then I filter the list by the keywords INSERT/DELETE to get only the new/removed content.
public String showAddedElements(){
String insertions = "";
for(Diff elem: diffs){
if(elem.operation == Operation.INSERT){
insertions = insertions + elem.text + System.lineSeparator();
}
}
return insertions;
}
However, when I output the contents, I sometimes get only single letters, like (o, contr, ler), when only single characters were removed/added. Instead, I would like to output the whole sentence in which a change occured.
Is there a way to also retrieve the line number from the DiffMatchPatch where the changes occured?

I have found a solution by using another libary for the line extraction. The DiffUtils (Class DiffUtils of DMitry Maumenko) helped me achieve the desired goal.
/**
* Converts a String to a list of lines by dividing the string at linebreaks.
* #param text The text to be converted to a line list
*/
private List<String> fileToLines(String text) {
List<String> lines = new LinkedList<String>();
Scanner scanner = new Scanner(text);
while (scanner.hasNext()) {
String line = scanner.nextLine();
lines.add(line);
}
scanner.close();
return lines;
}
/**
* Starts a line-by-line comparison between two strings. The results are included
* in an intern list element for further processing.
*
* #param firstText The first string to be compared
* #param secondText The second string to be compared
*/
public void startLineByLineComparison(String firstText, String secondText){
List<String> firstString = fileToLines(firstText);
List<String> secondString = fileToLines(secondText);
changes = DiffUtils.diff(firstString, secondString).getDeltas();
}
After inserting the list with changes can be extracted by using the following code, whereas elem.getType() represents the type of difference between the text:
/**
* Returns a String filled with all removed content including line position
* #return String with removed content
*/
public String returnRemovedContent(){
String deletions = "";
for(Delta elem: changes){
if(elem.getType() == TYPE.DELETE){
deletions = deletions + appendLines(elem.getOriginal()) + System.lineSeparator();
}
}
return deletions;
}

Related

I need to prase integers after a specific character from list of strings

i got a problem here guys. I need to get all the numbers from a string here from a list of strings.
Lets say one of the strings in the list is "Jhon [B] - 14, 15, 16"
and the format of the strings is constant, every string has maximum of 7 numbers in it and the numbers are separated with "," . I want to get every number after the "-". i am really confused here, i tried everything i know of but i am not getting even close.
public static List<String> readInput() {
final Scanner scan = new Scanner(System.in);
final List<String> items = new ArrayList<>();
while (scan.hasNextLine()) {
items.add(scan.nextLine());
}
return items;
}
public static void main(String[] args) {
final List<String> stats= readInput();
}
}
You could...
Just manually parse the String using things like String#indexOf and String#split (and String#trim)
String text = "Jhon [B] - 14, 15, 16";
int indexOfDash = text.indexOf("-");
if (indexOfDash < 0 && indexOfDash + 1 < text.length()) {
return;
}
String trailingText = text.substring(indexOfDash + 1).trim();
String[] parts = trailingText.split(",");
// There's probably a really sweet and awesome
// way to use Streams, but the point is to try
// and keep it simple 😜
List<Integer> values = new ArrayList<>(parts.length);
for (int index = 0; index < parts.length; index++) {
values.add(Integer.parseInt(parts[index].trim()));
}
System.out.println(values);
which prints
[14, 15, 16]
You could...
Make use of a custom delimiter for Scanner for example...
String text = "Jhon [B] - 14, 15, 16";
Scanner parser = new Scanner(text);
parser.useDelimiter(" - ");
if (!parser.hasNext()) {
// This is an error
return;
}
// We know that the string has leading text before the "-"
parser.next();
if (!parser.hasNext()) {
// This is an error
return;
}
String trailingText = parser.next();
parser = new Scanner(trailingText);
parser.useDelimiter(", ");
List<Integer> values = new ArrayList<>(8);
while (parser.hasNextInt()) {
values.add(parser.nextInt());
}
System.out.println(values);
which prints...
[14, 15, 16]
Or You could use a method that will extract signed or unsigned Whole or floating point numbers from a string. The method below makes use of the String#replaceAll() method:
/**
* This method will extract all signed or unsigned Whole or floating point
* numbers from a supplied String. The numbers extracted are placed into a
* String[] array in the order of occurrence and returned.<br><br>
*
* It doesn't matter if the numbers within the supplied String have leading
* or trailing non-numerical (alpha) characters attached to them.<br><br>
*
* A Locale can also be optionally supplied so to use whatever decimal symbol
* that is desired otherwise, the decimal symbol for the system's current
* default locale is used.
*
* #param inputString (String) The supplied string to extract all the numbers
* from.<br>
*
* #param desiredLocale (Optional - Locale varArgs) If a locale is desired for a
* specific decimal symbol then that locale can be optionally
* supplied here. Only one Locale argument is expected and used
* if supplied.<br>
*
* #return (String[] Array) A String[] array is returned with each element of
* that array containing a number extracted from the supplied
* Input String in the order of occurrence.
*/
public static String[] getNumbersFromString(String inputString, java.util.Locale... desiredLocale) {
// Get the decimal symbol the the current system's locale.
char decimalSeparator = new java.text.DecimalFormatSymbols().getDecimalSeparator();
/* Is there a supplied Locale? If so, set the decimal
separator to that for the supplied locale */
if (desiredLocale != null && desiredLocale.length > 0) {
decimalSeparator = new java.text.DecimalFormatSymbols(desiredLocale[0]).getDecimalSeparator();
}
/* The first replaceAll() removes all dashes (-) that are preceeded
or followed by whitespaces. The second replaceAll() removes all
periods from the input string except those that part of a floating
point number. The third replaceAll() removes everything else except
the actual numbers. */
return inputString.replaceAll("\\s*\\-\\s{1,}","")
.replaceAll("\\.(?![\\d](\\.[\\d])?)", "")
.replaceAll("[^-?\\d+" + decimalSeparator + "\\d+]", " ")
.trim().split("\\s+");
}

Scanner.findAll() and Matcher.results() work differently for same input text and pattern

I have seen this interesting thing during split of properties string using regex. I am not able to find the root cause.
I have a string which contains text like properties key=value pair.
I have a regex which split the string into keys/values based on the = position. It considers first = as the split point. Value can also contain = in it.
I tried using two different ways in Java to do it.
using Scanner.findAll() method
This is not behaving as expected. It should extract and print all keys based on pattern. But I found its behaving weird. I have one key-value pair as below
SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important .....}
The key which should be extracted is SectionError.ErrorMessage= but it also considers errorlevel= as key.
The interesting point is if I remove one of characters from properties String passed, it behaves fine and only extracts SectionError.ErrorMessage= key.
using Matcher.results() method
This works fine. No problem whatever we put in the properties string.
Sample code I tried :
import java.util.Scanner;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
import static java.util.regex.Pattern.MULTILINE;
public class MessageSplitTest {
static final Pattern pattern = Pattern.compile("^[a-zA-Z0-9._]+=", MULTILINE);
public static void main(String[] args) {
final String properties =
"SectionOne.KeyOne=first value\n" + // removing one char from here would make the scanner method print expected keys
"SectionOne.KeyTwo=second value\n" +
"SectionTwo.UUIDOne=379d827d-cf54-4a41-a3f7-1ca71568a0fa\n" +
"SectionTwo.UUIDTwo=384eef1f-b579-4913-a40c-2ba22c96edf0\n" +
"SectionTwo.UUIDThree=c10f1bb7-d984-422f-81ef-254023e32e5c\n" +
"SectionTwo.KeyFive=hello-world-sample\n" +
"SectionThree.KeyOne=first value\n" +
"SectionThree.KeyTwo=second value additional text just to increase the length of the text in this value still not enough adding more strings here n there\n" +
"SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message}\n" +
"SectionFour.KeyOne=sixth value\n" +
"SectionLast.KeyOne=Country";
printKeyValuesFromPropertiesUsingScanner(properties);
System.out.println();
printKeyValuesFromPropertiesUsingMatcher(properties);
}
private static void printKeyValuesFromPropertiesUsingScanner(String properties) {
System.out.println("===Using Scanner===");
try (Scanner scanner = new Scanner(properties)) {
scanner
.findAll(pattern)
.map(MatchResult::group)
.forEach(System.out::println);
}
}
private static void printKeyValuesFromPropertiesUsingMatcher(String properties) {
System.out.println("===Using Matcher===");
pattern.matcher(properties).results()
.map(MatchResult::group)
.forEach(System.out::println);
}
}
Output printed:
===Using Scanner===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
errorlevel=
SectionFour.KeyOne=
SectionLast.KeyOne=
===Using Matcher===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
SectionFour.KeyOne=
SectionLast.KeyOne=
What could be the root cause of this? Do scanner's findAll works differently than matcher?
Please let me know if any more info is required.
Scanner's documentation mentions the word "buffer" a lot. This suggests that Scanner does not know about the entire string from which it is reading, and only holds a small bit of it at a time in a buffer. This makes sense, because Scanners are designed to read from streams as well, reading everything from the stream might take a long time(, or forever!) and takes up a lot of memory.
In the source code of Scanner, there is indeed a CharBuffer:
// Internal buffer used to hold input
private CharBuffer buf;
Because of the length and contents of your string, the Scanner has decided to load everything up to...
SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very...
^
somewhere here
(It could be anywhere in the word "errorlevel")
...into the buffer. Then, after that half of the string is read, the other half the string starts like this:
errorlevel=Warning {HelpMessage:This is very...
errorLevel= is now the start of the string, causing the pattern to match.
Related Bug?
Matcher doesn't use a buffer. It stores the whole string against which it is matching in the field:
/**
* The original string being matched.
*/
CharSequence text;
So this behaviour is not observed in Matcher.
Sweepers answer got it right, this is an issue of the Scanner’s buffer not containing the entire string. We can simplify the example to trigger the issue specifically:
static final Pattern pattern = Pattern.compile("^ABC.", Pattern.MULTILINE);
public static void main(String[] args) {
String testString = "\nABC1\nXYZ ABC2\nABC3ABC4\nABC4";
String properties = "X".repeat(1024 - testString.indexOf("ABC4")) + testString;
String s1 = usingScanner(properties);
System.out.println("Using Scanner: "+s1);
String m = usingMatcher(properties);
System.out.println("Using Matcher: "+m);
if(!s1.equals(m)) System.out.println("mismatch");
if(s1.equals(usingScannerNoStream(properties)))
System.out.println("Not a stream issue");
}
private static String usingScanner(String source) {
return new Scanner(source)
.findAll(pattern)
.map(MatchResult::group)
.collect(Collectors.joining(" + "));
}
private static String usingScannerNoStream(String source) {
Scanner s = new Scanner(source);
StringJoiner sj = new StringJoiner(" + ");
for(;;) {
String match = s.findWithinHorizon(pattern, 0);
if(match == null) return sj.toString();
sj.add(match);
}
}
private static String usingMatcher(String source) {
return pattern.matcher(source).results()
.map(MatchResult::group)
.collect(Collectors.joining(" + "));
}
which prints:
Using Scanner: ABC1 + ABC3 + ABC4 + ABC4
Using Matcher: ABC1 + ABC3 + ABC4
mismatch
Not a stream issue
This example prepends a prefix with as much X characters needed to align the beginning of the false-positive match with the buffer’s size. The Scanner’s initial buffer size is 1024, though it may get enlarged when needed.
Since findAll ignores the scanner’s delimiters, just like findWithinHorizon, this code also shows that looping with findWithinHorizon manually exhibits the same behavior, in other words, this is not an issue of the Stream API used.
Since Scanner will enlarge the buffer when needed, we can work-around the issue by using a match operation that forces the reading of the entire contents into the buffer before performing the intended match operation, e.g.
private static String usingScanner(String source) {
Scanner s = new Scanner(source);
s.useDelimiter("(?s).*").hasNext();
return s
.findAll(pattern)
.map(MatchResult::group)
.collect(Collectors.joining(" + "));
}
This specific hasNext() with a delimiter that consumes the entire string will force the complete buffering of the string, without advancing the position. The subsequent findAll() operation ignores both, the delimiter and the result of the hasNext() check, but does not suffer from the issue anymore due to the completely filled buffer.
Of course, this destroys the advantage of Scanner when parsing an actual stream.

Split long lines and Indent and output as so

I have a code to remove duplicate words from a string. Lets say i have:
This is serious serious work. I apply the code and get: This is serious work
This is the code:
return Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" "));
Now i want to add new constraints that is if the string/line is longer than 78 characters, break and indent it where it makes sense so the line does not run longer than 78 characters. Example:
This one is a very long line that runs off the right side because it is longer than 78 characters long
It should then be
This one is a very long line that runs off the right side because it is longer
than 78 characters long
I cant find a solution to this. It was brought to my attention that there is a possible duplicate to my question. I cant find my answer there. I need to be able to indent.
You could create a StringBuilder off of the String and then insert a newline and tab at the last word break after 78 characters. You can find the last word break to insert the newline/tab by getting the substring of the first 78 characters, and then finding the index of the last space:
StringBuilder sb = new StringBuilder(Arrays.stream(input.split(" ")).distinct().collect(Collectors.joining(" ")));
if(sb.length() > 78) {
int lastWordBreak = sb.substring(0, 78).lastIndexOf(" ");
sb.insert(lastWordBreak , "\n\t");
}
return sb.toString();
Output:
This one is a very long line that runs off the right side because it longer
than 78 characters
Also your Stream does not do what you want it to. Yes it removes duplicate words but.. it removes duplicate words. So for the String:
This is a great sentence. It is a great example.
It would remove the duplicate is, great and a, and return
This is a great sentence. It example.
To only remove consecutive duplicate words you can look at the following solution:
Removing consecutive duplicates words out of text using Regex and displaying the new text
Alternatively you could create your own them by splitting the text into words, and comparing the current element to the one ahead of it to remove the consecutive duplicate words
Instead of using
Collectors.joining(" ")
it is possible to write a custom collector that adds new lines and indentation at proper places.
Let's introduce a LineWrapper class, which contains indent and limit fields:
public class LineWrapper {
private final int limit;
private final String indent;
The default constructor sets the fields to reasonable default values.
Note how the indent starts with a new line character.
public LineWrapper() {
limit = 78;
indent = "\n ";
}
A custom constructor allows the client to specify limit and indent:
public LineWrapper(int limit, String indent) {
if (limit <= 0) {
throw new IllegalArgumentException("limit");
}
if (indent == null || !indent.matches("\\n *")) {
throw new IllegalArgumentException("indent");
}
this.limit = limit;
this.indent = indent;
}
Following is a regex used to split the input around one or more spaces. This makes sure that the split will not produce empty Strings:
private static final String SPACES = " +";
The apply method splits the input and collects the words into lines of the specified maximum length, indents the lines and removes duplicate consecutive words. Note how duplicates are not removed using the Stream.distinct method, since it also removes duplicates that are not consecutive.
public String apply(String input) {
return Arrays.stream(input.split(SPACES)).collect(toWrappedString());
}
The toWrappedString method returns a collector that accumulates the words in a new ArrayList, and uses the following methods:
addIfDistinct: to add the words to the ArrayList
combine: to merge two array lists
wrap: to split and indent the lines
.
Collector<String, ArrayList<String>, String> toWrappedString() {
return Collector.of(ArrayList::new,
this::addIfDistinct,
this::combine,
this::wrap);
}
The addIfDistinct adds the word to the accumulator ArrayList if it is different than the previous word.
void addIfDistinct(ArrayList<String> accumulator, String word) {
if (!accumulator.isEmpty()) {
String lastWord = accumulator.get(accumulator.size() - 1);
if (!lastWord.equals(word)) {
accumulator.add(word);
}
} else {
accumulator.add(word);
}
}
The combine method adds all words from the second ArrayList to the first one. It also makes sure that the first word of the second ArrayList does not duplicate the last word of the first ArrayList.
ArrayList<String> combine(ArrayList<String> words,
ArrayList<String> moreWords) {
List<String> other = moreWords;
if (!words.isEmpty() && !other.isEmpty()) {
String lastWord = words.get(words.size() - 1);
if (lastWord.equals(other.get(0))) {
other = other.subList(1, other.size());
}
}
words.addAll(other);
return words;
}
Finally the wrap method appends all words to a StringBuffer, inserting the indent when the line length limit is reached:
String wrap(ArrayList<String> words) {
StringBuilder result = new StringBuilder();
if (!words.isEmpty()) {
String firstWord = words.get(0);
result.append(firstWord);
int lineLength = firstWord.length();
for (String word : words.subList(1, words.size())) {
//add 1 to the word length,
//to account for the space character
int len = word.length() + 1;
if (lineLength + len <= limit) {
result.append(' ');
result.append(word);
lineLength += len;
} else {
result.append(indent);
result.append(word);
//subtract 1 from the indent length,
//because the new line does not count
lineLength = indent.length() - 1 + word.length();
}
}
}
return result.toString();
}

Finding the strings in a TreeSet that start with a given prefix

I'm trying to find the strings in a TreeSet<String> that start with a given prefix. I found a previous question asking for the same thing — Searching for a record in a TreeSet on the fly — but the answer given there doesn't work for me, because it assumes that the strings don't include Character.MAX_VALUE, and mine can.
(The answer there is to use treeSet.subSet(prefix, prefix + Character.MAX_VALUE), which gives all strings between prefix (inclusive) and prefix + Character.MAX_VALUE (exclusive), which comes out to all strings that start with prefix except those that start with prefix + Character.MAX_VALUE. But in my case I need to find all strings that start with prefix, including those that start with prefix + Character.MAX_VALUE.)
How can I do this?
To start with, I suggest re-examining your requirements. Character.MAX_VALUE is U+FFFF, which is not a valid Unicode character and never will be; so I can't think of a good reason why you would need to support it.
But if there's a good reason for that requirement, then — you need to "increment" your prefix to compute the least string that's greater than all strings starting with your prefix. For example, given "city", you need "citz". You can do that as follows:
/**
* #param prefix
* #return The least string that's greater than all strings starting with
* prefix, if one exists. Otherwise, returns Optional.empty().
* (Specifically, returns Optional.empty() if the prefix is the
* empty string, or is just a sequence of Character.MAX_VALUE-s.)
*/
private static Optional<String> incrementPrefix(final String prefix) {
final StringBuilder sb = new StringBuilder(prefix);
// remove any trailing occurrences of Character.MAX_VALUE:
while (sb.length() > 0 && sb.charAt(sb.length() - 1) == Character.MAX_VALUE) {
sb.setLength(sb.length() - 1);
}
// if the prefix is empty, then there's no upper bound:
if (sb.length() == 0) {
return Optional.empty();
}
// otherwise, increment the last character and return the result:
sb.setCharAt(sb.length() - 1, (char) (sb.charAt(sb.length() - 1) + 1));
return Optional.of(sb.toString());
}
To use it, you need to use subSet when the above method returns a string, and tailSet when it returns nothing:
/**
* #param allElements - a SortedSet of strings. This set must use the
* natural string ordering; otherwise this method
* may not behave as intended.
* #param prefix
* #return The subset of allElements containing the strings that start
* with prefix.
*/
private static SortedSet<String> getElementsWithPrefix(
final SortedSet<String> allElements, final String prefix) {
final Optional<String> endpoint = incrementPrefix(prefix);
if (endpoint.isPresent()) {
return allElements.subSet(prefix, endpoint.get());
} else {
return allElements.tailSet(prefix);
}
}
See it in action at: http://ideone.com/YvO4b3.
If anybody is looking for a shorter version of ruakh's answer:
First element is actually set.ceiling(prefix),and last - you have to increment the prefix and use set.floor(next_prefix)
public NavigableSet<String> subSetWithPrefix(NavigableSet<String> set, String prefix) {
String first = set.ceiling(prefix);
char[] chars = prefix.toCharArray();
if(chars.length>0)
chars[chars.length-1] = (char) (chars[chars.length-1]+1);
String last = set.floor(new String(chars));
if(first==null || last==null || last.compareTo(first)<0)
return new TreeSet<>();
return set.subSet(first, true, last, true);
}

How to compare two .csv files?

/**
* 5 points
*
* Return the price of the stock at the given date as a double. Each line
* of the file contains 3 comma separated
* values "date,price,volume" in the format "2016-03-23,106.129997,25703500"
* where the data is YYYY-MM-DD, the price
* is given in USD and the volume is the number of shares traded throughout
* the day.
*
* Note: You don't have to interpret dates for this assignment and you can use
* the Sting's .equals method to
* compare dates whenever date comparisons are needed.
*
* #param stockFileName The filename containing the prices for a stock for each
* day in 2016
* #param date The date to lookup given in YYYY-MM-DD format
* #return The price of the stock represented in stockFileName on the given date
*/
public static double getPrice(String stockFileName, String date) {
try {
BufferedReader br = new BufferedReader(new FileReader(stockFileName));
String line = "";
String unparsedFile = "";
Double Price;
while ((line = br.readLine()) != null) {
String[] Ans = unparsedFile.split(",");
for (String item : Ans){
if(Ans[1].equals(date)){
double aDouble = Double.parseDouble(Ans[2]);
return aDouble;
}
}
}
br.close();
} catch (IOException ex) {
System.out.println("Error");
}
return Ans;
}
The code I have now will I assume compare only one column of the .csv file to the date parameter. How do I make it so that code will look for an individual line then compare [1] of that line to the date parameter and return [2] of that line back?
I think that your code is close to being (functionally) correct. The mistake that you have made is that Java arrays are indexed from zero, not from one. So Ans[1] is actually giving you the second element of the array ... not the first one.
Solution: obvious ... assuming that you understand what I just wrote above!
Once you have fixed the bug(s), you should fix the style issues:
Always start the names of variables with a lower case letter.
Use 4 spaces as your indentation level
One space after if.
One space between ) and {

Categories