I'm having trouble with a program I'm working on to create shingle pairs from each sentence in a text file. Right now my code reads in a .txt file in Java and outputs each sentence in order. I want to store each sentence separately then take each sentence and create 2-character shingles of them, which would be stored in an array. An example of this would be taking the sentence “The quick brown fox” and turning it into {th, he, e , q, qu, ui, ic, ck, k , b, br, ro, ow, wn, n , f, fo, ox} so that all of the spaces in between the words would be accounted for. My goal is to simply take each sentence and create an array for each of them that holds the shingle pairs like in the example above. My problem is that I'm not sure how to go about this. I can’t seem to figure out how to take the sentences and store them separately, and I’m not sure how to create shingle pairs. I'm still very new to Java, and any help is very much appreciated. Here is my code so far:
//Takes .txt file as command-line input parameter
File file = new File(args[0]);
Scanner scanner = new Scanner(new FileInputStream(file));
int i=0;
//Reads in and outputs each line from the file
while (scanner.hasNextLine()) {
System.out.print(++i + " : " + scanner.nextLine() + "\n");
}
Just take pairs of characters from [0,1] to [last-1,last]
String[] result = new String[sentence.length() - 1];
for (int i = 0; i < sentence.length() - 2; i++)
{
result[i] = sentence.substring(i, i + 2);
}
If you nead, you may delete spaces with trim() after it this cycle.
To split into sentences you can use pattern matching. Just define what is a valid sentene for your task. Here I assume a sentence is always ended with dot, question mark or exclamation mark; and the next sentence starts after one or more whitespaces
final Pattern sentencePattern = Pattern.compile("[\\.\\?!]+\\s+");
sentencePattern.splitAsStream(text).forEach(
System.out::println //your code here
);
Related
This is for AOC day 2. The input is something along the lines of
"6-7 z: dqzzzjbzz
13-16 j: jjjvjmjjkjjjjjjj
5-6 m: mmbmmlvmbmmgmmf
2-4 k: pkkl
16-17 k: kkkkkkkkkkkkkkkqf
10-16 s: mqpscpsszscsssrs
..."
It's formatted like 'min-max letter: password' and seperated by line. I'm supposed to find how many passwords meet the minimum and maximum requirements. I put all that prompt into a string variable and used Pattern.quote("\n") to seperate the lines into a string array. This worked fine. Then, I replaced all the letters except for the numbers and '-' by making a pattern Pattern.compile("[^0-9]|-"); and running that for every index in the array and using .trim() to cut off the whitespace at the end and start of each string. This is all working fine, I'm getting the desired output like 6 7 and 13 16.
However, now I want to try and split this string into two. This is my code:
HashMap<Integer,Integer> numbers = new HashMap<Integer,Integer>();
for(int i = 0; i < inputArray.length; i++){
String [] xArray = x[i].split(Pattern.quote(" "));
int z = Integer.valueOf(xArray[0]);
int y = Integer.valueOf(xArray[1]);
System.out.println(z);
System.out.println(y);
numbers.put(z, y);
}
System.out.println(numbers);
So, first making a hasmap which will store <min, max> values. Then, the for loop (which runs 1000 times) splits every index of the 6 7 and 13 16 string into two, determined by the " ". The System.out.println(z); and System.out.println(y); are working as intended.
6
7
13
16
...
This output goes on to give me 2000 integers seperated by a line each time. That's exactly what I want. However, the System.out.println(numbers); is outputting:
{1=3, 2=10, 3=4, 4=7, 5=6, 6=9, 7=12, 8=11, 9=10, 10=18, 11=16, 12=13, 13=18, 14=16, 15=18, 16=18, 17=18, 18=19, 19=20}
I have no idea where to even start with debugging this. I made a test file with an array that is formatted like "even, odd" integers all the way up to 100. Using this exact same code (I did change the variable names), I'm getting a better output. It's not exactly desired since it starts at 350=351 and then goes to like 11=15 and continues in a non-chronological order but at least it contains all the 100 keys and values.
Also, completely unrelated question but is my formatting of the for loop fine? The extra space at the beginning and the end of the code?
Edit: I want my expected output to be something like {6=7, 13=16, 5=6, 2=4, 16=17...}. Basically, the hashmap would have the minimum and maximum as the key and value and it'd be in chronological order.
The problem with your code is that you're trying to put in a nail with a saw. A hashmap is not the right tool to achieve what you want, since
Keys are unique. If you try to input the same key multiple times, the first input will be overwritten
The order of items in a HashMap is undefined.
A hashmap expresses a key-value-relationship, which does not exist in this context
A better datastructure to save your Passwords would probably just be a ArrayList<IntegerPair> where you would have to define IntegerPair yourself, since java doesn't have the notion of a type combining two other types.
I think you are complicating the task unnecessarily. I would proceed as follows:
split the input using the line separator
for each line remove : and split using the spaces to get an array with length 3
build from the array in step two
3.1. the min/max char count from array[0]
3.2 charachter classes for the letter and its negation
3.3 remove from the password all letters that do not correspond to the given one and check if the length of the password is in range.
Something like:
public static void main(String[] args){
String input = "6-7 z: dqzzzjbzz\n" +
"13-16 j: jjjvjmjjkjjjjjjj\n" +
"5-6 m: mmbmmlvmbmmgmmf\n" +
"2-4 k: pkkl\n" +
"16-17 k: kkkkkkkkkkkkkkkqf\n" +
"10-16 s: mqpscpsszscsssrs\n";
int count = 0;
for(String line : input.split("\n")){
String[] temp = line.replace(":", "").split(" "); //[6-7, z, dqzzzjbzz]
String minMax = "{" + (temp[0].replace('-', ',')) + "}"; //{6,7}
String letter = "[" + temp[1] + "]"; //[z]
String letterNegate = "[^" + temp[1] + "]"; //[^z]
if(temp[2].replaceAll(letterNegate, "").matches(letter + minMax)){
count++;
}
}
System.out.println(count + "passwords are valid");
}
Scanner scan = new Scanner(System.in);
String s = scan.nextLine();
Queue q=new LinkedList();
for(int i=0;i<s.length();i++){
int x=(int)s.charAt(i);
if(x<65 || (x>90 && x<97) || x>122) {
q.add(s.charAt(i));
}
}
System.out.println(q.peek());
String redex="";
while(!q.isEmpty()) {
redex+=q.remove();
}
String[] x=s.split(redex,-1);
for(String y:x) {
if(y!=null)
System.out.println(y);
}
scan.close();
I am trying to print the string "my name is NLP and I, so, works:fine;"yes"." without tokens such as {[]}+-_)*&%$ but it just prints out all the String as it is, and I don't understand the problem?
This is 3 answers in one:
For your initial problem
For a solution without regex
For a correct use of Scanner (this is up to you).
First
When you use a regex build from whatever character you got under the hand, you should quote it:
String[] x=s.split(Pattern.quote(redex),-1);
That would be the usual problem, but the second problem is that you are building a regexp range but you are omitting the [] making the range, so it can work as is:
String[] x=s.split("[" + Pattern.quote(redex) + "]",-1);
This one may work, but may fail if Pattern.quote don't quote - and - is found in between two characters making a range such as : $-!.
This would means: character in range starting at $ from !. It may fail if the range is invalid and my example may be invalid ($ may be after !).
Finally, you may use:
String redex = q.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"));
This regexp should match the unwanted character.
Second:
For the rest, the other answer point out another problem: you are not using the Character.isXXX method to check for valid characters.
Firstly, be wary that some method does not use char but code points. For example, isAlphabetic use code points. A code points is simply a representation of a character in a multibyte encoding. There some unicode character which take two char.
Secondly, I think your problem lies in the fact you are not using the right tool to split your words.
In pseudo code, this should be:
List<String> words = new ArrayList<>();
int offset = 0;
for (int i = 0, n = line.length(); i < n; ++i) {
// if the character fail to match, then we switched from word to non word
if (!Character.isLetterOrDigit(line.charAt(i)) {
if (offset != i) {
words.add(line.substring(offset, i));
}
offset = i + 1; // next char
}
}
if (offset != line.length()) {
words.add(line.substring(offset));
}
This would:
- Find transition from word to non word and change offset (where we started)
- Add word to the list
- Add the last token as ending word.
Last
Alternatively, you may also play with Scanner class since it allows you to input a custom delimiter for its hasNext(): https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
I quote the class javadoc:
The scanner can also use delimiters other than whitespace. This
example reads several items in from a string:
String input = "1 fish 2 fish red fish blue fish";
Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");
System.out.println(s.nextInt());
System.out.println(s.nextInt());
System.out.println(s.next());
System.out.println(s.next());
s.close();
As you guessed, you may pass on any delimiter and then use hasNext() and next() to get only valid words.
For example, using [^a-zA-Z0-9] would split on each non alpha/digit transition.
As noted in the comment, the condition x<65 will catch all sorts of special characters you're not interested in. Using Character's built-in methods will help you write this condition in a clearer, bug-free way:
x = s.charAt(i);
if (Character.isLetter(x) || Character.isWhiteSpace(x)) {
q.add(x);
}
I am working on a project which involves making a "worker" in java which receives instructions from an input string. In the input string normally should be first four numbers and then afterwards a number and a letter right after being N,S,W, or E. The first 2 numbers in the list are used to determine the size of the area this worker can walk. the next two numbers are the starting point for the worker. The number with the letter determines what direction the worker walks and how many paces. The problem I am having is I don't understand how to get the first four digits out of the string and separate them into what they each should be.
import java.util.*;
import java.io.*;
public class Worker {
private int height;
public void readInstructions(String inputFileName, String outputFileName) throws InvalidWorkerInstructionException{
try{
Scanner in = new Scanner(inputFileName);
PrintWriter wrt;
wrt = new PrintWriter(outputFileName);
if(inputFileName.startsWith("i")){
System.out.println("Input file not found.");
//wrt.println("Input file not found.");
}
while(in.hasNext()){
String s = in.nextLine();
if(Integer.parseInt(s)<= 9){
}
}
}catch(InvalidWorkerInstructionException e){
}catch(FileNotFoundException e){
}
While I would love to ask for a straight up answer, this is a project so I would prefer nobody gives me a fixed code. Please if you can give me advice for what I am doing wrong and where I should be going to solve the problem.
Ok I realized one other thing because I tried the advice given. So I am receiving a string that gives me the name of an input txt. Inside that input txt is the numbers and directions. How can i access this text file? Also how do I determine if it can be opened?
Okay, so you already know how to read the file using a Scanner. All you need to do next is split the String and extract the first four inputs out of it.
Here is the code snippet:
String s = in.nextLine();
int i = 0, digits[] = new int[4];
for(String inp : s.splits(" ")) {
if(i == 4) break;
digits[i++] = Integer.parseInt(inp);
}
Note: I'm assuming that the inputs in your file is space separated. If not then you can replace the space in the split() with the correct delimiter.
If input format is fixed than you can use substring method to get different parts of string. Refer documentation for more detail:
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#substring(int,%20int)
Example code:
String s = "12345E";
s.substring(0, 2); /* 12 */
s.substring(2, 4); /* 34 */
s.substring(4, 5); /* 5 */
s.substring(5, 6); /* E */
You can use the method .getChars() to accomplish this. Here is what the javadoc says about this method:
public void getChars(int srcBegin,
int srcEnd,
char[] dst,
int dstBegin)
Copies characters from this string into the destination character array.
The first character to be copied is at index srcBegin; the last character to be copied is at index srcEnd-1 (thus the total number of characters to be copied is srcEnd-srcBegin). The characters are copied into the subarray of dst starting at index dstBegin and ending at index:
dstbegin + (srcEnd-srcBegin) - 1
Parameters:
srcBegin - index of the first character in the string to copy.
srcEnd - index after the last character in the string to copy.
dst - the destination array.
dstBegin - the start offset in the destination array.
Throws:
IndexOutOfBoundsException - If any of the following is true:
srcBegin is negative.
srcBegin is greater than srcEnd
srcEnd is greater than the length of this string
dstBegin is negative
dstBegin+(srcEnd-srcBegin) is larger than dst.lengt....
Here is what you could do...
You read in the string - grab its length (You want to make sure that it has all the chars you need)
Read in to a separate array discarding any extraneous chars that are not needed for this functionality..
You can make your own pseudo code to work out the problem once the string is split into an array. Very easy to work with since you know what each location of the array is supposed to do.
This is not a hard problem to solve at all..
Good luck on your project.
So I am writing a scrabble word suggestion program that I decided to do because I wanted to learn sets (don't worry, I at least got that part) and referencing info/data not created within the program. Im pretty new to Java (and programming in general), but I was wondering how to pull words from a word list .FIC file in order to check them against words generated from the letters inputted.
To clarify, I have written a program which takes a series of letters and returns a set of every possible word created from those letters. for example:
input:
abc
would give a set containing the "words":
a, ab, ac, abc, acb, b, ba, bc, bac, bca, c, ca, cb, cab, cba
What I am asking, really, is how to check those to find the ones contained in the .FIC file.
The file is the "official crosswords" file from the Moby project word list and I am still (very) shaky on parsing and other file dealing-with methods. I am continuing to research so I dont have any prototype code for that.
Sorry if the question isn't entirely clear.
edit: here is the method that makes the "words" to make it easier to understand the idea. The part I don't understand is specifically how to pull a word(as a string) from the .FIC file.
private static Set<String> Words(String s)
{
Set<String> tempwords = new TreeSet<String>();
if (s.length() == 1)
{ // base case, last letter
tempwords.add(s);
// System.out.println(s); uncomment when debugging
}
else
{
//set up to add each letter in s
for (int i = 0; i < s.length(); i++)
{ //cut the i letter out of the string
String remaining = s.substring(0, i) + s.substring(i+1);
//recursion to add all combinations of letters onto the current letter/"word"
for (String permutation : Words(remaining))
{
// System.out.println(s.substring(i, i+1) + permutation); uncomment when debugging
//add the full length words
tempwords.add(s.substring(i, i+1) + permutation);
// System.out.println(permutation); uncomment when debugging
//add the not-full-length words
tempwords.add(permutation);
}
}
}
// System.out.println(tempwords); uncomment when debugging
return tempwords;
}
I dont know if it is the best solution, but i figured it out (hobbs the line thing helped a lot, thank you). I found that this works:
public static void main(String[] args) throws FileNotFoundException
{
Scanner s = new Scanner(new FileReader("C:/Users/Sean/workspace/Imbored/bin/113809of.fic"));
while(true)
{
words.clear();
String letters = enterLetters();
words.addAll(Words(letters));
while(s.hasNextLine()) {
String line = s.nextLine();
String finalword = checkWords(line, words);
if (finalword != null) finalwordset.add(finalword);
}
s.reset();
System.out.println(finalwordset);
System.out.println();
System.out.println("_________________________________________________________________________");
}
}
A few things:
The checkWords method checks if the current word from the file is in the generated list of "words"
The enterletters method takes user inputted letters and returns them in a string
The Words method returns a set of strings of all of the possible combinations of the characters in the given string, with each character used up to as many times as it appears in the string and no repeated "words" in the returned set.
finalwordset and words are arraylists of strings defined as instance variables(i would put them in the main method but I'm lazy and it doesn't matter for this case)
I am very sure there is a better/more efficient way to do this, but this at least works.
Finally: I decided to answer rather than delete because I didn't see this answered anywhere else, so if it is feel free to delete the question or link to the other answer or whatever, at this point it is to help other people.
I'm wondering how I could grab each nth lines from a String, say each 100, with the lines in the String being seperated with a '\n'.
This is probably a simple thing to do but I really can't think of how to do it, so does anybody have a solution?
Thanks much,
Alex.
UPDATE:
Sorry I didn't explain my question very well.
Basically, imagine there's a 350 line file. I want to grab the start and end of each 100 line chunk. Pretending each line is 10 characters long, I'd finish with a 2 seperate arrays (containing start and end indexes) like this:
(Lines 0-100) 0-1000
(Lines 100-200) 1000-2000
(Lines 200-300) 2000-3000
(Lines 300-350) 3000-3500
So then if I wanted to mess around with say the second set of 100 lines (100-200) I have the regions for them.
You can split the string into an array using split() and then just get the indexes you want, like so:
String[] strings = myString.split("\n");
int nth = 100;
for(int i = nth; i < strings.length; i + nth) {
System.out.println(strings[i]);
}
String newLine = System.getProperty("line.separator");
String lines[] = text.split(newLine);
Where text is string with your whole text.
Now to get nth line, do e.g.:
System.out.println(lines[nth - 1]); // Minus one, because arrays in Java are zero-indexed
One approach is to create a StringReader from the string, wrap it in a BufferedReader and use that to read lines. Alternatively, you could just split on \n to get the lines, of course...
String[] allLines = text.split("\n");
List<String> selectedLines = new ArrayList<String>();
for (int i = 0; i < allLines.length; i += 100)
{
selectedLines.add(allLines[i]);
}
This is simpler code than using a BufferedReader, but it does mean having the complete split string in memory (as well as the original, at least temporarily, of course). It's also less flexible in terms of being adapted to reading lines from other sources such as a file. But if it's all you need, it's pretty straightforward :)
EDIT: If the start indexes are needed too, it becomes slightly more complicated... but not too bad. You probably want to encapsulate the "start and line" in a single class, but for the sake of brevity:
String[] allLines = text.split("\n");
List<String> selectedLines = new ArrayList<String>();
List<Integer> selectedIndexes = new ArrayList<Integer>();
int index = 0;
for (int i = 0; i < allLines.length; i++)
{
if (i % 100 == 0)
{
selectedLines.add(allLines[i]);
selectedIndexes.add(index);
}
index += allLines[i].length + 1; // Add 1 for the trailing "\n"
}
Of course given the start index and the line, you can get the end index just by adding the line length :)