Find position of a term in ClueWeb09 corpora

Find position of a term in ClueWeb09 corpora - java

I want to read term from clueweb09 corpora and find its position to check that term is an entity or not based on dataset that was created here. They claim they calculate the position based on:
**" The zero (0) location used for calculating the annotation offsets is the beginning of the HTTP headers*. This is the first byte after the WARC document header."* .
I calculate the length of each term by calling term.getBytes().length function and sum all positions to find the position of the entity. Unfortunately my position is about 400 byte less than the reported position. I calculate position by reading each warcfile file based on following code.
ArrayList<Integer> pos = new ArrayList<Integer>();
int position=-1;
String text;
try{
FileReader fileReader = new FileReader("05");
BufferedReader bufferedReader = new BufferedReader(fileReader);
while(true){
String line= bufferedReader.readLine();
if(line==null)
break;
else
{
int index=line.indexOf(word);
if(index==-1)
position=position+line.getBytes().length;
else{
int poss= position +index;
pos.add(poss);
position=position+line.getBytes().length;
}
}
}
bufferedReader.close();
}
catch(Exception e)
{
e.printStackTrace();
}
Could you please tell me what is the problem(s)?

To Find a position of the Clueweb dataset terms in byte I used this code which implement the annotation version of this data set here. The lemur.cw.ann.DetectEncoding class returns the annotation which are not match with clueweb dataset based on their positions. I change that class and lemur.cw.ann.Matcher class to calculate the postion of the Clueweb terms based on byte offset.

Related

Java: Most efficient way to loop through CSV and sum values of one column for each unique value in another Column

I have a CSV file with 500,000 rows of data and 22 columns. This data represents all commercial flights in the USA for one year. I am being tasked with finding the tail number of the plane that flew the most miles in the data set. Column 5 contains the airplain's tail number for each flight. Column 22 contains the total distance traveled.
Please see my extractQ3 method below. First, created a HashMap for the whole CSV using the createHashMap() method. Then, I ran a for loop to identify every unique tail number in the dataset and stored them in an array called tailNumbers. Then for each unique tail number, I looped through the entire Hashmap to calculate the total miles of distance for that tail number.
The code runs fine on smaller datasets, but once the sized increased to 500,000 rows the code becomes horribly inefficient and takes an eternity to run. Can anyone provide me with a faster way to do this?
public class FlightData {
HashMap<String,String[]> dataMap;
public static void main(String[] args) {
FlightData map1 = new FlightData();
map1.dataMap = map1.createHashMap();
String answer = map1.extractQ3(map1);
}
public String extractQ3(FlightData map1) {
ArrayList<String> tailNumbers = new ArrayList<String>();
ArrayList<Integer> tailMiles = new ArrayList<Integer>();
//Filling the Array with all tail numbers
for (String[] value : map1.dataMap.values()) {
if(Arrays.asList(tailNumbers).contains(value[4])) {
} else {
tailNumbers.add(value[4]);
}
}
for (int i = 0; i < tailNumbers.size(); i++) {
String tempName = tailNumbers.get(i);
int miles = 0;
for (String[] value : map1.dataMap.values()) {
if(value[4].contentEquals(tempName) && value[19].contentEquals("0")) {
miles = miles + Integer.parseInt(value[21]);
}
}
tailMiles.add(miles);
}
Integer maxVal = Collections.max(tailMiles);
Integer maxIdx = tailMiles.indexOf(maxVal);
String maxPlane = tailNumbers.get(maxIdx);
return maxPlane;
}
public HashMap<String,String[]> createHashMap() {
File flightFile = new File("flights_small.csv");
HashMap<String,String[]> flightsMap = new HashMap<String,String[]>();
try {
Scanner s = new Scanner(flightFile);
while (s.hasNextLine()) {
String info = s.nextLine();
String [] piecesOfInfo = info.split(",");
String flightKey = piecesOfInfo[4] + "_" + piecesOfInfo[2] + "_" + piecesOfInfo[11]; //Setting the Key
String[] values = Arrays.copyOfRange(piecesOfInfo, 0, piecesOfInfo.length);
flightsMap.put(flightKey, values);
}
s.close();
}
catch (FileNotFoundException e)
{
System.out.println("Cannot open: " + flightFile);
}
return flightsMap;
}
}

The answer depends on what you mean by "most efficient", "horribly inefficient" and "takes an eternity". These are subjective terms. The answer may also depend on specific technical factors (speed vs. memory consumption; the number of unique flight keys compared to the number of overall records; etc.).
I would recommend applying some basic streamlining to your code, to start with. See if that gets you a better (acceptable) result. If you need more, then you can consider more advanced improvements.
Whatever you do, take some timings to understand the broad impacts of any changes you make.
Focus on going from "horrible" to "acceptable" - and then worry about more advanced tuning after that (if you still need it).
Consider using a BufferedReader instead of a Scanner. See here. Although the scanner may be just fine for your needs (i.e. if it's not a bottleneck).
Consider using logic within your scanner loop to capture tail numbers and accumulated mileage in one pass of the data. The following is deliberately basic, for clarity and simplicity:
// The string is a tail number.
// The integer holds the accumulated miles flown for that tail number:
Map<String, Integer> planeMileages = new HashMap();
if (planeMileages.containsKey(tailNumber)) {
// add miles to existing total:
int accumulatedMileage = planeMileages.get(tailNumber) + flightMileage;
planeMileages.put(tailNumber, accumulatedMileage);
} else {
// capture new tail number:
planeMileages.put(tailNumber, flightMileage);
}
After that, once you have completed the scanner loop, you can iterate over your planeMileages to find the largest mileage:
String maxMilesTailNumber;
int maxMiles = 0;
for (Map.Entry<String, Integer> entry : planeMileages.entrySet()) {
int planeMiles = entry.getValue();
if (planeMiles > maxMiles) {
maxMilesTailNumber = entry.getKey();
maxMiles = planeMiles;
}
}
WARNING - This approach is just for illustration. It will only capture one tail number. There could be multiple planes with the same maximum mileage. You would have to adjust your logic to capture multiple "winners".
The above approach removes the need for several of your existing data structures, and related processing.
If you still face problems, put in some timers to see which specific areas of your code are slowest - and then you will have more specific tuning opportunities you can focus on.

I suggest you use the java 8 Stream API, so that you can take advantage of Parallel streams.

Creating and Parsing Through Very Large Arrays in Java

I have a CSV file of nearly 2 million rows with 3 columns (item, rating, user). I am able to transfer the data into a 2D String array or list. However, my issue arises when I am trying to parse through the arrays to create CSV files from because the application stops and I do not know how long I am expected to wait for the program to finish running.
Basically, my end goal is to be able to parse through large CSV file, create a matrix in which each distinct item represents a row and each distinct user represents a column with the rating being at the intersection of the user and item. With this matrix, I then create a cosine similarity matrix with the rows and columns represented by items with their cosine similarity being at the intersection of the two distinct items.
I already know how to create CSV files, but my issue falls within the large loop structures when creating other arrays for the purposes of comparison.
Is there a better way to be able to process and calculate large amounts of data so that my application doesn't freeze?
My current program does the following:
Take large CSV file
Parse through large CSV file
Create 2D array resembling original CSV file
Create list of distinct items (each distinct item being represented by an index number)
Create list of distinct users (each distinct user being represented by an index number)
Create 2D array of with row indexes representing items, column indexes representing users resulting in array[row][column] = rating
Calculate cosine similarity of two matrices
Create 2D array with both row and column indexes representing items resulting in array[row]
[column] = cosine similarity
I noticed that my program freezes when it reaches steps 4 and 5
If I remove steps 4 and 5, it will still freeze at step 6
I have attached that portion of my code
FileInputStream stream = null;
Scanner scanner = null;
try{
stream = new FileInputStream(fileName);
scanner = new Scanner(stream, "UTF-8");
while (scanner.hasNextLine()){
String line = scanner.nextLine();
if (!line.equals("")){
String[] elems = line.split(",");
if (itemList.isEmpty()){
itemList.add(elems[0]);
}
else{
if (!itemList.contains(elems[0]))
itemList.add(elems[0]);
}
if (nameList.isEmpty()){
nameList.add(elems[2]);
}
else{
if (!nameList.contains(elems[2]))
nameList.add(elems[2]);
}
for (int i = 0; i < elems.length; i++){
if (i == 1){
if (elems[1].equals("")){
list.add("0");
}
else{
list.add(elems[1]);
}
}
else{
list.add(elems[i]);
}
}
}
}
if (scanner.ioException() != null){
throw scanner.ioException();
}
}
catch (IOException e){
System.out.println(e);
}
finally{
try{
if (stream != null){
stream.close();
}
}
catch (IOException e){
System.out.println(e);
}
if (scanner != null){
scanner.close();
}
}

You can try setting -Xms and -Xmx. If you're using default values, it's possible you just need more memory allocated to the JVM.
In addition to that, you could modify your code so it doesn't treat everything as String. For the score column (which is presumably numeric), you should be able to parse that as a numeric value and store that instead of the string representation. Why? Strings use a lot more memory than numeric values. Even an empty string uses 40 bytes, whereas a single numeric value can be as little as one byte.
If a single byte could work (numeric range is -128 to 127), then you could replace ~80MB memory usage with ~2MB. Even using int (4 bytes) would be a huge improvement over String. If there are any other numeric (or boolean) values present in the data, you could make further reductions.

Java read text file with maze and get all possible paths [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
EDIT: I have tried to store the lines character by character into a 2D Array.
However, the problem is to get all possible paths of a maze from 0 to 1 inside of a text file. And the asterisk are the walls or obstacle.
Maze looks like this
8,8
********
*0 *
* *
* ** *
* ** *
* *
* 1*
********
I'm not sure if it's achievable to put it into a Two Dimensional Array string. And do a recursion or dynamic programming afterwards.
Note that the only movements allowed is right and down, also the 0 destination could be somewhere on 2nd, 3rd and so on column. Same as 1 destination as well.
Any tips or suggestions will be appreciated, thank you in advance!

Yep, this is fairly easy to do:
Read the first line of the text file and parse out the dimensions.
Create an array of length n.
For every (blank) item in the array:
Create a new length-n array as the data.
Parse the next line of the text file as individual characters into the array.
After this, you'll have your n x n data structure to complete your game with.

Using a Map to store this File Seems like a good idea.
While I don't think reading a file character by character would be an issue,
BufferedReader br = new BufferedReader(new FileReader(file));
String line = br.readLine();
You have specified the grid dimensions say (n x n)
A Simple way I could visualize is by generating unique keys for every coordinate.
More like a Parser method to store Keys in the Map:
public String parseCoordinate(int x, int y){
return x + "" + y;
}
Map<String, Boolean> gridMap = new HashMap<>();
So when you read file by Characters, you could put parsed coordinates as keys in the map:
gridMap.put(parseCoordinate(lineCount, characterCount), line.charAt(characterCount) == '*');

I'm assuming the only problem you are facing is to decide how to read the file correctly for processing or applying the algorithm to determine the number of unique paths in the given maze.
private static int[][] getMatrixFromFile(File f) throws IOException {
//Read the input file as a list of String lines
List<String> lines = Files.lines(f.toPath())
//.map(line -> line.substring(1 , line.length() - 1))
.collect(Collectors.toList());
//Get the dimensions of the maze from the first line
String[] dimensions = lines.get(0).split("\\*");
//initalize a sub matrix of just the maze dimensions ignoring the walls
int[][] mat = new int[Integer.valueOf(dimensions[0]) - 2 ][Integer.valueOf(dimensions[1]) - 2];
//for each line in the maze excluding the boundaries , if you encounter a * encode as 0 else 1
for( int i = 2 ; i < lines.size() - 1 ; i++) {
String currLine = lines.get(i);
int j = 0;
for(char c : currLine.toCharArray())
mat[i - 2][j] = (c == '*') ? 0 : 1;
}
return mat;
}
With this in place you can now focus on the algorithm for actually traversing the matrix to determine the number of unique paths from top-right to bottom-left.
Having said that , once you have the above matrix you are not limited to traversing just top-right to bottom-left , rather any arbit point in you maze can serve as start and end points.
If you require help with figuring out the number of unique paths , i can edit to include the bit , but Dynamic programming should help in getting the same.

private char[][] maze;
private void read() {
final InputStream inputStream = YourClass.class.getResourceAsStream(INPUT_PATH);
final BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
try {
final String header = reader.readLine();
final String[] tokens = header.split(",");
if (tokens.length < 2) {
throw new RuntimeException("Invalid header"); // Use a dedicated exception
}
final int dimX = parseInt(tokens[0]);
final int dimY = parseInt(tokens[1]);
maze = new char[dimX][dimY];
for (int i = 0; i < dimY; i++) {
final String line = reader.readLine();
maze[i] = line.toCharArray();
}
} catch (final IOException e) {
// handle exception
} finally {
try {
reader.close();
} catch (IOException e) {
// handle exception
}
}
}
Now, some assumptions: I assumed the first line contains the declaration of the maze size, so it will be used to initialize the two dimensional array. The other assumption is that you can make use of a char array, but that's pretty easy to change if you want.
From here you can start working on your path finding algorithm.
By the way, this thing you're trying to implement reminds me a lot of this challenge in the Adventofcode challenge series. There are a lot of people discussing their solutions to the challenge, just have a look in Reddit for instance and you'll find plenty oh tips on how to go on with your little experiment.
Have fun!

how to iterate over a text file to perform different tasks (including creating an unknown number of objects) depending on which line I'm reading

Hello I'm a low level comp sci student that's really struggling/unfamiliar with file i/o.
I'm attempting to read in a text file using buffered reader. I understand how to use a while loop to continue scanning until the end of the file is reached but how can I instruct my reader to read just one line and do something until the end of that one line is reached, then read the next line and do something until that line ends, etc?
basically my input text file is going to repeat every three lines. The text file represents nodes in a weighted directed graph.
The input text file would supposedly look like the following:
each node is represented by two lines of text. For example the on the very top line, the first 'S' is the name of the node, the second 'S' indicates that it's a start node, the third 'n' indicates that it's a regular node, not a goal node, which would be indicated by a 'g'.
On the second line are the two nodes connected to 'S' the first being 'B' with a weighted distance of 1, and the second being 'E' with a weighted distance of 2.
The third line is supposed to be blank, and the pattern is repeated.
S S n
B 1 E 2
B N n
C 2 F 3
C N n
D 2 GA 4
D N n
GA 1
E N n
B 1 F 3 H 6
F N n
I 3 GA 3 C 1
GA N g
H N n
I 2 GB 2 F 1
I N n
GA 2 GB 2
GB N g
My code is as follows:
public void actionPerformed(ActionEvent e)
{
if(e.getSource() == openButton)
{
returnVal = fileChooser.showOpenDialog(null);
if(returnVal == JFileChooser.APPROVE_OPTION)
{
selected_file = fileChooser.getSelectedFile();
String file_name = fileChooser.getSelectedFile().getName();
file_name = file_name.substring(0, file_name.indexOf('.'));
try
{
BufferedWriter buff_writer = null;
File newFile = new File("."+file_name+"_sorted.txt");
boolean verify_creation = newFile.createNewFile();
//if (verify_creation)
// System.out.println("file created successfully");
//else
// System.out.println("file already present in specified location");
file_reader1 = new BufferedReader(new FileReader(selected_file));
file_reader2 = new BufferedReader(new FileReader(selected_file));
FileWriter file_writer = new FileWriter(newFile.getAbsoluteFile());
buff_writer = new BufferedWriter(file_writer);
//find the number of nodes in the file
while( (currentLine = file_reader1.readLine()) != null)
{
k++;
//System.out.println("value of k: " + k);
}
nodeArray = new Node[k];
while( (currentLine = file_reader2.readLine()) != null)
{
//System.out.print(currentLine);
String[] var = currentLine.split(" ");
nodeArray[x] = new Node(var[0]);
if (var[1].equals('S') || var[1].equals('s'))
nodeArray[x].setType(NodeType.START);
else if (var[2].equals('g') || var[2].equals('G'))
nodeArray[x].setType(NodeType.GOAL);
else
nodeArray[x].setType(NodeType.NORMAL);
x++;
}
buff_writer.close();
file_writer.close();
}
catch (Exception e1)
{
e1.printStackTrace();
}
}
}
My node class is as follows:
import java.util.*;
enum NodeType
{
START, GOAL, NORMAL;
}
public class Node
{
private String name;
private NodeType typeOfNode;
private final Map<Node, Integer> neighbors = new HashMap<>();
public Node(String name)
{
this.name = name;
}
public void setType(NodeType type)
{
typeOfNode = type;
}
public void addAdjacentNode(Node node, int distance)
{
neighbors.put(node, distance);
}
public String toString()
{
String output = "";
output += "node name: " + name + ",\n";
return output;
}
}
My other major problem is how to handle the second line in the repeating three line sequence. The second line gives all the adjacent nodes and their weighted distances from the node described on the first line. The problem is, I don't know how many adjacent nodes will exist for any given node. Technically there could be none, or maybe a large number.
A kind programmer here suggested that I use a hash map to record adjacent nodes but I'm not sure how to structure a line of code to account for an indeterminate number of such adjacencies
Note: this question is in reference to this earlier question I asked: how to create an adjacency matrix, using an input text file, to represent a directed weighted graph [java]?
if anyone could point me in the right direction I would be eternally grateful

As far as the input problem goes, your while loop is treating every line it reads identically. You will have to add a variable to keep track of which line in the 3-line sequence you're dealing with.

As for the adjacent nodes use an ArrayList, which is a dynamically sized array.
You need an ArrayList for each node, which stores information about that node's adjacent nodes.
So you will need an array containing (k divided by 3) ArrayLists.

Java Buffered Reader Text File Parsing

I am really struggling with parsing a text file. I have a text file which is in the following format
ID
Float Float
Float Float
.... // variable number of floats
END
ID
Float Float
Float Float
....
END
etc However the ID can represent one of two values, 0 which means it is a new field, or -1 which means it is related to the last new field. The number of times a related field can repeat it self is unlimited. Which is where the problem is occurring.
As I have a method in a library which takes an ArrayList of the new Floats, then an ArrayList of an ArrayList of the related floats.
When I try and code the logic for this I just keep getting deeper and deeper embedded while loops.
I would really appreciate any suggestions as to how I should go about this. Thanks in advance.
Here is the code I have so far.
BufferedReader br = new BufferedReader(new FileReader(buildingsFile));
String[] line = br.readLine().trim().split(" ");
boolean original = true;
while(true)
{
if(line[0].equals("END"))
break;
startCoordinate = new Coordinate(Double.parseDouble(line[0]), Double.parseDouble(line[1]));
while(true)
{
line = br.readLine().trim().split(" ");
if(!line[0].equals("END") && original == true)
polypoints.add(new Coordinate(Double.parseDouble(line[0]), Double.parseDouble(line[1])));
else if(!line[0].equals("END") && original == false)
cutout.add(new Coordinate(Double.parseDouble(line[0]), Double.parseDouble(line[1])));
else if(line[0].equals("END") && original == false)
{
cutouts.add(cutout);
cutout.clear();
}
else if(line[0].equals("-99999"))
original = false;
else if(line[0].equals("0"))
break;
}
buildingDB.addBuilding(mapName, startCoord, polypoints, cutouts);
}
New Code
int i = 0;
BufferedReader br = new BufferedReader(new FileReader(buildingsFile));
String[] line;
while(true)
{
line = br.readLine().trim().split(" ");
if(line[0].equals("END"))
break;
polygons.add(new Polygon(line));
while(true)
{
line = br.readLine().trim().split(" ");
if(line[0].equals("END"))
break;
polygons.get(i).addCoord(new Coordinate(Double.parseDouble(line[0]), Double.parseDouble(line[1])));
}
i++;
}
System.out.println(polygons.size());
int j = 0;
for(i = 0; i< polygons.size(); i++)
{
Building newBuilding = new Building();
if(polygons.get(i).isNew == true)
{
newBuilding = new Building();
newBuilding.startCoord = new Coordinate(polygons.get(i).x, polygons.get(i).y);
}
while(polygons.get(i).isNew == false)
newBuilding.cutouts.add(polygons.get(i).coords);
buildings.add(newBuilding);
}
for(i = 0; i<buildings.size(); i++)
{
System.out.println(i);
buildingDB.addBuilding(mapName, buildings.get(i).startCoord, buildings.get(i).polypoint, buildings.get(i).cutouts);
}

Maybe you should use map for new floats and related floats..if got your question it should help..example:
HashMap hm = new HashMap();
hm.put("Rohit", new Double(3434.34));

I assume that a "field" means an ID and a variable number of coordinates (pairs of floats), that, judging from your code, represents a polygon in fact.
I would first load all the polygons, each into a separate Polygon object:
class Polygon {
boolean isNew;
List<Coordinate> coordinates;
}
and store the polygons in another list. Then in a 2nd pass go through all the polygons to group them according to their IDs into something like
class Building {
Polygon polygon;
List<Polygon> cutouts;
}
I think this would be fairly simple to code.
OTOH if you have a huge amount of data in the file, and/or you prefer processing the read data little by little, you could simply read a polygon and all its associated cutouts, until you find the next polygon (ID of 0), at which point you could simply pass the stuff read so far to the building DB and start reading the next polygon.

You can try using ANTLR here, The Grammar defines the format of the text you are expecting and then you can wrap the contents in a Java object. The * and + Wildcards will resolve the complexity of while and for. Its very simple and easy to use, you dont have to construct AST you can take the parsed content from java objects directly. But the only overhead is you have to add the ANTLR.jar to your path.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find position of a term in ClueWeb09 corpora - java

Related

Java: Most efficient way to loop through CSV and sum values of one column for each unique value in another Column

Creating and Parsing Through Very Large Arrays in Java

Java read text file with maze and get all possible paths [closed]

how to iterate over a text file to perform different tasks (including creating an unknown number of objects) depending on which line I'm reading

Java Buffered Reader Text File Parsing

Categories

Resources