I am scrapping data from a website and store it in CSV file. When the data gets in the CSV file it was getting the comma at the last place of every line. Somehow I manage to handle it. But, now I am getting that comma at the very start of every line which is creating another column. Following is my code.
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append(" \n ");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement = it.next();
final String content = tdElement.text();
if (it2.hasNext()) {
sb.append(" , ");
sb.append(formatData(content));
}
if (!it2.hasNext()) {
String content1 = content.replaceAll(",$", " ");
sb.append(formatData(content1));
break;
} //to remove last placed Commas.
}
System.out.println(sb.toString());
sb.flush();
sb.close();
Result which I want e.g: a,b,c,d,e
Result which I am getting e.g: ,a,b,c,d,e
If you're developing in Java 8, I suggest that you use StringJoiner. With this new class, you don't have to build the string yourself. You can find an example to create a CSV with StringJoiner here.
I hope it helps.
StringBuffer sb = new StringBuffer(" ");
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.deleteCharAt(sb.length() - 1);
sb.append(" \n ");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement = it.next();
final String content = tdElement.text();
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(",");
}
if (!it2.hasNext()) {
String content1 = content.replaceAll(",$", " ");
sb.append(formatData(content1));
break;
} //to remove last placed Commas.
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
im trying to remove the last character which in your case is a , at the instance where it is trying to move to a new line try replacing with my code
and make sure to instantiate stringbuffer with a space passed as a string.
Related
I have two files each having the same format with approximately 100,000 lines. For each line in file one I am extracting the second component or column and if I find a match in the second column of second file, I extract their third components and combine them, store or output it.
Though my implementation works but the programs runs extremely slow, it takes more than an hour to iterate over the files, compare and output all the results.
I am reading and storing the data of both files in ArrayList then iterate over those list and do the comparison. Below is my code, is there any performance related glitch or its just normal for such an operation.
Note : I was using String.split() but I understand form other post that StringTokenizer is faster.
public ArrayList<String> match(String file1, String file2) throws IOException{
ArrayList<String> finalOut = new ArrayList<>();
try {
ArrayList<String> data = readGenreDataIntoMemory(file1);
ArrayList<String> data1 = readGenreDataIntoMemory(file2);
StringTokenizer st = null;
for(String line : data){
HashSet<String> genres = new HashSet<>();
boolean sameMovie = false;
String movie2 = "";
st = new StringTokenizer(line, "|");
//String line[] = fline.split("\\|");
String ratingInfo = st.nextToken();
String movie1 = st.nextToken();
String genreInfo = st.nextToken();
if(!genreInfo.equals("null")){
for(String s : genreInfo.split(",")){
genres.add(s);
}
}
StringTokenizer st1 = null;
for(String line1 : data1){
st1 = new StringTokenizer(line1, "|");
st1.nextToken();
movie2 = st1.nextToken();
String genreInfo2= st1.nextToken();
//If the movie name are similar then they should have the same genre
//Update their genres to be the same
if(!genreInfo2.equals("null") && movie1.equals(movie2)){
for(String s : genreInfo2.split(",")){
genres.add(s);
}
sameMovie = true;
break;
}
}
if(sameMovie){
finalOut.add(ratingInfo+""+movieName+""+genres.toString()+"\n");
}else if(sameMovie == false){
finalOut.add(line);
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
return finalOut;
}
I would use the Streams API
String file1 = "files1.txt";
String file2 = "files2.txt";
// get all the lines by movie name for each file.
Map<String, List<String[]>> map = Stream.of(Files.lines(Paths.get(file1)),
Files.lines(Paths.get(file2)))
.flatMap(p -> p)
.parallel()
.map(s -> s.split("[|]", 3))
.collect(Collectors.groupingByConcurrent(sa -> sa[1], Collectors.toList()));
// merge all the genres for each movie.
map.forEach((movie, lines) -> {
Set<String> genres = lines.stream()
.flatMap(l -> Stream.of(l[2].split(",")))
.collect(Collectors.toSet());
System.out.println("movie: " + movie + " genres: " + genres);
});
This has the advantage of being O(n) instead of O(n^2) and it's multi-threaded.
Do a hash join.
As of now you are doing an outer loop join which is O(n^2), the hash join will be amortized O(n)
Put the contents of each file in a hash map, with key the field you want (second field).
Map<String,String> map1 = new HashMap<>();
// build the map from file1
Then do the hash join
for(String key1 : map1.keySet()){
if(map2.containsKey(key1)){
// do your thing you found the match
}
}
I have a class called CD with the following private variables:
private String artist = "";
private String year = "";
private String albumName = "";
private ArrayList<String> songs = new ArrayList<String>();
This class is used to store input data that is in this format:
Led Zeppelin
1979 In Through the Outdoor
-In the Evening
-South Bound Saurez
-Fool in the Rain
-Hot Dog
-Carouselambra
-All My Love
-I'm Gonna Crawl
I have a CDParser class that is in charge of parsing the file called sample.db line by line to store it into our CD object. After parsing, the CD object, after initializing it with CD newCD = new CD() has the following structure:
artist = "Led Zeppelin"
year = "1979"
albumName = "In Through the Outdoor"
songs = {"-In the Evening", "-South Bound Saurez", "-Fool in the Rain", "-Hot Dog"}
Now.. For this project, sample.db contains many albums, which looks like the following:
Led Zeppelin
1979 In Through the Outdoor
-In the Evening
-South Bound Saurez
-Fool in the Rain
-Hot Dog
-Carouselambra
-All My Love
-I'm Gonna Crawl
Led Zeppelin
1969 II
-Whole Lotta Love
-What Is and What Should Never Be
-The Lemon Song
-Thank You
-Heartbreaker
-Living Loving Maid (She's Just a Woman)
-Ramble On
-Moby Dick
-Bring It on Home
Bob Dylan
1966 Blonde on Blonde
-Rainy Day Women #12 & 35
-Pledging My Time
-Visions of Johanna
-One of Us Must Know (Sooner or Later)
-I Want You
-Stuck Inside of Mobile with the Memphis Blues Again
-Leopard-Skin Pill-Box Hat
-Just Like a Woman
-Most Likely You Go Your Way (And I'll Go Mine)
-Temporary Like Achilles
-Absolutely Sweet Marie
-4th Time Around
-Obviously 5 Believers
-Sad Eyed Lady of the Lowlands
I have so far been able to parse all three different albums and save them into my CD object, but ran into a roadblock where I'm simply saving all three albums into the same newCD object.
My question is - is there a way of programmatically initialize my CD constructor that will follow the format newCD1, newCD2, newCD3, etc, as I parse the sample.db?
What this means is, as I parse this particular file:
newCD1 would be the album In Through the Outdoor (and its respective private vars)
newCD2 would be the album II (and its respective private vars)
newCD3 would be the album Blonde on Blonde, and so on
Is this a smart way to do it? Or could you suggest me a better way?
EDIT:
Attached is my parser code. ourDB is an ArrayList containing every line of sample.db:
CD newCD = new CD();
int line = 0;
for(String string : this.ourDB) {
if(line == ARTIST) {
newCD.setArtist(string);
System.out.println(string);
line++;
} else if(line == YEAR_AND_ALBUM_NAME){
String[] elements = string.split(" ");
String[] albumNameArr = Arrays.copyOfRange(elements, 1, elements.length);
String year = elements[0];
String albumName = join(albumNameArr, " ");
newCD.setYear(year);
newCD.setAlbumName(albumName);
System.out.println(year);
System.out.println(albumName);
line++;
} else if(line >= SONGS && !string.equals("")) {
newCD.setSong(string);
System.out.println(string);
line++;
} else if(string.isEmpty()){
line = 0;
}
}
You have a single CD object, so you keep overwriting it. Instead, You could hold a collection of CDs. E.g.:
List<CD> cds = new ArrayList<>();
CD newCD = new CD();
int line = 0;
for(String string : this.ourDB) {
if(line == ARTIST) {
newCD.setArtist(string);
System.out.println(string);
line++;
} else if(line == YEAR_AND_ALBUM_NAME){
String[] elements = string.split(" ");
String[] albumNameArr = Arrays.copyOfRange(elements, 1, elements.length);
String year = elements[0];
String albumName = join(albumNameArr, " ");
newCD.setYear(year);
newCD.setAlbumName(albumName);
System.out.println(year);
System.out.println(albumName);
line++;
} else if(line >= SONGS && !string.equals("")) {
newCD.setSong(string);
System.out.println(string);
line++;
} else if(string.isEmpty()){
// We're starting a new CD!
// Add the one we have so far to the list, and start afresh
cds.add(newCD);
newCD = new CD();
line = 0;
}
}
// Take care of the case the file doesn't end with a newline:
if (line != 0) {
cds.add(newCD);
}
The problem is that you're using the same object reference of CD to fill the values of the parse of the file.
Just make sure to initialize and store every instance of CD newCD every time you start parsing the content of a new album.
You may do the following:
List<CD> cdList = new ArrayList<>();
for (<some way to handle you're reading a new album entry from your file>) {
CD cd = new CD();
//method below parses the data in the db per album entry
//an album entry may contain several lines
parseData(cd, this.ourDB);
cdList.add(cd);
}
System.out.println(cdList);
Your current way to parse the file works but is not as readable as it should be. I would recommend using two loops:
List<CD> cdList = new ArrayList<>();
Iterator<String> yourDBIterator = this.ourDB.iterator();
//it will force to enter the first time
while (yourDBIterator.hasNext()) {
//do the parsing here...
CD cd = new CD();
//method below parses the data in the db per album entry
//an album entry may contain several lines
parseData(cd, yourDBIterator);
cdList.add(cd);
}
//...
public void parseData(CD cd, Iterator<String> it) {
String string = it.next();
int line = ARTIST;
while (!"".equals(string)) {
if (line == ARTIST) {
newCD.setArtist(string);
System.out.println(string);
line++;
} else if(line == YEAR_AND_ALBUM_NAME){
String[] elements = string.split(" ");
String[] albumNameArr = Arrays.copyOfRange(elements, 1, elements.length);
String year = elements[0];
String albumName = join(albumNameArr, " ");
newCD.setYear(year);
newCD.setAlbumName(albumName);
System.out.println(year);
System.out.println(albumName);
line++;
} else if(line >= SONGS && !string.equals("")) {
newCD.setSong(string);
System.out.println(string);
line++;
}
if (it.hasNext()) {
string = it.next();
} else {
string = "";
}
}
}
Then, your code
I suggest to use the Builder design pattern to construct the CD object. If you read lines always in the same order, it will be not complicated to implement and use. Good tutorial: http://www.javacodegeeks.com/2013/01/the-builder-pattern-in-practice.html
Suppose csv file contains
1,112,,ASIF
Following code eliminates the null value in between two consecutive commas.
Code provided is more than it is required
String p1=null, p2=null;
while ((lineData = Buffreadr.readLine()) != null)
{
row = new Vector(); int i=0;
StringTokenizer st = new StringTokenizer(lineData, ",");
while(st.hasMoreTokens())
{
row.addElement(st.nextElement());
if (row.get(i).toString().startsWith("\"")==true)
{
while(row.get(i).toString().endsWith("\"")==false)
{
p1= row.get(i).toString();
p2= st.nextElement().toString();
row.set(i,p1+", "+p2);
}
String CellValue= row.get(i).toString();
CellValue= CellValue.substring(1, CellValue.length() - 1);
row.set(i,CellValue);
//System.out.println(" Final Cell Value : "+row.get(i).toString());
}
eror=row.get(i).toString();
try
{
eror=eror.replace('\'',' ');
eror=eror.replace('[' , ' ');
eror=eror.replace(']' , ' ');
//System.out.println("Error "+ eror);
row.remove(i);
row.insertElementAt(eror, i);
}
catch (Exception e)
{
System.out.println("Error exception "+ eror);
}
//}
i++;
}
how to read two consecutive commas from .csv file format as unique value in java.
Here is an example of doing this by splitting to String array. Changed lines are marked as comments.
// Start of your code.
row = new Vector(); int i=0;
String[] st = lineData.split(","); // Changed
for (String s : st) { // Changed
row.addElement(s); // Changed
if (row.get(i).toString().startsWith("\"") == true) {
while (row.get(i).toString().endsWith("\"") == false) {
p1 = row.get(i).toString();
p2 = s.toString(); // Changed
row.set(i, p1 + ", " + p2);
}
...// Rest of Code here
}
The StringTokenizer skpis empty tokens. This is their behavious. From the JLS
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Just use String.split(",") and you are done.
Just read the whole line into a string then do string.split(",").
The resulting array should have exactly what you are looking for...
If you need to check for "escaped" commas then you will need some regex for the query instead of a simple ",".
while ((lineData = Buffreadr.readLine()) != null) {
String[] row = line.split(",");
// Now process the array however you like, each cell in the csv is one entry in the array
I have used the following code to extract text from .odt files:
public class OpenOfficeParser {
StringBuffer TextBuffer;
public OpenOfficeParser() {}
//Process text elements recursively
public void processElement(Object o) {
if (o instanceof Element) {
Element e = (Element) o;
String elementName = e.getQualifiedName();
if (elementName.startsWith("text")) {
if (elementName.equals("text:tab")) // add tab for text:tab
TextBuffer.append("\\t");
else if (elementName.equals("text:s")) // add space for text:s
TextBuffer.append(" ");
else {
List children = e.getContent();
Iterator iterator = children.iterator();
while (iterator.hasNext()) {
Object child = iterator.next();
//If Child is a Text Node, then append the text
if (child instanceof Text) {
Text t = (Text) child;
TextBuffer.append(t.getValue());
}
else
processElement(child); // Recursively process the child element
}
}
if (elementName.equals("text:p"))
TextBuffer.append("\\n");
}
else {
List non_text_list = e.getContent();
Iterator it = non_text_list.iterator();
while (it.hasNext()) {
Object non_text_child = it.next();
processElement(non_text_child);
}
}
}
}
public String getText(String fileName) throws Exception {
TextBuffer = new StringBuffer();
//Unzip the openOffice Document
ZipFile zipFile = new ZipFile(fileName);
Enumeration entries = zipFile.entries();
ZipEntry entry;
while(entries.hasMoreElements()) {
entry = (ZipEntry) entries.nextElement();
if (entry.getName().equals("content.xml")) {
TextBuffer = new StringBuffer();
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(zipFile.getInputStream(entry));
Element rootElement = doc.getRootElement();
processElement(rootElement);
break;
}
}
System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
return TextBuffer.toString();
}
}
now my problem occurs when using the returned string from getText() method.
I ran the program and extracted some text from a .odt, here is a piece of extracted text:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
So I tried this
System.out.println( TextBuffer.toString().split("\\n"));
the output I received was:
substring: [Ljava.lang.String;#505bb829
I also tried this:
System.out.println( TextBuffer.toString().trim() );
but no changes in the printed string.
Why this behaviour?
What can I do to parse that string correctly?
And, if I wanted to add to array[i] each substring that ends with "\n\n" how can I do?
edit:
Sorry I made a mistake with the example because I forgot that split() returns an array.
The problem is that it returns an array with one line so what I'm asking is why doing this:
System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));
has no effect on the string I wrote in the example.
Also this:
System.out.println( TextBuffer.toString().trim() );
has no effects on the original string, it just prints the original string.
I want to example the reason why I want to use the split(), it is because I want parse that string and put each substring that ends with "\n" in an array line, here is an example:
my originale string:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
after parsing I would print each line of an array and the output should be:
line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....
If I understood your question correctly I would do something like this
String str = "(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....";
List<String> al = new ArrayList<String>(Arrays.asList(str.toString()
.split("\\n")));
al.removeAll(Arrays.asList("", null)); // remove empty or null string
for (int i = 0; i< al.size(); i++) {
System.out.println("Line " + i + " : " + al.get(i).trim());
}
Output
Line 0 : (no hi virtual x oy)
Line 1 : house cat
Line 2 : open it
Line 3 : trying to....
I got a result set "rs" from a database. rs contains only one column. I want to access the row at by its index and not all rows. Right now, i know that I can use this to iterate -
while(rs.next()){
rs.getString("employee_name")
}
But, it does not let me select the row.
Actually, I want to take a row, add a comma to it and then add the next row. There is no comma after the last element. So, I will need to iterate up to second last or n-1 th row and keep on adding commas. After that, i only need to append the last row to my string and the job is done.
Try the next:
Set<String> set = new LinkedHashSet<String>();
while(rs.next()) {
set.add(rs.getString("employee_name"));
}
Iterator<String> it = set.iterator();
StringBuilder sb = new StringBuilder();
while(it.hasNext()) {
sb.append(it.next());
if (it.hasNext()) {
sb.append(",");
}
}
String result = sb.toString();
String temp = "";
StringBuilder sb = new StringBuiler();
while(rs.next()){
sb.append(rs.getString("employee_name"));
sb.append(",");
}
temp = sb.toString();
temp = temp.substring(0, temp.length()-1);