writing a file from Set(Crawler,Jsoup,Java)

writing a file from Set(Crawler,Jsoup,Java) - java

I recently wrote a small crawler that searches for links on any page, and writes to a file. I added code in the collection (HashSet) to avoid the same links ...
but the code does not work for some reason and the file I see a lot of duplicates.
Could you help fix bugs in it?
here is the code of the crawler:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Crawler {
public static void main(String[] args) {
Set<String> setUrlBase = new HashSet<String>();
Document doc;
String BaseUrlTxtT = "C://Search/urlw.txt";
try {
doc = Jsoup.connect("http://stackoverflow.com/").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.indexOf("http://") == 0) {
setUrlBase.add(UrlLinkHref);
for (String strUrlHash : setUrlBase) {
writeToBase(BaseUrlTxtT, strUrlHash + "\n");
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static void writeToBase(String fileName, String text) {
File file = new File(fileName);
try {
if (!file.exists()) {
file.createNewFile();
}
FileWriter wr = new FileWriter(file.getAbsoluteFile(), true);
try {
wr.write(text + "\n");
} finally {
wr.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

It seems that in code
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.indexOf("http://") == 0) {
setUrlBase.add(UrlLinkHref);
for (String strUrlHash : setUrlBase) {
writeToBase(BaseUrlTxtT, strUrlHash + "\n");
}
}
you are adding link to set and appending to file entire content of set. I said appending instead of rewriting because you are using
FileWriter wr = new FileWriter(file.getAbsoluteFile(), true);
where true means "don't erase my file, but add new content after existing one" so your file probably looks like
link1 //part written when added first link
link1 //part written when added second link
link2
link1 //part written when added third link
likn2
link3
(actually since HashSet is unordered elements in each parts could be written as link2, link3, link1 instead of link1, link2, link3 but you should get the idea)
To avoid it just use code like
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.indexOf("http://") == 0) {
if (setUrlBase.add(UrlLinkHref)){//will return true if link wasn't in set yet
writeToBase(BaseUrlTxtT, UrlLinkHref + "\n");// so we also want to write
// it to file
}
}
which can be shorten even more using short-circuit-and (&&)
String UrlLinkHref = link.attr("href");
if (UrlLinkHref.startsWith("http://") && setUrlBase.add(UrlLinkHref)){
writeToBase(BaseUrlTxtT, UrlLinkHref + "\n");// so we also want to write
}
Few more improvements
To check if string starts with "http://" don't use "cryptic" code like UrlLinkHref.indexOf("http://") == 0 but simply UrlLinkHref.startsWith("http://")
Don't hardcode line separator line \n because different Operation Systems can use different sets of line separators (Windows for instance uses \r\n). Instead get it from System.lineSeparator(). Or even better, wrap your writers in PrintWriter which has println method (just like System.out.println()) which will add line separator based on OS for you.

Related

Concatenation of Strings in new line in JAva

I'm trying to concatenate strings in new lines when a condition is met. This is my input:
Concept
soft top cove
tonneau cove
interior persennin
Concept
Innen
Innenraum
Platz im Inneren
All I want to do is to concatenate all the strings after the string concept and to get the following output:
lemma, surface
soft top cove, tonneau cove|interior persennin
Innen, Innenraum|Platz im Inneren
I know if a string value is equal concept I want to go to the other line and write the string of the next line before a comma, than the strings from the other lines delimited by "|" e.g. soft top cove, tonneau cove|interior persenning
This is my code so far. Any suggestions are welcome!
Thank you:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
public class Converter {
public static void main(String[] args) {
// TODO Auto-generated method stub
BufferedReader inputcsv = null;
List <String> zeilencsv = new ArrayList<String>();
try {
inputcsv = Files.newBufferedReader(Paths.get("ErsteDatei.csv"));
String content;
while ((content = inputcsv.readLine()) != null) {
zeilencsv.add(content);
System.out.println(content);
}
File outputcsv = new File("TwoColumnsResult.csv");
//creates new file
outputcsv.createNewFile();
FileWriter csvFilewriter = new FileWriter(outputcsv);
//arraylist loop
int counter_a=0;
int counter = 1;
for (String zeile:zeilencsv){
String concept = "Concept";
//check string value =concept?
if(zeile.toString().equals(concept)){
zeile="lemma,surface";
for(String zeile2:zeilencsv){
//here I don't know how to say give me the next line, write it as a word , put comma and than concatenate with a |
}
}
else {
counter++;
}
csvFilewriter.write(zeile+"\n");
counter++;
}
//write
csvFilewriter.flush();
//closes the file
csvFilewriter.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

so if I understood correctly you want that after finding "Concept" save the next line as text, concatenate a comma, then save the next line and concatenate a | and then save the next?
Well the code that you're using is kind of bizarre but what I would do is the next:
zeile="lemma,surface";
// I'll put a counter to distinguish where to put "," and "|"
int counter = 0;
// Then I need a string variable to save the line
String line = "";
for(String zeile2:zeilencsv){
if (count == 0){ //First iteration
line = zeile2.toString();
}
if(count == 1){ . //Second iteration add the comma
line = line + "," + zeile2.toString();
}
if(count == 2){ . //Second iteration add the pipe
line = line + "|" + zeile2.toString();
}
count++;
}
I hope that this is what you're looking for. If you would like to optimize the code, feel free to send me a message and we could work together.

File I/O Practice.

I'm trying to take names from a file called boysNames.txt . That contains 1000 boy names. for Example every line looks like this in the file looks like this:
Devan
Chris
Tom
The goal was just trying to read in the name, but I couldn't find a method in the java.util package that allowed me to grab just the name in the file boysName.txt .
For example I just wanted to grab Devan, then next Chris, and tom.
NOT "1. Devan" and "2. Chris."
The problem is hasNextLine grabs the whole line. I don't want the "1." part.
So I just want Devan, Chris, Tom to be read or stored in a variable of type String. Does anyone know how to do that? I've tried HasNext(), but that didn't work.
Here the code here so you can get a visual:
import java.io.PrintWriter;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.io.FileInputStream;
public class PracticeWithTxtFiles {
public static void main(String[] args){
PrintWriter outputStream = null;
Scanner inputStream = null;
try{
outputStream = new PrintWriter(new FileOutputStream("boys.txt")); //opens up the file boys.txt
inputStream = new Scanner(new FileInputStream("boyNames.txt"));//opens up the file boyNames.txt
}catch(FileNotFoundException e){
System.out.println("Problem opening/creating files");
System.exit(0);
}
String names = null;
int ssnumbers= 0;
while(inputStream.hasNextLine())//problem is right here need it to just
// grab String representation of String, not an int
{
names = inputStream.nextLine();
ssnumbers++;
outputStream.println(names + " " + ssnumbers);
}
inputStream.close();
outputStream.close();
}
}

If you are unaware, Check for this String API's Replace method
String - Library
Just do a replace, its as simple as that.
names = names.replace(".","").replaceAll("[0-9]+","").trim()

How do I print out just certain elements from a text file that has xml tags to a new text file?

I need help with something that sounds easy but has given me some trouble.
I have a text file (record.txt) that has a root element 'PatientRecord' and sub tags in it ('first name', 'age', blood type, address etc...) that repeat over and over but with different values since it's a record for each person. I'm only interested in printing out the values in between the tags to a new text file for each person but only for the elements I want. For example with the tags I mentioned above I only want the name and age but not the rest of the info for that patient. How do I print out just those values separated by commas and then go to the next patient?
Here is the code I have so far
package patient.records;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
public class ProcessRecords {
private static final String FILE = "C:\\Users\\Desktop\\records.txt";
private static final String RECORD_START_TAG = "<PatientRecord>";
private static final String RECORD_END_TAG = "</PatientRecord>";
private static final String newFileName = "C:\\Users\\Desktop\\DataFolder\\";
public static void main(String[] args) throws Exception {
String scan;
FileReader file = new FileReader(FILE);
BufferedReader br = new BufferedReader(file);
Writer writer = null;
while ((scan = br.readLine()) != null)
{
if (scan.contains(RECORD_START_TAG)) {
//This is the logic I am missing that will only grab the element values
//between the tags inside of the file
writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(newFileName + "Record Data" + ".txt"), "utf-8"));
}
else if (scan.contains(RECORD_END_TAG)) {
writer.close();
writer=null;
}
else {
// only write if writer is not null
if (writer!=null) {
writer.write(scan);
}
}
}
br.close();
}
} //This is the end of my code
The text file (record.txt) I am reading in looks like this:
<PatientRecord> <---first patient record--->
<---XML Schema goes here--->
<Info>
<age>66</age>
<first_name>john</first_name>
<last_name>smith</last_name>
<mailing_address>200 main street</mailing_address>
<blood_type>AB</blood_type>
<phone_number>000-000-0000</phone_number>
</PatientRecord>
<PatientRecord> <---second patient record--->
<---XML Schema goes here--->
<Info>
<age>27</age>
<first_name>micheal</first_name>
<last_name>thompson</last_name>
<mailing_address>123 baker street</mailing_address>
<blood_type>O</blood_type>
<phone_number>111-222-3333</phone_number>
</PatientRecord>
So in theory if I ONLY wanted to print out the values from the tags first name, mailing address, and blood type from this text file for all patients it should look like this:
john, 200 main street, AB
//this line is blank
michael, 123 baker street, O
Thanks for any and all help. If you feel like my code should be modified then I'm all for it. Thank you.

My first gut feeling is to wrap the entire text content around some outer tag and process the text as XML, something like...
<Patients>
<PatientRecord> <---first patient record--->
<Info>
<age>66</age>
<first_name>john</first_name>
<last_name>smith</last_name>
<mailing_address>200 main street</mailing_address>
<blood_type>AB</blood_type>
<phone_number>000-000-0000</phone_number>
</PatientRecord>
...
</Patients>
But there are two problems with this...
One <---first patient record---> isn't a valid XML comment or text and two, there is no closing </Info> tag...[sigh]
So, my next thought was, read in each <PatientRecord> individual, as text, and then process that as XML....
Here come the problems...we need to remove anything surrounded by <--- ... ---> including the little arrows...There is a lot of assumptions about this, but hopefully we can ignore it...
The next problem is, we need to insert a closing </Info> tag...
After that, it's all really easy...
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.SAXException;
public class Test {
private static final String RECORD_START_TAG = "<PatientRecord>";
private static final String RECORD_END_TAG = "</PatientRecord>";
public static void main(String[] args) {
File records = new File("Records.txt");
try (BufferedReader br = new BufferedReader(new FileReader(records))) {
StringBuilder record = null;
String text = null;
while ((text = br.readLine()) != null) {
if (text.contains("<---") && text.contains("--->")) {
String start = text.substring(0, text.indexOf("<---"));
String end = text.substring(text.indexOf("--->") + 4);
text = start + end;
}
if (text.trim().length() > 0) {
if (text.startsWith(RECORD_START_TAG)) {
record = new StringBuilder(128);
record.append(text);
} else if (text.startsWith(RECORD_END_TAG)) {
record.append("</Info>");
record.append(text);
try (ByteArrayInputStream bais = new ByteArrayInputStream(record.toString().getBytes())) {
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(bais);
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression exp = xPath.compile("PatientRecord/Info/first_name");
Node firstName = (Node) exp.evaluate(doc, XPathConstants.NODE);
exp = xPath.compile("PatientRecord/Info/mailing_address");
Node address = (Node) exp.evaluate(doc, XPathConstants.NODE);
exp = xPath.compile("PatientRecord/Info/blood_type");
Node bloodType = (Node) exp.evaluate(doc, XPathConstants.NODE);
System.out.println(
firstName.getTextContent() + ", "
+ address.getTextContent() + ", "
+ bloodType.getTextContent());
} catch (ParserConfigurationException | XPathExpressionException | SAXException ex) {
ex.printStackTrace();
}
} else {
record.append(text);
}
}
}
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
Which prints out...
john, 200 main street, AB
micheal, 123 baker street, O
The long and short of it is, go back to the person who gave you this file, slap them, then tell them to put into a valid XML format...

Use a DOM parser and parse the text file. You can see one example in this link

Java Regex to remove all words after a key till end key

Can anyone out there please help me ,
i have a file containing several important information but also containing irrelevant information inside it as well . the irrelevant information is mentioned inside a curly
bracket for example :
Function blah blah 1+2 {unwanted information} something+2
what i wish to do is remove the unwanted information, and display the out put like this :
Function blah blah 1+2 something+2
can some 1 please give me the regex code for this ?
I have a partial code for this
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.BufferedReader;
public class SimpleReader{
public static void main( String a[] )
{
String source = readFile("source.java");
}
static String readFile(String fileName) {
File file = new File(fileName);
char[] buffer = null;
try {
BufferedReader bufferedReader = new BufferedReader( new FileReader(file));
buffer = new char[(int)file.length()];
int i = 0;
int c = bufferedReader.read();
while (c != -1) {
buffer[i++] = (char)c;
c = bufferedReader.read();
}
} catch (IOException e) {
e.printStackTrace();
}
return new String(buffer);
}
}
Thanks in advance.

newstr = str.replaceAll("{[^}]*}", "");
Modified the answer from this question: How to remove entire substring from '<' to '>' in Java

iText PdfTextExtractor Missing Ligatures in Resulting Text

I am attempting to take a pdf file and grab the text from it.
I found iText and have been using it and have had decent success. The one problem I have remaining are ligatures.
At first I noticed that I was simply missing characters. After doing some searches I came across this:
http://support.itextpdf.com/node/25
Once I knew that it was ligatures I was missing, I began to search for ways to solve the problem and haven't been able to come up with a solution yet.
Here is my code:
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.FilteredTextRenderListener;
import java.io.File;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.BufferedWriter;
import java.io.IOException;
import java.util.Formatter;
import java.lang.StringBuilder;
public class ReadPdf {
private static String INPUTFILE = "F:/Users/jmack/Webwork/Redglue_PDF/live/ADP/APR/ADP_41.pdf";
public static void writeTextFile(String fileName, String s) {
// s = s.replaceAll("\u0063\u006B", "just a test");
s = s.replaceAll("\uFB00", "ff");
s = s.replaceAll("\uFB01", "fi");
s = s.replaceAll("\uFB02", "fl");
s = s.replaceAll("\uFB03", "ffi");
s = s.replaceAll("\uFB04", "ffl");
s = s.replaceAll("\uFB05", "ft");
s = s.replaceAll("\uFB06", "st");
s = s.replaceAll("\u0132", "IJ");
s = s.replaceAll("\u0133", "ij");
FileWriter output = null;
try {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileName), "UTF-8"));
writer.write(s);
writer.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (output != null) {
try {
output.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
String str = PdfTextExtractor.getTextFromPage(reader, 1, new SimpleTextExtractionStrategy());
writeTextFile("F:/Users/jmack/Webwork/Redglue_PDF/live/itext/read_test.txt", str);
}
catch (Exception e) {
System.out.println(e);
}
}
}
In the PDF referenced above one line reads:
part of its design difference is a roofline
But when I run the Java class above the text output contains:
part of its design diference is a roofine
Notice that difference became diference and roofline became roofine.
It is interesting to note that when I copy and paste from the PDF to stackoverflow's textfield, it also looks like the second sentence with the two ligatures "ff" and "fl" reduced to simply "f"s.
I am hoping that someone here can help me figure out how to catch the ligatures and perhaps replaces them with the characters they represent, as in the ligature "fl" being replaced with an actual "f" and a "l".
I ran some tests on the output from the PDFTextExtractor and attempted to replace the ligature unicode characters with the actual characters, but discovered that the unicode characters for those ligatures do not exist in the value it returns.
It seems that it must be something in iText itself that is not reading those ligatures correctly. I am hopeful that someone knows how to work around that.
Thank you for any help you can give!
TLDR: Converting PDF to text with iText, had missing characters, discovered they were ligatures, now I need to capture those ligatures, not sure how to go about doing that.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

writing a file from Set(Crawler,Jsoup,Java) - java

Related

Concatenation of Strings in new line in JAva

File I/O Practice.

How do I print out just certain elements from a text file that has xml tags to a new text file?

Java Regex to remove all words after a key till end key

iText PdfTextExtractor Missing Ligatures in Resulting Text

Categories

Resources