Web scraper not creating CSV file - java

I have created a web scraper which brings the market data of share rates from the website of stock exchange. www.psx.com.pk in that site there is a hyperlink of Market Summary. From that link I have to scrap the data. I have created a program which is as follows.
package com.market_summary;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.Locale;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ComMarket_summary {
boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();
public static void createConnection() throws IOException {
System.setProperty("http.proxyHost", "191.1.1.202");
System.setProperty("http.proxyPort", "8080");
String tempUrl = "http://www.psx.com.pk/index.php";
doc = Jsoup.connect(tempUrl).get();
System.out.println("Successfully Connected");
}
public static void parsingHTML() throws Exception {
File fold = new File("C:\\market_smry.csv");
fold.delete();
File fnew = new File("C:\\market_smry.csv");
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
trElement2 = trElement.getElementsByTag("td");
tdElements = trElement.getElementsByTag("td");
FileWriter sb = new FileWriter(fnew, true);
if (trElement.hasClass("marketData")) {
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" | ");
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
}
System.out.println(sampleList.add(tdElements));
}
}
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);
public static String formatData(String text) {
String tmp = null;
try {
Date d = FORMATTER_MMM_d_yyyy.parse(text);
tmp = FORMATTER_dd_MMM_yyyy.format(d);
} catch (ParseException pe) {
tmp = text;
}
return tmp;
}
public static void main(String[] args) throws IOException, Exception {
createConnection();
parsingHTML();
}
}
Now, the problem is when I execute this program it should create a .csv file but what actually happens is it's not creating any file. When I debug this code I found that program is not entering in the loop. I don't understand that why it is doing so. While when I run the same program on the other website which have slightly different page structure it is running fine.
What I understand that this data is present in the #document which is a virtual element and doesn't mean anything that's why program can't read it while there is no such thing in other website. Kindly, help me out to read the data inside the #document element.

Long Story Short
Change your temp url to http://www.psx.com.pk/phps/index1.php
Explanation
There is no table in the document of http://www.psx.com.pk/index.php.
Instead it is showing it's content in two frameset.
One is dummy with url http://www.psx.com.pk/phps/blank.php.
Another one is the real page which is showing actual data and it's url is
http://www.psx.com.pk/phps/index1.php

Related

how can i do web scraping in this case?

i am trying to scrap text from https://in-the-sky.org/data/object.php?id=A216&day=17&month=6&year=2022
so i wrote a code like
import java.util.Iterator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String args[]) {
int num = 216;
int day = 17;
int month = 6;
int year = 2022;
String url ="https://in-the-sky.org/data/object.php?id=A"+Integer.toString(num)+"&day="+Integer.toString(day)+"&month="+Integer.toString(month)+"&year="+Integer.toString(year);
System.out.println(url);
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
System.out.println("=======================================================");
Elements element = doc.select("div.col-md-6 col-md-pull-6");
String output = element.select("p").text();
System.out.println(output);
System.out.println("=======================================================");
}
}
but it doesnt work well. i would like someone to help me please
I believe that you can use Elements element = doc.select("div.col-md-6 > p"); to get your desired output.

Creating an Arrylist in a Arraylist in Java

This is my first post so sorry if I mess something up or if I am not clear enough. I have been looking through online forums for several hours and spend more trying to figure it out for myself.
I am reading information from a file and I need a loop that creates an ArrayList every time it goes through.
static ArrayList<String> fileToArrayList(String infoFromFile)
{
ArrayList<String> smallerArray = new ArrayList<String>();
//This ArrayList needs to be different every time so that I can add them
//all to the same ArrayList
if (infoFromFile != null)
{
String[] splitData = infoFromFile.split(":");
for (int i = 0; i < splitData.length; i++)
{
if (!(splitData[i] == null) || !(splitData[i].length() == 0))
{
smallerArray.add(splitData[i].trim());
}
}
}
The reason I need to do this is that I am creating an app for a school project that reads questions from a delimited text file. I have a loop earlier that reads one line at a time from the text. I will insert that string into this program.
How do I make the ArrayList smallerArray a separate ArrayList everytime it goes through this method?
I need this so I can have an ArrayList of each of these ArrayList
Here is a sample code of what you intend to do -
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Stream;
public class SimpleFileReader {
private static final String DELEMETER = ":";
private String filename = null;
public SimpleFileReader() {
super();
}
public SimpleFileReader(String filename) {
super();
setFilename(filename);
}
public String getFilename() {
return filename;
}
public void setFilename(String filename) {
this.filename = filename;
}
public List<List<String>> getRowSet() throws IOException {
List<List<String>> rows = new ArrayList<>();
try (Stream<String> stream = Files.lines(Paths.get(filename))) {
stream.forEach(row -> rows.add(Arrays.asList(row.split(DELEMETER))));
}
return rows;
}
}
And, here is the JUnit test for the above code -
import static org.junit.jupiter.api.Assertions.fail;
import java.io.IOException;
import java.util.List;
import org.junit.jupiter.api.Test;
public class SimpleFileReaderTest {
public SimpleFileReaderTest() {
super();
}
#Test
public void testFileReader() {
try {
SimpleFileReader reader = new SimpleFileReader("c:/temp/sample-input.txt");
List<List<String>> rows = reader.getRowSet();
int expectedValue = 3; // number of actual lines in the sample file
int actualValue = rows.size(); // number of rows in the list
if (actualValue != expectedValue) {
fail(String.format("Expected value for the row count is %d, whereas obtained value is %d", expectedValue, actualValue));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Date Format getting disturb when creating .CSV file in Java

I am creating a web scraper and then store the data in the .CSV file.
My program is running fine but, there is a problem that the website from where I am retrieving data have a date which is in (Month Day, Year) format. So when I save the data in .CSV file it will consider the Year as another column due to which all the data gets manipulated. I actually want to store that data into (MM-MON-YYYY) and store Validity date in one column. I am posting my code below. Kindly, help me out. Thanks!
P.S: I am sorry for not writing the format I want in the original post.
package com.mufapscraping;
//import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
//import java.util.Collections;
import java.util.Iterator;
//import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ComMufapScraping {
boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
//String destinationCSVFile = "C:\\convertedCSV.csv";
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ", 2";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();
public static void createConnection() throws IOException {
System.setProperty("http.proxyHost", "191.1.1.123");
System.setProperty("http.proxyPort", "8080");
String tempUrl = "http://www.mufap.com.pk/nav_returns_performance.php?tab=01";
doc = Jsoup.connect(tempUrl).get();
}
public static void parsingHTML() throws Exception {
for (int i = 1; i <= 1; i++) {
tbodyElements = doc.getElementsByTag("tbody");
//Element table = doc.getElementById("dataTable");
if (tbodyElements.isEmpty()) {
throw new Exception("Table is not found");
}
elements = tbodyElements.get(0).getElementsByTag("tr");
for (Element trElement : elements) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
FileWriter sb = new FileWriter("C:\\convertedCSV2.csv", true);
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append(" \n ");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement = it.next();
sb.append(tdElement.text());
if (it2.hasNext()) {
sb.append(" , ");
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
/* for (Elements elements2 : zakazky) {
System.out.println(elements2);
}*/
}
}
}
public static void main(String[] args) throws IOException, Exception {
createConnection();
parsingHTML();
}
}
Instead of appeding directly the element text in the FileWriter, format it first then append it.
So, replace the following line:
sb.append(tdElement.text());
into
sb.append(formatData(tdElement.text()));
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);
public static String formatData(String text) {
String tmp = null;
try {
Date d = FORMATTER_MMM_d_yyyy.parse(text);
tmp = FORMATTER_dd_MMM_yyyy.format(d);
} catch (ParseException pe) {
tmp = text;
}
return tmp;
}
SAMPLE
public static void main(String[] args) {
String[] fields = new String[] { //
"ABL Cash Fund", //
"AA(f)", //
"Apr 18, 2016", //
"10.4729" //
};
for (String field : fields) {
System.out.format("%s\n%s\n\n", field, formatData(field));
}
}
OUTPUT
ABL Cash Fund
ABL Cash Fund
AA(f)
AA(f)
Apr 18, 2016
18-Apr-2016
10.4729
10.4729
Instead of using the method getElementsByTag many times you can use cssSelector which can be much easier and enables you to get the same output in few lines of code
public static void main (String []args) throws IOException{
String tempUrl = "http://www.mufap.com.pk/nav_returns_performance.php?tab=01";
Document doc = Jsoup.connect(tempUrl).get();
Elements trElements = doc.select("#dataTable tbody tr");
FileWriter sb = new FileWriter("C:\\convertedCSV2.csv", true);
for(Element tr : trElements){
Elements tdElements = tr.select("td");
for (Element td : tdElements){
sb.append(td.text());
sb.append(";");
}
sb.append("\n");
}
}
This could be achieved by simply surrounding your data with double quotes, so month day, year would become "month day, year". Here's modified code that does the job for you:
package com.mufapscraping;
//import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
//import java.util.Collections;
import java.util.Iterator;
//import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ComMufapScraping {
boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
//String destinationCSVFile = "C:\\convertedCSV.csv";
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ", 2";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();
public static void createConnection() throws IOException {
System.setProperty("http.proxyHost", "191.1.1.123");
System.setProperty("http.proxyPort", "8080");
String tempUrl = "http://www.mufap.com.pk/nav_returns_performance.php?tab=01";
doc = Jsoup.connect(tempUrl).get();
}
public static void parsingHTML() throws Exception {
for (int i = 1; i <= 1; i++) {
tbodyElements = doc.getElementsByTag("tbody");
//Element table = doc.getElementById("dataTable");
if (tbodyElements.isEmpty()) {
throw new Exception("Table is not found");
}
elements = tbodyElements.get(0).getElementsByTag("tr");
for (Element trElement : elements) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
FileWriter sb = new FileWriter("C:\\convertedCSV2.csv", true);
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append(" \n ");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement = it.next();
sb.append('\"'); // surround your data
sb.append(tdElement.text());
sb.append('\"'); // with double quotes
if (it2.hasNext()) {
sb.append(" , ");
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
/* for (Elements elements2 : zakazky) {
System.out.println(elements2);
}*/
}
}
}
public static void main(String[] args) throws IOException, Exception {
createConnection();
parsingHTML();
}
}
Then you do want to split it. ok, then modify the first line by adding "year," column:
Element tdElement = it.next();
final String content = tdElement.text()
sb.append(content);
if (it2.hasNext()) {
sb.append(" , ");
if (content.equals("Validity Date"))
sb.append("Validity Year,");
you probably want to break after the for? or you'll overwrite the file elements.size()-1 times...
FileWriter sb = new FileWriter("C:\\convertedCSV2.csv", true);
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) { ... }
break;

jsoup java html parsing

I'm a new french user on stack and I have a problem ^^
I use an HTML parse Jsoup for parsing a html page. For that it's ok but I can't parse more url in same time.
This is my code:
first class for parsing a web page
package test2;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public final class Utils {
public static Map<String, String> parse(String url){
Map<String, String> out = new HashMap<String, String>();
try
{
Document doc = Jsoup.connect(url).get();
doc.select("img").remove();
Elements denomination = doc.select(".AmmDenomination");
Elements composition = doc.select(".AmmComposition");
Elements corptexte = doc.select(".AmmCorpTexte");
for(int i = 0; i < denomination.size(); i++)
{
out.put("denomination" + i, denomination.get(i).text());
}
for(int i = 0; i < composition.size(); i++)
{
out.put("composition" + i, composition.get(i).text());
}
for(int i = 0; i < corptexte.size(); i++)
{
out.put("corptexte" + i, corptexte.get(i).text());
System.out.println(corptexte.get(i));
}
} catch(IOException e){
e.printStackTrace();
}
return out;
}//Fin Methode parse
public static void excelizer(int fileId, Map<String, String> values){
try
{
FileOutputStream out = new FileOutputStream("C:/Documents and Settings/c.bon/git/clinsearch/drugs/src/main/resources/META-INF/test/fichier2.xls" );
Workbook wb = new HSSFWorkbook();
Sheet mySheet = wb.createSheet();
Row row1 = mySheet.createRow(0);
Row row2 = mySheet.createRow(1);
String entete[] = {"CIS", "Denomination", "Composition", "Form pharma", "Indication therapeutiques", "Posologie", "Contre indication", "Mise en garde",
"Interraction", "Effet indesirable", "Surdosage", "Pharmacodinamie", "Liste excipients", "Incompatibilité", "Duree conservation",
"Conservation", "Emballage", "Utilisation Manipulation", "TitulaireAMM"};
for (int i = 0; i < entete.length; i++)
{
row1.createCell(i).setCellValue(entete[i]);
}
Set<String> set = values.keySet();
int rowIndexDenom = 1;
int rowIndexCompo = 1;
for(String key : set)
{
if(key.contains("denomination"))
{
mySheet.createRow(1).createCell(1).setCellValue(values.get(key));
rowIndexDenom++;
}
else if(key.contains("composition"))
{
row2.createCell(2).setCellValue(values.get(key));
rowIndexDenom++;
}
}
wb.write(out);
out.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
second class
package test2;
public final class Task extends Thread {
private static int fileId = 0;
private int id;
private String url;
public Task(String url)
{
this.url = url;
id = fileId;
fileId++;
}
#Override
public void run()
{
Utils.excelizer(id, Utils.parse(url));
}
}
the main class (entry point)
package test2;
import java.util.ArrayList;
public class Main {
public static void main(String[] args)
{
ArrayList<String> urls = new ArrayList<String>();
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=61266250&typedoc=R");
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=66207341&typedoc=R");
for(String url : urls)
{
new Task(url).run();
}
}
}
When the data was copied to my excel file, the second url doesn't work.
Can you help me solve my problem please?
Thanks
I think its because your main() exits before your second thread has a chance to do its job. You should wait for all spawned threads to complete using Thread.join(). Or better yet, create one of the ExecutorService's and use awaitTermination(...) to block until all URLs are parsed.
EDIT See some examples here http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html

Saving Vectors In Java

How would I go about saving a String Vector to a file every time it is edited?
So let's say I have usernames in a vector, after I add or delete a username I'd like it to save that vector so if the program is closed, it will show the most recent elements.
This should help you get started.
As JB Nizet said, you should use an ArrayList.
I also went ahead and used Java 7 autocloseable functionality, which ensures you close file handles appropriately.
Of course, you will need to validate your input, and you will want to take care about what you persist. I suspect that you will soon want to consider a better storage strategy, however, this will get you started.
In addition, since this is acting like a collection, you should add hashcode and equals. For brevity sake, I did not add those.
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
public class PersistedCollection {
private static final String NEWLINE_SEPARATOR = System.getProperty("line.separator");
private final List<String> values;
private final File file;
public PersistedCollection(File file) {
this.values = new ArrayList<>();
this.file = file;
}
public void add(String value) {
// You should validate this value. Remove carriage returns, make sure it meets your value specifications.
values.add(value);
persist();
}
public void remove(String value) {
values.remove(value);
persist();
}
private void persist() {
// Using Java 7 autocloseable to ensure that the output stream is closed, even in exceptional circumstances.
try (OutputStream outputStream = new BufferedOutputStream(new FileOutputStream(this.file), 8192); Writer writer = new PrintWriter(outputStream)) {
for (String value : values) {
writer.append(value);
writer.append(NEWLINE_SEPARATOR);
}
} catch (IOException e) {
e.printStackTrace();
}
}
#Override
public String toString() {
StringBuilder builder = new StringBuilder();
builder.append("PersistedCollection [values=");
builder.append(values);
builder.append(", file=");
builder.append(file);
builder.append("]");
return builder.toString();
}
public static void main(String[] arguments) {
PersistedCollection persistedCollection = new PersistedCollection(new File("/tmp/test.txt"));
persistedCollection.add("jazeee");
persistedCollection.add("temporary user");
persistedCollection.add("user402442");
persistedCollection.add("JB Nizet");
persistedCollection.remove("temporary user");
System.out.println(persistedCollection);
}
}
Another solution would be to create a class where you add all the methods required to read from a file of usernames (one username per line). Then you can refer to this class from anywhere (as the modifier is public) and call the methods such that you will add or remove usernames from that file.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.io.File;
public class Test {
private static BufferedWriter bw;
private static ArrayList<String> vector=new ArrayList<String>();
private static String everything;
//add an username
public static void add(String x){
vector.add(x);
}
//remove an username
public static void remove(String x){
vector.remove(x);
}
//update the file with the new vector of usernames
public static void updateToFile() throws IOException{
File username = new File("/home/path/to/the/file");
FileWriter fw = new FileWriter(username.getAbsoluteFile());
bw= new BufferedWriter(fw);
for (String x:vector){
bw.write(x.toString());
bw.write("\n");
}
bw.close();
}
//you call this method to initialise your vector of usernames
//this implies that you already have a file of usernames
//one username per line
public static void setUsername() throws IOException{
vector=new ArrayList<String>();
BufferedReader br = new BufferedReader(new FileReader("/home/path/to/the/file"));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
everything = sb.toString();
} finally {
br.close();
}
String lines[] = everything.split("\\r?\\n");
for (String x:lines){
vector.add(x);
}
}
//print your usernames in the console
public static void printUsers(){
for (String User:vector){
System.out.println(User);
}
}
}
Then it gets as easy as this:
import java.io.IOException;
public class MainTest {
public static void main(String[] args) throws IOException{
Test.setUsername();
Test.printUsers();
Test.add("username5");
Test.remove("username2");
System.out.println("// add username5; remove username2");
Test.printUsers();
System.out.println("// file has been updated with the new state");
Test.updateToFile();
System.out.println("// veryfing update");
Test.setUsername();
Test.printUsers();
}
}
The output:
(this first 4 users is what I have in the file)
username1
username2
username3
username4
// add username5; remove username2
username1
username3
username4
username5
// file has been updated with the new state
// verifying update
username1
username3
username4
username5

Categories