Extracting URLs with elem.absUrl - java

I have a program all what I need it to do is to extract URLs from a text file and saves them into another text file. The code calls ExtractHTML2.getURL2(url,input); which is simply extract the HTML code for a given link (which works correctly & no need to include its code here).
EDIT: The code parse number of pages, on each page, it save its html code in text file, then parse this text file, to extract 10 links.
Now, the following code suppose to parse the extracted HTML code and extract the URLs. This does not work with me. It does not extract any thing.
CODE EDITED:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.*;
public class ExtractLinks2 {
public static void getLinks2(String url, int pages) throws IOException {
{
Document doc;
Element link;
String elementLink=null;
int linkId=1; //represent the Id of the href tag inside the HTML code
//The file that contains the extracted HTML code for the web page.
File input = new File
("extracted.txt");
//To write the extracted links
FileWriter fstream = new FileWriter
("links.txt");
BufferedWriter out = new BufferedWriter(fstream);
// Loop to traverse the pages
for (int z=1; z<=pages; z++)
{
/*get the HTML code for that page and save
it in input (extracted.txt)*/
ExtractHTML2.getURL2(url, input);
//Using parse function from JSoup library
doc = Jsoup.parse(input, "UTF-8");
//Loop for 10 times to extract 10 links per page
for(int e=1; e<=10; e++)
{
link = doc.getElementById("link-"+linkId); //the href tag Id
System.out.println("This is link no."+linkId);
elementLink=link.absUrl("href");
//write the extracted link to text file
out.write(elementLink);
out.write(","); //add a comma
linkId++;
} //end for loop
linkId=1; //reset the linkId
}//end for loop
out.close();
} //end the getLinks function
} //end IOExceptions
} //end ExtractDNs class
As I said, my program does not extract the URLs. I have doubt in my syntax for Jsoup.parse. Reference to: http://jsoup.org/cookbook/input/load-document-from-file there is optional third argument that I ignored it as I think it is not needed in my case. I need to extract from text file not html page.
My program is able to extract the href tag text if I typed: eURL =elem.text(); but I don’t need the text, I need the URL itself, e.g: If I have the following:
<a id="link-1" class="yschttl spt" href="/r/_ylt=A7x9QXi_UOlPrmgAYKpLBQx.;
_ylu=X3oDMTBzcG12Mm9lBHNlYwNzcgRwb3MDMTEEY29sbwNpcmQEdnRpZAM-/SIG=1329l4otf/
EXP=1340719423/**http%3a//www.which.co.uk/technology/computing/guides/how-to-buy
-the-best-laptop/" data-bk="5040.1">How to <b>buy</b> the best <b>laptop</b>
- <b>Laptop</b> <wbr />reviews - Computing ...</a>
I only need "www.which.co.uk" or even better "which.co.uk" if there is a way to do that.
Why the above program does not extract URLs and how to correct the problem ?

The problem was in this line:
link = doc.getElementById("link-"+linkId);
It should be:
link = doc.getElementById("link-" + Integer.toString(linkId));
Since linkId is integer, and getElementById takes string as parameter. So, I had to convert the Id to int first, so the input for the getElementById becomes in the form: link-1, link-2, etc.

Related

Parse data from webpage to android app using Jsoup

My android app has a part were i need to parse data from wikipedia.com and use that in application. when i go to https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data I can see the covid19 cases. I want to retrieve the number from the table
I am using Jsoup. I am able to get html data by using this https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data .If you can guide me how can i extract the india cases and deaths from html file. as the html doc is huge and there no attr for tr. There's not much information about this on internet. What i have tried so far...
private void getWebsite() {
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder builder = new StringBuilder();
String web_link = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(web_link).get();
String title = doc.title();
Elements links = doc.select("tr");
builder.append(title).append("\n");
for(Element link : links){
builder.append(link);
}
} catch (IOException e) {
builder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
textView.setText(builder.toString());
}
});
}
}).start();
}
The problem is related to the format of the data (XML). When you navigate down the XML elements, you find what's displayed in the document, when viewed via your browser, is:
<someTag>...</someTag>
But what's actually present is the xml encoded version of the string:
<someTag>...</someTag>
JSoup won't work well with this and you'll need further processing to convert the output to more XML to get it working I'd imagine. You can test this yourself by viewing the result of:
doc.getElementsByTag("text")
And you'd need to replace all < and > tokens with <, > respectively.
Here's what I tried, plus some minor edits after failing to be able to pull tbody/thead/th.. I then started trying to pull from the top level tag, starting with api, moving deeper into the DOM.
final StringBuilder builder = new StringBuilder();
String url = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(url).get();
String title = doc.getElementsByTag("parse").attr("title");
Also worth mentioning there are some really good examples in the documents here: https://jsoup.org/cookbook/extracting-data/dom-navigation
And finally, for what it's worth, I'd change the URL used to: https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data to make life easier for use with JSoup so you can just pull the relevant bits of data from HTML rather than XML.
In my view, if you have the choice, HtmlUnit would be a better tool for this since you can simply specify an XPath for the HTML element you want to extract without having to use multiple method calls to get what you want... the more concise format means there's less room for errors to hide.

How to load local html file in java swing?

I have a tree list which will open a specific html file when I click at a node. I try loading my html into a Jeditorpanel but it can't seem to work.
Here's my code from main file:
private void treeItemMouseClicked(java.awt.event.MouseEvent evt) {
DefaultMutableTreeNode selectedNode = (DefaultMutableTreeNode) treeItem.getSelectionPath().getLastPathComponent();
String checkLeaf = selectedNode.getUserObject().toString();
if (checkLeaf == "Java Turtorial 1") {
String htmlURL = "/htmlFILE/javaTurtorial1.html";
new displayHTML(htmlURL).setVisible(true);
}
}
Where I wanna display it:
public displayHTML(String htmlURL) {
initComponents();
try {
//Display html file
editorHTML.setPage(htmlURL);
} catch (IOException ex) {
Logger.getLogger(displayHTML.class.getName()).log(Level.SEVERE, null, ex);
}
}
My files:
One simple way to render HTML with JEditorPane is using it's setText method:
JEditorPane editorPane =...
editorPane.setContentType( "text/html" );
editorPane.setText( "<html><body><h1>I'm an html to render</h1></body></html>" );
Note that only certain HTML pages (relatively simple ones) can be rendered with this JEditoPane, if you need something more complicated you'll have to use thirdparty components
Based on OP's comment, I'm adding an update to the answer:
Update
Since the HTMLs that you're trying to load are files inside the JAR, you should read the file into some string variable and use the aforementioned method setText
Note that you shouldn't use java.io.File because it used to identify resources at the filesystem, and you're trying to access something inside the artifact:
Reading the resource like this can be done with the following construction:
InputStream is = getClass().getResourceAsStream("/htmls/myhtml.html");
// and then read with the help of variety of ways, depending on Java Version of your choice and by the availaility by auxiliary thirdparties
// here is the most simple way IMO for Java 9+
String htmlString = new String(input.readAllBytes(), StandardCharsets.UTF_8);
Read Here about many different ways to read InputStream into String

Get line of PDF file after a specific line

I use Apache PDFBox to parse text from pdf file. I tried to get a line after a specific line.
PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println("Text from pdf:" + text);
} else{
log.info("File is encrypted!");
}
document.close();
Sample:
Sentence 1, nth line of file
Needed line
Sentence 3, n+2th line of file
I tried to get all the lines from file in an array, but it is unstable, because unable to filter to a specific text. It is problem also in second solution, that is why I am looking for a PDFBox based solution.
Solution 1:
String[] lines = myString.split(System.getProperty("line.separator"));
Solution 2:
String neededline = (String) FileUtils.readLines(file).get("n+2th")
In fact, the source code for the PDFTextStripper class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.
You see, the PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().
Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:
First, you captures lines in an array like you said.
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
static String[] lines;
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("my2.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
lines = text.split(System.getProperty("line.separator"));
document.close();
}
}
Here's a method to get a complete String by any line number index, easy:
// returns a full String line by number n
static String getLine(int n) {
return lines[n];
}
Here's a linear search method that finds a string match and returns the first line number where found.
// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
int n = 0;
for(String line : lines) {
if(line.indexOf(filter) != -1) {
return n;
}
n++;
}
return -1;
}
With the above, it possible to get only the line number for your matched search:
System.out.println(getLine(8)); // line 8 for example
Or, the entire String line that contains your matched search:
System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);
This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.
Here's a link to a tutorial that also does what you are trying to do:
https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/
Again, same approach...

How to save a jsoup document as text file

I am trying to save all of the readable words on a web page into one text document while ignoring html markup.
Using JSoup to parse all of the words on a webpage, my only guess of how to seperate the real words from the code is through elements.
Is it possible to convert multiple elements of the jsoup document into a text file?
i.e.:
Elements titles = doc.select("title");
Elements paragraphs = doc.select("p");
Elements links = doc.select("a[href]");
Elements smallText = doc.select("a");
Currently saving the parse as a document with:
Document doc = Jsoup.connect("https:// (enter a url)").get();
Its simple way
Document doc = Jsoup.connect("https:// (enter a url)").get();
BufferedWriter writer = null;
try
{
writer = new BufferedWriter( new FileWriter("d://test.txt"));
writer.write(doc.toString());
}
catch ( IOException e)
{
}
Adding answer because I am unable to comment above.
Replace writer.write(doc.toString()); by writer.write(doc.select("html").text()); in the above code.
It will give you the text on the page.
Instead of "html" in doc.select("**html**").text() other tags can be used to extract text enclosed in those tags.
Edit: you can also use writer.write(doc.body().text());
After writing in the text with writer.write(doc.text()); the very next line you need to write writer.close(); this will fix the problem.

Jsoup not seeing some text on website

Currently I am making a program (in Java) that grabs all the streamers on twitch (Videogame streaming site) from a given URL e.g. and lists them into a text file using Jsoup.
However, no matter what I try, it seems like I can't get the streamer's names. After a while I discovered that the page source for some reason does not contain the streamer's names which I think could be the problem?
Here is my code currently.
public static void main(String[] args) throws IOException {
int i = 0;
PrintWriter streamerwriter = new PrintWriter("streamer.txt", "UTF-8");
Document doc = Jsoup.connect(https://www.twitch.tv/directory/game/Hearthstone%3A%20Heroes%20of%20Warcraft).get();
Elements streamers = doc.getElementsByClass("js-profile-link");
for (Element streamer : streamers) {
i++;
System.out.println(i + "." + streamer.text());
streamerwriter.println(i + "." + streamer.text());
}
streamerwriter.close();
}
Any help would be greatly appreciated.
you don't need to parse webpage.Because twitch has an api to select streamers.
https://streams.twitch.tv/kraken/streams?limit=20&offset=0&game=Hearthstone%3A+Heroes+of+Warcraft&broadcaster_language=&on_site=1
so you should parse json data
if you want to know why you don't see streamers in jsoup because of lazy loading.because that part you want to parse is loaded lazily.You should know that lazy request and parse that url by jsoup which i found and wrote up.(twitch api)
Please check this question out:
how to use Jsoup in site that has lazyload scrollLoader.js

Categories