How to extract headline titles followed by respective text from Wikipedia

How to extract headline titles followed by respective text from Wikipedia - java

I am trying to use Jsoup in order to extract text from Wikipedia articles.
My idea is to simply extract every headline, and their respective text paragraphs.
I am having some trouble understanding how I can take only the specific text of each section, here's what I have:
public static void main(String[] args) {
String url = "http://en.wikipedia.org/wiki/Albert_Einstein";
Document doc;
try {
doc = Jsoup.connect(url).get();
doc = Jsoup.parse(doc.toString());
Elements titles = doc.select(".mw-headline");
PrintStream out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
for(Element h3 : doc.select(".mw-headline"))
{
String title = h3.text();
String titleID = h3.id();
Elements paragraphs = doc.select("p#"+titleID);
//Element nextEle=h3.nextElementSibling();
System.out.println(title);
System.out.println("----------------------------------------");
System.out.println(titleID);
System.out.print("\n");
System.out.println(paragraphs.text());
System.out.print("\n");
}
} catch (IOException e) {
System.out.println("deu merda");
e.printStackTrace();
}
With this I can extract every headline, but I can't get how I would get the text from each section to print it accordingly. I was thinking maybe with the headline's ID, but no dice.
Thank you for any help!

Depending on the tag structure of the page (if any), that could be complicated. A better alternative could be to iterate on all the elements, detecting headlines. Every time you detect a new headline (or you reach the end of the elements), it means a new headline. All elements up to here belong to the previous headline (or to the "header" of the article if there is no previous headline).

Related

Trying to iterate over very similar Elements in an XML file. NOTE XML file is attribute less

Hullo, I have a question about xml and java. I have a weird XML file with no attributes and only Elements, im trying to zero in on a specific Element Stack, and then iterate over all of the similar element stacks.
<InstrumentData>
<Action>Entire Plot</Action>
<AppStamp>Vectorworks</AppStamp>
<VWVersion>2502</VWVersion>
<VWBuild>523565</VWBuild>
<AutoRot2D>false</AutoRot2D>
<UID_1505_1_1_0_0> ---- This is the part I care about, there are about 1000+ of these and they all vary slightly after the "UID_"---
<Action>Update</Action>
<TimeStamp>20200427192323</TimeStamp>
<AppStamp>Vectorworks</AppStamp>
<UID>1505.1.1.0.0</UID>
</UID_1505_1_1_0_0>
I am using dom4j as the xml parser and I dont have any issues spitting out all of the data I just want to zero in on the XML path.
This is the code so far:
public class Unmarshal {
public Unmarshal() {
File file = new File("/Users/michaelaboah/Desktop/LIHN 1.11.18 v2020.xml");
SAXReader reader = new SAXReader();
try {
Document doc = reader.read(file);
Element ele = doc.getRootElement();
Iterator<Element> it = ele.elementIterator();
Iterator<Node> nodeIt = ele.nodeIterator();
while(it.hasNext()) {
Element test2 = (Element) it.next();
List<Element> eleList = ele.elements();
for(Element elementsIt : eleList) {
System.out.println(elementsIt.selectSingleNode("/SLData/InstrumentData").getStringValue());
//This spits out everything under the Instrument Data branch
//All of that data is very large
System.out.println(elementsIt.selectSingleNode("/SLData/InstrumentData/UID_1505_1_1_0_0").getStringValue());
//This spits out everything under the UID branch
}
}
} catch (DocumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Also, I know there are some unused data types and variables there was a lot of testing

I think your answer is:
elementsIt.selectSingleNode("/SLData/InstrumentData/*[starts-with(local-name(), 'UID_')]").getStringValue()
I used this post to find this XPath and it works with the few xml lines you gave.

Parse data from webpage to android app using Jsoup

My android app has a part were i need to parse data from wikipedia.com and use that in application. when i go to https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data I can see the covid19 cases. I want to retrieve the number from the table
I am using Jsoup. I am able to get html data by using this https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data .If you can guide me how can i extract the india cases and deaths from html file. as the html doc is huge and there no attr for tr. There's not much information about this on internet. What i have tried so far...
private void getWebsite() {
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder builder = new StringBuilder();
String web_link = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(web_link).get();
String title = doc.title();
Elements links = doc.select("tr");
builder.append(title).append("\n");
for(Element link : links){
builder.append(link);
}
} catch (IOException e) {
builder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
textView.setText(builder.toString());
}
});
}
}).start();
}

The problem is related to the format of the data (XML). When you navigate down the XML elements, you find what's displayed in the document, when viewed via your browser, is:
<someTag>...</someTag>
But what's actually present is the xml encoded version of the string:
<someTag>...</someTag>
JSoup won't work well with this and you'll need further processing to convert the output to more XML to get it working I'd imagine. You can test this yourself by viewing the result of:
doc.getElementsByTag("text")
And you'd need to replace all < and > tokens with <, > respectively.
Here's what I tried, plus some minor edits after failing to be able to pull tbody/thead/th.. I then started trying to pull from the top level tag, starting with api, moving deeper into the DOM.
final StringBuilder builder = new StringBuilder();
String url = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(url).get();
String title = doc.getElementsByTag("parse").attr("title");
Also worth mentioning there are some really good examples in the documents here: https://jsoup.org/cookbook/extracting-data/dom-navigation
And finally, for what it's worth, I'd change the URL used to: https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data to make life easier for use with JSoup so you can just pull the relevant bits of data from HTML rather than XML.
In my view, if you have the choice, HtmlUnit would be a better tool for this since you can simply specify an XPath for the HTML element you want to extract without having to use multiple method calls to get what you want... the more concise format means there's less room for errors to hide.

Storing text into a String using jSoup

I'm trying to understand how to use htmlUnit and jSoup together and have been successful in understanding the basics. However, I'm trying to store text from a specific webpage into a string but when I try to do this, it only returns a single line rather than the whole text.
I know the code I've written works as I when I print out p.text, it returns the whole text stored within the website.
private static String getText() {
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
System.out.println(p.text());
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return null;
}
}
When I introduce the notion of a string to store the text from p.text, it only returns a single line rather than the whole text.
private static String getText() {
String text = "";
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
text=p.text();
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return text;
}
Ultimately, all I want to do is store the whole text into a string. Any help would be greatly appreciated, thanks in advance.

Document doc = Jsoup.connect(url).get();
String text = doc.text();
That's basically it. Due to the fact that JSoup is already taking care of cleaning all the html tags from the text, you can use the doc.text() and you'll receive the content of the whole page cleaned from html tags.

for (Element p : paragraphs)
text+=p.text(); // Append the text.
In your code, you are overwriting the values of variable text. That's why only last line is returned by the function.

I think it is a strange idea to use the HtmlUnit result as starting point for jSoup. There a various drawbacks of your approach (e.g. think about cookies). And of course HtmlUnit had parsed the html code already; you will do the work twice.
I hope this code will fulfill your requirements without jSoup.
private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
StringBuilder text = new StringBuilder();
try (WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
for (DomNode p : paragraphs) {
text.append(p.asText());
}
}
return text.toString();
}

How to save a jsoup document as text file

I am trying to save all of the readable words on a web page into one text document while ignoring html markup.
Using JSoup to parse all of the words on a webpage, my only guess of how to seperate the real words from the code is through elements.
Is it possible to convert multiple elements of the jsoup document into a text file?
i.e.:
Elements titles = doc.select("title");
Elements paragraphs = doc.select("p");
Elements links = doc.select("a[href]");
Elements smallText = doc.select("a");
Currently saving the parse as a document with:
Document doc = Jsoup.connect("https:// (enter a url)").get();

Its simple way
Document doc = Jsoup.connect("https:// (enter a url)").get();
BufferedWriter writer = null;
try
{
writer = new BufferedWriter( new FileWriter("d://test.txt"));
writer.write(doc.toString());
}
catch ( IOException e)
{
}

Adding answer because I am unable to comment above.
Replace writer.write(doc.toString()); by writer.write(doc.select("html").text()); in the above code.
It will give you the text on the page.
Instead of "html" in doc.select("**html**").text() other tags can be used to extract text enclosed in those tags.
Edit: you can also use writer.write(doc.body().text());

After writing in the text with writer.write(doc.text()); the very next line you need to write writer.close(); this will fix the problem.

java data structure to replace file io

My program goes to a my uni results page, finds all the links and saves to a file. Then I read the file and copy only lines which contain required links and save it to another file. Then I parse it again to extract required data
public class net {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://jntuconnect.net/results_archive/").get();
Elements links = doc.select("a");
File f1 = new File("flink.txt");
File f2 = new File("rlink.txt");
//write extracted links to f1 file
FileUtils.writeLines(f1, links);
// store each link from f1 file in string list
List<String> linklist = FileUtils.readLines(f1);
// second string list to store only required link elements
List<String> rlinklist = new ArrayList<String>();
// loop which finds required links and stores in rlinklist
for(String elem : linklist){
if(elem.contains("B.Tech") && (elem.contains("R07")||elem.contains("R09"))){
rlinklist.add(elem);
}
}
//store required links in f2 file
FileUtils.writeLines(f2, rlinklist);
// parse links from f2 file
Document rdoc = Jsoup.parse(f2, null);
Elements rlinks = rdoc.select("a");
// for storing hrefs and link text
List<String> rhref = new ArrayList<String>();
List<String> rtext = new ArrayList<String>();
for(Element rlink : rlinks){
rhref.add(rlink.attr("href"));
rtext.add(rlink.text());
}
}// end main
}
I don't want to create files to do this. Is there a better way to get hrefs and link texts of only specific urls without creating files?
It uses Apache commons fileutils, jsoup

Here's how you can get rid of the first file write/read:
Elements links = doc.select("a");
List<String> linklist = new ArrayList<String>();
for (Element elt : links) {
linklist.add(elt.toString());
}
The second round trip, if I understand the code, is intended to extract the links that meet a certain test. You can just do that in memory using the same technique.
I see you're relying on Jsoup.parse to extract the href and link text from the selected links. You can do that in memory by writing the selected nodes to a StringBuffer, convert it to a String by calling it's toString() method, and then using one of the Jsoup.parse methods that takes a String instead of a File argument.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract headline titles followed by respective text from Wikipedia - java

Related

Trying to iterate over very similar Elements in an XML file. NOTE XML file is attribute less

Parse data from webpage to android app using Jsoup

Storing text into a String using jSoup

How to save a jsoup document as text file

java data structure to replace file io

Categories

Resources