My program goes to a my uni results page, finds all the links and saves to a file. Then I read the file and copy only lines which contain required links and save it to another file. Then I parse it again to extract required data
public class net {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://jntuconnect.net/results_archive/").get();
Elements links = doc.select("a");
File f1 = new File("flink.txt");
File f2 = new File("rlink.txt");
//write extracted links to f1 file
FileUtils.writeLines(f1, links);
// store each link from f1 file in string list
List<String> linklist = FileUtils.readLines(f1);
// second string list to store only required link elements
List<String> rlinklist = new ArrayList<String>();
// loop which finds required links and stores in rlinklist
for(String elem : linklist){
if(elem.contains("B.Tech") && (elem.contains("R07")||elem.contains("R09"))){
rlinklist.add(elem);
}
}
//store required links in f2 file
FileUtils.writeLines(f2, rlinklist);
// parse links from f2 file
Document rdoc = Jsoup.parse(f2, null);
Elements rlinks = rdoc.select("a");
// for storing hrefs and link text
List<String> rhref = new ArrayList<String>();
List<String> rtext = new ArrayList<String>();
for(Element rlink : rlinks){
rhref.add(rlink.attr("href"));
rtext.add(rlink.text());
}
}// end main
}
I don't want to create files to do this. Is there a better way to get hrefs and link texts of only specific urls without creating files?
It uses Apache commons fileutils, jsoup
Here's how you can get rid of the first file write/read:
Elements links = doc.select("a");
List<String> linklist = new ArrayList<String>();
for (Element elt : links) {
linklist.add(elt.toString());
}
The second round trip, if I understand the code, is intended to extract the links that meet a certain test. You can just do that in memory using the same technique.
I see you're relying on Jsoup.parse to extract the href and link text from the selected links. You can do that in memory by writing the selected nodes to a StringBuffer, convert it to a String by calling it's toString() method, and then using one of the Jsoup.parse methods that takes a String instead of a File argument.
Related
I'm trying to read .docx files with styling information using Apache Poi which I have done by looping through each XWPFParagraph and working with all the XWPFRun run inside the paragraphs. Now I want to get contents of each pages. So is there a way to get the contents of each pages or is it possible to know in which page a paragraph is currently in?
This is a function that takes the absolute path of a docx file and returns an array of strings
FileInputStream fis = new FileInputStream(absolutePath);
XWPFDocument document = new XWPFDocument(fis);
List<IBodyElement> bodyElements = document.getBodyElements();
List<String> textList = new ArrayList<>();
/* I want to add some kind of outer loop here for each page
and at the end of that loop I want to add a "<hr/>" tag in the textList
*/
for (IBodyElement bodyElement : bodyElements) { // Looping through paragraphs
if (bodyElement.getElementType() == BodyElementType.PARAGRAPH) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
String textToAdd = parseParagraph(paragraph); //custom funtion to handle paragraphs
textList.add(textToAdd);
}
}
document.close();
return textList.toArray(new String[0]);
As you can see my goal here is to add a <hr/> tag after each page. So, if somehow I can get the page number of a paragraph or loop through pages, I will be able to do that.
Please kindly mention if you know about any other approach that may help.
To get Page Count from XWPFDocument (for your outer loop), you can do something like this:
XWPFDocument docx = new XWPFDocument(POIXMLDocument.openPackage(YOUR_FILE_PATH));
int numOfPages = docx.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
For your paragraph text,
for (XWPFParagraph p : document.getParagraphs()) {
System.out.println(p.getParagraphText()); // YOUR PARAGRAPH TEXT
}
I have a PPT containing values in the format: {VALUES}. I would like to replace the text of the file containing {SAMPLE} to SARAN and so on using Java. Earlier I tried with Apache poi and I had no solution but I could find that in Docx4j there is an method variableReplace() which will replace the text. I tried one sample but I don't see any text being replaced.
My code:
public static void main(String[] args) throws IOException, Docx4JException, Pptx4jException, JAXBException
{
String inputfilepath = "C:\\Work\\SampleTemplate.pptx";
PresentationMLPackage pptPackage
= PresentationMLPackage.load(new File(inputfilepath));
ThemePart themeSlidePart = (ThemePart)
pptPackage.getParts().getParts().get(new PartName("/ppt/theme/theme1.xml"));
Theme themeOfSlides = (Theme)themeSlidePart.getJaxbElement();
SlidePart slide = pptPackage.getMainPresentationPart().getSlide(0);
HashMap h = new HashMap<String, String>();
h.put("SlideTitleName", "SARANYA");
slide.variableReplace(h);
String outputfilepath = "C:\\Work\\24Jan2018_CheckOut\\dhl\\PPT-TRAILS\\SampleTemplate.pptx";
PresentationMLPackage pptPackagetoApply
= PresentationMLPackage.load(new File(outputfilepath));
ThemePart themeSlidePartToApply
= (ThemePart) pptPackagetoApply.getParts().getParts()
.get(new PartName("/ppt/theme/theme1.xml"));
themeSlidePartToApply.setJaxbElement(themeOfSlides);
SaveToZipFile saver = new SaveToZipFile(pptPackagetoApply);
saver.save(outputfilepath);
}
But still text is not being replaced. I tried with a new PPTX file with only text "${SlideTitleName}" inside it. But still it's not working.
I would like to add many rows to my pptx template using for loop, I tried the below,
Java:
if(majorAchievements.size()>0){
System.out.println("majorAchievements Size in Servlet is : "+majorAchievements.size() + " and value of get(0) is : "+ majorAchievements.get(0) );
for(int i=0;i<majorAchievements.size();i++){
achievements=(String) majorAchievements.get(i);
}
}
mappings.put("MajorAchievementText", achievements);
and inside my ppt file ,
i haved added two rows with same ${MajorAchievementText}.
The above list fetches 2 rows. But inside ppt it is not replace.
Please help me how to replace an text by iterating.
With Regards,
Saranya C
I'm new in java and i have a link "https://moz.com/blog-sitemap.xml" that has URLs ,i want to get them and save them in a string vector/array.
i tried this first to see how i'm going to get the links
URL robotFile = new URL("https://moz.com/blog-sitemap.xml");
//read robot.txt line by line
Scanner robotScanner = new Scanner(robotFile.openStream());
while (robotScanner.hasNextLine()) {
System.out.println(robotScanner.nextLine());
}
this is the sample output
my answer is ,is there a simple easier way to get these links instead of looping on each line checking if it contains "https" so i can extract the link from it ?
You can use Jsoup to do this more easly:
List<String> urlList = new ArrayList<>();
Document doc = Jsoup.connect("https://moz.com/blog-sitemap.xml").get();
Elements urls = doc.getElementsByTag("loc");
for (Element url : urls) {
urlList.add(url.text());
}
I am trying to use Jsoup in order to extract text from Wikipedia articles.
My idea is to simply extract every headline, and their respective text paragraphs.
I am having some trouble understanding how I can take only the specific text of each section, here's what I have:
public static void main(String[] args) {
String url = "http://en.wikipedia.org/wiki/Albert_Einstein";
Document doc;
try {
doc = Jsoup.connect(url).get();
doc = Jsoup.parse(doc.toString());
Elements titles = doc.select(".mw-headline");
PrintStream out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
for(Element h3 : doc.select(".mw-headline"))
{
String title = h3.text();
String titleID = h3.id();
Elements paragraphs = doc.select("p#"+titleID);
//Element nextEle=h3.nextElementSibling();
System.out.println(title);
System.out.println("----------------------------------------");
System.out.println(titleID);
System.out.print("\n");
System.out.println(paragraphs.text());
System.out.print("\n");
}
} catch (IOException e) {
System.out.println("deu merda");
e.printStackTrace();
}
With this I can extract every headline, but I can't get how I would get the text from each section to print it accordingly. I was thinking maybe with the headline's ID, but no dice.
Thank you for any help!
Depending on the tag structure of the page (if any), that could be complicated. A better alternative could be to iterate on all the elements, detecting headlines. Every time you detect a new headline (or you reach the end of the elements), it means a new headline. All elements up to here belong to the previous headline (or to the "header" of the article if there is no previous headline).
I'm developing Java code to get data from a website and store it in a file. I want to store the result of xpath into a file. Is there any way to save the output of the xpath? Please forgive for any mistakes; this is my first question.
public class TestScrapping {
public static void main(String[] args) throws MalformedURLException, IOException, XPatherException {
// URL to be fetched in the below url u can replace s=cantabil with company of ur choice
String url_fetch = "http://www.yahoo.com";
//create tagnode object to traverse XML using xpath
TagNode node;
String info = null;
//XPath of the data to be fetched.....use firefox's firepath addon or use firebug to fetch the required XPath.
//the below XPath will display the title of the company u have queried for
String name_xpath = "//div[1]/div[2]/div[2]/div[1]/div/div/div/div/table/tbody/tr[1]/td[2]/text()";
// declarations related to the api
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = new CleanerProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
//creating url object
URL url = new URL(url_fetch);
URLConnection conn = url.openConnection(); //opening connection
node = cleaner.clean(new InputStreamReader(conn.getInputStream()));//reading input stream
//storing the nodes belonging to the given xpath
Object[] info_nodes = node.evaluateXPath(name_xpath);
// String li= node.getAttributeByName(name_xpath);
//checking if something returned or not....if XPath invalid info_nodes.length=0
if (info_nodes.length > 0) {
//info_nodes[0] will return string buffer
StringBuffer str = new StringBuffer();
{
for(int i=0;i<info_nodes.length;i++)
System.out.println(info_nodes[i]);
}
/*str.append(info_nodes[0]);
System.out.println(str);
*/
}
}
}
You can "simply" print the nodes as strings, to console/or a file --
example in Perl:
my $all = $XML_OBJ->find('/'); # selecting all nodes from root
foreach my $node ($all->get_nodelist()) {
print XML::XPath::XMLParser::as_string($node);
}
note: this output however may not be nicely xml-formatted/indented
The output of an XPath in Java is a nodeset, so yes, once you have a nodeset you can do anything you want with it, save it to a file, process it some more.
Saving it to a file would involve the same steps in java that saving anything else to a file involve, there is no difference between that and and any other data. Select the nodeset, itterate through it, get the parts you want from it and write them to some kind of file stream.
However, if you mean is there a Nodeset.SaveToFile(), then no.
I would recommend you to take the NodeSet, which is a collection of Nodes, iterate on it, and add it to a created DOM document object.
After this, you can use the TransformerFactory to get a Transformer object, and to use its transform method. You should transform from a DOMSource to a StreamResult object which can be created based on FileOutputStream.