Java replace content in a link - java

I need to read the html of a webpage, then find the links and images, then rename the links and images, what i have done
reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));
String line;
while ((line = reader.readLine()) != null) {
regex = "<a[^>]*href=(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)</a>";
final Pattern pa = Pattern.compile(regex, Pattern.DOTALL);
final Matcher ma = pa.matcher(s);
if(ma.find()){
string newlink=path+"1-2.html";
//replace the link in href with newlink, how can i do this?
}
html.append(line).append("/r/n");
}
how can i do the comment part

Using regex for parsing HTML can be difficult and unreliable. It's better to use XPath and DOM manipulation for things like that.

Alternatives were mentioned, nevertheless:
Matcher has support to do a "replace all" using a StringBuffer.
The matched text must partially be readded as replacement text, hence all must be in ma.group(1) (2, 3, ...).
DOTALL would let . match newline chars, not needed as using readLine which strips the line end.
There could be more than one link per line.
You had a matcher(s) instead of matcher(line) in the example code.
So the code uses Matcher.appendReplacement and appendTail.
StringBuffer html = new StringBuffer();
reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), 'UTF-8'));
String line;
regex = "(<a[^>]*href=)(\"([^\"]*)\"|\'([^\']*)\'|([^\\s>]*))[^>]*>(.*?)(</a>)";
final Pattern pa = Pattern.compile(regex);
while ((line = reader.readLine()) != null) {
final Matcher ma = pa.matcher(line);
while (ma.find()) {
string newlink=path+"1-2.html";
ma.appendReplacement(html, m.group(1) /* a href */ + ...);
}
ma.appendTail(html);
html.append(line).append("/r/n");
}

Related

java Jsoup question how can i split by word?

I want to get html content without tags and the result as
word
word
word
So I tried the following.
public class PreProcessing {
public static void main(String\[\] args) throws Exception {
PrintWriter out = new PrintWriter("filename.txt");
URL url = new URL("[https://en.wikipedia.org/wiki/Distributed\_computing](https://en.wikipedia.org/wiki/Distributed_computing)");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine = "";
String input = "";
while ((inputLine = in.readLine()) != null)
{
input += inputLine;
// System.out.println(inputLine);
}
//create Jsoup document from HTML
Document jsoupDoc = Jsoup.parse(input);
//set pretty print to false, so \\n is not removed
jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \\n after that
// [jsoupDoc.select](https://jsoupDoc.select)("br").after("\\\\n");
//select all <p> tags and prepend \\n before that
// [jsoupDoc.select](https://jsoupDoc.select)("p").before("\\\\n");
//get the HTML from the document, and retaining original new lines
String str = jsoupDoc.html().replaceAll(" ", "\n");
// str.replaceAll("\t", "");
String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
strWithNewLines.replaceAll("\t", "\n");
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
System.out.println(strWithNewLines);
out.print(strWithNewLines);
}
}
This is my code I tried en.wiki~ distributed_computing and read from BufferedReader and use jsoupDoc and I want to replace word " " to "\n" because I want to word \n word\n word\n like this.
Then result is
Distributed
computing
-
Wikipedia Distributed
computing From
Wikipedia,
the
free
encyclopedia Jump
to
navigation Jump
to
search "Distributed
application"
redirects
here.
For
trustless
applications,
see
But I want result like this
Distributed
computing
-
Wikipedia
Distributed
computing
From
Wikipedia
the
free
encyclopedia
Jump
to
navigation
Jump
to
search
Distributed
application
redirects
here
For
trustless
applications
see
I tried like
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
But this did not work. Why didn't it work? I did googling but I can't found the solution.
Try this for the last few lines. This will bring you nearer to your desired result:
String strWithNewLines = Jsoup.clean ...;
String result = strWithNewLines.replaceAll("\t", "\n")
.replaceAll("\"", "");
//.replaceAll(".", "");
System.out.println(result);
The problem in your code is that String is immutable, so String.replaceAll will replace nothing in the original String, but produce a new one where the substitiution has been done. But you never use the result.
And there is a problem with .replaceAll(".", ""). This will give you an empty string, because . matches every character and it will be substituted by an empty string.

Regex pattern to find Integers in every line of the string

I have a pattern here which finds the integers after a comma.
The problem I have is that my return value is in new lines, so the pattern only works on the new line. How do I fix this? I want it to find the pattern in every line.
All help is appreciated:
url = new URL("https://test.com");
con = url.openConnection();
is = con.getInputStream();
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
String responseData = line;
System.out.println(responseData);
}
pattern = "(?<=,)\\d+";
pr = Pattern.compile(pattern);
match = pr.matcher(responseData); // String responseData
System.out.println();
while (match.find()) {
System.out.println("Found: " + match.group());
}
Here is the response returned as a string:
test.test.test.test.test-test,0,0,0
test.test.test.test.test-test,2,0,0
test.test.test.test.test-test,0,0,3
Here is the printout:
Found: 0
Found: 0
Found: 0
The problem is with building your String, you're assigning only the last line from the BufferedReader:
responseData = line;
If you print responseData before you try to match, you'll see it's only one line, and not what you expected.
Since you're printing the buffer's content using a System.out.println, you do see the whole result, but what's getting saved to responseData is actually the last line.
You should use a StringBuilder to build the whole string:
StringBuilder str = new StringBuilder();
while ((line = br.readLine()) != null) {
str.append(line);
}
responseData = str.toString();
// now responseData contains the whole String, as you expected
Tip: Use the debugger, it'll make you better understand your code and will help you to find bugs very faster.
You can use the Pattern.MULTILINE option when compiling your regex:
pattern = "(?<=,)\\d+";
pr = Pattern.compile(pattern, Pattern.MULTILINE);

Java String Matching in a Sorted File and grouping similar data

i have sorted file and i need to do the following pattern match. I read the row and then compare or do patern match with the row just after it , if it matches then insert the string i used to match after a comma in that row and move on to the next row. I am new to Java and overwhelmed with options from Open CSV to BufferedReader. I intend to iterate through the file till it reaches the end. I may always have blanks and have a dated in quotes. The file size would be around 100 MBs.
My file has data like
ABCD
ABCD123
ABCD456, 123
XYZ
XYZ890
XYZ123, 890
and output is expected as
ABCD, ABCD
ABCD123, ABCD
ABCD456, 123, ABCD
XYZ, XYZ
XYZ890, XYZ
XYZ123, 890, XYZ
Not sure about the best method. Can you please help me.
To open a file, you can use File and FileReader classes:
File csvFile = new File("file.csv");
FileReader fileReader = null;
try {
fileReader = new FileReader(csvFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
You can get a line of the file using Scanner:
Scanner reader = new Scanner(fileReader);
while(reader.hasNext()){
String line = reader.nextLine();
parseLine(line);
}
You want to parse this line. For it, you have to study Regex for using Pattern and Matcher classes:
private void parseLine(String line) {
Matcher matcher = Pattern.compile("(ABCD)").matcher(line);
if(matcher.find()){
System.out.println("find: " + matcher.group());
}
}
To find the next pattern of the same row, you can reuse matcher.find(). If some result was found, it will return true and you can get this result with matcher.groud();
Read line by line and use regex to replace it as per your need using String.replaceAll()
^([A-Z]+)([0-9]*)(, [0-9]+)?$
Replacement : $1$2$3, $1
Here is Online demo
Read more about Java Pattern
Sample code:
String regex = "^([A-Z]+)([0-9]*)(, [0-9]+)?$";
String replacement = "$1$2$3, $1";
String newLine = line.replaceAll(regex,replacement);
For better performance, read 100 or more lines at a time and store in a buffer and finally call String#replaceAll() single time to replace all at a time.
sample code:
String regex = "([A-Z]+)([0-9]*)(, [0-9]+)?(\r?\n|$)";
String replacement = "$1$2$3, $1$4";
StringBuilder builder = new StringBuilder();
int counter = 0;
String line = null;
try (BufferedReader reader = new BufferedReader(new FileReader("abc.csv"))) {
while ((line = reader.readLine()) != null) {
builder.append(line).append(System.lineSeparator());
if (counter++ % 100 == 0) { // 100 lines
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
builder.setLength(0); // reset the buffer
}
}
}
if (builder.length() > 0) {
String newLine = builder.toString().replaceAll(regex, replacement);
System.out.print(newLine);
}
Read more about Java 7 - The try-with-resources Statement

What's the correct regex to use for parsing data from a webpage source between tags?

Hi I'm having some trouble with parsing some data from a web source between two "tags"
Here's what a sample of the web source and the code I'm using to try and parse it.
<div class="ProfileTweet-contents">
<p class="ProfileTweet-text js-tweet-text u-dir"
dir="ltr">Come join us now! </span><span class="invisible">http://</span><span class="js-display-url">www.google.com</span><span class="invisible">/</span><span class="tco-ellipsis"><span class="invisible"> </span></span> <a href="http://t.co/jIw2344dDZz" class="twitter-timeline-link u-isHiddenVisually" data-pre-embedded="true" dir="ltr" >pic.twitter.com/jIwtc23juZz</a></p>
Code
while ((line = in.readLine()) != null) {
Pattern pattern = Pattern.compile("dir=.?!<a href=");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
tweets[0] = matcher.group();
System.out.println(matcher.group());
}
}
The item of data I'm trying to fetch is the following
dir="ltr">Come join us now! <a href=
For some reason it's not fetching the data inbetween dir= and < a href
Another working example which is parsing the web source just fine
URL addr = new URL(url);
URLConnection con = addr.openConnection();
ArrayList<String> data = new ArrayList<String>();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
Pattern p = Pattern.compile("<span itemprop=.*?</span>");
Pattern p2 = Pattern.compile(">.*?<");
Matcher m = p.matcher(inputLine);
Matcher m2;
while (m.find()) {
m2 = p2.matcher(m.group());
while (m2.find()) {
data.add(m2.group().replaceAll("<", "").replaceAll(">", "").replaceAll("&", "").replaceAll("#", "").replaceAll(";", "").replaceAll("3",""));
}
}
}
in.close();
addr = null;
con = null;
Edit: Sorry have just realised I was using a different regex from my other code example without realising.
(dir=).*?(<a href=)
Works fine
You're probably looking for a pattern such as:
(dir=\".+\">.+<a\\shref=).+rel
The reason your original pattern doesn't work is that you've not included several characters in your pattern such as " along with improperly using .? — it's not going capture anything between that and !.
Here a working example of the pattern above:
http://ideone.com/wbH9O6
Use a XML parser is the short version of the answer. If the html is mangled use a HTML parser that will try to make sense of the madness . Read this post as a bonus :
RegEx match open tags except XHTML self-contained tags

Extracting specific urls from a text file using java

I have a text document in which I have a bunch of urls of the form /courses/......./.../..
and from among these urls, I only want to extract those urls that are of the form /courses/.../lecture-notes. Meaning the urls that begin with /courses and ends with /lecture-notes.
Would anyone know of a good way to do this with regular expressions or just by string matching?
Here's one alternative:
Scanner s = new Scanner(new FileReader("filename.txt"));
String str;
while (null != (str = s.findWithinHorizon("/courses/\\S*/lecture-notes", 0)))
System.out.println(str);
Given a filename.txt with the content
Here /courses/lorem/lecture-notes and
here /courses/ipsum/dolor/lecture-notes perhaps.
the above snippet prints
/courses/lorem/lecture-notes
/courses/ipsum/dolor/lecture-notes
The following will only return the middle part (ie: exclude /courses/ and /lectures-notes/:
Pattern p = Pattern.compile("/courses/(.*)/lectures-notes");
Matcher m = p.matcher(yourStrnig);
if(m.find()).
return m.group(1) // The "1" here means it'll return the first part of the regex between parethesis.
Assuming that you have 1 URL per line, could use:
BufferedReader br = new BufferedReader(new FileReader("urls.txt"));
String urlLine;
while ((urlLine = br.readLine()) != null) {
if (urlLine.matches("/courses/.*/lecture-notes")) {
// use url
}
}

Categories