Bug #3602

HTMLParser.getHttpLinks broken in SVN (Working in Stable)

Added by pspzockerscene 10 months ago. Updated 10 months ago.

Status:Closed Start date:07/23/2011
Priority:Normal Due date:
Assignee:jiaz % Done:

100%

Category:General
Target version:-
Revision:0000 Resolution:

Description

Ouf of heap space

History

Updated by mccartney 10 months ago

I tried to look into the issue. My impression is that it's unsafe to assume that each '<' sign is a opening tag in the HTML source. There might be inlined Javascript with < comparisons within, e.g.

ar X=U.length;for(var Y=0;Y<X;Y++){U[Y]()}}function K(X){if(J){X()}

Moreover I am afraid JD should consider handling improper HTML markups too (like having unescaped '<' somewhere in the page content) even if no Javascript is involved. That would have to be tested - how do the real-world browsers behave?

Anyway, I believe the problem here is that the code trying to collect the data between tags uses manual 'indexof' searches, which might not be correct. I might be totally wrong here, but the following code (HTMLParser:75-79 @ r14663):

                final int pos = data.indexOf('>');
                if (pos >= 0 && data.length() >= pos + 1) {
                    final int posb = data.indexOf('<');
                    if (posb > 0) {

was meant to look for '<' characters happening before '>'. So I believe one way to go would be to put an additional (&& posb < posb) check.

I didn't feel confident enough around the purpose and meaning of the code there so I didn't do any changes though.

Updated by jiaz 10 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

Applied in changeset r14680.

Also available in: Atom PDF