Bug #3602
HTMLParser.getHttpLinks broken in SVN (Working in Stable)
| Status: | Closed | Start date: | 07/23/2011 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | % Done: | 100% |
||
| Category: | General | |||
| Target version: | - | |||
| Revision: | 0000 | Resolution: |
Description
Ouf of heap space
History
Updated by mccartney 10 months ago
I tried to look into the issue. My impression is that it's unsafe to assume that each '<' sign is a opening tag in the HTML source. There might be inlined Javascript with < comparisons within, e.g.
ar X=U.length;for(var Y=0;Y<X;Y++){U[Y]()}}function K(X){if(J){X()}
Moreover I am afraid JD should consider handling improper HTML markups too (like having unescaped '<' somewhere in the page content) even if no Javascript is involved. That would have to be tested - how do the real-world browsers behave?
Anyway, I believe the problem here is that the code trying to collect the data between tags uses manual 'indexof' searches, which might not be correct. I might be totally wrong here, but the following code (HTMLParser:75-79 @ r14663):
final int pos = data.indexOf('>');
if (pos >= 0 && data.length() >= pos + 1) {
final int posb = data.indexOf('<');
if (posb > 0) {
was meant to look for '<' characters happening before '>'. So I believe one way to go would be to put an additional (&& posb < posb) check.
I didn't feel confident enough around the purpose and meaning of the code there so I didn't do any changes though.