This isn’t really important or anything, mostly wanted to see my thoughts written down and put this out there for feedback. But I guess I should first start with giving you some background. The 0.4.3 version is to be released shortly and this brings me closer to v0.5, with the major feature of that release being support for DOM/JS and subsequently AJAX.
The way this usually gets implemented in scanners is that people use WebDriver to crawl the website via a browser or use a simple regular crawler and pass pages to a browser to be parsed and have JS and DOM events executed and somehow record and analyze any HTTP requests made by the browser in order to capture resources only visible via AJAX.
As you can imagine, this causes a lot of latency and I’ve always said that this isn’t acceptable as a solution so I’ll, instead, write my own lightweight DOM in Ruby and use TheRubyRacer as a binding to Chrome’s V8 JS interpreter. That approach would bypass the latency of WebDriver and the browser’s rendering engine and even though it would take some time to become a stable solution it would at least be efficient. And if you’re thinking “this will take an insane amount of effort” then you’d be absolutely right but here’s the thing: 2 versions (or so) ago, Arachni’s crawler reached a stage where is was fast enough that the the biggest bottleneck became parsing the URLs it was finding, so slipping the latency of a real browser in the mix wasn’t acceptable, nor possible. Besides killing the crawler’s performance it’d also kill the WebDriver/Browser system because it would be feeding it hundreds of pages per second — pausing to let you imagine GMail or Facebook being loaded 100 times a second via your browser, not a pretty sight is it?
And this is the background and the reason why I was leaning towards my approach.
Still, I like optimizing things as much as possible and, sadly, even my lightweight approach would have a big negative impact on the crawler’s performance. I initially figured that it being an optional feature and the benefits being so massive that it would make up for it, but still, it’s just not right.
Then I thought, where is it written that this should be in the crawler? It would be much better to have a regular dumb (but really fast) crawler and run the JS and DOM events when a fresh instance of the page is fetched to be audited, and if anything new is found then push it to the audit queue (and also update the system’s sitemap for posterity’s sake). The reason behind that being that the audit is inherently a high latency operation so that would mask the latency of the DOM/JS stuff.
About half an hour ago I got a better idea (as is usually the case) while I was in the bathroom — I’ll help you push away any disturbing mental images by clarifying that I was just showering, which brings on its own disturbing images so let’s just gloss over that and get back on topic…I did say “Eureka” though. The idea was that the latency of DOM/JS can be masked completely by performing a regular page audit (as it is right now) and then have the DOM/JS analysis in a different thread, occurring while the HTTP responses of the regular audit are being received. And if any new elements are found as a result of the DOM/JS analysis, they can be queued to be audited on the fly (along with the ones that are already running) and no one would be any the wiser.
And here comes the good part, a single page audit takes at least a couple of seconds and that is plenty of time to use WebDriver with a real browser to analyze a page. And this takes away the problem of using a real browser for DOM/JS/AJAX since it turns out that it can be done in such a way that its latency can be completely hidden and the WebDriver process’ workload would be minimal as it would only get to analyze a page every few seconds (or even minutes).
So….yeah…there it is. At this point I think I’ll go with the latter approach as it’d provide a full browser stack with minimal downsides and would allow for that feature to be implemented over a weekend or so. Selenium with PhantomJS seem like the way to go for this but it’s too early to say.
And that’s kind of a bittersweet realization as writing a user agent with DOM/JS/AJAX support in Ruby would be really cool; kinda like what women feel (I imagine) after getting a false-positive pregnancy test, terrified, for sure, but after starting warming up to it, once you realize it was a false positive, relieved but strangely disappointed. You know…if you were to multiply my situation by 10 to the gazillionth power.