Go to Top

Preliminary look at the new high-performance distributed crawler

Update: Updated benchmark data.

Let’s start with a recap

Arachni has a few high-performance distribution features enabling it to be configured in a Grid and utilize multiple nodes to perform faster individual scans.

As far as I know, this hadn’t been done before, probably because it wasn’t worth it as SaaS providers only care about scanning a lot of sites under a reasonable time-frame and commercial scanner vendors would have to be crazy to completely change their profitable products in order to implement this.

However, this wasn’t a problem for me as my only motivation is plain curiosity and the code-base was already experimental to begin with so I figured what the hell.

Plus, with such easy access to “cloud” services nowadays, anyone can grab a couple of servers and set them up in a Grid and why not let users take advantage of that fact?

This is the way high-performance scanning currently works:

  • You start 2 or more Dispatchers on machines with distinct bandwidth lines (at least to the target webapp).
  • These Dispatchers converge and setup a grid.
  • You ask for an Instance from one of the Dispatchers and tell it to operate in High-Performance Grid mode — this one will be the “Master” Instance.
  • The Master asks its own Dispatcher for a list of Dispatchers with distinct bandwidth lines and proceeds to ask for one Instance from each of them (these will be the “Slaves”).
  • The Master performs the crawl, calculates what needs to be audited and once the crawl finishes the audit workload is distributed amongst it and its slaves.

In essence, we’re aiming at a high-level line aggregation situation here, that’s why only Dispatchers with distinct lines are considered.

But something was missing

Once I got that sorted, I was quite pleased with the results (not to mention burnt out) so I left it at that, but you can see that I could have added a couple more features to make the system more…”complete”.

These features being a distributed crawler and to allow any 2 or more Instances to gang-up, instead of being so rigid. But, like I said, I was too tired so I left that for later.

Combining multiple Instances without Dispatchers

One thing I noticed though is that even when pairing up a couple of Dispatchers locally (and thus performing an audit using multiple Instances) there were noticeable increases in performance.

After I noticed this, I put it on my TODO list to allow any 2 or more Instances to be configured in a Master-Slaves setup.

I mean, everything was already in place; IPC was there, in the form of the existing RPC protocol, and all required functionality was already exposed over it. It only required a few minor adjustments to make this work.

Distributed crawling

Some time after that, I started tackling a distributed crawler prototype to see if I can figure out a way to make something like that work. Distributing the audit was tricky but made sense, you knew before-hand what needed to be audited, but when crawling you need to deal with things on the fly. You’ve no idea how the site is being laid out, that’s why you’re crawling it in the first place.

Not to mention the fact that the crawler had to not include a single decision/deduplication point nor any other sort of operation that would require one Instance to request another one for an operation before continuing — for performance reasons.

Things needed to be very, very fluid. No checking, no waiting, no blocking, no duplication while maintaining fair and predictable workload distribution — and that pissed me off at the beginning because I didn’t know if it was even possible.

Luckily, I figured out a way to make it work and and the prototype performed surprisingly well.

The (preliminary) end result

Today I implemented both these features and here are the results (the fun starts at line 50):

You can’t actually run the above code as the necessary changes haven’t been pushed to the repo yet (and the API could change at any time) but that’s the sort of thing you can expect to be able to do with v0.4.2.

Time for some numbers

Blah blah blah yada yada yada who cares? Let’s see some numbers!

As you can see from the code sample, the site I chose for the benchmarks is http://www.ruby-doc.org (excluding URLs that contain the string “downloads”) as it provides plenty of breadth and depth and their server is quite responsive, not to mention that they serve large doc pages so they contain a boatload of HTML which also serves as a decent stress-test for the parser — kept my CPU steadily at 99-100%.

The total amount of pages was 26,935 and the total HTTP Request max concurrency was set to 20.

1 Process (HTTP request max concurrency: 20): 00:35:37

2 Processes (HTTP request max concurrency: 10 each): 00:22:06

3 Processes (HTTP request max concurrency: 7 each): 00:16:57

4 Processes (HTTP request max concurrency: 5 each): 00:11:26

5 Processes (HTTP request max concurrency: 4 each): 00:11:00

These were done with my crummy 5741Kbps line.

Same machine, same overall max concurrency, the only difference was the amount of processes used — which goes to show that no-matter how careful you are with your design, having your workload spread across proper OS processes makes a big difference.

And Arachni doesn’t even use threads, so it doesn’t really get affected by Ruby’s GIL, which makes the results even more curious. Although, in this case the HTML responses were quite sizable and, like I previously mentioned, had the processes consume 100% of their cores so it may have been a matter of load-balancing raw processing operations rather than network I/O.

Bottom line is, it was certainly worth it. :)

, , , , , ,

About Tasos Laskos

CEO of Sarosys LLC, founder and lead developer of Arachni.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.