Update: Updated benchmark data.
Let’s start with a recap
Arachni has a few high-performance distribution features enabling it to be configured in a Grid and utilize multiple nodes to perform faster individual scans.
As far as I know, this hadn’t been done before, probably because it wasn’t worth it as SaaS providers only care about scanning a lot of sites under a reasonable time-frame and commercial scanner vendors would have to be crazy to completely change their profitable products in order to implement this.
However, this wasn’t a problem for me as my only motivation is plain curiosity and the code-base was already experimental to begin with so I figured what the hell.
Plus, with such easy access to “cloud” services nowadays, anyone can grab a couple of servers and set them up in a Grid and why not let users take advantage of that fact?
This is the way high-performance scanning currently works:
- You start 2 or more Dispatchers on machines with distinct bandwidth lines (at least to the target webapp).
- These Dispatchers converge and setup a grid.
- You ask for an Instance from one of the Dispatchers and tell it to operate in High-Performance Grid mode — this one will be the “Master” Instance.
- The Master asks its own Dispatcher for a list of Dispatchers with distinct bandwidth lines and proceeds to ask for one Instance from each of them (these will be the “Slaves”).
- The Master performs the crawl, calculates what needs to be audited and once the crawl finishes the audit workload is distributed amongst it and its slaves.
In essence, we’re aiming at a high-level line aggregation situation here, that’s why only Dispatchers with distinct lines are considered.
But something was missing
Once I got that sorted, I was quite pleased with the results (not to mention burnt out) so I left it at that, but you can see that I could have added a couple more features to make the system more…”complete”.
These features being a distributed crawler and to allow any 2 or more Instances to gang-up, instead of being so rigid. But, like I said, I was too tired so I left that for later.
Combining multiple Instances without Dispatchers
One thing I noticed though is that even when pairing up a couple of Dispatchers locally (and thus performing an audit using multiple Instances) there were noticeable increases in performance.
After I noticed this, I put it on my TODO list to allow any 2 or more Instances to be configured in a Master-Slaves setup.
I mean, everything was already in place; IPC was there, in the form of the existing RPC protocol, and all required functionality was already exposed over it. It only required a few minor adjustments to make this work.
Distributed crawling
Some time after that, I started tackling a distributed crawler prototype to see if I can figure out a way to make something like that work. Distributing the audit was tricky but made sense, you knew before-hand what needed to be audited, but when crawling you need to deal with things on the fly. You’ve no idea how the site is being laid out, that’s why you’re crawling it in the first place.
Not to mention the fact that the crawler had to not include a single decision/deduplication point nor any other sort of operation that would require one Instance to request another one for an operation before continuing — for performance reasons.
Things needed to be very, very fluid. No checking, no waiting, no blocking, no duplication while maintaining fair and predictable workload distribution — and that pissed me off at the beginning because I didn’t know if it was even possible.
Luckily, I figured out a way to make it work and and the prototype performed surprisingly well.
The (preliminary) end result
Today I implemented both these features and here are the results (the fun starts at line 50):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | require 'arachni/rpc/server/instance' require 'arachni/rpc/server/instance' require 'arachni/rpc/client' # Auth token for all instances. @token = 'secret' # Global address to bind to. Options.rpc_address = 'localhost' # Starts an instance in its own process and returns a hash with its connection info. def spawn_instance_on_port( port ) (@pids ||= []) << ::EM.fork_reactor { $stdout.reopen( '/dev/null', 'w' ) $stderr.reopen( '/dev/null', 'w' ) Options.rpc_port = port RPC::Server::Instance.new( Options.instance, @token ) } sleep 1 Process.detach( @pids.last ) { 'url' => "#{Options.rpc_address}:#{port}", 'token' => @token } end # Kills all processes/instances. def killall return if !@pids || @pids.empty? puts print 'Killing all processes... ' @pids.each do |pid| tries = 0 begin Process.kill( 'KILL', pid ) rescue tries += 1 retry if tries <= 4 end end puts 'Done!' @pids.clear exit end # Setup interrupt handling. %w(QUIT INT).each { |signal| trap( signal ) { killall } } begin print 'Initialising the master instance... ' # Get an instance to act as a master. master_info = spawn_instance_on_port( 1111 ) master = RPC::Client::Instance.new( Options.instance, master_info['url'], master_info['token'] ) # Beware, the master's options will be applied as they are to each of the # instances. master.opts.url = 'http://www.ruby-doc.org' # This particular site has links to packages so skip them. master.opts.exclude = 'downloads' # We don't load any modules nor plugins nor enable auditing of any element # type, we're only interested in crawling. # Since we plan on using 4 instances, we set the maximum HTTP request # concurrency of each to 5 (to have a total of 20). master.opts.http_req_limit = 5 puts ' Done!' print 'Preparing the slaves... ' # Enslaving other instances automatically makes us the master of the group. master.framework.enslave spawn_instance_on_port( 2222 ) master.framework.enslave spawn_instance_on_port( 3333 ) master.framework.enslave spawn_instance_on_port( 4444 ) puts ' Done!' puts print 'Running' # Start the show! master.framework.run # Block until the crawl finishes. while master.framework.busy? print '.' sleep 5 end puts ' Done!' puts print 'Grabbing the audit results, please wait... ' auditstore = master.framework.auditstore puts 'Done!' puts puts 'Sitemap' puts '~' * 25 auditstore.sitemap.each { |url| puts "---- #{url}" } puts puts "Total: #{auditstore.sitemap.size}" puts puts "Total runtime: #{auditstore.delta_time}" ensure killall end |
You can’t actually run the above code as the necessary changes haven’t been pushed to the repo yet (and the API could change at any time) but that’s the sort of thing you can expect to be able to do with v0.4.2.
Time for some numbers
Blah blah blah yada yada yada who cares? Let’s see some numbers!
As you can see from the code sample, the site I chose for the benchmarks is http://www.ruby-doc.org (excluding URLs that contain the string “downloads”) as it provides plenty of breadth and depth and their server is quite responsive, not to mention that they serve large doc pages so they contain a boatload of HTML which also serves as a decent stress-test for the parser — kept my CPU steadily at 99-100%.
The total amount of pages was 26,935 and the total HTTP Request max concurrency was set to 20.
1 Process (HTTP request max concurrency: 20): 00:35:37
2 Processes (HTTP request max concurrency: 10 each): 00:22:06
3 Processes (HTTP request max concurrency: 7 each): 00:16:57
4 Processes (HTTP request max concurrency: 5 each): 00:11:26
5 Processes (HTTP request max concurrency: 4 each): 00:11:00
These were done with my crummy 5741Kbps line.
Same machine, same overall max concurrency, the only difference was the amount of processes used — which goes to show that no-matter how careful you are with your design, having your workload spread across proper OS processes makes a big difference.
And Arachni doesn’t even use threads, so it doesn’t really get affected by Ruby’s GIL, which makes the results even more curious. Although, in this case the HTML responses were quite sizable and, like I previously mentioned, had the processes consume 100% of their cores so it may have been a matter of load-balancing raw processing operations rather than network I/O.
Bottom line is, it was certainly worth it. :)
