It’s been a while since I posted here, been quite busy with my new job at Rapid7 and working on Arachni but I wanted to share a little piece of code with you.

Like with the last post, I was working on something for Arachni and since it took me a few hours of googling before I gave up and rolled my own I figured I’d post the solution here and save someone else some time.

What I was working on is the first step of URL rewrite support and the first issue was allowing the user to specify how paths should be interpreted. That was easy, have the user specify regular expressions, and to make things a bit nicer/cleaner/neater have them specify named groups for each parameter embedded in the path.

The book category and its ID, then we’ve got some random crap that are to be ignored, then the chapter ID then some other ID and then a value identical to the book ID.

So an appropriate regexp to extract the data we need would be:

1

2

3

4

5

6

7

8

9

10

11

/

\/(?<category>\w+)# matches category type

\/# path separator

(?<book-id>\d+)# matches book ID numbers

\/# path separator

.*# irrelevant

\/# path separator

chapter-(?<chapter-id>\d+)# matches chapter ID numbers

\/# path separator

stuff(?<stuff-id>\d+)# matches stuff ID numbers

/x

It would be nice if we could get the matches as a hash with the group name as the key and the matched data as the value like so:

1

2

3

4

5

6

{

"category"=>["book"],

"book-id"=>["12"],

"chapter-id"=>["3"],

"stuff-id"=>["4"]

}

(The values are arrays in case the regexp matches more that one value. In the case of extracting URL rewrite data the values should be unique and singular but it’s nice to have a smarter algo to cover future needs.)

We can do the above with:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

classString

def scan_in_groups(regexp)

raise ArgumentError,'Regexp does not contain any names.'ifregexp.names.empty?

It’d also be nice if we could substitute the matched named groups with a hash similar to the one returned by #scan_in_groups — this would be used to make fuzzing the inputs in the path easier. I was sure that String#sub would provide a way of doing this (since it supports grouped regular expressions) but, unfortunately, it does not; however it’s not a big deal since it only takes a few lines of code.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

classString

def scan_in_groups(regexp)

raise ArgumentError,'Regexp does not contain any names.'ifregexp.names.empty?

Those of you who’ve been following Arachni’s Twitter account or Blog you’ll know where this post is coming from.

I recently found that my URL normalization methods were sucking up loads of CPU time and that caching these methods (with a simple Ruby Hash) was cutting the time of a 1000-page crawl to almost half; that was a great day as you can imagine, but for the time being I left it at that and continued working on the specs — which I’ve now gotten sick of. After I finished adding specs for all Arachni’s core classes I went back to add a proper cache, after all I couldn’t just let these methods infinitely populate their cache hashes.

Never having been in that situation before I took a stroll over at Wikipedia to get me started. I soon discovered that what I needed was a Least Recently Used cache, thus I did a bit of googling to see if some kind soul had already written such a cache in Ruby. Luckily, I did find a few implementations that seemed to do the job and with those as a base I got cracking.

Sadly, the overhead of having to maintain ages for all entries after each access operation completely negated the performance improvements of caching — i.e. the crawl time was the same either way. I spent hours trying to write or find a better LRU algorithm but no dice.

So back to that Wikipedia page I went, trying to find some other algorithm that could suite my needs. Then, my eye fell on the Random Replacement one, which sounded nice. Since it randomly removes cache entries when it needs to make space for new ones there’s pretty much no overhead, nothing to keep track of — plus the description has a nice side-note that For its simplicity, it has been used in ARM processors so I figured why not.

Fortunately, it performed beautifully, fast and simple and no overhead.

But, as it usually is, that wasn’t enough (for me)… I figured that I should try to push it a bit further and update the RR algo to be cost-inclusive. That is, each entry would belong to a cost class, (low, medium and high) with the cache giving precedence to entries of the lowest classes when it needs to make room for new entries.

Unfortunately though, the operations in question weren’t costly enough so as to justify the gains one would enjoy from introducing that tiny cost-checking overhead in the cache; soooo… no dice.

And this is why I’m writing this post, if someone out there finds (him|her)self in a similar situation and needs a cache implementation (or 3) you can just use mine and skip all the hassle.

I may have waisted a few hours but I ended up with 3 cache implementations and at least one of them works so overall it wasn’t too bad.

When I started development on the Arachni high-performance grid my focus was on the audit part, i.e. find a way to distribute the audit of batches of individual elements across multiple nodes and avoid duplication of effort amongst them. It was a bit tricky to get right but it turned out to be quite do-able and worthwhile.

However, the crawl was done the old fashioned way, the master instance would crawl the targeted website and once completed it would then analyze all the pages it found and spread the workload. I always intended to try out my hand on something similar but aimed towards the crawling process but it wasn’t a high priority. But, as you can see from my last post, I did sort of figure it out, although I hadn’t had a chance to implement it until now.

This is tricky to do because there’s no way of knowing the workload before hand as it is basically a freaking labyrinth and precious information (new paths) can be hidden behind walls and walls of crap.

On the other hand, since when running Arachni in HPG mode you already have a few nodes up and running in the first place, why not utilize them a bit more — even if it turns out to be only slightly faster than a single crawler.

With that in mind, I yesterday started to implement that sort of a crawler, and here it is. Its sole existence is that of a toy, a fun experiment, and not as a stable system. I may, in the future, put some more effort into it but my main reason for doing this is to explore this idea and eventually port it over to Arachni.

If you find this interesting, want to help out in researching or have any sort of feedback or just want to get in touch don’t hesitate to do so.

This one had been bugging me since I first started work on the HPG. The gain you get from distributed computing is directly related to how efficient the workload distribution is — which makes sense. The crawling process though doesn’t consist of a workload per se but rather looks for the workload. Also, the difficulty of the crawl doesn’t lie in parsing or following the paths but actually finding the paths, this is because new paths are hidden behind old ones and as you progress new paths become sparser and sparser. So the more work you do the less productive you become — these are grim prospects.

So here’s our problem, how the hell do you distribute something when you don’t know:

where it is

what it is

how big it is

Truthfully, in a single node setup this can be done quite easily and there are lots of ways to go about it A basic crawling algorithm is actually one of the simplest around, it only has one rule: Follow every path but only once — when there are no more new paths you’re done.

In the most basic of implementations what you’d need to do this is a look-up table to help you keep track of the paths you’ve already followed in order not to go over them again. Or you can amend the previous model by going the multi-threaded way and put the new paths in a queue, have workers pop paths from the queue, follow them and then report the new paths they find back to their coordinator for de-duplication and then put everything that passes filtering back to the queue and so the story goes…

When you do this on a single machine these approaches are good enough (and when you use async requests for the first/simplest approach it becomes more efficient than the latter one as well). Thing is, these work well because you have the benefit of a multi Gbps bandwidth and close-to-zero latency pipe, your Front-Side Bus (or whatever computers have nowadays, haven’t kept up with h/w design).

So you can check if something is in the lookup table in something close to 0s, actually in Ruby the look-up time of a Set with 1,000,000 URLs in it is 9.735e-06 (0.000009735) seconds on my machine. Which is effectively no time at all, you spend 0.000009735 seconds waiting for a decision before following each URL — ooooh scary.

However, when you need to do this over the network these response times take a dive — off a cliff. You see, when distributing work you play with gain/cost ratios; if the ratio is good you go ahead, if not you go back to the drawing board.

Such a naive implementation deserves picking up the multi-colored markers again because:

Let’s say every worker communicates via an RPC protocol and the master worker maintains the look-up table and the work Queue. Assuming that an RPC call costs about the same as an HTTP request and 25% of the paths contained in most pages are identical (nav menu, css, js, images, links to new blog posts etc.) this means that each follow operation will cost, per work unit: 1 RPC call to pop a path from the Queue +

25% of the amount of links found by following the path * 1 RPC — for paths that are common and have already been visited from the get go

75% of the amount of links found by following the path * 1 RPC call + 1 HTTP request — for new paths which aren’t in the lookup table and must be visited

So you’ll be spending most of your time waiting for permission rather than doing actual work and suddenly the cost of doing the work doubles. There has to be a better way…

Annoyingly, it took me a couple of days to figure this out and it turned out what I needed was a relaxing shower and the answer come to me on its own — gooood answer *pat**pat*.

In all honesty, I had bits and pieces of the answer from the begging and I knew that the final algorithm would have to be a composite of models — a piece from the producer-consumer there, a bit of a policy-enforcer here sprinkled with some delegation across the board — but the problem was putting them in the right order to form a unified model that would:

Avoid any sort of blocking (no look-ups or waiting for decisions)

Automate load balancing

Prevent crawling redundant URLs (more than one worker following the same path)

Tricky stuff right?

And here’s what I came up with:

The master scopes out the place (follows 10 paths or so) and deduces the webapp structure — it will, most certainly, be incomplete but it doesn’t matter as we just want some seeds to get us going.

The master creates a per directory policy and assigns dirs to workers AND sends that policy to them as well.

Workers perform the crawl as usual but also implement that policy for URLs that don’t match their own policy rules i.e. send URLs that are out of their scope to the appropriate peer and let him handle it — the peer will ignore it if he has already visited it or put it in his queue.

If no policy matches a URL then it is sent back to the master; the master creates a new policy(ies), stores the work in a Queue and then sends an announcement to the workers (“There’s some work up for grabs!” ).

Busy workers ignore it; idling workers try to pull it and the work is assigned first-come/first-serve along with the updated policy.

Go to 3

If at any point a worker becomes idle he sends the paths he has discovered back to the master for storage/further processing/whatever and tries to pull some new work.

Also, the master will be a worker too — why waste a node, right?

Let’s go back to our list of requirements and see how we did:

Avoid any sort of blocking (no look-ups or waiting for decisions) — ✓ Instead of waiting for permission to do something, we delegate to the appropriate authority and forget about it

Automate load balancing — ✓ Workers pull work when they are good and ready

Prevent crawling redundant URLs (no more than one worker following the same path) — ✓ A local look-up table and item #1 take care of that quite nicely

In addition, the policy can use any sort of criteria (directories were just an example), which means that we can achieve very granular distribution if we are a bit clever with it.

I think that this serves as a decent starting point, there’s still the issue of how to efficiently group the new URLs that are fed back to the master (because they don’t match any of the initial policy rules) but I’ll have to see this working under real world conditions in order to get a better feel for it first. I’ve got some stuff in mind, we’ll see…

Please do comment if you have a suggestion of have spotted a fault somewhere.

Yes, it’s true… As of now the code in the experimental branch has been converted to use the Apache License Version 2.0.

If you’re interested in why this happened here’s the deal: There are currently a few companies that use Arachni internally and a few others that actually provide SaaS security services using Arachni’s distributed features. Thing is though, a lot of companies can’t touch GPL code (not that I blame them) which isn’t good for them nor Arachni as neither of us gets what he wants. It makes sense, surely, but it was about a month ago that it really clicked as I were reading the comments of a Slashdot article. Lots of people were agreeing on the same subject, the money-men don’t like GPL which kind of sucked for the project.

At this point I started researching alternative licenses and started asking around a bit.

As fate would have it — although more due to the increasing userbase I guess — a few people told me that the GPL was a deal-braker for them and I even had one guy tell me that he couldn’t include Arachni in his book because of it (I’ll spare him his blushes and not say his name).

Now that the project is gaining some momentum these technicalities become more and more important.

So after a bit of research I settled on Apache License 2.0 mainly because of its trademark and patent grants (because who the hell wants to deal with that bureaucratic crap?) and the requirement to redistribute a visible copy of the original work’s NOTICE file (if it includes one) which is nice since hard work must be properly credited. You know, you can use my work for free (and I hope you do ) but mention that “this product contains some code from that Arachni thingy written by this bloke with the funny name”.

So that’s the reason, I’m hoping that a more permissive license will increase adoption and make everybody’s life easier.

A couple of days ago I proudly released v0.4 and, as luck would have it, I later had to swallow some of that pride due to a couple of intermittent bugs that I hadn’t spotted. Well, worry no more as I’m writing this post to announce a rush hotfix version of Arachni, v0.4.0.2.

If you installed the previous version via “gem install” or have downloaded the previous Cygwin package then all you need to do is issue:gem install arachni

Ruby’s XMLRPC has been ditched (as initially discussed in these two [1, 2] posts) in favor of Arachni-RPC. Arachni-RPC is lightweight, simple and fast which makes it ideal for large Grid deployments and makes it easy for 3rd parties to interoperate with Arachni’s servers.

Notice: If you were using the old XMLRPC interface please update your code to use the new RPC API.

I’ve been talking about this one so much that I’ve actually grown a bit sick of it — joking aside though this is one of Arachni’s most important features. It allows you to connect multiple nodes into a Grid and use them to perform lightning-fast scans.

This is due to the way Arachni distributes the workload, which is finely grained down to individual page elements to ensure fair and optimal distribution; because workload distribution is so fluid it effectively becomes a sort of bandwidth and CPU aggregation.

To put this in simple(-istic) terms: If you have 2 Amazon instances and you need to scan one site, by utilising the HPG you’ll be able to cut the scan time down to approximately half of what it would take by using a single node (plus the initial crawl time).

And if you have a huge site you can use 50 nodes and so the story goes…

This feature was an imaginary, almost unattainable, milestone back when I added the initial client/server implementation and I didn’t really think that I’d ever be able to make it happen. Luckily, I was wrong and I’m proud to present you with the first Open Source High Performance Grid web application security scanner! (By the way, does anyone know of a commercial scanner that can do this?)

Notice: With the WebUI’s updated AutoDeploy add-on you’ll be able to go into World domination mode by performing point and click Grid deployments!Another notice: Use responsibly, don’t DDoS people.Yet another notice: It’s still considered experimental so let me know if you come across a bug. 😉

The WebUI now contains a few context-sensitive help dialogs to help out the newcomers and it has been updated to use the Thin webserver to send responses asynchronously in order to increase performance and feel “snappier”. It also supports HTTP basic auth just in case you want some simple password protection and has been updated to provide access to the brand new HPG goodies.

Spider improvements

There was a bug with redirections that prevented the spider from achieving optimal coverage which has now been resolved. More than that, the scope of the crawl can now be either extended or restricted by supplying newline-separated lists of URLs which should help you import 3rd party sitemaps.

Plugins

The plugin API has been extended in order to allow plugins to let the framework know if they can be distributed across HPG Instances and, if so, how to merge their results for the final report.

Another big (although invisible to the end-user) change is the conversion of all meta-modules to full-fledged plugins to simplify management and Grid distribution.

And these new plugins have been added:

ReScan — It uses the AFR report of a previous scan to extract the sitemap in order to avoid a redundant crawl.

BeepNotify — Beeps when the scan finishes.

LibNotify — Uses the libnotify library to send notifications for each discovered issue and a summary at the end of the scan.

EmailNotify — Sends a notification (and optionally a report) over SMTP at the end of the scan.

Manual verification — Flags issues that require manual verification as untrusted in order to reduce the signal-to-noise ratio.

Resolver — Resolves vulnerable hostnames to IP addresses.

Modules

I’ve got both good and bad news for this…. In an attempt to cleanup and optimise pattern matching in v0.3 I inadvertently broke some aspects of it which crippled the XSS (xss), SQL injection (sqli) and Path Traversal (path_traversal) modules –I sincerely apologise, mea culpa.

The good news is that I’ve made things right, cleaned up the API and the existing modules and improved their accuracy.

Reports

The HTML report has waved goodbye to Highcharts due to licensing reasons and now uses jqPlot for all its charting and graphing needs. I’ve also removed the “report false-positive” button since a part of that process required RSA encryption which for some reason caused segfaults on Mac OSX. Good news is that the HTML reports will be significantly smaller in size from now on.

Moreover, the following new report formats have been added:

JSON — Exports the audit results as a JSON serialized Hash.

Marshal — Exports the audit results as a Marshal serialized Hash.

YAML — Exports the audit results as a YAML serialized Hash.

Cygwin package for Windows

About time indeed, Windows users can now enjoy Arachni’s features — albeit via a preconfigured Cygwin environment. The important point is that you no longer have to manually hassle to install Arachni via MinGW or Cygwin yourselves or use a VM and what have you… Simply download and run the self-extracting archive, double click the “Cygwin” batch file and lo and behold: you’ve got a bash shell ready to execute Arachni’s scripts.

Unfortunately, there’s a performance penalty involved when running Arachni in Cygwin but until I port it to run natively on Windows it’ll have to do.

Before I forget, the Wiki has been cleaned up and brought up to date so if you need to go through the documentation that should be your first stop.

I’ve caught a bug it seems and because I can’t just sit on my ass all day I figured why not play around with my latest toy.

I’ve updated the code to make the process more streamlined and allow for fuzzing (or at least altering) some possible input vectors. Things will be very very simple for now as I’m merely trying to demo that with appropriate effort invested this can become a viable solution — eventually.

To the point, I’ll showcase a DOM XSS vulnerability that will take place purely on the client-side. Unfortunately, there are a lot of DOM interfaces/vectors that I haven’t yet implemented so let’s stick to one I have — navigator.userAgent.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

require_relative'init'

html=<<EOHTML

<html>

<head>

<title>Mytitle!</title>

</head>

<body>

<div>

<script type="text/javascript">

document.write(navigator.userAgent);

</script>

</div>

</body>

</html>

EOHTML

#

# the second param sets 'dont_eval_js' to true

#

# we want to do that ourselves later on after we've prepared the vectors

# (navigator.userAgent in this case)

#

window=DOM::Window.new(html,true)

#

# we'll inject this fictional tag and look for it in the DOM structure later on

#

# if found, then we have an XSS vuln

#

seed_tag='myinjectedtag'

# our XSS vector

window.navigator.userAgent="<#{seed_tag}>blah blah blah</#{seed_tag}"

# this will show the HTML as is (Ref. #1)

# puts window.document.to_html

# puts '-' * 80

# execute the JS

window.instance_eval{exec_js!}

# this one will show the updated HTML i.e. including our tag (Ref. #2)

# puts window.document.to_html

# look for the tag in the DOM structure

ifwindow.document.getElementsByTagName(seed_tag)[0]

puts'Vulnerable to XSS!'

end

Ref. #1 This is what the code looks like at this point:<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><head><meta http-equiv="Content-Type"content="text/html; charset=US-ASCII"><title>My title!</title></head><body><div><script type="text/javascript">document.write(navigator.userAgent);</script></div></body></html>

Pretty much what we passed…

Ref. #2 And now that the JS has been executed:<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><head><meta http-equiv="Content-Type"content="text/html; charset=US-ASCII"><title>My title!</title></head><body><div><script type="text/javascript">document.write(navigator.userAgent);</script></div></body><myinjectedtag>blah blah blah</myinjectedtag></html>

Yes I know, document.write() should have written inside the parent of the script tag, the div in this case. I can’t be bothered to take care of this right now, this is just a prototype…a prototype of a prototype of the prototype actually.

End result Since we can see that the injection was clearly successful, the inevitable message shall appear:Vulnerable toXSS!

Conclusion Yeah it’s really basic and simplistic and not a big deal but it’s fun to see things working — barely but still, heh.

One of the things everyone is taking for granted nowadays for every browser and website is decent support for AJAX. Naturally, scanner devs have been trying to find a decent way to automatically audit that side of the fence or at least provide decent coverage for JS-heavy webapps.

Thing is though… this is a bitch to get right. And never mind getting it right, it’s hard enough getting the damn thing to work to begin with.

There are 3 things that need to be integrated in order to achieve that sort of functionality:

DOM

The static parts of the DOM can easily be built on top of tested and proven XML parsers like libxml. The DOM has tricky parts too though, which are where the money is, like timers and events.

It’s alright though, a little bit of smart thread scheduling and clean collections of callbacks will sustain you at the beginning.

JS

JS integration difficulty depends on your language of choice. Thankfully, more and more bindings for several different JS engines are being released so you can have your pick. The interfaces can be a bit dodgy though at times which can defeat the whole purpose.

AJAX

AJAX functionality will have to fall on your shoulders but if you managed to get the first 2 parts working this won’t be much of a challenge. You simply write an AJAX API in your language of choice and make that interface available to the JS code.

Where I’m going with this…

It’s certainly possible to get these working together (without being a multi-million dollar corporation even) but it’ll be a lot of work. Which is the reason there’s no open source scanner that supports AJAX or even basic JS scripting.

Arachni is no exception, not to mention that due to the young age of the system there had been far more important and basic things to be worked out first.

Luckily, a lot of things have changed in very little time. The project still has a few bugs but it has been stable enough for a few businesses to build some of their infrastructure on it. And the v0.4 version if pretty much ready, which takes care of another big feature I could not wait to implement — the High Performance Grid.

Next stop: AJAX

Some time ago I was contacted by a CompSci student who had chosen to add AJAX support to Arachni as his final-year project, nice guy and seems motivated so I’m pretty sure that this is gonna happen.

I’ve managed to make this work with some simple stuff, JQuery loads without errors and kind of works (I haven’t had much time to test it). I haven’t had time to implement timers and events yet but they’re coming… I don’t want to finish this thing on my own though, the other guy will need to work on it for his project.

Point is…AJAX support is coming to Arachni; it will of course take time but it’s going to happen.

As promised, part 5. Not that anyone’s reading this crap, once I’m done with the series though I’ll be able to gather them into a nice developer’s guide so I might as well keep going.

As always, keep your installation to up date with the experimental branch before continuing. These articles have forced me to see Arachni from a completely different perspective and so I keep improving the API to make it more developer friendly.

This time we’ll focus on auditing individual elements and also work on a per page scope.

Let me paint you a picture:

You have a Rails (or some other such framework) web application. You need to audit it in a consistent manner. You wish you could simply add security tests to your existing test suite next to your units/functional/integration/etc. tests. Your webapp framework already keeps track of pretty much all inputs (and if not you can override helper methods like link_to and form_for to keep track of them). The only thing missing is a system to which you can feed that data and audit those inputs.

You see where I’m going with this, right?

The Arachni framework can easily handle this in a number of ways, some of which I’ll demonstrate here.Read more…

My GitHub Projects

raw2vmdk is an OS independent Java utility that allows you to mount raw disk images, like images created by "dd", using VMware, VirtualBox or any other virtualization platform supporting the VMDK disk format.