Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services providing escrow services between buyers & sellers transacting in Bitcoin or other cryptocoins, usually for drugs or other illegal/regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model in 2011. From 2013-2015, I scraped/mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage, lifetimes/characteristics, & legal riskiness; these scrapes covered vendor pages, feedback, images, etc. In addition, I made or obtained copies of as many other datasets & documents related to the DNMs as I could. This uniquely comprehensive collection is now publicly released as a 50GB (~1.6TB uncompressed) collection covering 89 DNMs & 37+ related forums, representing <4,438 mirrors, and is available for any research. This page documents the download, contents, interpretation, and technical methods behind the scrapes.

I have been involved in DNMs since June 2011 when Adrian Chen published his famous Gawker article proving that Silk Road 1 was, contrary to my assumption when it was announced in January/February 2011, not a scam and had become a functional drug market, a new kind dubbed dark net markets (DNM); fascinated, I signed up, made my first order, and began documenting how to use SR1 and then a few months later, began documenting the first known SR1-linked arrests. Monitoring DNMs was easy because SR1 was overwhelmingly dominant and BlackMarket Reloaded was a distant second-place market, with a few irrelevancies like Deepbay or Sheep and then the flashy Atlantis.

This idyllic period ended with the raid on SR1 in October 2013, which ushered in a new age of chaos in which centralized markets battled for dominance, the would-be successor Silk Road 2 was crippled by arrests and turned into a ghost-ship carrying scammers, and the multisig breakthrough went begging. The tumult made it clear to me that no market or forum could be counted on to last as long as SR1, and research into the DNM communities and markets, or even simply the memory of their history, was threatened by bitrot: already in November 2013 I was seeing pervasive myths spread throughout the media - that SR1 had $1 billion in sales, that you could buy child pornography or hitmen services on it, that there were multiple Dread Pirate Roberts - and other dangerous beliefs in the community (that use of PGP was paranoia & unnecessary, markets could be trusted not to exit-scam, that FE was not a recipe for disaster, that SR2 was not infiltrated despite the staff arrests & even media coverage of a SR1 mole, that guns & poison sellers were not extraordinarily risky to purchase from, that buyers were never arrested).

And so, starting with the SR1 forums, which had not been taken down by the raid (to help the mole? I wondered at the time), I began scraping all the new markets, doing so weekly and sometimes daily starting in December 2013. These are the results.

the DNMs could not exist without volunteers and nonprofits spending the money to pay for the bandwidth used by the Tor network; these scrapes collectively represent terabytes of consumed bandwidth. If you would like to donate towards keeping Tor servers running, you can donate to Torservers.net or the Tor Project itself

collating and creating these scrapes has absorbed an enormous amount of my time & energy due to the need to solve CAPTCHAs, launch crawls on a daily or weekly basis, debug subtle glitches, work around site defenses, periodically archive scrapes to make disk space available, provide hosting for some scrapes released publicly etc (my arbtt time-logs suggest >200 hours since 2013); I subsist primarily on donations, and I too accept Bitcoins: 1GWERNi49LgEb5LpvxxGFSuVYo2K3BDRdo

There are ~89 markets, >37 forums and ~5 other sites, representing <4,438 mirrors of >43,596,420 files in ~49.4GB of 163 compressed files, unpacking to >1548GB; the largest single archive decompresses to <250GB. (It can be burned to 3 25GB BDs or 2 50GB BDs; if the former, it may be worth generating additional FEC.)

These archives are xz-compressed tarballs (optimized with the sort-key trick); typically each subfolder is a single date-stamped (YYYY-MM-DD) crawl using wget, with the default directory/file layout. The majority of the content is HTML, CSS, and images (typically photos of item listings); images are space-intensive & omitted from many crawls, but I feel that images are useful to allow browsing the markets as they were and may be highly valuable in their own right as research material, so I tried to collect images where applicable. (Child porn is not a concern as all DNMs & DNM forums ban that content.) Archives sourced from other people follow their own particular conventions. Mac & Windows users may be able to uncompress using their built-in OS archiver, 7zip, Stuffit, or WinRAR; the PAR2 error-checking can be done using par2, QuickPar, Par Buddy, MultiPar or others depending on one’s OS.

If you don’t want to uncompress all of a particular archive, as they can be very large, you can try extracting specific files using archiver-specific options; for example, a SR2F command targeting a particular old forum thread:

Scrapes can be difficult to analyze. They are large, complicated, redundant, and highly error-prone. They cannot be taken at face-value.

No matter how much work one puts into it, one will never get an exact snapshot of a market at a particular instant: listings will go up or down as one crawls, vendors will be banned and their entire profile & listings & all feedback vanish instantly, Tor connection errors will cause a nontrivial % of page requests to fail, the site itself will go down (Agora especially), and Internet connections are imperfect. Scrapes can get bogged down in a backwater of irrelevant pages, spend all their time downloading a morass of on-demand generated pages, the user login expire or be banned by site administrators, etc. If a page is present in a scrape, then it probably existed at some point; but if a page is not present, then it may not have existed or existed but did not get downloaded for any of a myriad of reasons. At best, a scrape is a lower bound on how much was there.

So any analysis must take seriously the incompleteness of each crawl and the fact that there is a lot and always will be a lot of missing data, and do things like focus on what can be inferred from random sampling or explicitly model incompleteness by using markets’ category-count-listings. (For example, if your download of a market claims to have 1.3k items but the categories’ claimed listings sum to 13k items, your download is probably highly incomplete & biased towards certain categories as well.) There are many subtle biases: for example, there will be upward biases in markets’ average review ratings because sellers who turn out to be scammers will disappear from the market scrapes when they are banned, and few of their customers will go back and revise their ratings; similarly if scammers are concentrated in particular categories, then using a single snapshot will lead to biased results as the scammers have already been removed, while uncontroversial sellers last a lot longer (which might lead to, say, e-book sellers seeming to have many more sales than expected).

The contents cannot be taken at face-value either. Some vendors engage in review-stuffing using shills. Metadata like categories can be wrong, manipulated, or misleading (a category labeled Musical instruments may contain listings for prescription drugs - beta blockers - or modafinil or Adderall may be listed in both a Prescription drugs and Stimulants category). Many things said on forums are lies or bluffing or scams. Market operators may deliberately deceive users (Ross Ulbricht claiming to have sold SR1, the SR2 team engaging in psyops) or conceal information (the hacks of SR1; the second SR2 hack) or attack their users (Sheep Marketplace and Pandora). Different markets have different characteristics: the commission rate on Pandora was unilaterally raised after it was hacked (causing sales volume to fall); SR2 was a notorious scammer haven due to inactive or overwhelmed staff and lacking a working escrow mechanism; etc. There is no substitute here for domain knowledge.

Knowing this, analyses should have some strategy to deal with missingness. There are a couple tacks:

attempt to exploit ground truths to explicitly model and cope with varying degrees of missingness; there are a number of ground-truths available in the form of leaked seller data (screenshots & data), databases (leaked, hacked), official statements (eg the FBI’s quoted numbers about Silk Road 1’s total sales, number of accounts, number of transactions, etc)

assume missing-at-random and use analyses insensitive to that, focusing on things like ratios

work with the data as is, writing results such that the biases and lower-bounds are explicit & emphasized

The September SR1 crawl is processed data stored in SPSS.sav Data Files. There are various libraries available for reading this format (in R, using the foreign library like library(foreign); sellers <- read.spss("Sellers -- 2013-09-15.sav", to.data.frame=TRUE).)

A crawl of AlphaBay 26-28 January 2017 and data extraction (using a Python script) provided by Michael McKenna & Sigi Goode. They also tried to crawl AB’s historical inactive listings in addition to the usual live/active listings, reaching many of them.

DNStats is a service which periodically pings hidden services and records the response & latency, generating graphs of uptime and allowing users to see how long a market has been down and if an error is likely to be transient. The owner has provided me with three SQL exports of the ping database up to 25 March 2017; this database could be useful for comparing downtime across markets, examining the effect of DoS attacks, or regressing downtime against things like the Bitcoin exchange rate (presumably if the markets still drive more than a trivial amount of the Bitcoin economy, downtime of the largest markets or market deaths should predict falls in the exchange rate).

For example, to graph an average of site uptime per day and highlight as an exogenous event Operation Onymous, the R code would go like this:

Grams (subreddit) is a service primarily specializing in searching market listings; they can pull listings from API exports provided by markets (Evolution, Cloud9, Middle Earth, Bungee54, Outlaw), or they may use their own custom crawls (the rest). They have generously given me near-daily CSV exports of the current state of listings in their search engine, ranging from 2014-06-09 to 2015-07-12 for the first archive and 2015-07-14 to 2016-04-17 for the second. Grams coverage:

first:

1776

Abraxas

ADM

Agora

Alpaca

AlphaBay

BlackBank

Bungee54

Cloud9

Evolution

Haven

Middle Earth

NK

Outlaw

Oxygen

Pandora

Silkkitie

Silk Road 2

TOM

TPM

second archive:

Abraxas

Agora

AlphaBay

Dream Market

Hansa

Middle Earth

Nucleus

Oasis

Oxygen

RealDeal

Silkkitie

Tochka

Valhalla

The Grams archive has three virtues:

while it doesn’t have any raw data, the CSVs are easy to work with. For example, to read in all the Grams SR2 crawls, then count & graph total listings by day in R:

Other included datasets which are in structured formats that may be easier to deal with for prototyping: the Aldridge & Décary-Hétu 2013 SR1 crawl; the SR1 sales spreadsheet (original is a PDF but I’ve created a usable CSV of it); the BMR feedback dumps are in SQL, as is DNStats and Christin et al 2013’s public data (but note the last is so heavily redacted & anonymized as to support few analyses); and Daryl Lau’s SR2 work may be in a structured format.

the crawls were conducted independent of other crawls and they can be used to check each other

the market data sourced from the APIs can be considered close to 100% complete & accurate, which is rare

The main drawbacks are:

the largest markets can be split across multiple CSVs (eg EVO.csv & EVO2.csv), complicating reading the data in somewhat

the export each time is of the current listings, which means that different days can repeat the same identical crawl data if there was not a successful crawl by Grams in between

exports are not available for every day, and some gaps are large. The 2015-01-09 to 2015-02-21 gap is due to a broken Grams export during this period before I noticed the problem and requested it be fixed; other gaps may be due to transient errors with the cron job:

Diabolus/Crypto Market are two markets run by the same team off, apparently, the same server. Crypto Market had an information leak where any attempt to log in as an existing user revealed the status bar of that Diabolus account, listing their current number of orders, number of PMs, and Bitcoin balance, and hence giving access to ground-truth estimates of market turnover and revenue. Using my Diabolus crawls to source a list of vendors, I set up a script to automatically download the leaks daily until the hole was finally closed.

Upon launch, the market Simply Bear made the amateur mistake of failing to disable the default Apache /server-status page, which shows information about the server such as what HTML pages are being browsed and the connecting IPs. Being a Tor hidden service, most IPs were localhost connections from the daemon, but I noticed the administrator was logging in from a local IP (the 192.168.1.x range) and curious whether I could de-anonymize him, I set up a script to poll /server-status every minute or so, increasing the interval as time passed. After two or three days, no naked IPs had appeared yet and I killed the script.

TheRealDeal was reported on Reddit in late June 2015 to have a info leak where any logged-in user could browse around a sixth of the order-details pages (which were in a predictable incrementing whole-number format) of all users without any additional authentication, yielding the Bitcoin amount, listing, and all Bitcoin multisig addresses for that order. TRD denied that this was any kind of problem, so I collected order information for about a week.

As part of my interest in the stimulant modafinil, I have been monthly collecting by hand scrapes of all modafinil/armodafinil/adrafinil listings across the DNMs; the modafinil archive contains the saved files in MHT or MAFF format from 2013-05-28 to 2015-07-03. Sampled markets include:

A crowdfunding site for child pornography, Pedofunding, was launched in November 2014. It seemed like possibly the birth of a new DNM business model so I set up a logged-out scrape to archive its beginnings (sans any images), collecting 20 scrapes from 2014-11-13 to 2014-12-02, after which it shut down, apparently having found no traction. (A followup in 2015 tried to use some sort of Dash/Darkcoin mining model; it’s unclear why they don’t simply use Darkleaks.)

This archive of the Silk Road 1 forums is composed of 3 parts, all created during October 2013 after Silk Road 1 was shut down but before the Silk Road 1 forums went offline some months later:

StExo’s archive, released anonymously

This excludes the Vendor Roundtable (VRT) subforum, and is believed to have been censored in various respects such as removing many of StExo’s own posts.

Moustache’s archived pages

Unknown source, may be based on StExo archives

consolidated wget spider

After the SR1 bust and StExo’s archiving, I began mirroring the SR1F with wget, logged in as a vendor with access to the Vendor Roundtable; unfortunately due to my inexperience with the forum software Simple Machines, I did not know it was possible to revoke your own access to subforums with wget and failed to blacklist the revocation URL. Hence the VRT was incompletely archived. I combined my various archives into a single version.

Simultaneously, qwertyoruiop was archiving the SR1F with a regular user account and a custom Node.js script. I combined his spider with my version to produce a final version with reasonable coverage of the forums (perhaps 3/4s of what was left after everyone began deleting & censoring their past posts).

In 2015, a pseudonym claiming to be a SR2 programmer offered for sale, using the Darkleaks protocol, what he claimed was the username/password dump and SR2 source code. The Darkleaks protocol requires providing encrypted data and then the revelation of a random fraction of it. This archive is all the encrypted data, decryption keys, and revealed usernames I was able to collate. (The auction did not seem to go well as the revealed data was not a compelling proof, and it’s unclear whether he was the genuine article.)

Integrity of the archive can be verified using PAR2: par2verify ecc.par2 Up to 10% of file damage/loss can be repaired using the supplied PAR2 files for FEC and par2repair; see the man page for details.

when a new market opens, I learn of it typically from Reddit or The Hub, and browse to it in Firefox configured to proxy through 127.0.0.1:8123 (Polipo)

create a new account

The username/password are not particularly important but using a password manager to create & store strong passwords for throwaway accounts has the advantage of making it easier to authenticate any hacks or database dumps later. (Given the poor security record of many markets, it should go without saying that you should not use your own username or any password which is used anywhere else.)

I locate various action URLs: login, logout, report vendor, settings, place order, send message, and add the URL prefixes (sometimes they need to be regexps) into /etc/privoxy/user.action; Privoxy, a filtering proxy running on 127.0.0.1:8118, will then block any attempt to download URLs which match those prefixes/regexps

A good blacklist is critical to avoid logging oneself out and immediately ending the crawl, but it’s also important to avoid triggering any on-site actions which might cause your account to be banned or prompt the operators to put in anti-crawl measures you may have a hard time working around. A blacklist is also invaluable for avoiding downloading superfluous pages like the same category page sorted 15 different ways; Tor is high latency and you cannot afford to waste a request on redundant or meaningless pages, which there can be many of. Simple Machine Forums are particularly dangerous in this regard, requiring at least 39 URLs blacklisted to get an efficient crawl, and implementing many actions as simply HTTP links that a crawler will browse (for example, if you have managed to get access to a private subforum on a SMF, you will delete your access to it if you simply turn a crawler like wget or HTTrack loose, which I learned the hard way).

where possible, configure the site to simplify crawling: request as many listings as possible on each page, hide clutter, disable any options which might get in the way, etc.

Forums often default to showing 20 posts on a page, but options might let you show 100; if you set it to display as much as possible (maximum number of posts per page, subforums listed, etc), the crawls will be faster, save disk space, and be more reliable because the crawl is less likely to suffer from downtime. So it is a good idea to go into the SMF forum settings and customize it for your account.

the fgrep invocation minimizes the size of the local cookies.txt and helps prevent accidental release of a full cookies.txt while packing up archives and sharing them with other people

wget:

we direct it to download only through Privoxy in order to benefit from the blacklist. Warning: wget has a blacklist option but it does not work, because it is implemented in a bizarre fashion where it downloads the blacklisted URL (!) and then deletes it; this is a known >12-year-old bug in wget. For other crawlers, this behavior should be double-checked so you don’t wind up inadvertently logging yourself out of a market and downloading gigabytes of worthless front pages.

we throw in a number of options to encourage wget to ignore connection failures and retry; hidden servers are slow and unreliable

we load the cookies file with the authentication for the market, and in particular, we need --keep-session-cookies to keep around all cookies a market might give us, particularly the ones which change on each page load.

--max-redirect=1 helps deal with a nasty market behavior where when one’s cookie has expired, they then quietly redirect, without errors or warnings, all subsequent page requests to a login page. Of course, the login page should also be in the blacklist as well, but this is extra insurance and can save one round-trip’s worth of time, which will add up. (This isn’t always a cure, since a market may serve a requested page without any redirects or error codes but the content will be a transcluded login page; this apparently happened with some of my crawls such as Black Bank Market. There’s not much that can be done about this except some sort of post-download regexp check or a similar post-processing step.)

some markets seem to snoop on the referer part of a HTTP request specifying where you come from; putting in the market page seems to help

the user-agent, as mentioned, should exactly match however one logged in, as some markets record that and block accesses if the user-agent does not match exactly. Putting the current user-agent into a centralized text file helps avoid scripts getting out of date and specifying an old user-agent

logging of requests and particularly errors is important; --server-response prints out headers, and --append-output stores them to a log file. Most crawlers do not keep an error log around, but this is necessary to allow investigation of incompleteness and observe where errors in a crawl started (perhaps you missed blacklisting a page); for example, Evaluating drug trafficking on the Tor Network: Silk Road 2, the sequel, Dolliver 2015, failed to log errors in their few HTTrack crawls of SR2, and so wound up with a grossly incomplete crawl which led to nonsense conclusions like 1-2% of SR2’s sales were drugs. (I speculate the HTTrack crawl was stuck in the ebooks section, which was always clogged with spam, and then SR2 went down for an hour or two, leading to HTTrack’s default behavior of quickly erroring out and finishing the crawl; but the lack of logging means we may never know what went wrong.)

once the wget crawl is done, then we name it whatever day it terminated on, we store the log inside the mirror, and clean up the probably-now-expired cookies, and perhaps check for any unusual problems.

This method will permit somewhere around 18 simultaneous crawls of different DNMs or forums before you begin to risk Privoxy throwing errors about too many connections. A Privoxy bug may also lead to huge logs being stored on each request. Between these two issues, I’ve found it helpful to have a daily cron job reading rm -rf /var/log/privoxy/*; /etc/init.d/privoxy restart so as to keep the logfile mess under control and occasionally start a fresh Privoxy.

Crawls can be quickly checked by comparing the downloaded sizes to past downloads; markets typically do not grow or shrink more than 10% in a week, and forums’ downloaded size should monotonically increase. (Incidentally, that implies that it’s more important to archive markets than forums.) If the crawls are no longer working, one can check for problems:

is your user-agent no longer in sync?

does the crawl error out at a specific page?

do the headers shown by wget match the headers you see in a regular browser using Live HTTP Headers?

has the target URL been renamed?

do the URLs in the blacklist match the URLs of the site, or did you log in at the right URL? (for example, a blacklist of www.abraxas…onion is different from abraxas…onion; and if you logged in at a onion with www. prefix, the cookie may be invalid on the prefix-free onion)

did the server simply go down for a few hours while crawling? Then you can simply restart and merge the crawls.

has your account been banned? If the signup process is particularly easy, it may be simplest to just register a fresh account each time.

Despite all this, not all markets can be crawled or present other difficulties:

Blue Sky Market did something with HTTP headers which defeated all my attempts to crawl it; it rejected all my wget attempts at the first request, before anything even downloaded, but I was never able to figure out exactly how the wget HTTP headers differed in any respect from the (working) Firefox requests

Mr Nice Guy 2 breaks the HTTP standard by returning all pages gzip-encoded, whether or not the client says it can accept gzip-encoded HTML; as it happens, wget cannot read gzip-encoded HTML and parse the page for additional URLs to download, and so mirroring breaks

AlphaBay, during the DoS attacks of mid-2015, began doing something odd with its HTTP responses, which makes Polipo error out; one must browse AlphaBay after switching to Privoxy; Poseidon also did something similar for a time

Middle Earth rate-limits crawls per session, limiting how much can be downloaded without investing a lot of time or in a CAPTCHA-breaking service

Abraxas leads to peculiarly high RAM usage by wget, which can lead to the OOM killer ending the crawl prematurely

In retrospect, had I known I was going to be scraping so many sites for 3 years, I probably would have worked on writing a custom crawler. A custom crawler could have simplified the blacklist part and allowed some other desirable features (in descending order of importance):

CAPTCHA library: if CAPTCHAs could be solved automatically, then each crawl could be scheduled and run on its own.

The downside is that one would need to occasionally manually check in to make sure that none of the possible problems mentioned previously have happened, since one wouldn’t be getting the immediate of noticing a manual crawl finishing suspiciously quickly (eg a big site like SR2 or Evolution or Agora should take a single-threaded normal crawl at least a day and easily several days if images are downloaded as well; if a crawl finishes in a few hours, something went wrong).

supporting parallel crawls using multiple accounts on a site

optimized tree traversal: ideally one would download all category pages on a market first, to maximize information gain from initial crawls & allow estimates of completeness, and then either randomly sample items or prioritize items which are new/changed compared to previous crawls; this would be better than generic crawlers’ defaults of depth or breadth-first

removing initial hops in connecting to the hidden service, speeding it up and reducing latency (does not seem to be a config option in Tor daemon but I’m told something like this is done in Tor2web)

post-download checks: a market may not visibly error out but start returning login pages or warnings. If these could be detected, the custom crawler could log back in (particularly with CAPTCHA-solving) or at least alert the user to the problem so they can decide whether to log back in, create a new account, slow down crawling, split over multiple accounts, etc

National Drug and Alcohol Research Centre (NDARC) in Sydney, Australia; Australian vendor focused crawls, non-release may be due to concerns over Australian police interest in them as documentation of sales volume to use against the many arrested Australian sellers

Something that might be useful for those seeking to upload large datasets or derivatives to the IA: there is a mostly-undocumented ~25GB size limit on its torrents. Past that, the background processes will no longer update the torrent to cover the additional files, and one will be handed valid but incomplete torrents. Without IA support staff intervention to remove the limit, the full set of files will then only be downloadable over HTTP, not through the torrent.↩