ArchiveTeam Warrior

What is the Archive Team Warrior?

The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!

The warrior is a virtual machine, so there is no risk to your computer. The warrior will only use your bandwidth and some of your disk space. It will get tasks from and report progress to the Tracker.

Basic usage

The warrior runs on Windows, OS X and Linux using a virtual machine. You'll need one of:

We prefer connections from many public IP addresses if possible. (For example, if your apartment building uses a single IP address, we don't want your apartment banned.)

Why am I seeing a message that no item was received?

It means that there is no work available. This happens for several reasons:

There project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work is available.

In a rare case, you have been banned by a tracker administrator because you were requesting too much work, you were tampering with the scripts, a malfunction has occurred, or your internet connection is "unclean".

Why am I seeing a message about rate limiting?

Keep in mind that although downloading the internet for digital preservation and fun are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a tracker administrator and should not be subverted.

(In other words, we don't want to DDoS the servers.)

Why am I seeing a message about code being out of date?

The warrior will update its code every hour. If you are impatient, please restart the warrior and it will download the latest code and resume work.

Help! The warrior is eating all my bandwidth!

You can limit the warrior's bandwidth quite easily for VirtualBox as long as you are running a relatively recent version. The option is not offered with a GUI however.

I turned my warrior VM appliance off. Will those tasks be lost?

If you've killed your warrior VM instances, then the work your warrior did has been lost, however the tasks will be returned to the pool after a period of time. If you want, you can alert the admins via IRC of what's happened, and they can clear the claims your username may have made. However, this isn't very important on most projects.

I closed my browser or tab with the warrior's web interface. Will those tasks be lost?

No, the web browser interface just provides, well, a user interface to the warrior. As long as the VM is not stopped, it will continue normally.

If you pause/suspend the warrior instance, most projects will allow resuming of work in progress when you unsuspend the warrior instance.

If you decided to use the suspend feature in VirtualBox, please note that if you keep it suspended for too long (more than a few hours), the admins will assume that the item is lost and be re-queued. Using the suspend feature so that you can reboot your computer is perfectly fine.

I told the warrior to shutdown from the interface but nothing has changed! What gives?

The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away, go ahead. Your progress will be lost, however the jobs will eventually cycle out to another user.

How much disk space will the warrior use?

Short answer: it depends on the project.

Long answer: because the way each project defines an item differently, the warrior may be downloading a small file or downloading a whole subsection of a website. The virtual machine is configured by default to use 60GB as an absolute maximum. Any unused virtual machine disk space is not used on the host computer. You may, however, run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet after all!

The secondary disk is using up space even though it's not running a project.

Virtual machine disk images do not behave like a regular file. There are several ways to reclaim space:

Delete the second disk and put back an empty disk. The warrior should reformat the second disk.

Delete the entire warrior application and re-import it.

Use the zerofree program and then clone the disk image. Reattach the cloned disk image.

I can't connect to localhost.

The application includes a configuration to set up port forwarding to the guest machine on port 8001 so you can access the interface through your web browser. If this does not happen, you may need to double check your machine's network settings.

The warrior can't connect to the internet.

It may be possible that the virtual machine has picked up the address of the local DNS cache on your computer which the virtual machine does not have access to.

I'm looking at the text scrolling by and I notice some errors. rsync is not working.

Uh-oh! Something is not right. Notify us immediately in the appropriate IRC channel.

The item I'm working on is downloading thousands of URLs and it's taking hours.

See the above question and reboot the warrior as appropriate.

I'm looking at the leaderboard. What's that icon beside the username?

That's just the warrior logo: (click on the image for a larger version). It means that that person is using the warrior. Those without the icon are running the scripts manually.

What's that guy doing in the logo?

The place is on fire! But don't worry, he safely escaped with the rescued data in his arms.

I want to log in to the virtual machine. How do I do this?

Unless you know what you are doing, you should not need to do this. But if you want to, the username is root and the password is archiveteam. Then, you can execute sudo -u warrior -i to log in as the warrior user.

Press ALT+F3 to switch to virtual console number 3. Use ALT+Left or ALT+Right to switch between virtual consoles. There are 6 virtual consoles in total. Consoles 1 and 2 are reserved for the warrior.

Can I run multiple virtual machines at the same time?

Yes, but you'll need to adjust the networking settings.

On the machine, open up Settings → Network → Adapter 1 → Port Fowarding. You need to adjust the Host Port. For example, ensure your table looks like TCP | 127.0.0.1 | 8123 | | 8001. In this example, you can then visit http://localhost:8123/ as it maps port 8123 in your browser to port 8001 which the warrior uses.

The warrior seems to have too much overhead. I can't run a VM in a VPS!

You don't need to run a virtual machine. If you are managing a VPS, it's likely you are comfortable with some Linux stuff. Projects can be run manually. Consult the project wiki page or the source code repository readme file.

Why a virtual machine in the first place?

The virtual machine is a quick, safe, and easy way for newcomers to help us out. It offers many features:

If you have suggestions for improving this system, please talk to us as described below.

I'm running the scripts manually in a VPS but it says the code is out of date a while later

It happens when a bug in the scripts is discovered. Bugs are unavoidable especially when the server is out of our control.

Try the --auto-update option available in Seesaw version 0.8. However, please be aware that you are now executing code automatically. Be sure to run the scripts in a separate user account for safety.

I just imported the ova image and the warrior is stuck on "Preparing the data partition"

This issue has cropped up before and we do not know what causes it. It is recommended to just delete the warrior image and import the ova again. Testing shows that such a reimport works in the majority of cases.

Why is the default project not working? / Why is a manual project not in the Warrior yet?

Sorry. Sometimes the administrators are too busy...

Why are there no projects?

If there are no projects showing, you can help us write one. No projects does not mean there is nothing left to archive!

The instructions to run the software/scripts are awful and they are difficult to set up.

Well, excuuuuse me, princess!

We're not a professional support team so help us help you help us all. See below for bug reports, suggestions, or contribute writing code.

Where can I file a bug, suggestion, or a feature request?

If the issue is related to the warrior's web interface or the library that grab scripts are using, see seesaw-kit issues. Other issues should be filed into their own repositories.

I'd like to help write code. Where can I find more info?

Check out the Dev documentation for details on the infrastructure and details of the source code layout.