Help:Toolforge

Toolforge is a hosting environment for developers working on services that provide value to the Wikimedia movement. The platform allows developers to easily do ad hoc analytics, administer bots, run webservices, and generally create tools to help Wikimedia project editors, technical contributors, and other volunteers in their work. The environment also includes access to a variety of data services. This infrastructure is supported by a dedicated group of Wikimedia Foundation staff and volunteers.

Getting Started

See the unified Help:Getting Started documentation for account creation instructions and details on joining the Toolforge project.

What is Toolforge

Rationale

Toolforge was developed in response to the need to support external tools and their developers and maintainers. The system is designed to make it easy for maintainers to share responsibility for their tools and bots, which helps ensure that no useful tool gets ‘orphaned’ when one person needs a break. The system is designed to be reliable, scalable and simple to use, so that developers can hit the ground and start coding.

Features

In addition to a well-supported hosting environment, Toolforge provides:

support for Web services, continuous bots, and scheduled tasks

access to replicated production databases

easily shared management of tool accounts, where tools and bots are stored

a grid engine for dispatching jobs

support for mosh, SSH, SFTP without complicated proxy setup

version control via Gerrit and Git

support for Redis

support for Elasticsearch

Shared storage

You will have access to some of the shared storage, see Help:Shared storage (for instance the /shared/mediawiki/ checkout).

Architecture and terminology

Toolforge has four components: the bastion hosts, the grid, the web cluster, and the databases.

Bastion hosts

You log in to Toolforge through a bastion host. As of May 2015, Toolforge has two bastion hosts:

tools-login.wmflabs.org

user login to access tools interactively, also named login.tools.wmflabs.org

dev.tools.wmflabs.org

functionally identical, please use this for heavy processing such as compiles

The grid

The Toolforge grid, implemented with Open Grid Engine (the open-source fork of Sun Grid Engine) permits users to submit jobs from either a log-in account on the bastion host or from a web service. Submitted jobs are added to a work queue, and the system finds a host to execute them. Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going. For more information about the grid, please see § Submitting, managing and scheduling jobs on the grid.

The web cluster

The Toolforge web cluster is fronted by a web proxy, which supports SSL and is open to the Internet. The proxy distributes web requests among the web servers in this cluster; any server in this web cluster can serve any of the hosted web tools because Toolforge uses a shared storage system. For more information, please see § Web server.

Each tool has its own lighttpd Web server, with full configuration options. FCGI scripts are supported with configuration options, and WSGI is supported using flup.server.fcgi. See § Web server for more information.

The databases

Toolforge supports two sets of databases: the wiki replicas and user-created databases, which are used by individual tools. The wiki replicas follow the same setup as production wiki databases, and the information that can be accessed from them is the same as that which normal registered users (i.e.: not +sysop or other types of advanced permissions) can access on-wiki or via the API. Note that some data has been removed from the replicas for privacy reasons. User-created databases can be created by either a user or a tool on the wiki replica servers or on a local ‘tools’ project database.

No "instances"

Developers working in Toolforge do not have to create or set up virtual machines (i.e., Cloud VPS "instances"), because the Toolforge project admins create and manage them. The term may appear in documentation on Wikitech, otherwise, don’t worry about it.

Rules of use

As part of Wikimedia Cloud Services, Toolforge is subject to the general Labs Terms of use, and is governed by the following additional rules:

All code run in the Tools project must be of benefit to the Wikimedia movement.

Using resources for any other reason is considered abuse and may result in a loss of access. This ban does include, but is not limited to, all mining for cryptographic currencies. This class of activity also falls under the Prohibited Uses section of the TOU.

All code in the Tools project must be published under an OSI approved open source license

The absence of a license means that default copyright laws apply. Without a clear license you are implicitly claiming copyright without providing an explanation of the rights you are willing to grant to others who wish to use or modify your software. This means that you retain all rights to your source code and that nobody else may reproduce, distribute, or create derivative works from your work until standard copyright lapses. In the United States today that means until 70 years after your death. This is counter to the general principles of the Wikimedia movement.

Do not use your personal account for noninteractive use

Any process intended to keep running while you are not actively interacting with it (e.g., through a detached screen session, as a background process, or through cron) must be run through a tool account, and not your personal account.

Do not run noninteractive processes on the bastion servers

Likewise, any process meant to execute without direct interaction should be submitted to the grid (e.g. via jsub or webservice) and not run directly on the login hosts. It is permissible to run lightweight processes (such as submitting a job, or rotating logs), but the job grid or Kubernetes should not be used for anything that runs for more than a few seconds or consumes large amounts of resources. Processes running on the bastion servers are subject to termination without notice.

Do not run wikis or user-contributed content sites with open registration

Spambots are very good at finding and flooding wikis, forums and other forms of user-contributed content sites to hammer with their crud. Tools that allow end-users to post content should limit posting to registered users that have been validated in some generally reliable manner (either by human verification, by checking against the user being a project member, or using OAuth).

Do not provide direct access to Cloud Services resources to unauthenticated users

For instance, do not allow web clients to issue shell commands or arbitrary SQL queries against the databases. Cloud Services resources are shared and limited, and it must be possible to attribute usage to specific LDAP users who are bound to the terms of use. Toolforge admin vetted Tools that include substantial anti-abuse and attribution information, such as PAWS and Quarry, are allowed.

Individual wiki policies (these differ!)

When developing on Toolforge, please adhere to the bot policies of the wikis your bot interacts with. Each wiki has its own guidelines and procedures for obtaining approval. The English Wikipedia, for example, requires that a bot be approved by the Bot Approvals Group before it is deployed, and that the bot account be marked with a 'bot' flag. See Wikipedia Bot policy for more information on the English Wikipedia.

cloud@lists.wikimedia.org A list for announcements and discussion related to Wikimedia Cloud Services products. (archives): cloud-announce@lists.wikimedia.org The announce-only version. If you run any Toolforge tools you should subscribe to this at a minimum, as changes that may impact your project are communicated here. (archives)

Toolforge is a joint Foundation and volunteer run project, and we welcome contributions to the infrastructure. The current maintainers are:

Using Toolforge and managing your files

Toolforge can be accessed in a variety of ways – from its public IP to a GUI client. Please see Help:Access for general information about accessing Cloud VPS projects.

The tools list

The Toolforge tools list page is publicly available and contains a list of all currently-hosted Tool accounts along with their maintainers. Tool accounts that have an associated web page appear as links. Users with access to the 'tools' project can create new tool accounts here, and add or remove maintainers to and from existing tool accounts.

Updating files

After you can ssh successfully, you can transfer files via sftp and scp. Note that the transferred files will be owned by you. You will likely wish to transfer ownership to your tool account. To do this:

The take command will change the ownership of the file(s) and directories recursively to the calling user (in this case, the tool account).

Handling permissions

if you're getting permission errors, note that you can also transfer files the other way around: copy the files as your tool account to /data/project/<projectname>.

Another, probably easier, way is to set the permission to group-writable for the tools directory. For example, if your shell account's name is alice and your tool name is alicetools you could do something like this after logged in as a shell user

Using git

The best option is to create a Git repository to which project participants commit files. To access the files, become the tool account, check that repository out in your tool's directory, and thereafter run a regular git pull whenever you want to deploy new files.

Other graphical file managers (e.g., Gnome/KDE)

Installing MediaWiki core

MediaWiki installations attract spammers faster than anything else, and the load caused makes Tools administrators grumpy. Do lock down your installation immediately after setup so that uninvited users cannot publish information. You should also re-read the Terms of use regarding the rules on wikis.

You want to install MediaWiki core and make your installation visible on the web.

One-time steps per tool

First, you have to do some preparatory steps which you need only once per tool.

If your local bin directory it not in your $PATH (use echo $PATH to find out), then create or alter the file ~/.profile and add the lines:

# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
PATH="$HOME/bin:$PATH"
fi

Finish your session as <YOURTOOL> and start a new one, or:

. ~/.profile

Now you are done with the one-time preparations.

For each instance of core

The following steps are needed for each new installation of MediaWiki. We assume that you want to access MediaWiki via the web in a directory named MW — you are free to use another name. If not already done:

Tool Accounts

What is a Tool account?

A Tool account is a shared Unix account. This account acts as the "user" associated with a Tool on Toolforge. Although each tool account has a user ID, they are not personal accounts (like a Cloud VPS account), rather services that consist of a user and group ID that are intended to run the actual tool or bot. Anyone who is a member of the Toolforge project can create a Tool account.

Members of the Tool account's Unix group include:

the tool account creator

the tool account itself

(optionally, but encouraged!) additional tool maintainers

Maintainers may have more than one tool account, and tool accounts may have more than one maintainer. Every member of the group has the authorization to sudo to the tool account. By default, only members of the group have access to tool account's code and data.

In addition to the user/group pair, each tool account includes:

A home directory on shared storage: /data/project/<TOOL NAME>

The ability to run a web service which is visible at https://tools.wmflabs.org/<TOOL NAME>/

Database access credentials: $HOME/replica.my.cnf, which provide access to the production database replicas as well as to project-local databases.

Creating a new Tool account

Click the "Create new tool" link at the bottom of the "Your tools" sidebar.

Follow the instructions in the tool account creation form.

The new tool will need a unique name. The name will become part of the URL for the final webservice, so choose wisely!

Do not prefix your tool name with tools.. The system will do so automatically where appropriate, and there is a known issue that will cause the account to be created improperly if you do.

Note: If you have only recently been added to the 'tools' project, you may get an error about not being a member. Simply log out and back in to toolsadmin to fix this.

The tool account will be created and you will be granted access to it within a minute or two. If you were already logged in to a Toolforge bastion through SSH, you will have to log off then back in before you can access the new tool account.

Joining an existing Tool account

All tool accounts hosted in Toolforge are listed on the Tools list. If you would like to be added to an existing account, you must contact the maintainer(s) directly.

If you would like to add (or remove) maintainers to a tool account that you manage, you may do so with the 'manage maintainers' link found beneath the tool name on the Toolforge home page.

Using a Tool account

A simple way for maintainers to switch to the tool account is with become:

Troubleshooting

It may take a few minutes for your tool's home directory and files to be created. Wait a few minutes, and try again.

$ become <TOOL NAME>
You are not a member of the group tools.<TOOL NAME>.Any existing member of the tool's group can add you to that.

An active ssh session to login.tools.wmflabs.org will not automatically be updated with new permissions when you are added as a maintainer of a tool. If you are already logged in via ssh when you create the new tool, log out and then log in again to activate your new permissions.

Deleting a Tool account

You can't delete a tool account yourself, though you can delete the content of your directories and make an existing web tool inaccessible by shutting down the web service (webservice stop). If you really want a tool account to be deleted, please file a task in Phabricator requesting that the tool be deleted eventually.

Customizing a Tool account

Once you have created a tool account, there are a few things that you can customize to make the tool more easily understood and used by other users. These include:

Adding a tool account description (the description will appear on the Tools home page beside the tool name)

Creating a web page for your tool (it will be linked from the Tools home page automatically)

Creating a Tool web page

To create a web page for your tool account, simply place an index.html file in the tool account's ~/public_html/ directory. The page can be a simple description of the tool or bot with basic information on how to set it up or shut it down, or it contain an interface for the web service. To see examples of existing tool web pages, click any of the linked tool names on the Tools list.

Note that some files, such as PHP files, will give a 500 error unless the owner of the file is tool account.

You will also need to start a webservice for your tool.

1. Log into the Tool environment and become your tool account:

maintainer@tools-login:~$ become toolname

2. Start the web service:

tools.toolname@tools-login:~$ webservice start

Make the Tool translatable

If your tool is used from the web, and assuming you think it's worth something at all, you want to make it translatable. You can and should use the Intuition framework (PHP only), which allows you to use translatewiki.net and delivers you the localisation.

Configuring Tools

Tools and bot code should be stored in your tools account, where it can be managed by multiple users and accessed by all execution hosts. Specific information about configuring web services and bots, along with information about licensing, package installation, and shared code storage, is available at the § Developing on Toolforge section.

Note that bots and tools should be run via the grid, which finds a suitable host with sufficient resources to run each. Simple, one-off jobs can be submitted to the grid easily with the jsub command. Continuous jobs, such as bots, can be submitted with jstart.

Setting up code review and version control

Although it's possible to just stick your code in the directory and mess with it manually every time you want to change something, your future self and your future collaborators will thank you if you instead use source control, a.k.a. version control and a code review tool. Wikimedia Cloud VPS makes it pretty easy to use Git for source control and Gerrit for code review, but you also have other options.

Git/New repositories/Requests -- a list of existing requests, as well as a place to make new ones. You can see the status of your request as well.

For more information about using Git and Gerrit in general, please see Git/Gerrit.

Setting up a local Git repository

It is fairly simple to set up a local Git repository to keep versioned backups of your code. However, if your tool directory is deleted for some reason, your local repository will be deleted as well. You may wish to request a Gerrit/Git repository to safely store your backups and/or to share your code more easily. Other backup/versioning solutions are also available. See User:Magnus Manske/Migrating from toolserver § GIT for some ideas.

Enabling simple public HTTP access to local Git repository

If you've set up a local Git repository like the above in your tool directory, you can easily set up public read access to the repository through HTTP. This will allow you to, for instance, clone the Git repository to your own home computer without using an intermediary service such as GitHub.

Database access

Tool and Tools users are granted access to replicas of the production databases. Private user data has been redacted from these replicas (some rows are elided and/or some columns are made NULL depending on the table). For most practical purposes this is identical to the production databases and sharded into clusters in much the same way.

Database credentials are generated on account creation and placed in a replica.my.cnf file in the home directory of both a Tool and a Tools user account. This file cannot be modified or removed by users.

Symlinking the access file can be practical:

ln -s $HOME/replica.my.cnf $HOME/my.cnf

To connect to the English Wikipedia replica, specify the alias of the hosting cluster (enwiki.analytics.db.svc.eqiad.wmflabs) and the alias of the database replica (enwiki_p) :

Code samples for common languages

In most programming languages, it will be sufficient to tell MySQL to use the database credentials found in $HOME/.my.cnf assuming that you have created a symlink from $HOME/.my.cnf to $HOME/replica.my.cnf.

Extra flags are required for oursql to force binary mode since VARCHAR fields on sql-s2 are mislabeled with latin-1. Manual decoding is required even after upgrading since the fields will be VARBINARY instead.
Note: oursql is only installed on solaris, see jira:TS-760 and jira:TS-1452 for more information.

See also

Submitting, managing and scheduling jobs on the grid

Every non-trivial task performed in Toolforge should be dispatched by the grid engine, which ensures that the job is run in a suitable place with sufficient resources.
The basic principle of running jobs is fairly straightforward:

You submit a job to a work queue from a submission server (e.g., -login) or web server

The grid engine master finds a suitable execution host to run the job on, and starts it there once resources are available

As it runs, your job will send output and errors to files until the job completes or is aborted.

Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going.

To schedule jobs to be run at specific days or time of days, you can use cron to submit the jobs to the grid.

Scheduling a command more often than every five minutes (e.g. * * * * * command) is highly discouraged, even if the command is "only" jsub. In these cases, you very probably want to use 'jstart' instead. The grid engine ensures that jobs submitted with 'jstart' are automatically restarted if they exit.

Email

Mail to users

Mail sent to user@tools.wmflabs.org (where user is a shell account) will be forwarded to the email address that user has set in their Wikitech preferences, if it has been verified (the same as the 'Email this user' function on wikitech).

Any existing .forward in the user's home will be ignored.

Mail to a Tool

Mail can also be sent "to a tool" with:

toolname.anything@tools.wmflabs.org

Where "anything" is an arbitrary alphanumeric string. Mail will be forwarded to the first of:

The email(s) listed in the tool's ~/.forward.anything, if present;

The email(s) listed in the tool's ~/.forward, if present; or

The wikitech email of the tool's individual maintainers.

Additionally, tools.toolname@tools.wmflabs.org is an alias pointing to toolname.maintainers@tools.wmflabs.org mostly useful for automated email generating from within Cloud VPS.

~/.forward and ~/.forward.anything need to be readable by the user Debian-exim; to achieve that, you probably need to chmod o+r ~/.forward*.

Processing email programatically

In addition to mail forwarding, tools can have incoming mail sent to an arbitrary program by setting one of its .forwards (as above) to:

|jmail /path/to/program

In that case, program will be invoked as a job on the grid and will have the email presented to it as its standard input. If program fails to run, or exits with a non-zero status, then the email will bounce with the standard error included it the bounce message.

Please be aware that mail processing on the grid is limited in memory and in runtime (30s CPU time, 60s wall clock) so you should not do heavy processing in your script. If you need more than this, then have the initial script simply queue the email for later processing from another component.

This should use the 'mailq' SGE queue on the grid.

Mail from Tools

When sending mail from a job, the usual command line method of piping the message body to /usr/bin/mail may not work correctly because /usr/bin/mail attempts to deliver the message to the local MSA in a background process which will be killed if it is still running when the job exits.

If piping to a subprocess to send mail is needed, the message including headers may be piped to /usr/sbin/exim -odf -i.

Web server

Historically all webservices ran on the Grid. All webservices are now encouraged to run on the Kubernetes platform if possible.

Web Service Introduction

Every tool can have a dedicated web server running on either the job grid or kubernetes. The default 'lighttpd' webservice type runs a lighttpd web server configured to serve static files and PHP scripts from the tool's $HOME/public_html directory.

You can start a tool's web server with the webservice command:

$ become my_cool_tool
$ webservice start

You can also use the webservice command to stop, restart, and check the status of the webserver. Use webservice --help to get a full list of arguments.

Keep passwords and other credentials (OAuth secrets, etc) separated from the main application code so that they are not exposed publicly in your version control system of choice.

Create a page in the Tool:namespace documenting the basics of what your tool does and how to start and stop it.

Find co-maintainers for your tools who can help out at least with starting/stopping jobs when needed.

Make many small tools that each do one specific task rather than a catch-all tool that does many different tasks.

The full documentation page provides tips and instructions for developing code in the Toolforge, including specific language support.

Redis

Redis is a key-value store similar to memcache, but with more features. It can be easily used to do publish/subscribe between processes, and also maintain persistent queues. Stored values can be different data structures, such as hash tables, lists, queues, etc. Stored data persists across service restarts. For more information, please see the Wikipedia article on Redis.

A Redis instance that can be used by all tools is available on tools-redis, on the standard port 6379. It has been allocated a maximum of 12G of memory, which should be enough for most usage. You can set limits for how long your data stays in Redis; otherwise it will be evicted when memory limits are exceeded. See the Redis documentation for a list of available
commands.

Libraries for interacting with Redis from PHP (phpredis) and Python (redis-py) have been installed on all the web servers and exec nodes. For an example of a bot using Redis, see gerrit-to-redis.

Security

Redis has no access control mechanism, so other users can accidentally/intentionally overwrite and access the keys you set. Even if you are not worried about security, it is highly probable that multiple tools will try to use the same key (such as lastupdated, etc). To prevent
this, it is highly recommended that you prefix all your keys with an application-specific, lengthy, randomly generated secret key.

You can very simply generate a good enough prefix by running the following command:

openssl rand -base64 32

PLEASE PREFIX YOUR KEYS! We have also disabled the redis commands that let users 'list' keys.

Can I use memcache?

Elasticsearch

Elasticsearch is a full text search system built on Apache Lucene. It can be used to index and search data stored as JSON documents. It is the technology used to power Wikimedia's CirrusSearch system.

An Elasticsearch cluster that can be used by all tools is available on tools-elastic-0[123], on the non-standard port 80. This Elasticsearch cluster is a shared resource and all documents indexed in it can be read by anonymous users from within Toolforge. Write access needed to create new indexes, and store or update documents requires a username and password.

CatGraph (aka Graphserv/Graphcore)

CatGraph is a custom graph database that provides tool developers fast access to the Wikipedia category structure. For more information, please see the documentation.

Troubleshooting

If you run into problems, please see the § Contact section. Specifically, please feel free to come into #wikimedia-cloudconnect and look for Coren (Marc-Andre Pelletier) or petan (Petr Bena). The cloud mailing list is another good place to ask for help, especially if the people in chat are not responding.

Backups

What gets backed up?

The basic rule is: there is a lot of redundancy, but no user-accessible backups. Toolforge users should make certain that they use source control to preserve their code, and make regular backups of irreplaceable data. With luck, some files may be recoverable by Cloud Services administrators in a manual process. But this requires human intervention and will likely not rescue the file that was created five minutes ago and deleted two minutes ago. If necessary, ask on IRC or file a Phabricator task.