The first thing pywikibot needs to know, is which mediawiki website to target. There are many official sites like en.wikipedia.org, commons.wikimedia.org, en.wikitionary.org, en.wikiquote.org, en.wikinews.org, en.wikisource.org, etc. And each has their own versions with different languages like ml.wikipedia.org, ml.wikitionary.org, etc.

A mediawiki website has 2 parts which are important. The code and the family. The pywikibot API supports a LOT of official families and codes, and can also add a local instance or a personal deployment of mediawiki.

The family tells pywikibot which type of mediawiki site should be used, and it can read and write data specific to the family. Examples of family are: wikipedia, wikitionary, wikisource, etc.

The code tells pywikibot which variant of the family should be used. Common examples of codes are: en, es, ml, etc. The code depends on the family though. For example, the "commons" family has only the "commons" code.

In the PAWS interface, the user is set by default to the user account that has been used to login to PAWS. But in a local script, we would need to modify the user-config.py file to add the username and password. We will see this later.

We tell pywikibot to login with the login() function. Then we check which user has been used to login:

testwiki.login()print('Logged in user is:',testwiki.user())

WARNING: API error mwoauth-invalid-authorization-invalid-user: The authorization headers in your request are for a user that does not exist here

You can get a lot of other information about the page by using various helper functions provided by pywikibot:

print("Check if page exists:",demo_page.exists())print("Title of the page:",demo_page.title())print("Contributors of the page:",demo_page.contributors())print("Last edit made on page:",demo_page.editTime())print("Full URL to page:",demo_page.full_url())

In general use the test wikipedia website for writing data, and ensure that you make changes in your User space (pages starting with User:<Your user name> as these are meant for your personal usage like testing these scripts :)

sandbox.text="""== About Me ==Hello!My name is '''{name}'''.I am from {hometown} and am learning how to use pywikibot !This page has been written using the pywikibot API.""".format(name=,hometown=)sandbox.save()

Let's open up the webpage and see if our changes have been added there.

Using Jupyter and IPython, we can even embed the webpage into the notebook:

Once you can get content and save new content, there are many times you'd like to get a list of categories or templates from a mediawiki instance.

A category is a special namespace (Similar to the user space) which holds categories that are used to classify pages. For example the "Python (programming language)" page on wikipedia has the categories "Category:Class-based programming languages", "Category:Cross-platform free software", "Category:Dynamically typed programming languages" and so on.

To add a category to a page, a link to the category must be added to the medaiwiki page. Hence, something like [[Category:<name of category>]] should be added according to the wiki markup.

A template is a snippet of text which can be included into multiple other pages (Something like a #include or import). The wiki markup to add a template is {{<template name>}} and it can also take in arguments, for example {{<template name>|arg1|arg2}}.

python=pywikibot.Page(enwiki,'Python_(programming_language)')python

Let's get a list of all categories added to the page:

list(python.categories())

The textlib functions help to modify the text content on the page for specific needs like adding or removing categories. Hence, it has it's parsers which read through the text and pull out all the category links it finds based on the wiki markup.

There are many instances where it is useful to create a "page generator" which helps iterate over multiple pages that share a common property. For example, consider you want to find all pages of wikimedia projects:

Exercise 1 - Write a script to remove trailing whitespace from a given page¶

In many mediawiki pages, we see that editors leave trailing whitespace at the bottom of the page. While this does not matter when the page is rendered for viewing, it adds unnecessary length to the article when downloading the text and raw wikicode.

Write a script to remove the trailing whitespace and keep only 1 newline at the end of the page. (Test this on a testwiki !)

Exercise 2 - Write a script to find the number of devices using Android Operating System¶

Find the number of pages that exist that are related to devices that use the Android Operating System's category.

PAWs provides a method to run pywikibot and related commands through Jupyter notebooks. It has already installed various requirements and so on that are needed for pywikibot scripts. Hence, it's an easy way to get users started. As it's only 1 server on the internet, if everyone began using PAWs, it gets crowded and slow. In such cases, it may be easier to run these scripts locally in your own desktop/laptop.

Pywikibot is currently still a release candidate, hence rather than installing the rc5 from pip, we will get the latest source code at the master branch using git. To do this, run the following command on your terminal or command prompt:

You will find the folder pywikibot-core has been created in the current working directory. If you wish to move the folder simple move it to another directory, or use the cd command to change directory before running the above git command.

Once the git repository has been downloaded, cd into the directory and run:

/home/user/git_repos/pywikibot-core/$ pip install .

Which installs the pywikibot repository to your python installation. The . (dot) is required as it tells pip to find the python package at the current directory. Pywikibot also has a lot of optional dependencies which are used to run specific scripts and unittests. To install all of these (to avoid errors later) run: