Table of Contents

Locate Orphan Media

Sometimes you get to upload many, many files in your DokuWiki installation. Or you are trying to do some maintenance and save space. Anyways, you'll come to the point where you may ask yourself, “what media files are actually being used in my wiki installation”?

There are some ways to clean up media automatically. clean_media_directory shows a Perl snippet that gets rid of the unlinked files. Here however we are only interested in generating a list of unlinked files for later use.

The following is simply some Unix shell utility combo put together during the night. It can be easily improved upon and integrated eg.: in a cronjob.

Explanation: We find first any file in the media dir that is not a (d)irectory. This gets all the media files. We remove the first two characters which are ./ to obtain relative links from the base media directory, and then we transform all slashes into colons to adapt the links to DokuWiki link syntax. The result of this is stored in /tmp/mediafiles.txt.

Now, check all text files in the pages directory, and list all text patterns of the form {{:mediafile[...]}} (note the leading colon is there to dismiss external links). This snippet creates a file in /tmp listing all the direct invocations to media files.

This creates /tmp/mediareferences.txt a text file containing all the media file invocations, stripped of their markdown and any custom title. It requires that the media references begin with a colon (or a period) as if they were absolute links, but should work for most media references in a wiki which are uploaded via the Upload Manager, or linked to via the Link Wizard.

Explanation: We find and retrieve a list of all the pages in the wiki installation. For each page we must find any instances of media invocations. These are defined as text patterns of the form {{[.]:path:to:media_file.gif|Some text}}, where the period is optional (and indicates the link is relative to the current directory). Extensions are assumed to be three character long (ie.: “gif”, “zip”). The leading {{ and any text starting at | or } are removed from the pattern, eventually leaving only the media link. Finally all those patterns are stored in a file.

A/N: Note that the snippet above will not necessarily catch all media links. In particular, it will fail to catch some relative links; this will be improved upon.

Now the only thing remaining is to find all files indicated in/tmp/mediafiles.txt that do not appear in /tmp/mediareferences.txt:

Voilà. orphanedmedia.txt contains the wikipaths of all the media files that are never invoked, in DokuWiki link format (:path:to:media_file.gif).

Considerations

Not 100% safe (see above) but should locate most orphan files if media references are always inserted through the media manager. Also note I'm not a Bash master or something, just worked out some tools until it worked.

Both scripts could be further adapted to assure catching all relative media links. I'll be studying how to do that.