Supplying historical Portuguese web contents

However, contents previously published must be gathered from external sources to be archived in our system.

If you have web contents interesting for the Portuguese community and want to contribute for their preservation, please contact us.

We consider that all contents published on sites under the .PT domain are part of the Portuguese web and must be preserved. However, contents hosted under other domains, considered of interest for the Portuguese community, will also be accepted.

Should I supply only old contents?

We are interested in receiving contents that are no longer available online, independently from their publication date.

The web is extremely dynamic and the lifetime of most contents is very short.

Thus, many contents are lost because they become unavailable before we can gather them, even though we perform periodic crawls of the Portuguese web.

Backups of Portuguese web sites are a good example of contents that may be provided.

How can I supply contents?

Arquivo.pt stores the archived contents using the ARC format. Ideally, contents should be supplied using this format.

However, it’s natural that most people do not use it to keep their files. Therefore, we accept Portuguese web contents kept in any format.

Later, the Portuguese web archive team will convert them to the ARC format, so that they can be integrated in our system.

To facilitate this task we would appreciate that the largest amount of meta-data would be supplied along with the contents, specially:

the web site address(es). If there are several web sites, please group the contents belonging to each one of them on a separate directory;

the content addresses (URL). If you are providing a local copy of a site please maintain the original file names. If you are supplying contents that you gathered from the web please provide their original URLs;

the content dates. Supply the date when each content was published or saved. If you do not know the exact dates, please supply approximate dates;

the content media type (MIME). Please maintain the original file name extensions of the contents (e.g. .gif, .html, .jpg). If possible, provide the full HTTP header for each content. It is particularly important to provide the media type for contents dynamically generated that do not contain file name extensions.