1 x Cite
Embed
0
Summary:
  • Definition of ‘web preservation’.
  • To ensure that everyone in the Institution agrees on what should be preserved and how, a web preservation programme should be developed.
  • All resources must be managed in order to preserve them.
  • There are issues to bear in mind which are specific to web resources, Web 2.0 resources and content management systems.
2 x Cite
Embed
0

For the purposes of this Guide, we define web preservation as ‘the capture, management and preservation of websites and web resources’. Web preservation must be a start-to-finish activity, and it should encompass the entire lifecycle of the web resource.

3 x Cite
Embed
0

Another definition to consider in this context is JISC’s definition of ‘digital preservation’ ‘the set of processes and activities that ensures long-term, sustained storage of, access to and interpretation of, digital information’.

4 x Cite
Embed
0

Institutional views of preservation requirements may vary so it is important for an Institution to agree on a web preservation programme which defines the web resources which will be preserved. When considering this bear in mind that:

5 x Cite
Embed
0
  • Resources must be managed in order to preserve them.
  • Preservation will not apply to all web resources: a selective approach is recommended.
  • Preserving every version of every resource is not always necessary.
  • Permanent preservation (as defined by the OAIS model) is not the only viable option. Short-term protection of a resource from loss or damage is an acceptable form of preservation.
  • Preservation actions do not have to result in a perfect solution.
6 x Cite
Embed
0

When considering web resources there are a number of specific preservation issues which apply. In addition, Web 2.0 and content management systems present unique issues.

7 x Cite
Embed
0

Web resource preservation issues

8 x Cite
Embed
0

1 Frequency of change: Web resources change to a greater or lesser extent every day, and periodically change dramatically because of events such as re-branding, the implementation of a content management system, or changes to content providers.

9 x Cite
Embed
0

2 Quantity and range of resources: The quantity and range of resources potentially needing preservation are so large it is vital to: know what resources there are; where they are; and what to do about them.

10 x Cite
Embed
0

3 Continuity: Because of the ease with which websites and pages can be edited, the possible impact on users expecting continuity in web resources can be overlooked. For example, a page may stay the same, but no longer be available from the same URL, or it may remain at the same URL but its content changes. So the issues are: persistence of resources at a given URL; and persistence of resources within a domain.

11 x Cite
Embed
0

Ideally it should be possible to support versioning across a whole site, so that old versions of a page link to their associated contemporary versions, but this represents a large overhead.

12 x Cite
Embed
0

4- Integrity of web resources: Websites and pages need to be protected from careless or wrongful amendment, deletion, or removal, whether by malevolent hackers/crackers, or well-intentioned Institutional staff.

13 x Cite
Embed
0

5 Ownership: There may be issues of ownershipresulting from web resources being managed by many different departments or members of staff, or by sub-sites sometimes being temporary or ad hoc (for example, a project site).

14 x Cite
Embed
0

6 Databases and deep websites: Databases present particular issues becausepreserving an underlying database may not preserve the users experience on the web. Also database-driven websites are not always easy to capture by remote harvesting.

15 x Cite
Embed
0

7 Streaming and multimedia: The quantity and quality of data, and the range of formats, can cause issues when dealing with multimedia. In addition, these resources can be hosted elsewhere and therefore the same set of issues applies as for Web 2.0 applications (see below).

16 x Cite
Embed
0

8 Personalised websites: Some websites offer users customisable features. This raises the issue of whether every possible combination of every users custom view should be preserved.

17 x Cite
Embed
0

9 Appraisal and selection: Appraising and selecting which web resources should be preserved raises many questions which are dealt with in Chapter 5.

18 x Cite
Embed
0

10 Provising access: Once preserved it has to be considered how access will be provided to the web resources and how to deal with issues of IPR and ownership.

19 x Cite
Embed
0

11 Resources for preservation: Both personnel and technical resource issues also have to be considered. Preservation work can be an overhead on day-to-day web and records management activities so assigning people to the preservation work needs to be balanced with routine web and records management.

20 x Cite
Embed
0

In technical terms, it is necessary to estimate how much storage space will be required to store the old web resources and where this will be located.

21 x Cite
Embed
0

Web 2.0 preservation issues

22 x Cite
Embed
0

The two most important issues with Web 2.0 software and applications are ownership and retention.

23 x Cite
Embed
0

1 Ownership and responsibility: Often individuals create and manage their own Web 2.0 resources such as external (personal) accounts for Flickr, Slideshare or WordPress.com. So it is possible for academics to conduct a significant amount of Institutional business outside any known Institution network. In these cases, the Institution either does not know this activity is taking place, or ownership of the resources is not recognised officially. In such a scenario, it is likely the resources are at risk.

24 x Cite
Embed
0

2 Retention of master copies: Third party sites such as Slideshare or YouTube are excellent for dissemination, but they cannot be relied on to preserve materials permanently. So, if a resource is created on one of these third party sites and it requires retention or preservation, arrangements must be made within the Institution for the master copy.

25 x Cite
Embed
0

Content management system preservation issues

26 x Cite
Embed
0

With digital preservation in mind, the features of particular value which content management systems (CMSs) may offer are:

27 x Cite
Embed
0
  • Version control when changes are made to items in the CMS, the previous version is kept.
  • Change logging when changes are made to items in the CMS, the system records who made the change and when.
  • Rollback/reversion the facility to restore the website, or a part of it, to a previous state.
  • Creating a snapshot of the website at a particular point in time.
28 x Cite
Embed
0

Many CMSs offer one or more of these features but the extent to which they can easily be used to reinstate older versions of a website, or find what changes happened when, varies dramatically. Version control information is easy to create and store, but less easy to put to practical use. Discussions with web managers suggest that these features are rarely tested very vigorously.

29 x Cite
Embed
0

The particular preservation issues of CMSs are:

30 x Cite
Embed
0
  • Page names and numbers.
  • Rollback function is limited.
  • Lifespan of system.
  • Compatibility between systems.
31 x Cite
Embed
0

Page names and numbers: Some CMSs may present problems to a remote harvesting engine, or crawler, as pages that are identified with numerical tags instead of page names, for example, may not be recognised, and hence may be missed by the remote harvester. This is especially true if the CMS generates pages dynamically. The severity of this behaviour may also depend on how the site was built in the first place.

32 x Cite
Embed
0

Rollback function is limited: A rollback may not be the same as restoring a full snapshot as it will tend to focus on a particular page or content element, but not its entire context. Web pages usually have many objects that they relate to for example embedded images and stylesheets so the rollback cannot be used to view the content of the whole page as seen by the user. The content is held in the database as layers of time-stamped pages and a script is required to retrieve it. It is therefore not clear to what extent the rollback functions and version control tools produce useful, tangible outputs that could be captured, managed or preserved.

33 x Cite
Embed
0

Compatibility between systems: A CMS may not be supported indefinitely so the question arises about whether the new version will be compatible with the old version. Also, the Institution may decide to change the CMS and, as CMS internal management of content, data and metadata tends to be application-specific, this may mean that moving large quantities of interlinked website content between CMS packages is likely to be a manual and intensive process.

34 x Cite
Embed
0

Backing up is not enough: A CMS is a database full of content, but simply backing up the database will not constitute preservation of the content. The backup action would capture a change history of the website for as long as it was kept in that CMS; it would not constitute a usable collection of page snapshots, or an archived website.

35 x Cite
Embed
0

Metadata: The change history metadata would be extremely useful for records management and preservation purposes, but access to that metadata is not guaranteed: it would need to be exportable in a form that could be preserved.

36 x Cite
Embed
0
Action
  • Define web preservation in the context of your Institution.
  • Consider to what extent the issues raised by Web 2.0 and content management systems affect your Institution.