The collection of web pages is the main way to carry out the legal deposit of online publications. It is carried out with crawling robots that go through the previously selected URLs and saving everything they have linked with the frequency, depth and size that is determined. The result of these web collections are web files.
Today it is impossible to aspire to completeness in the archived web, so in the National Library of Spain has opted for a mixed model that combines massive and selective collections:
1. Massive collections collect as many domains as possible with a small depth in navigation levels and are linked to the.es domain. They are made once a year.
2. Selective collections are made to complete the mass collections, as they more deeply and frequently collect a smaller sample of websites selected for their relevance to history, society and culture. They are carried out several times a year in collaboration with the conservation centers of the autonomous communities and other specialised institutions. These selective collections can be of three types:
2.1. Themes: Each Department of the National Library and each autonomous community maintains their thematic collections with the online resources they deem necessary to keep as part of the legal deposit. For example: Music and Audiovisuals, Andalusian Electronic Magazines, Institutions of the Valencian Community, etc.
2.2. Event: on events of special relevance.
2.3. Emergency, in the case of websites in danger of extinction.
**Downloadable file fields:**
* Website title
* Seed: it is the URL we provide as a starting point for collection. You can represent the home page of a site, a section of a site, or a document with other formats contained on a web page.
* Additional URLs: we can add additional URLs to improve tracking coverage or quality (e.g. website map, an important section, etc.).
* Status: we will put “Active” if we want to collect the website or “Inactive” if we want to stop collecting it, for example in the event that the website has ceased to exist.
* Frequency: it is the periodicity with which we want to collect the website. Frequencies can be Daily, Monthly, Quincenal, and Unique (if you only want to collect once).
* Depth: it is the level of depth with which we want to collect the website, that is how much the robot will descend following the links contained in the URL that we give it as a seed. The depth can be:
Home: Collects only the URL that is given as seed.
Start and 1 level: Collect the URL that is given as seed plus a depth level.
Start and 2 levels: Collect the URL that is given as seed plus two levels of depth.
Domain: Collects all URLs containing the proposed domain. For example, from the seed www.bne.es, collects all URLs containing “bne.es”.
Host: Collects all URLs containing the proposed host. For example, from the seed www.bne.es, collects all URLs that have www.bne.es.
Route: collects only the URLs from the path we give you, do not go back to URLs in previous directories.
* Size:
Small: to collect websites up to 10,000 URLs.
Medium: to collect websites up to 50,000 URLs.
Large: to collect websites up to 100,000 URLs.
* Keywords: they more accurately describe the content of the resource to be collected and allow the creation of subcollections within a collection. Are assigned between 1 and 5 words per record, separated by/
* Material:
The materials of each collection allow us to distinguish the different sub-collections that the Autonomous Communities have.
An abbreviated CDU and its literal are assigned.
Contact:
[
[email protected]](and mailto:
[email protected])
How to cite the set:
Title of the data set.
[Data set].
Version of DDMMAAAA.
Data.gob.es.
Dataset URL
E.g. Archive of the Spanish Web:
Autonomous Community of Aragon.
[Data set].
January 2019 version.
Data.gob.es. https://datos.gob.es/es/catalogo/ea0019768-archivo-de-la-web-espanola-comunidad-autonoma-de-aragon