CK • Washington. Die Forschungsabteilung des Kongresses sitzt in der
Library of Congress, und diese traut der
Wayback Machine bei
archive.org. Um Störungen zu vermeiden, setzt sie ihre Spinnen auf Webserver an, nachdem sie deren zuständige Kontaktperson angesprochen hat. Ohne Impressum, das in den USA unbekannt ist, dürfte das nicht immer leicht fallen. Anders als viele Spider ignoriert die Kongressbibliothek die robots.txt-Einschränkunken,
wie sie selbst bekennt:
An email notification with further information has been sent separately to a contact at your organization identified by our team. Rather than send to webmaster@ or info@ addresses and risk bounced or filtered messages, we identified contact information for site owners, managers, directors, etc. to ensure successful delivery.
… The Library of Congress has contracted with the Internet Archive to collect content from Web sites at regular intervals as specified in the notification sent to your Web site. … The Internet Archive uses the Heritrix crawler to collect Web sites on behalf of the Library of Congress. For more information on Heritrix see http://crawler.archive.org/index.html
We bypass Robots.txt (http://www.robotstxt.org/wc/robots.html) in order to get a complete representation in our archive. We crawl to the fullest scope to ensure our archives will represent your site accurately.