#archival

dredmorbius@diaspora.glasswings.com

Internet Archive's Wayback Machine APIs

Say, for the sake of argument, that you've^†^ spent a few weeks trying to ensure that a set of URLs were archived at the Internet Archive's Wayback Machine, and you're aware that not all of those URLs were in fact archived, but you^†^ aren't sure just which ones were or were not.

The question might occur to you^†^, "Is there some way of testing whether or not a particular URL has or has not been successfully archived? Preferably in an automated manner?"

And the answer to that question would be YES!!! Yes there is!!!

What you^†^ are looking for are the Internet Archive Wayback Machine APIs, and specifically:

Wayback Availability JSON API

This simple API for Wayback is a test to see if a given url is archived and currenlty accessible in the Wayback Machine. This API is useful for providing a 404 or other error handler which checks Wayback to see if it has an archived copy ready to display.

Quoting from the Archive:


The API can be used as follows:

http://archive.org/wayback/available?url=example.com

which might return:

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.

If the url is not available (not archived or currently not accessible), the response will be:

{"archived_snapshots":{}}

https://archive.org/help/wayback_api.php

It's also possible to query for a specific timestamp, though not AFAICT for saves within a date range.

You^†^ are now running that check on a set of 1300 or so URLs you'd^†^ hoped to have saved in the past two months or so.


See Also: Data Migration Tips and Questions.


Notes:

†: And my "you" I of course me "me".


#DataMigration #InternetArchive #APIs #Joindiaspora #JoindiasporaCom #Pluspora #Archival

dredmorbius@diaspora.glasswings.com

Observation on archival sites: Archive.Today vs. Internet Archive

Some of my followers may have noted I've been archiving a number of older posts from my previous account of late....

In doing this, I've noticed a few things about Archive.Today (a/k/a Archive.Is) and the Internet Archive's Wayback Machine.

It turns out that Archive.Today is really convenient to invoke with DDG set as my default search engine as I simply highlight the navigation bar for a page, prepend the "!ais" bang search to the head of the URL (followed by a space) and hit return.

Archive.Today helpfully offers links for other potential archive sites, including the Internet Archive, so I don't have to independently call up that URL.

Archive.Today responds very quickly. There's a practically instant response that the page is or is not archived, and if not, the "save" form also pops up nearly instantly.

By contrast, the Internet Archive takes a few seconds to respond whether or not the page is archived, and a few further seconds when requesting a page be saved.

(Both sites have a two-stage submission. The Internet Archive does have a submission URL which should work in one fell swoop, though it occasionally breaks and error-detection is ... difficult.)

Archive.Today's processing queue ranges from 0 to 10k or so slots.

The Internet Archive is currently reporting ~10 hours to process archival requests.

AT does include comments on Diaspora* posts. IA does not.

My manual workflow has evolved to:

  • Pull up page, reload in Diaspora* (otherwise cookies may not be current, forcing a log-out / log-in cycle, also annoying).
  • Mark the post "tagged" to indicate it's been archived. I typically also "like" it to set a sharper visual indicator.
  • Prepend '!ais ' to the navigation bar and hit <enter>.
  • Open "Search in Internet Archive" in a new tab, then select that tab to get IA working on finding the post.
  • Switch back to the Archive.Today tab and select save, then confirm. At that point the request is processing.
  • Switch back to the Internet Archive tab, wait for the page to fully load, request archive, wait for that page to load, confirm, and wait for the request to return.
  • Even after this stage, the IA request may still fail. Detecting this is ... difficult.

I may also save content from the original (JoindiasporaCom) address, though mostly I'm working through Glasswings. I have run an automated submission of all my posts from the take-out JSON archive, and will run that another time or so before final shutdown. That will at least preserve post content online, but not the comments threads :-(

Hopefully this information may be useful to others.

#Archival #WebArchival #ArchiveIs #ArchiveToday #InternetArchive #WaybackMachine