One person like that
#archival
Internet Archive's Wayback Machine APIs
Say, for the sake of argument, that you've^†^ spent a few weeks trying to ensure that a set of URLs were archived at the Internet Archive's Wayback Machine, and you're aware that not all of those URLs were in fact archived, but you^†^ aren't sure just which ones were or were not.
The question might occur to you^†^, "Is there some way of testing whether or not a particular URL has or has not been successfully archived? Preferably in an automated manner?"
And the answer to that question would be YES!!! Yes there is!!!
What you^†^ are looking for are the Internet Archive Wayback Machine APIs, and specifically:
Wayback Availability JSON API
This simple API for Wayback is a test to see if a given url is archived and currenlty accessible in the Wayback Machine. This API is useful for providing a 404 or other error handler which checks Wayback to see if it has an archived copy ready to display.
Quoting from the Archive:
The API can be used as follows:
http://archive.org/wayback/available?url=example.com
which might return:
{
"archived_snapshots": {
"closest": {
"available": true,
"url": "http://web.archive.org/web/20130919044612/http://example.com/",
"timestamp": "20130919044612",
"status": "200"
}
}
}
if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.
If the url is not available (not archived or currently not accessible), the response will be:
{"archived_snapshots":{}}
https://archive.org/help/wayback_api.php
It's also possible to query for a specific timestamp, though not AFAICT for saves within a date range.
You^†^ are now running that check on a set of 1300 or so URLs you'd^†^ hoped to have saved in the past two months or so.
See Also: Data Migration Tips and Questions.
Notes:
†: And my "you" I of course me "me".
#DataMigration #InternetArchive #APIs #Joindiaspora #JoindiasporaCom #Pluspora #Archival
4 Likes
5 Comments
Observation on archival sites: Archive.Today vs. Internet Archive
Some of my followers may have noted I've been archiving a number of older posts from my previous account of late....
In doing this, I've noticed a few things about Archive.Today (a/k/a Archive.Is) and the Internet Archive's Wayback Machine.
It turns out that Archive.Today is really convenient to invoke with DDG set as my default search engine as I simply highlight the navigation bar for a page, prepend the "!ais" bang search to the head of the URL (followed by a space) and hit return.
Archive.Today helpfully offers links for other potential archive sites, including the Internet Archive, so I don't have to independently call up that URL.
Archive.Today responds very quickly. There's a practically instant response that the page is or is not archived, and if not, the "save" form also pops up nearly instantly.
By contrast, the Internet Archive takes a few seconds to respond whether or not the page is archived, and a few further seconds when requesting a page be saved.
(Both sites have a two-stage submission. The Internet Archive does have a submission URL which should work in one fell swoop, though it occasionally breaks and error-detection is ... difficult.)
Archive.Today's processing queue ranges from 0 to 10k or so slots.
The Internet Archive is currently reporting ~10 hours to process archival requests.
AT does include comments on Diaspora* posts. IA does not.
My manual workflow has evolved to:
- Pull up page, reload in Diaspora* (otherwise cookies may not be current, forcing a log-out / log-in cycle, also annoying).
- Mark the post "tagged" to indicate it's been archived. I typically also "like" it to set a sharper visual indicator.
- Prepend '!ais ' to the navigation bar and hit <enter>.
- Open "Search in Internet Archive" in a new tab, then select that tab to get IA working on finding the post.
- Switch back to the Archive.Today tab and select save, then confirm. At that point the request is processing.
- Switch back to the Internet Archive tab, wait for the page to fully load, request archive, wait for that page to load, confirm, and wait for the request to return.
- Even after this stage, the IA request may still fail. Detecting this is ... difficult.
I may also save content from the original (JoindiasporaCom) address, though mostly I'm working through Glasswings. I have run an automated submission of all my posts from the take-out JSON archive, and will run that another time or so before final shutdown. That will at least preserve post content online, but not the comments threads :-(
Hopefully this information may be useful to others.
#Archival #WebArchival #ArchiveIs #ArchiveToday #InternetArchive #WaybackMachine
2 Likes
1 Comments
1 Shares