#archival

psychmesu@diaspora.glasswings.com

https://hachyderm.io/@molly0xfff/113635397560220096 molly0xfff@hachyderm.io - Incredible essay about the importance and challenges of digital archival by Maxwell Neely-Cohen, as well as the various imperfect strategies to achieve “century-scale” digital archives.

https://lil.law.harvard.edu/century-scale-storage/

"We picked a century scale because most physical objects can survive 100 years in good care. It is attainable, and yet we selected it because the design of mainstream digital storage mediums are nowhere close to even considering this mark."

1/

#archival

dredmorbius@diaspora.glasswings.com

Internet Archive's Wayback Machine APIs

Say, for the sake of argument, that you've^†^ spent a few weeks trying to ensure that a set of URLs were archived at the Internet Archive's Wayback Machine, and you're aware that not all of those URLs were in fact archived, but you^†^ aren't sure just which ones were or were not.

The question might occur to you^†^, "Is there some way of testing whether or not a particular URL has or has not been successfully archived? Preferably in an automated manner?"

And the answer to that question would be YES!!! Yes there is!!!

What you^†^ are looking for are the Internet Archive Wayback Machine APIs, and specifically:

Wayback Availability JSON API

This simple API for Wayback is a test to see if a given url is archived and currenlty accessible in the Wayback Machine. This API is useful for providing a 404 or other error handler which checks Wayback to see if it has an archived copy ready to display.

Quoting from the Archive:


The API can be used as follows:

http://archive.org/wayback/available?url=example.com

which might return:

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.

If the url is not available (not archived or currently not accessible), the response will be:

{"archived_snapshots":{}}

https://archive.org/help/wayback_api.php

It's also possible to query for a specific timestamp, though not AFAICT for saves within a date range.

You^†^ are now running that check on a set of 1300 or so URLs you'd^†^ hoped to have saved in the past two months or so.


See Also: Data Migration Tips and Questions.


Notes:

†: And my "you" I of course me "me".


#DataMigration #InternetArchive #APIs #Joindiaspora #JoindiasporaCom #Pluspora #Archival

dredmorbius@diaspora.glasswings.com

Observation on archival sites: Archive.Today vs. Internet Archive

Some of my followers may have noted I've been archiving a number of older posts from my previous account of late....

In doing this, I've noticed a few things about Archive.Today (a/k/a Archive.Is) and the Internet Archive's Wayback Machine.

It turns out that Archive.Today is really convenient to invoke with DDG set as my default search engine as I simply highlight the navigation bar for a page, prepend the "!ais" bang search to the head of the URL (followed by a space) and hit return.

Archive.Today helpfully offers links for other potential archive sites, including the Internet Archive, so I don't have to independently call up that URL.

Archive.Today responds very quickly. There's a practically instant response that the page is or is not archived, and if not, the "save" form also pops up nearly instantly.

By contrast, the Internet Archive takes a few seconds to respond whether or not the page is archived, and a few further seconds when requesting a page be saved.

(Both sites have a two-stage submission. The Internet Archive does have a submission URL which should work in one fell swoop, though it occasionally breaks and error-detection is ... difficult.)

Archive.Today's processing queue ranges from 0 to 10k or so slots.

The Internet Archive is currently reporting ~10 hours to process archival requests.

AT does include comments on Diaspora* posts. IA does not.

My manual workflow has evolved to:

  • Pull up page, reload in Diaspora* (otherwise cookies may not be current, forcing a log-out / log-in cycle, also annoying).
  • Mark the post "tagged" to indicate it's been archived. I typically also "like" it to set a sharper visual indicator.
  • Prepend '!ais ' to the navigation bar and hit <enter>.
  • Open "Search in Internet Archive" in a new tab, then select that tab to get IA working on finding the post.
  • Switch back to the Archive.Today tab and select save, then confirm. At that point the request is processing.
  • Switch back to the Internet Archive tab, wait for the page to fully load, request archive, wait for that page to load, confirm, and wait for the request to return.
  • Even after this stage, the IA request may still fail. Detecting this is ... difficult.

I may also save content from the original (JoindiasporaCom) address, though mostly I'm working through Glasswings. I have run an automated submission of all my posts from the take-out JSON archive, and will run that another time or so before final shutdown. That will at least preserve post content online, but not the comments threads :-(

Hopefully this information may be useful to others.

#Archival #WebArchival #ArchiveIs #ArchiveToday #InternetArchive #WaybackMachine