Diaspora* Data Migration and Archival Lessons Learned
(So far)
This is a summary of my discoveries and learning over the past two months or so concerning Diaspora* data archives and references as well as JSON and tools for manipulating it, specifically jq
.
It is a condensation of conversation mostly at my earlier Data Migration Tips & Questions (2022-1-10) thread, though also scattered elsewhere. I strongly recommend you review that thread and address general questions there.
Discussion here should focus on the specific information provided, any additions or corrections, and questions on how to access/use specific tools. E.g., how to get #jq running on Microsoft Windows, which I don't have specific experience with.
Archival Philosophy
I'm neither a maximalist nor minimalist when it comes to content archival. What I believe is that people should be offered the tools and choices they need to achieve their desired goal. Where preservation is preferred and causes minimal harm, it's often desirable. Not everything needs to be preserved, but too it isn't necessary to burn down every library one encounters as one journeys through life.
In particular, I'm seeking to preserve access for myself and others to previous conversations and discussions, and to content that's been shared and linked elsewhere. Several of my own posts have been submissions to Hacker News and other sites, for example, and archival at, say, the Internet Archive or Archive Today will preserve at least some access.
This viewpoint seems not to be shared by key members of the Diaspora* dev team and some pod administrators. As such, I'll note that their own actions and views reduce choice and agency amongst members of the Diaspora* community. The attitude is particularly incongruous given Diaspora*'s innate reliance on federation and content propagation according to the original specified intent of the content's authors and creators. This is hardly the first time Diaspora* devs have put their own concerns far above those of members of the Diaspora* community.
Information here is provided for those who seek to preserve content from their own profiles on Diaspora* servers likely to go offline, in the interest of maximising options and achieving desired goals. If this isn't your concern or goal, you may safely ignore what follows.
Prerequisites
The discussion here largely addresses working with a downloaded copy of Diaspora* profile data in JSON format.
It presumes you have jq installed on your system, and have a Bash or equivalent command-line / scripting environment. Most modern computers can offer jq though you will have to install it: natively on Linux, any of the BSDs, MacOS (via Homebrew), Windows (via Cygwin or WSL), and Android (via Termux). iOS is the only mass-market exception, and even there you might get lucky using iSH.
Create your archive by visiting your Pod's /user/edit page and requesting EXPORT DATA at the bottom of that page.
If you have issues doing so, please contact your Pod admin or other support contact(s). Known problems for some Joindiaspora members in creating archives are being worked on.
## Diaspora* post URLs can be reconstructed from the post GUID
The Diaspora* data extract does not include a canonical URL, but you can create one easily:
Post URL = /posts/
So for the GUID 64cc4c1076e5013a7342005056264835
We can tack on:
- protocol:
https://
- host_name:
pluspora.com
Substitute your intended Pod's hostname here.
- the string literal
/posts/
to arrive at:
https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
... which is the URL for a post by @Rhysy (rhysy@pluspora.com
) in which I'd initially witten the comment this post is based on, at that post's Pluspora Pod origin.
Given that Pluspora is slated to go offline a few weeks from now, Future Readers may wish to refer to an archived copy here:
https://archive.ph/Y8mar
Once you have the URL, you can start doing interesting things with it.
Links based on other Pod URLs can be created
Using our previous example, links for the post on, e.g., diasp.org, diaspora.glasswings.com, diasp.eu, etc., can be generated by substituting for host_name
:
Simply having a URL on a pod does not ensure that the content will be propagated. A member of that pod must subscribe to the post first. In many cases this occurs through followers, though occasionally it does not.
You can trigger federation by specifically mentioning a user at that instance and having them request the page.
I'm not sure of when specifically federation occurs --- when the notification is generated, when the notification is viewed, or when the post itself is viewed. I've experienced such unfederated posts (404s) often as I've updated, federated, and archived my own earlier content from Joindiaspora to Glasswings. If federation occurs at some time after initial publication and comments the post URL and content should resolve, but comments made prior to that federation will not propagate.
(Pinging a profile you control on another pod is of course an excellent way to federate posts to that pod.)
Once a post is federated to a set of hosts it will be reachable at those hosts. If it has not yet been federated, you'll receive a "404" page, usually stating "These are not the kittens you're looking for. Move along." on Diaspora* instances.
(I'm not aware of other ways to trigger federation, if anyone knows of methods, please advise in comments.)
Note that comments shown on a post will vary by Pod, when and how it was Federated, and any blocks or networking issues between other Pods from which comments have been made. Not all instances necessarily show the same content, inconsistencies do occur.
Links to archival tools can be created by prepending their URLs to the appropriate link
Those will either show existing archives if they exist or provide links to submit the post if they do not.
Note that the Internet Archive does not include comments, though Archive.Today does, see: https://archive.is/almMw vs. https://web.archive.org/web/20220224213824/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
To include later comments, additional archival requests will have to be submitted.
My Archive-Index script does all of the above
See My current jq project: create a Diaspora post-abstracter.
https://diaspora.glasswings.com/posts/ed03bc1063a0013a2ccc448a5b29e257
That still has a few rough edges, but works to create an archive index which can be edited down to size. There's a fair bit of "scaffolding" in the direct output.
Note that the OLD and NEW hosts in the script specify Joindiaspora and Glasswings specifically. You'll want to adapt these to YOUR OWN old and newPod hostnames.
The script produces output which (after editing out superflous elements) looks like this in raw form:
## 2012
### May
**Hey everyone, I'm #NewHere. I'm interested in #debian and #linux, among other things. Thanks for the invite, Atanas Entchev!**
> Yet another G+ refuge. ...
<https://diaspora.glasswings.com/posts/cc046b1e71fb043d>
[Original](https://joindiaspora.com/posts/cc046b1e71fb043d) :: [Wayback Machine](https://web.archive.org/*/https://joindiaspora.com/posts/cc046b1e71fb043d) :: [Archive.Today](https://archive.is/https://joindiaspora.com/posts/cc046b1e71fb043d)
(2012-05-17 20:33)
----
**Does anyone have the #opscodechef wiki book as an ePub? Only available formats are online/web, or PDF (which sucks). I'm becoming a rapid fan of the #epub format having found a good reader for Android and others for Debian/Ubuntu.**
> Related: strategies for syncing libraries across Android and desktop/laptop devices. ...
<https://diaspora.glasswings.com/posts/e76c078ba0544ad9>
[Original](https://joindiaspora.com/posts/e76c078ba0544ad9) :: [Wayback Machine](https://web.archive.org/*/https://joindiaspora.com/posts/e76c078ba0544ad9) :: [Archive.Today](https://archive.is/https://joindiaspora.com/posts/e76c078ba0544ad9)
(2012-05-17 21:29)
----
Which renders as:
2012
May
Hey everyone, I'm #NewHere. I'm interested in #debian and #linux, among other things. Thanks for the invite, Atanas Entchev!
Yet another G+ refuge. ...
https://diaspora.glasswings.com/posts/cc046b1e71fb043d
Original :: Wayback Machine :: Archive.Today
(2012-05-17 20:33)
Does anyone have the #opscodechef wiki book as an ePub? Only available formats are online/web, or PDF (which sucks). I'm becoming a rapid fan of the #epub format having found a good reader for Android and others for Debian/Ubuntu.
Related: strategies for syncing libraries across Android and desktop/laptop devices. ...
https://diaspora.glasswings.com/posts/e76c078ba0544ad9
Original :: Wayback Machine :: Archive.Today
(2012-05-17 21:29)
I've been posting those in fragmenents by year as private posts to myself to facilitate both federation and archival of the content. In chunks as Diaspora* has a 2^16^ / 65,536 byte per-post size limit. It's a slow slog but I've only one more year (2021) to manually process at this point, with post counts numbering up to 535 per year.
The Internet Archive Wayback Machine (at Archive.org) accepts scripted archival requests
If you submit a URL in the form of https://web.archive.org/save/<URL>
, the Wayback Machine will attempt to archive that URL.
This can be scripted for an unattended backup request if you can generate the set of URLs you want to save.
Using our previous example, the URL would be:
https://web.archive.org/save/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
Clicking that link will generate an archive request.
(IA limit how frequently such a request will be processed.)
Joindiaspora podmins discourage this practice. Among the more reasonable concerns raised is system load.
I suggest that if you do automate archival requests, as I have done, you set a rate-limit or sleep timer on your script. A request every few seconds should be viable. As a Bash "one-liner" reading from the file DIASPORA_EXTRACT.json.gz
(change to match your own archive file), which logs progress to the timestamped file run-log
with a YYYYMMDD-hms format, e.g., run-log.20220224-222158
:
time zcat DIASPORA_EXTRACT.json.gz |
jq -r '.user .posts[] | "https://joindiaspora.com/posts/\(.entity_data .guid )"' |
xargs -P4 -n1 -t -r ~/bin/archive-url |
tee run-log.$(date +%Y%m%d-%H%M%S)
archive-url
is a Bash shell script:
#!/bin/bash
url=${1}
echo -e "Archiving ${url} ... "
lynx -dump -nolist -width=1024 "https://web.archive.org/save/${url}" |
sed -ne '/[Ss]aving page now/,/^$/{/./s/^[ ]*//p;}' |
grep 'Saving page now'
sleep 4
Note that this waits 4 seconds between requests (sleep 4
), which limits itself to a maximum of 900 requests per hour. There is NO error detection and you should confirm that posts you think you archived actually are archived. (We can discuss methods for this in comments, I'm still working on how to achieve this.)
The script could be improved to only process public posts, something I need to look into. Submitting private posts won't result in their archival, but it's additional time and load.
There is no automated submission mechanism for Archive.Today of which I'm aware.
Appending .json
to the end of a Diaspora* URL provides the raw JSON data for that post:
https://joindiaspora.com/posts/64cc4c1076e5013a7342005056264835.json
That can be further manipulated with tools, e.g., to extract original post or comment Markdown text, or other information. Using jq
is useful for this as described in other posts under the #jq hashtag generally.
Notably:
As always: This is my best understanding
There are likely errors and omissions. Much of the behaviour and structure described is inferred. Corrections and additions are welcomed.
#DiasporaMigration #Migration #Diaspora #Help #Tips #JoindiasporaCom #jq #json #DataArchves #Archives