Degeneracy at LiveJournal
Dec. 19th, 2022 05:11 pmOf course, LiveJournal is nowadays Russian-owned, there's a war in Ukraine, Western access to Russian Internet may be cut off at any time, etc. Seemingly unrelated, LiveJournal is once again not on speaking terms with DreamWidth — and it looks to be permanent this time. Cross-posting has not worked for all of 2022. I just did the three cross-posts manually, except the "Interactive Brokers" thing was already copied, so my ole LJ is now up to date!
Why did I do this? Because I am still using the Charm Python app (November 2005 update) to back up my Journal. Can anyfur recommend a more modern solution that works directly with DreamWidth? Or I could get my lazy ass in gear and look around for myself. But instead I cross-posted the newer entries and then continued with my existing backup solution. If you're "stuck" on LiveJournal, you are now permitted to see my posts on Randal Munroe's art, erroneous mathematical beliefs, and the degeneracy of Ontario's province-wide police alerts being sent to my email inbox. It's just wrong!
Why did I do this? Because I am still using the Charm Python app (November 2005 update) to back up my Journal. Can anyfur recommend a more modern solution that works directly with DreamWidth? Or I could get my lazy ass in gear and look around for myself. But instead I cross-posted the newer entries and then continued with my existing backup solution. If you're "stuck" on LiveJournal, you are now permitted to see my posts on Randal Munroe's art, erroneous mathematical beliefs, and the degeneracy of Ontario's province-wide police alerts being sent to my email inbox. It's just wrong!
no subject
Date: 2022-12-20 04:21 am (UTC)1. Obtain a copy of your Dreamwidth login cookies in Netscape format, most simply by using a cookies.txt browser extension.
2. Install grab-site. I found the nixpkgs installation method to be the easiest to get working.
3. Make a .txt file describing, in Python regex format, the webpages that you do *not* want to download. Here are mine:
(I once ran into a Flash-related infinite rabbit hole trying to scrape a Livejournal. Might well not be relevant anymore, but I haven't tried removing it.)
4. Run grab-site. Recommended configuration:
grab-site https://pyesetz.dreamwidth.org/ --import-ignores=/home/{{Linux-username}}/{{ignore-file}}.txt --no-offsite-links --concurrency=1 --delay=1000 --warc-max-size=100737418240 --wpull-args=--load-cookies=/home/{{Linux-username}}/Downloads/cookies.txt(Since Dreamwidth is fairly small-time, it seems polite to rate-limit oneself to one page per second, hence the concurrency and delay flags. The warc-max-size is very big to override the sometimes-too-small default maximum: if it hits the maximum it'll split the file into multiple pieces, and then it's a needless annoyance to merge them back together again. If you also want to download every page you linked to, remove the no-offsite-links flag. (Embedded images should work fine either way.))
Optional: run gs-server and open your browser to localhost:29000 to watch it work.
5. If all went well, you should now have a folder containing, among other things, a copy of your blog in WARC format (the file format the Wayback Machine uses to store its data). Some options:
5a. Use the ReplayWeb.Page AppImage to read the WARC file directly.
5b. Use warc2zim to convert to a ZIM file, and read with Kiwix. (For reasons I have not managed to figure out, warc2zim output files think that Dreamwidth homepages are infinite loops and won't load them properly, but the rest seems to work. Tell the url flag to use https://pyesetz.dreamwidth.org/archive instead of https://pyesetz.dreamwidth.org/ as the ZIM file's start page.)
5c. Use warcat to unpack into a collection of HTML pages, for ingestion into Recoll and other software that can't handle fancypants WARC or ZIM formats.
---
I don't know which features of your current backup solution you value and as such cannot promise this method shares 100% of them, but I *can* tell you that this is what I do for my own blog.
no subject
Date: 2022-12-26 07:51 am (UTC)I can't keep up with the balkanization of social media. Twitter always sucked at maintaining friends. Faceplant is all about the algorithm where they won't even show me posts about relative's funerals. Not doing the Mastodon or wherever the next destination is.
no subject
Date: 2023-01-04 05:02 pm (UTC)