Overhaul the documentation to remove all references of the ArchiveTeam instance

JustAnotherArchivist · JustAnotherArchivist · commit ca550b992db6 · 2022-11-28T06:38:30.000Z
This makes the docs more relevant to anyone wanting to run their own instance of ArchiveBot.

It also removes an ugly hack on the dashboard which was specific to the ArchiveTeam instance as well.
diff --git a/INSTALL.pipeline b/INSTALL.pipeline
@@ -67,7 +67,7 @@ actually has.
 As user archivebot, in the FIRST tmux session:
 
     autossh -C -L 127.0.0.1:16379:127.0.0.1:6379 \
-      YOUR-USERNAME-GOES-HERE@archivebot.at.ninjawedding.org -N
+      YOUR-USERNAME-GOES-HERE@CONTROL-NODE-GOES-HERE -N
 
 
 As user archivebot, in the SECOND tmux session:
@@ -129,9 +129,7 @@ If you start multiple pipelines, you can safely point them to the
 same FINISHED_WARCS_DIR and run just one uploader.
 
 Check out the ArchiveBot dashboard to make sure everything is 
-working like it ought to:
-
-http://dashboard.at.ninjawedding.org/
+working like it ought to.
 
 
 ** STEP 5: MISCELLANEOUS **
diff --git a/dashboard/dashboard.html b/dashboard/dashboard.html
@@ -10,19 +10,6 @@
 <link rel="alternate" type="application/atom+xml" title="Atom Feed" href="/feed/archivebot.atom">
 <link rel="icon" type="image/png" href="/assets/favicon.png">
 <title>ArchiveBot dashboard</title>
-<script>
-(function() {
-	// Framebust out if necessary
-	if(self != top) {
-		var target = self.location.href;
-		// Ugly archivebot.com-specific hack
-		target = target.replace(
-			"http://arshboard.at.ninjawedding.org:4567/",
-			"http://dashboard.at.ninjawedding.org/");
-		top.location = target;
-	}
-})();
-</script>
 </head>
 <body>
 <style>
diff --git a/doc/admin.rst b/doc/admin.rst
@@ -2,9 +2,9 @@
 ArchiveBot Administration
 =========================
 
-ArchiveBot has a central "control node" server, currently run by Archive Team member David Yip (yipdw) at ``archivebot.at.ninjawedding.org``.  This document explains how to manage it, hopefully without breaking anything.
+ArchiveBot has a central "control node" server.  This document explains how to manage it, hopefully without breaking anything.
 
-This control node server does many things. It runs the actual bot that sits in the EFnet IRC channel #archivebot and listens to Archive Team members' commands about which websites to archive. It runs the Redis server that keeps track of all the pipelines and their data. It runs the web-based ArchiveBot dashboard and pipeline dashboard. It runs the Twitter bot that sends information about what's being archived. It has access to log files and debug information.
+This control node server does many things. It runs the actual bot that sits in an IRC channel and listens to commands about which websites to archive. It runs the Redis server that keeps track of all the pipelines and their data. It runs the web-based ArchiveBot dashboard and pipeline dashboard. It runs the Twitter bot that sends information about what's being archived. It has access to log files and debug information.
 
 It also handles many manual administrative tasks that need doing from time to time, such as cleaning out (or "reaping") information about old pipelines that have gone offline, or old web crawl jobs that were aborted or died or disappeared.
 
@@ -14,134 +14,105 @@ Another common administrative task on this server is manually adding new pipelin
 Basic Information
 =================
 
-The control node server is reachable by SSH at ``archivebot.at.ninjawedding.org``.
-
-Archive Team members can SSH into this server with two possible usernames:
-
-	* ``archivebot@archivebot.at.ninjawedding.org`` - for performing more delicate administrative tasks
-	* ``pipeline@archivebot.at.ninjawedding.org`` - for adding/editing SSH keys for new pipeline servers
-
-Neither of these accounts has sudo access.
-
-Long-time Archive Team volunteers used to be assigned individual user accounts on this machine, but starting in mid-2017 all new pipelines are now added to the server via the shared ``pipeline@`` account instead, with a shared ``authorized_keys`` file, to keep things simpler.
-
-This control node server is the same server that also runs the web-based ArchiveBot dashboard:
-http://dashboard.at.ninjawedding.org/
-
-And it also runs the web-based ArchiveBot pipeline dashboard:
-http://dashboard.at.ninjawedding.org/pipelines
+The control node server is usually administrated by SSH.  Pipelines also connect over SSH, possibly with a separate account (e.g. ``pipeline``).
 
 
 How to add new ArchiveBot pipelines
 ===================================
 
-Archive Team volunteers set up and run pipelines on their own servers. Each of these can handle several web crawls at a time, depending on their servers' individual configuration and their available hard drive space and memory.  More information and installation instructions are at GitHub:
+Pipelines run on their own servers. Each of these can handle several web crawls at a time, depending on their servers' individual configuration and their available hard drive space and memory.  More information and installation instructions are at GitHub:
 https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline
 
-When a new pipeline is set up and all ready to go, the last step is that the server's SSH key still needs to be manually added to the control node. The new pipeline's operator should e-mail or private message one of the Archive Team members who already has SSH access to the control node server, such as David Yip (yipdw), Brooke Schreier Ganz (Asparagirl) or Just Another Archivist (JAA), who may be hanging out in #archiveteam on EFnet. One of them should SSH into the ``pipeline@archivebot.at.ninjawedding.org`` account, and do:
-
-	```bash
-	cd /home/pipeline/.ssh
-	```
-
-Then they should open the file ``authorized_keys`` with the text editor of their choice, and add the new pipeline server's SSH key to the bottom of the list, save, and quit.  If the new pipeline is set up correctly, it should then show up on the web-based pipeline dashboard shortly after that, and should start being assigned web crawl jobs from the queue.
+When a new pipeline is set up and all ready to go, the last step is that the server's SSH key still needs to be manually added to the control node. The new pipeline's operator should e-mail or private message one of the members with access to the control node server, who then need to open ``~/.ssh/authorized_keys`` for the relevant account with the text editor of their choice and add the new pipeline server's SSH key to the bottom of the list.  If the new pipeline is set up correctly, it should then show up on the web-based pipeline dashboard shortly after that, and should start being assigned web crawl jobs from the queue.
 
 
 All about tmux
 ==============
 
-The control node server has many different processes running constantly. To help keep these processes running even when people log in or out, and to keep things somewhat well-organized, the server is set up with a program called ``tmux`` to run multiple "panes" of information.
+The control node server has many different processes running constantly. To help keep these processes running even when people log in or out, and to keep things somewhat well-organized, the server is set up with a program called ``tmux`` to run multiple "windows" and "panes" of information.
 
 When you log into the control node server, you should type ``tmux attach`` to view all the panes and easily move between them.
 
 Here are some common tmux commands that can be helpful:
 
-	* Control-B N - moves to the next pane
-	* Control-B C - create a new pane
-	* Control-B W – select a pane/window (shows all running panes)
-	* Control-B [0-9] – go to a specific pane number (numbered 0 through 9)
+	* Control-B N - moves to the next window
+	* Control-B C - create a new window
+	* Control-B W – select a window (shows all running panes)
+	* Control-B [0-9] – go to a specific number (numbered 0 through 9)
+	* Control-B arrow – move between panes within a window
 	* Control-B S – select an entirely different tmux session (although there should usually be just one)
 
-Each pane has a process running in it, sometimes more than one process, for handling a different administrative task.
-
-
-tmux pane 0: spiped (secure pipe daemon)
-++++++++++++++++++++++++++++++++++++++++
-
-This pane runs ``spiped`` for Redis, which is used by some but not all pipelines.  ``spiped`` is secure pipe daemon, and it forwards packets from one port to another port.  The preferred connection is ssh tunneling.
-
-Administrators probably won't need to do much in this pane, but it's useful to keep an eye on things.
-
-
-tmux pane 1: pipeline manager
-+++++++++++++++++++++++++++++
+Each pane has a process running in it, and related processes' panes are usually grouped in one window.
 
-This pane runs the pipeline manager, which is ``plumbing/updates-listener``.  This listens for updates coming into Redis from all of the many pipelines.  It then sends these updates to a ZeroMQ socket, which is what used by the web-based ArchiveBot dashboard (and possibly a few other things?); the dashboard is listening on publicly accessible port 31337.
 
-(This port is *not* where the ArchiveBot Twitter bot gets its data; that's a different daemon.)
+CouchDB and Redis
++++++++++++++++++
 
-Logs from this pipeline manager are stored in ``plumbing/log-firehose``.  Someday this log firehose could be replaced with Redis pubsub.
+CouchDB and Redis might be running in tmux or as a system service, depending on how it was set up exactly. Either way, they can generally be ignored and left alone.
 
 
-tmux pane 2: pipeline log analyzer and log trimmer
-++++++++++++++++++++++++++++++++++++++++++++++++++
+Dashboard
++++++++++
 
-This pane manages the pipeline log analyzer and the pipeline log trimmer.
+This window runs the dashboard components: the Ruby server (static files, job and pipeline list, etc.), the Python WebSocket server (real-time log delivery), and the Ruby server killer (``killer.py``).
 
-The log analyzer looks at updates coming off the firehose and classifies them as HTTP 1xx, 2xx, etc, or network error.
+The Ruby server pane logs warnings and errors occurring in the Ruby code but is generally relatively quiet.  The Python WebSocket server logs stats (number of connected users, queue size, CPU and memory usage) every minute.  The Ruby server has an unknown bug which renders it unresponsive.  ivan's dashboard killer regularly polls it to see if it's alive, and it prints a dot if it was a success (dashboard was alive and responded).  If the dashboard does not respond, probably because of that small memory leak, then it kills it.  The Ruby server is run in a ``while :; do ...; done`` loop to restart immediately when this happens.
 
-The log trimmer is an artifact of how ArchiveBot stores logs, could probably be removed someday.  It gets rid of old logs from Redis to prevent out-of-memory errors.
+IRC bot
++++++++
 
+This pane runs the actual ArchiveBot, which is an IRC bot that listens for commands about what websites to archive.
 
-tmux pane 3: web-based dashboard
-++++++++++++++++++++++++++++++++
+Usually, there's not much that an administrator will need to do for this.  If the bot loses its IRC connection, it will try to reconnect on its own.  This should usually work fine, but during a netsplit (a disconnect between IRC server nodes), it might reconnect to an undesired server, in which case the bot might need to be "kicked" (restarted and reconnected to the IRC server).
 
-This pane runs the web-based ArchiveBot dashboard, which is publicly viewable at:
-http://dashboard.at.ninjawedding.org/
+If you need to kick it, hit ``^C`` in this pane to kill the non-responding bot.  Then rerun the bot (by hitting the ``Up arrow key`` to show the last command), possibly after adjusting the command if needed.
 
-This tmux pane is split into two parts on the screen, top and bottom.  The top pane shows the throughput of the dashboard web socket, which is the rate of data flowing from the log firehose to the dashboard.
 
-The web-based dashboard has a small unknown memory leak, so the bottom pane runs and monitors ivan's “dashboard killer” daemon. It constantly polls the dashboard to see if it's alive, and it prints a dot if it was a success (dashboard was alive and responded).  If the dashboard does not respond, probably because of that small memory leak, then this daemon kills it and automatically re-spawns it.
+plumbing
+++++++++
 
+Plumbing is responsible for much of the data flow of log lines within the control node.
 
-tmux pane 4: IRC bot
-++++++++++++++++++++
+The ``plumbing/updates-listener`` listens for job updates coming into Redis from the pipelines.  This produces job IDs, which are sent to ``plumbing/log-firehose``, which pulls new log lines from Redis (using the job IDs read from stdin) and pushes them to a ZeroMQ socket.  This ZeroMQ socket is used by the dashboard and the two further plumbing tools below.
 
-This pane runs the actual ArchiveBot, which is an IRC bot that sits in the channel #archivebot on EFnet and listens for Archive Team volunteers feeding it commands about what websites to archive.
+The ``plumbing/analyzer`` looks at new log lines and classifies them as HTTP 1xx, 2xx, etc, or network error.
 
-Usually, there's not much that an administrator will need to do for this. If the bot gets kicked off EFnet, it will try to reconnect on its own. However, EFnet sometimes has the tendency to netsplit (disconnect from some IRC nodes in a disorganized manner). If that happens, the bot might try to rejoin a server that's been split, in which case the bot might need to be "kicked" (restarted and reconnected to the IRC server).
+The ``plumbing/trimmer`` is an artefact of the current log flow design.  It removes old log lines, i.e. ones that have been processed by the firehose sender and the analyzer, from Redis to prevent out-of-memory errors.
 
-If you need to kick it, hit ``^C`` in this pane to kill the non-responding bot. Then hit the ``Up arrow key`` to show the last command that had been typed into bash, which is usually the one that invokes the bot. You can then adjust that command if you need to (such as possibly changing the server), and then hit enter to re-run that command and reconnect the bot to EFnet.
 
+cogs
+++++
 
-tmux pane 5: redis-cli console
-++++++++++++++++++++++++++++++
+cogs is responsible for keeping the user agents and browser aliases in CouchDB updated and for tweeting about things getting archived.  It also prints very verbose warnings about jobs that haven't sent updates (a heartbeat) to the control node for a long time, recommending them to be 'reaped'.  These warnings may or may not be accurate.  For reaping jobs (or pipelines), see below.
 
-This is the console for running redis-cli commands.  It might get closed down, because it's rarely used.
 
+Job reaping
++++++++++++
 
-tmux pane 6: job reaper and Twitter bot
-+++++++++++++++++++++++++++++++++++++++
+Jobs need to be reaped manually when they no longer exist but the pipeline did not inform the control node about this.  Examples include pipeline crashes (say, a freeze or a power outage).  Note that individual job crashes (e.g. due to wpull bugs) do not need to be handled on the control node; as long as the pipeline process still runs, it will treat the job as finishing once the wpull process has been killed by the pipeline operator.
 
-This is the job reaper, used by administrators to manually get rid of "zombie" web crawl jobs that are dead or quit but which are still showing up for some reason on the web-based dashboard, cluttering it up.
+If you need to reap a dead ArchiveBot job -- in this case, one with the hypothetical job id 'abcdefghiabcdefghi' -- here's what to do:
 
-Every job has a heartbeat associated with it, which Redis monitors. This pane will let you know if certain jobs' heartbeats have not been seen for a long time, which would indicate that the jobs are zombies.
-
-If you need to reap a dead ArchiveBot job -- in this case, one with the hypothetical job id 'abcdefghiabcdefghi' -- here's what to do in this pane:
+If there is no Ruby console for reaping yet:
 
 	```bash
-	cd ~/ArchiveBot/bot/
+	cd ArchiveBot/bot
 	bundle exec ruby console.rb
+	```
+
+Retrieve the job:
+
+	```ruby
 	j = Job.from_ident('abcdefghiabcdefghi', $redis)
 	```
 
-At this point, you should get a response message starting with ``<struct Job...>``.  That means the job id does exist somewhere in Redis, which is good. Then you should run:
+At this point, you should get a response message starting with ``<struct Job...>``.  That means the job id does exist somewhere in Redis, which is good.  Then you should run:
 
-	```bash
+	```ruby
 	j.fail
 	```
 
-This will kill that one job, but note that the magic Redis word in the command here is 'fail', not 'kill'.  This deletes the job state from Redis.
+This will kill that one job, but note that the magic Redis word in the command here is 'fail', not 'kill'.  This deletes the job state from Redis (after a few seconds).
 
 It is possible to reap multiple jobs at once, by mapping their job id's with regex and such. Such exercises are best left to experts.
 
@@ -153,52 +124,36 @@ You can also clean out “nil” jobs with redis-cli in the admin console with t
 
 That command would send the delete command about each id to the Redis server.
 
-This tmux pane 6 *also* runs the ArchiveBot Twitter bot connector. You shouldn't need to do anything with that most of the time, but it ever dies, go to pane 6 and press up and enter to re-run command, which is:
-
-	```bash
-	bundle exec ruby start.rb -t twitter_archivebot.json
-	```
 
-The Twitter bot is publicly viewable at https://twitter.com/ArchiveBot/ .
+Pipeline reaping
+++++++++++++++++
 
-
-tmux pane 7: couchdb
-++++++++++++++++++++
-
-This pane inserts couchdb documents.  You can probably ignore this, and should leave it as-is.
-
-
-tmux pane 8: the pipeline reaper
-++++++++++++++++++++++++++++++++
-
-This is the pane where you can reap old dead pipelines from the pipeline monitor.  You can view the web-based pipeline monitor page here: http://dashboard.at.ninjawedding.org/pipelines
-
-Pipeline data is stored inside Redis. You can get a list of all the pipelines Redis knows about with this command:
+Pipeline data is stored inside Redis. You can get a list of all the pipelines Redis knows about from the dashboard or with this command:
 
 	```bash
-	~/redis-2.8.6/src/redis-cli keys pipeline:*
+	redis-cli keys pipeline:*
 	```
 
 That will list all currently assigned pipeline keys -- but some of those pipelines may be dead.
 
 To peek at the data within any given pipeline -- in this case, a pipeline that was assigned the id 4f618cfcd81f44583a93b8bdb50470a1 -- use the command:
 
 	```bash
-	~/redis-2.8.6/src/redis-cli type pipeline:4f618cfcd81f44583a93b8bdb50470a1
+	redis-cli type pipeline:4f618cfcd81f44583a93b8bdb50470a1
 	```
 
 To find out which pipelines are dead, check the web-based pipeline monitor and copy the unique key for a dead pipeline.
 
 To reap the dead pipeline (two parts):
 
 	```bash
-	~/redis-2.8.6/src/redis-cli srem pipelines pipeline:4f618cfcd81f44583a93b8bdb50470a1
+	redis-cli srem pipelines pipeline:4f618cfcd81f44583a93b8bdb50470a1
 	```
 
 That removes the dead pipeline from the set of active pipelines. Then do:
 
 	```bash
-	~/redis-2.8.6/src/redis-cli del pipeline:4f618cfcd81f44583a93b8bdb50470a1
+	redis-cli del pipeline:4f618cfcd81f44583a93b8bdb50470a1
 	```
 	***NOTE: be very careful with this; make sure you do not have the word "pipelines" in this command!***
 
@@ -211,9 +166,8 @@ Re-sync the IRC !status command to actual Redis data
 The ArchiveBot ``!status`` command that is available in the #archivebot IRC channel on EFnet is supposed to be an accurate counter of how many jobs are currently running, aborted, completed, or pending.  But sometimes it gets un-synchronized from the actual Redis values, especially if a pipeline dies.  Here's how to automatically sync the information again, from Redis to IRC:
 
 	```bash
-	cd /ArchiveBot/bot
+	cd ArchiveBot/bot
 	bundle exec ruby console.rb
 	in_working = $redis.lrange('working', 0, -1); 1
 	in_working.each { |ident| $redis.lrem('working', 0, ident) if Job.from_ident(ident, $redis).nil ? }
 	```
-
diff --git a/doc/commands.rst b/doc/commands.rst