Skip to content

Commit cddce92

Browse files
Merge pull request #548 from JustAnotherArchivist/generalise-docs
Overhaul the documentation to remove all references of the ArchiveTeam instance
2 parents f2743e2 + ca550b9 commit cddce92

4 files changed

Lines changed: 58 additions & 122 deletions

File tree

INSTALL.pipeline

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ actually has.
6767
As user archivebot, in the FIRST tmux session:
6868

6969
autossh -C -L 127.0.0.1:16379:127.0.0.1:6379 \
70-
YOUR-USERNAME-GOES-HERE@archivebot.at.ninjawedding.org -N
70+
YOUR-USERNAME-GOES-HERE@CONTROL-NODE-GOES-HERE -N
7171

7272

7373
As user archivebot, in the SECOND tmux session:
@@ -129,9 +129,7 @@ If you start multiple pipelines, you can safely point them to the
129129
same FINISHED_WARCS_DIR and run just one uploader.
130130

131131
Check out the ArchiveBot dashboard to make sure everything is
132-
working like it ought to:
133-
134-
http://dashboard.at.ninjawedding.org/
132+
working like it ought to.
135133

136134

137135
** STEP 5: MISCELLANEOUS **

dashboard/dashboard.html

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,6 @@
1010
<link rel="alternate" type="application/atom+xml" title="Atom Feed" href="/feed/archivebot.atom">
1111
<link rel="icon" type="image/png" href="/assets/favicon.png">
1212
<title>ArchiveBot dashboard</title>
13-
<script>
14-
(function() {
15-
// Framebust out if necessary
16-
if(self != top) {
17-
var target = self.location.href;
18-
// Ugly archivebot.com-specific hack
19-
target = target.replace(
20-
"http://arshboard.at.ninjawedding.org:4567/",
21-
"http://dashboard.at.ninjawedding.org/");
22-
top.location = target;
23-
}
24-
})();
25-
</script>
2613
</head>
2714
<body>
2815
<style>

doc/admin.rst

Lines changed: 55 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
ArchiveBot Administration
33
=========================
44

5-
ArchiveBot has a central "control node" server, currently run by Archive Team member David Yip (yipdw) at ``archivebot.at.ninjawedding.org``. This document explains how to manage it, hopefully without breaking anything.
5+
ArchiveBot has a central "control node" server. This document explains how to manage it, hopefully without breaking anything.
66

7-
This control node server does many things. It runs the actual bot that sits in the EFnet IRC channel #archivebot and listens to Archive Team members' commands about which websites to archive. It runs the Redis server that keeps track of all the pipelines and their data. It runs the web-based ArchiveBot dashboard and pipeline dashboard. It runs the Twitter bot that sends information about what's being archived. It has access to log files and debug information.
7+
This control node server does many things. It runs the actual bot that sits in an IRC channel and listens to commands about which websites to archive. It runs the Redis server that keeps track of all the pipelines and their data. It runs the web-based ArchiveBot dashboard and pipeline dashboard. It runs the Twitter bot that sends information about what's being archived. It has access to log files and debug information.
88

99
It also handles many manual administrative tasks that need doing from time to time, such as cleaning out (or "reaping") information about old pipelines that have gone offline, or old web crawl jobs that were aborted or died or disappeared.
1010

@@ -14,134 +14,105 @@ Another common administrative task on this server is manually adding new pipelin
1414
Basic Information
1515
=================
1616

17-
The control node server is reachable by SSH at ``archivebot.at.ninjawedding.org``.
18-
19-
Archive Team members can SSH into this server with two possible usernames:
20-
21-
* ``archivebot@archivebot.at.ninjawedding.org`` - for performing more delicate administrative tasks
22-
* ``pipeline@archivebot.at.ninjawedding.org`` - for adding/editing SSH keys for new pipeline servers
23-
24-
Neither of these accounts has sudo access.
25-
26-
Long-time Archive Team volunteers used to be assigned individual user accounts on this machine, but starting in mid-2017 all new pipelines are now added to the server via the shared ``pipeline@`` account instead, with a shared ``authorized_keys`` file, to keep things simpler.
27-
28-
This control node server is the same server that also runs the web-based ArchiveBot dashboard:
29-
http://dashboard.at.ninjawedding.org/
30-
31-
And it also runs the web-based ArchiveBot pipeline dashboard:
32-
http://dashboard.at.ninjawedding.org/pipelines
17+
The control node server is usually administrated by SSH. Pipelines also connect over SSH, possibly with a separate account (e.g. ``pipeline``).
3318

3419

3520
How to add new ArchiveBot pipelines
3621
===================================
3722

38-
Archive Team volunteers set up and run pipelines on their own servers. Each of these can handle several web crawls at a time, depending on their servers' individual configuration and their available hard drive space and memory. More information and installation instructions are at GitHub:
23+
Pipelines run on their own servers. Each of these can handle several web crawls at a time, depending on their servers' individual configuration and their available hard drive space and memory. More information and installation instructions are at GitHub:
3924
https://github.com/ArchiveTeam/ArchiveBot/blob/master/INSTALL.pipeline
4025

41-
When a new pipeline is set up and all ready to go, the last step is that the server's SSH key still needs to be manually added to the control node. The new pipeline's operator should e-mail or private message one of the Archive Team members who already has SSH access to the control node server, such as David Yip (yipdw), Brooke Schreier Ganz (Asparagirl) or Just Another Archivist (JAA), who may be hanging out in #archiveteam on EFnet. One of them should SSH into the ``pipeline@archivebot.at.ninjawedding.org`` account, and do:
42-
43-
```bash
44-
cd /home/pipeline/.ssh
45-
```
46-
47-
Then they should open the file ``authorized_keys`` with the text editor of their choice, and add the new pipeline server's SSH key to the bottom of the list, save, and quit. If the new pipeline is set up correctly, it should then show up on the web-based pipeline dashboard shortly after that, and should start being assigned web crawl jobs from the queue.
26+
When a new pipeline is set up and all ready to go, the last step is that the server's SSH key still needs to be manually added to the control node. The new pipeline's operator should e-mail or private message one of the members with access to the control node server, who then need to open ``~/.ssh/authorized_keys`` for the relevant account with the text editor of their choice and add the new pipeline server's SSH key to the bottom of the list. If the new pipeline is set up correctly, it should then show up on the web-based pipeline dashboard shortly after that, and should start being assigned web crawl jobs from the queue.
4827

4928

5029
All about tmux
5130
==============
5231

53-
The control node server has many different processes running constantly. To help keep these processes running even when people log in or out, and to keep things somewhat well-organized, the server is set up with a program called ``tmux`` to run multiple "panes" of information.
32+
The control node server has many different processes running constantly. To help keep these processes running even when people log in or out, and to keep things somewhat well-organized, the server is set up with a program called ``tmux`` to run multiple "windows" and "panes" of information.
5433

5534
When you log into the control node server, you should type ``tmux attach`` to view all the panes and easily move between them.
5635

5736
Here are some common tmux commands that can be helpful:
5837

59-
* Control-B N - moves to the next pane
60-
* Control-B C - create a new pane
61-
* Control-B W – select a pane/window (shows all running panes)
62-
* Control-B [0-9] – go to a specific pane number (numbered 0 through 9)
38+
* Control-B N - moves to the next window
39+
* Control-B C - create a new window
40+
* Control-B W – select a window (shows all running panes)
41+
* Control-B [0-9] – go to a specific number (numbered 0 through 9)
42+
* Control-B arrow – move between panes within a window
6343
* Control-B S – select an entirely different tmux session (although there should usually be just one)
6444

65-
Each pane has a process running in it, sometimes more than one process, for handling a different administrative task.
66-
67-
68-
tmux pane 0: spiped (secure pipe daemon)
69-
++++++++++++++++++++++++++++++++++++++++
70-
71-
This pane runs ``spiped`` for Redis, which is used by some but not all pipelines. ``spiped`` is secure pipe daemon, and it forwards packets from one port to another port. The preferred connection is ssh tunneling.
72-
73-
Administrators probably won't need to do much in this pane, but it's useful to keep an eye on things.
74-
75-
76-
tmux pane 1: pipeline manager
77-
+++++++++++++++++++++++++++++
45+
Each pane has a process running in it, and related processes' panes are usually grouped in one window.
7846

79-
This pane runs the pipeline manager, which is ``plumbing/updates-listener``. This listens for updates coming into Redis from all of the many pipelines. It then sends these updates to a ZeroMQ socket, which is what used by the web-based ArchiveBot dashboard (and possibly a few other things?); the dashboard is listening on publicly accessible port 31337.
8047

81-
(This port is *not* where the ArchiveBot Twitter bot gets its data; that's a different daemon.)
48+
CouchDB and Redis
49+
+++++++++++++++++
8250

83-
Logs from this pipeline manager are stored in ``plumbing/log-firehose``. Someday this log firehose could be replaced with Redis pubsub.
51+
CouchDB and Redis might be running in tmux or as a system service, depending on how it was set up exactly. Either way, they can generally be ignored and left alone.
8452

8553

86-
tmux pane 2: pipeline log analyzer and log trimmer
87-
++++++++++++++++++++++++++++++++++++++++++++++++++
54+
Dashboard
55+
+++++++++
8856

89-
This pane manages the pipeline log analyzer and the pipeline log trimmer.
57+
This window runs the dashboard components: the Ruby server (static files, job and pipeline list, etc.), the Python WebSocket server (real-time log delivery), and the Ruby server killer (``killer.py``).
9058

91-
The log analyzer looks at updates coming off the firehose and classifies them as HTTP 1xx, 2xx, etc, or network error.
59+
The Ruby server pane logs warnings and errors occurring in the Ruby code but is generally relatively quiet. The Python WebSocket server logs stats (number of connected users, queue size, CPU and memory usage) every minute. The Ruby server has an unknown bug which renders it unresponsive. ivan's dashboard killer regularly polls it to see if it's alive, and it prints a dot if it was a success (dashboard was alive and responded). If the dashboard does not respond, probably because of that small memory leak, then it kills it. The Ruby server is run in a ``while :; do ...; done`` loop to restart immediately when this happens.
9260

93-
The log trimmer is an artifact of how ArchiveBot stores logs, could probably be removed someday. It gets rid of old logs from Redis to prevent out-of-memory errors.
61+
IRC bot
62+
+++++++
9463

64+
This pane runs the actual ArchiveBot, which is an IRC bot that listens for commands about what websites to archive.
9565

96-
tmux pane 3: web-based dashboard
97-
++++++++++++++++++++++++++++++++
66+
Usually, there's not much that an administrator will need to do for this. If the bot loses its IRC connection, it will try to reconnect on its own. This should usually work fine, but during a netsplit (a disconnect between IRC server nodes), it might reconnect to an undesired server, in which case the bot might need to be "kicked" (restarted and reconnected to the IRC server).
9867

99-
This pane runs the web-based ArchiveBot dashboard, which is publicly viewable at:
100-
http://dashboard.at.ninjawedding.org/
68+
If you need to kick it, hit ``^C`` in this pane to kill the non-responding bot. Then rerun the bot (by hitting the ``Up arrow key`` to show the last command), possibly after adjusting the command if needed.
10169

102-
This tmux pane is split into two parts on the screen, top and bottom. The top pane shows the throughput of the dashboard web socket, which is the rate of data flowing from the log firehose to the dashboard.
10370

104-
The web-based dashboard has a small unknown memory leak, so the bottom pane runs and monitors ivan's “dashboard killer” daemon. It constantly polls the dashboard to see if it's alive, and it prints a dot if it was a success (dashboard was alive and responded). If the dashboard does not respond, probably because of that small memory leak, then this daemon kills it and automatically re-spawns it.
71+
plumbing
72+
++++++++
10573

74+
Plumbing is responsible for much of the data flow of log lines within the control node.
10675

107-
tmux pane 4: IRC bot
108-
++++++++++++++++++++
76+
The ``plumbing/updates-listener`` listens for job updates coming into Redis from the pipelines. This produces job IDs, which are sent to ``plumbing/log-firehose``, which pulls new log lines from Redis (using the job IDs read from stdin) and pushes them to a ZeroMQ socket. This ZeroMQ socket is used by the dashboard and the two further plumbing tools below.
10977

110-
This pane runs the actual ArchiveBot, which is an IRC bot that sits in the channel #archivebot on EFnet and listens for Archive Team volunteers feeding it commands about what websites to archive.
78+
The ``plumbing/analyzer`` looks at new log lines and classifies them as HTTP 1xx, 2xx, etc, or network error.
11179

112-
Usually, there's not much that an administrator will need to do for this. If the bot gets kicked off EFnet, it will try to reconnect on its own. However, EFnet sometimes has the tendency to netsplit (disconnect from some IRC nodes in a disorganized manner). If that happens, the bot might try to rejoin a server that's been split, in which case the bot might need to be "kicked" (restarted and reconnected to the IRC server).
80+
The ``plumbing/trimmer`` is an artefact of the current log flow design. It removes old log lines, i.e. ones that have been processed by the firehose sender and the analyzer, from Redis to prevent out-of-memory errors.
11381

114-
If you need to kick it, hit ``^C`` in this pane to kill the non-responding bot. Then hit the ``Up arrow key`` to show the last command that had been typed into bash, which is usually the one that invokes the bot. You can then adjust that command if you need to (such as possibly changing the server), and then hit enter to re-run that command and reconnect the bot to EFnet.
11582

83+
cogs
84+
++++
11685

117-
tmux pane 5: redis-cli console
118-
++++++++++++++++++++++++++++++
86+
cogs is responsible for keeping the user agents and browser aliases in CouchDB updated and for tweeting about things getting archived. It also prints very verbose warnings about jobs that haven't sent updates (a heartbeat) to the control node for a long time, recommending them to be 'reaped'. These warnings may or may not be accurate. For reaping jobs (or pipelines), see below.
11987

120-
This is the console for running redis-cli commands. It might get closed down, because it's rarely used.
12188

89+
Job reaping
90+
+++++++++++
12291

123-
tmux pane 6: job reaper and Twitter bot
124-
+++++++++++++++++++++++++++++++++++++++
92+
Jobs need to be reaped manually when they no longer exist but the pipeline did not inform the control node about this. Examples include pipeline crashes (say, a freeze or a power outage). Note that individual job crashes (e.g. due to wpull bugs) do not need to be handled on the control node; as long as the pipeline process still runs, it will treat the job as finishing once the wpull process has been killed by the pipeline operator.
12593

126-
This is the job reaper, used by administrators to manually get rid of "zombie" web crawl jobs that are dead or quit but which are still showing up for some reason on the web-based dashboard, cluttering it up.
94+
If you need to reap a dead ArchiveBot job -- in this case, one with the hypothetical job id 'abcdefghiabcdefghi' -- here's what to do:
12795

128-
Every job has a heartbeat associated with it, which Redis monitors. This pane will let you know if certain jobs' heartbeats have not been seen for a long time, which would indicate that the jobs are zombies.
129-
130-
If you need to reap a dead ArchiveBot job -- in this case, one with the hypothetical job id 'abcdefghiabcdefghi' -- here's what to do in this pane:
96+
If there is no Ruby console for reaping yet:
13197

13298
```bash
133-
cd ~/ArchiveBot/bot/
99+
cd ArchiveBot/bot
134100
bundle exec ruby console.rb
101+
```
102+
103+
Retrieve the job:
104+
105+
```ruby
135106
j = Job.from_ident('abcdefghiabcdefghi', $redis)
136107
```
137108

138-
At this point, you should get a response message starting with ``<struct Job...>``. That means the job id does exist somewhere in Redis, which is good. Then you should run:
109+
At this point, you should get a response message starting with ``<struct Job...>``. That means the job id does exist somewhere in Redis, which is good. Then you should run:
139110

140-
```bash
111+
```ruby
141112
j.fail
142113
```
143114

144-
This will kill that one job, but note that the magic Redis word in the command here is 'fail', not 'kill'. This deletes the job state from Redis.
115+
This will kill that one job, but note that the magic Redis word in the command here is 'fail', not 'kill'. This deletes the job state from Redis (after a few seconds).
145116

146117
It is possible to reap multiple jobs at once, by mapping their job id's with regex and such. Such exercises are best left to experts.
147118

@@ -153,52 +124,36 @@ You can also clean out “nil” jobs with redis-cli in the admin console with t
153124

154125
That command would send the delete command about each id to the Redis server.
155126

156-
This tmux pane 6 *also* runs the ArchiveBot Twitter bot connector. You shouldn't need to do anything with that most of the time, but it ever dies, go to pane 6 and press up and enter to re-run command, which is:
157-
158-
```bash
159-
bundle exec ruby start.rb -t twitter_archivebot.json
160-
```
161127

162-
The Twitter bot is publicly viewable at https://twitter.com/ArchiveBot/ .
128+
Pipeline reaping
129+
++++++++++++++++
163130

164-
165-
tmux pane 7: couchdb
166-
++++++++++++++++++++
167-
168-
This pane inserts couchdb documents. You can probably ignore this, and should leave it as-is.
169-
170-
171-
tmux pane 8: the pipeline reaper
172-
++++++++++++++++++++++++++++++++
173-
174-
This is the pane where you can reap old dead pipelines from the pipeline monitor. You can view the web-based pipeline monitor page here: http://dashboard.at.ninjawedding.org/pipelines
175-
176-
Pipeline data is stored inside Redis. You can get a list of all the pipelines Redis knows about with this command:
131+
Pipeline data is stored inside Redis. You can get a list of all the pipelines Redis knows about from the dashboard or with this command:
177132

178133
```bash
179-
~/redis-2.8.6/src/redis-cli keys pipeline:*
134+
redis-cli keys pipeline:*
180135
```
181136

182137
That will list all currently assigned pipeline keys -- but some of those pipelines may be dead.
183138

184139
To peek at the data within any given pipeline -- in this case, a pipeline that was assigned the id 4f618cfcd81f44583a93b8bdb50470a1 -- use the command:
185140

186141
```bash
187-
~/redis-2.8.6/src/redis-cli type pipeline:4f618cfcd81f44583a93b8bdb50470a1
142+
redis-cli type pipeline:4f618cfcd81f44583a93b8bdb50470a1
188143
```
189144

190145
To find out which pipelines are dead, check the web-based pipeline monitor and copy the unique key for a dead pipeline.
191146

192147
To reap the dead pipeline (two parts):
193148

194149
```bash
195-
~/redis-2.8.6/src/redis-cli srem pipelines pipeline:4f618cfcd81f44583a93b8bdb50470a1
150+
redis-cli srem pipelines pipeline:4f618cfcd81f44583a93b8bdb50470a1
196151
```
197152

198153
That removes the dead pipeline from the set of active pipelines. Then do:
199154

200155
```bash
201-
~/redis-2.8.6/src/redis-cli del pipeline:4f618cfcd81f44583a93b8bdb50470a1
156+
redis-cli del pipeline:4f618cfcd81f44583a93b8bdb50470a1
202157
```
203158
***NOTE: be very careful with this; make sure you do not have the word "pipelines" in this command!***
204159

@@ -211,9 +166,8 @@ Re-sync the IRC !status command to actual Redis data
211166
The ArchiveBot ``!status`` command that is available in the #archivebot IRC channel on EFnet is supposed to be an accurate counter of how many jobs are currently running, aborted, completed, or pending. But sometimes it gets un-synchronized from the actual Redis values, especially if a pipeline dies. Here's how to automatically sync the information again, from Redis to IRC:
212167

213168
```bash
214-
cd /ArchiveBot/bot
169+
cd ArchiveBot/bot
215170
bundle exec ruby console.rb
216171
in_working = $redis.lrange('working', 0, -1); 1
217172
in_working.each { |ident| $redis.lrem('working', 0, ident) if Job.from_ident(ident, $redis).nil ? }
218173
```
219-

0 commit comments

Comments
 (0)