You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Overhaul the documentation to remove all references of the ArchiveTeam instance
This makes the docs more relevant to anyone wanting to run their own instance of ArchiveBot.
It also removes an ugly hack on the dashboard which was specific to the ArchiveTeam instance as well.
ArchiveBot has a central "control node" server, currently run by Archive Team member David Yip (yipdw) at ``archivebot.at.ninjawedding.org``. This document explains how to manage it, hopefully without breaking anything.
5
+
ArchiveBot has a central "control node" server. This document explains how to manage it, hopefully without breaking anything.
6
6
7
-
This control node server does many things. It runs the actual bot that sits in the EFnet IRC channel #archivebot and listens to Archive Team members' commands about which websites to archive. It runs the Redis server that keeps track of all the pipelines and their data. It runs the web-based ArchiveBot dashboard and pipeline dashboard. It runs the Twitter bot that sends information about what's being archived. It has access to log files and debug information.
7
+
This control node server does many things. It runs the actual bot that sits in an IRC channel and listens to commands about which websites to archive. It runs the Redis server that keeps track of all the pipelines and their data. It runs the web-based ArchiveBot dashboard and pipeline dashboard. It runs the Twitter bot that sends information about what's being archived. It has access to log files and debug information.
8
8
9
9
It also handles many manual administrative tasks that need doing from time to time, such as cleaning out (or "reaping") information about old pipelines that have gone offline, or old web crawl jobs that were aborted or died or disappeared.
10
10
@@ -14,134 +14,105 @@ Another common administrative task on this server is manually adding new pipelin
14
14
Basic Information
15
15
=================
16
16
17
-
The control node server is reachable by SSH at ``archivebot.at.ninjawedding.org``.
18
-
19
-
Archive Team members can SSH into this server with two possible usernames:
20
-
21
-
* ``archivebot@archivebot.at.ninjawedding.org`` - for performing more delicate administrative tasks
22
-
* ``pipeline@archivebot.at.ninjawedding.org`` - for adding/editing SSH keys for new pipeline servers
23
-
24
-
Neither of these accounts has sudo access.
25
-
26
-
Long-time Archive Team volunteers used to be assigned individual user accounts on this machine, but starting in mid-2017 all new pipelines are now added to the server via the shared ``pipeline@`` account instead, with a shared ``authorized_keys`` file, to keep things simpler.
27
-
28
-
This control node server is the same server that also runs the web-based ArchiveBot dashboard:
29
-
http://dashboard.at.ninjawedding.org/
30
-
31
-
And it also runs the web-based ArchiveBot pipeline dashboard:
32
-
http://dashboard.at.ninjawedding.org/pipelines
17
+
The control node server is usually administrated by SSH. Pipelines also connect over SSH, possibly with a separate account (e.g. ``pipeline``).
33
18
34
19
35
20
How to add new ArchiveBot pipelines
36
21
===================================
37
22
38
-
Archive Team volunteers set up and run pipelines on their own servers. Each of these can handle several web crawls at a time, depending on their servers' individual configuration and their available hard drive space and memory. More information and installation instructions are at GitHub:
23
+
Pipelines run on their own servers. Each of these can handle several web crawls at a time, depending on their servers' individual configuration and their available hard drive space and memory. More information and installation instructions are at GitHub:
When a new pipeline is set up and all ready to go, the last step is that the server's SSH key still needs to be manually added to the control node. The new pipeline's operator should e-mail or private message one of the Archive Team members who already has SSH access to the control node server, such as David Yip (yipdw), Brooke Schreier Ganz (Asparagirl) or Just Another Archivist (JAA), who may be hanging out in #archiveteam on EFnet. One of them should SSH into the ``pipeline@archivebot.at.ninjawedding.org`` account, and do:
42
-
43
-
```bash
44
-
cd /home/pipeline/.ssh
45
-
```
46
-
47
-
Then they should open the file ``authorized_keys`` with the text editor of their choice, and add the new pipeline server's SSH key to the bottom of the list, save, and quit. If the new pipeline is set up correctly, it should then show up on the web-based pipeline dashboard shortly after that, and should start being assigned web crawl jobs from the queue.
26
+
When a new pipeline is set up and all ready to go, the last step is that the server's SSH key still needs to be manually added to the control node. The new pipeline's operator should e-mail or private message one of the members with access to the control node server, who then need to open ``~/.ssh/authorized_keys`` for the relevant account with the text editor of their choice and add the new pipeline server's SSH key to the bottom of the list. If the new pipeline is set up correctly, it should then show up on the web-based pipeline dashboard shortly after that, and should start being assigned web crawl jobs from the queue.
48
27
49
28
50
29
All about tmux
51
30
==============
52
31
53
-
The control node server has many different processes running constantly. To help keep these processes running even when people log in or out, and to keep things somewhat well-organized, the server is set up with a program called ``tmux`` to run multiple "panes" of information.
32
+
The control node server has many different processes running constantly. To help keep these processes running even when people log in or out, and to keep things somewhat well-organized, the server is set up with a program called ``tmux`` to run multiple "windows" and "panes" of information.
54
33
55
34
When you log into the control node server, you should type ``tmux attach`` to view all the panes and easily move between them.
56
35
57
36
Here are some common tmux commands that can be helpful:
58
37
59
-
* Control-B N - moves to the next pane
60
-
* Control-B C - create a new pane
61
-
* Control-B W – select a pane/window (shows all running panes)
62
-
* Control-B [0-9] – go to a specific pane number (numbered 0 through 9)
38
+
* Control-B N - moves to the next window
39
+
* Control-B C - create a new window
40
+
* Control-B W – select a window (shows all running panes)
41
+
* Control-B [0-9] – go to a specific number (numbered 0 through 9)
42
+
* Control-B arrow – move between panes within a window
63
43
* Control-B S – select an entirely different tmux session (although there should usually be just one)
64
44
65
-
Each pane has a process running in it, sometimes more than one process, for handling a different administrative task.
66
-
67
-
68
-
tmux pane 0: spiped (secure pipe daemon)
69
-
++++++++++++++++++++++++++++++++++++++++
70
-
71
-
This pane runs ``spiped`` for Redis, which is used by some but not all pipelines. ``spiped`` is secure pipe daemon, and it forwards packets from one port to another port. The preferred connection is ssh tunneling.
72
-
73
-
Administrators probably won't need to do much in this pane, but it's useful to keep an eye on things.
74
-
75
-
76
-
tmux pane 1: pipeline manager
77
-
+++++++++++++++++++++++++++++
45
+
Each pane has a process running in it, and related processes' panes are usually grouped in one window.
78
46
79
-
This pane runs the pipeline manager, which is ``plumbing/updates-listener``. This listens for updates coming into Redis from all of the many pipelines. It then sends these updates to a ZeroMQ socket, which is what used by the web-based ArchiveBot dashboard (and possibly a few other things?); the dashboard is listening on publicly accessible port 31337.
80
47
81
-
(This port is *not* where the ArchiveBot Twitter bot gets its data; that's a different daemon.)
48
+
CouchDB and Redis
49
+
+++++++++++++++++
82
50
83
-
Logs from this pipeline manager are stored in ``plumbing/log-firehose``. Someday this log firehose could be replaced with Redis pubsub.
51
+
CouchDB and Redis might be running in tmux or as a system service, depending on how it was set up exactly. Either way, they can generally be ignored and left alone.
84
52
85
53
86
-
tmux pane 2: pipeline log analyzer and log trimmer
This pane manages the pipeline log analyzer and the pipeline log trimmer.
57
+
This window runs the dashboard components: the Ruby server (static files, job and pipeline list, etc.), the Python WebSocket server (real-time log delivery), and the Ruby server killer (``killer.py``).
90
58
91
-
The log analyzer looks at updates coming off the firehose and classifies them as HTTP 1xx, 2xx, etc, or network error.
59
+
The Ruby server pane logs warnings and errors occurring in the Ruby code but is generally relatively quiet. The Python WebSocket server logs stats (number of connected users, queue size, CPU and memory usage) every minute. The Ruby server has an unknown bug which renders it unresponsive. ivan's dashboard killer regularly polls it to see if it's alive, and it prints a dot if it was a success (dashboard was alive and responded). If the dashboard does not respond, probably because of that small memory leak, then it kills it. The Ruby server is run in a ``while :; do ...; done`` loop to restart immediately when this happens.
92
60
93
-
The log trimmer is an artifact of how ArchiveBot stores logs, could probably be removed someday. It gets rid of old logs from Redis to prevent out-of-memory errors.
61
+
IRC bot
62
+
+++++++
94
63
64
+
This pane runs the actual ArchiveBot, which is an IRC bot that listens for commands about what websites to archive.
95
65
96
-
tmux pane 3: web-based dashboard
97
-
++++++++++++++++++++++++++++++++
66
+
Usually, there's not much that an administrator will need to do for this. If the bot loses its IRC connection, it will try to reconnect on its own. This should usually work fine, but during a netsplit (a disconnect between IRC server nodes), it might reconnect to an undesired server, in which case the bot might need to be "kicked" (restarted and reconnected to the IRC server).
98
67
99
-
This pane runs the web-based ArchiveBot dashboard, which is publicly viewable at:
100
-
http://dashboard.at.ninjawedding.org/
68
+
If you need to kick it, hit ``^C`` in this pane to kill the non-responding bot. Then rerun the bot (by hitting the ``Up arrow key`` to show the last command), possibly after adjusting the command if needed.
101
69
102
-
This tmux pane is split into two parts on the screen, top and bottom. The top pane shows the throughput of the dashboard web socket, which is the rate of data flowing from the log firehose to the dashboard.
103
70
104
-
The web-based dashboard has a small unknown memory leak, so the bottom pane runs and monitors ivan's “dashboard killer” daemon. It constantly polls the dashboard to see if it's alive, and it prints a dot if it was a success (dashboard was alive and responded). If the dashboard does not respond, probably because of that small memory leak, then this daemon kills it and automatically re-spawns it.
71
+
plumbing
72
+
++++++++
105
73
74
+
Plumbing is responsible for much of the data flow of log lines within the control node.
106
75
107
-
tmux pane 4: IRC bot
108
-
++++++++++++++++++++
76
+
The ``plumbing/updates-listener`` listens for job updates coming into Redis from the pipelines. This produces job IDs, which are sent to ``plumbing/log-firehose``, which pulls new log lines from Redis (using the job IDs read from stdin) and pushes them to a ZeroMQ socket. This ZeroMQ socket is used by the dashboard and the two further plumbing tools below.
109
77
110
-
This pane runs the actual ArchiveBot, which is an IRC bot that sits in the channel #archivebot on EFnet and listens for Archive Team volunteers feeding it commands about what websites to archive.
78
+
The ``plumbing/analyzer`` looks at new log lines and classifies them as HTTP 1xx, 2xx, etc, or network error.
111
79
112
-
Usually, there's not much that an administrator will need to do for this. If the bot gets kicked off EFnet, it will try to reconnect on its own. However, EFnet sometimes has the tendency to netsplit (disconnect from some IRC nodes in a disorganized manner). If that happens, the bot might try to rejoin a server that's been split, in which case the bot might need to be "kicked" (restarted and reconnected to the IRC server).
80
+
The ``plumbing/trimmer`` is an artefact of the current log flow design. It removes old log lines, i.e. ones that have been processed by the firehose sender and the analyzer, from Redis to prevent out-of-memory errors.
113
81
114
-
If you need to kick it, hit ``^C`` in this pane to kill the non-responding bot. Then hit the ``Up arrow key`` to show the last command that had been typed into bash, which is usually the one that invokes the bot. You can then adjust that command if you need to (such as possibly changing the server), and then hit enter to re-run that command and reconnect the bot to EFnet.
115
82
83
+
cogs
84
+
++++
116
85
117
-
tmux pane 5: redis-cli console
118
-
++++++++++++++++++++++++++++++
86
+
cogs is responsible for keeping the user agents and browser aliases in CouchDB updated and for tweeting about things getting archived. It also prints very verbose warnings about jobs that haven't sent updates (a heartbeat) to the control node for a long time, recommending them to be 'reaped'. These warnings may or may not be accurate. For reaping jobs (or pipelines), see below.
119
87
120
-
This is the console for running redis-cli commands. It might get closed down, because it's rarely used.
121
88
89
+
Job reaping
90
+
+++++++++++
122
91
123
-
tmux pane 6: job reaper and Twitter bot
124
-
+++++++++++++++++++++++++++++++++++++++
92
+
Jobs need to be reaped manually when they no longer exist but the pipeline did not inform the control node about this. Examples include pipeline crashes (say, a freeze or a power outage). Note that individual job crashes (e.g. due to wpull bugs) do not need to be handled on the control node; as long as the pipeline process still runs, it will treat the job as finishing once the wpull process has been killed by the pipeline operator.
125
93
126
-
This is the job reaper, used by administrators to manually get rid of "zombie" web crawl jobs that are dead or quit but which are still showing up for some reason on the web-based dashboard, cluttering it up.
94
+
If you need to reap a dead ArchiveBot job -- in this case, one with the hypothetical job id 'abcdefghiabcdefghi' -- here's what to do:
127
95
128
-
Every job has a heartbeat associated with it, which Redis monitors. This pane will let you know if certain jobs' heartbeats have not been seen for a long time, which would indicate that the jobs are zombies.
129
-
130
-
If you need to reap a dead ArchiveBot job -- in this case, one with the hypothetical job id 'abcdefghiabcdefghi' -- here's what to do in this pane:
96
+
If there is no Ruby console for reaping yet:
131
97
132
98
```bash
133
-
cd ~/ArchiveBot/bot/
99
+
cd ArchiveBot/bot
134
100
bundle exec ruby console.rb
101
+
```
102
+
103
+
Retrieve the job:
104
+
105
+
```ruby
135
106
j = Job.from_ident('abcdefghiabcdefghi', $redis)
136
107
```
137
108
138
-
At this point, you should get a response message starting with ``<struct Job...>``. That means the job id does exist somewhere in Redis, which is good. Then you should run:
109
+
At this point, you should get a response message starting with ``<struct Job...>``. That means the job id does exist somewhere in Redis, which is good. Then you should run:
139
110
140
-
```bash
111
+
```ruby
141
112
j.fail
142
113
```
143
114
144
-
This will kill that one job, but note that the magic Redis word in the command here is 'fail', not 'kill'. This deletes the job state from Redis.
115
+
This will kill that one job, but note that the magic Redis word in the command here is 'fail', not 'kill'. This deletes the job state from Redis (after a few seconds).
145
116
146
117
It is possible to reap multiple jobs at once, by mapping their job id's with regex and such. Such exercises are best left to experts.
147
118
@@ -153,52 +124,36 @@ You can also clean out “nil” jobs with redis-cli in the admin console with t
153
124
154
125
That command would send the delete command about each id to the Redis server.
155
126
156
-
This tmux pane 6 *also* runs the ArchiveBot Twitter bot connector. You shouldn't need to do anything with that most of the time, but it ever dies, go to pane 6 and press up and enter to re-run command, which is:
The Twitter bot is publicly viewable at https://twitter.com/ArchiveBot/ .
128
+
Pipeline reaping
129
+
++++++++++++++++
163
130
164
-
165
-
tmux pane 7: couchdb
166
-
++++++++++++++++++++
167
-
168
-
This pane inserts couchdb documents. You can probably ignore this, and should leave it as-is.
169
-
170
-
171
-
tmux pane 8: the pipeline reaper
172
-
++++++++++++++++++++++++++++++++
173
-
174
-
This is the pane where you can reap old dead pipelines from the pipeline monitor. You can view the web-based pipeline monitor page here: http://dashboard.at.ninjawedding.org/pipelines
175
-
176
-
Pipeline data is stored inside Redis. You can get a list of all the pipelines Redis knows about with this command:
131
+
Pipeline data is stored inside Redis. You can get a list of all the pipelines Redis knows about from the dashboard or with this command:
177
132
178
133
```bash
179
-
~/redis-2.8.6/src/redis-cli keys pipeline:*
134
+
redis-cli keys pipeline:*
180
135
```
181
136
182
137
That will list all currently assigned pipeline keys -- but some of those pipelines may be dead.
183
138
184
139
To peek at the data within any given pipeline -- in this case, a pipeline that was assigned the id 4f618cfcd81f44583a93b8bdb50470a1 -- use the command:
185
140
186
141
```bash
187
-
~/redis-2.8.6/src/redis-cli type pipeline:4f618cfcd81f44583a93b8bdb50470a1
142
+
redis-cli type pipeline:4f618cfcd81f44583a93b8bdb50470a1
188
143
```
189
144
190
145
To find out which pipelines are dead, check the web-based pipeline monitor and copy the unique key for a dead pipeline.
That removes the dead pipeline from the set of active pipelines. Then do:
199
154
200
155
```bash
201
-
~/redis-2.8.6/src/redis-cli del pipeline:4f618cfcd81f44583a93b8bdb50470a1
156
+
redis-cli del pipeline:4f618cfcd81f44583a93b8bdb50470a1
202
157
```
203
158
***NOTE: be very careful with this; make sure you do not have the word "pipelines" in this command!***
204
159
@@ -211,9 +166,8 @@ Re-sync the IRC !status command to actual Redis data
211
166
The ArchiveBot ``!status`` command that is available in the #archivebot IRC channel on EFnet is supposed to be an accurate counter of how many jobs are currently running, aborted, completed, or pending. But sometimes it gets un-synchronized from the actual Redis values, especially if a pipeline dies. Here's how to automatically sync the information again, from Redis to IRC:
0 commit comments