@@ -4,35 +4,253 @@ ExtractCode
44- license: Apache-2.0
55- copyright: copyright (c) nexB. Inc. and others
66- homepage_url: https://github.com/nexB/extractcode
7- - keywords: archive, extraction, libarchive, 7zip, scancode-toolkit
7+ - keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode
88
9+ Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.
910
10- ExtractCode is a universal archive extractor. It uses behind the scenes
11- the Python standard library, a custom ctypes binding to libarchive and
12- the 7zip command line to extract a large number of common and
13- less common archives and compressed files. It tries to extract things
14- in the same way on all OSes, including auto-renaming files that would
15- not have valid names on certain filesystems or when there are multiple
16- copies of the same path in a given archive.
17- The extraction is driven from a "voting" system that considers the
18- file extension(s) and name, the file type and mime type (using a ctypes
19- binding to libmagic) to select the most appropriate extractor or
20- uncompressor function. It can handle multi-level archives such as tar.gz.
2111
12+ **ExtractCode is a (mostly) universal archive extractor. **
2213
14+ Install with::
15+
16+ pip install extractcode[full]
17+
18+
19+ Why another extractor?
20+ ----------------------
21+
22+ **it will extract! **
23+
24+ ExtractCode will extract things where other extractors may fail.
25+
26+ - Say you want to extract the tarball of the Linux kernel source code on Windows.
27+ It contains paths that are the same when ignoring the case and therefore will
28+ not extract OK on Windows: some file may be munged or the extract may file.
29+
30+ - Or a tarball (on any OS) may contain multiple times the exact same path. In
31+ these cases the paths showing up earlier in the archive may be "hidden" and
32+ overwritten by the same path showing up later in the archive giving the
33+ impression that there is only one file.
34+
35+ - Or an archive may be damaged a little but most files can still be extracted.
36+
37+ - Or the extracted files are such permissions that you cannot read them and are
38+ not owned by you.
39+
40+ - Or the archive may contain weird paths inluding relative paths that may be
41+ problematic to extract.
42+
43+ - Or the archive may contain special file types (character/device files) that
44+ may be problematic to extract.
45+
46+ - Or an archive may be a virtual disk or some file system(s) images that would
47+ typically need to be mounted to be accessed, and may require root access
48+ and guesswork to find out which partition and filesystem are at play and
49+ which driver to use.
50+
51+ In all these cases, ExtractCode will extract and try hard do the right thing to
52+ obtain the actual archived content when other tools may fail.
53+
54+ It can also extract recursively any type of (nested) archives-in-archives
55+
56+ As a downside, the extracted content may not be exactly what would be expected
57+ to use the contained files: for instance ... but this it is perfectly OK for
58+ file content analysis for software composition or forensic analysis.
59+
60+ Behind the scene, ExtractCode uses multiple tools such as:
61+
62+ - the Python standard library,
63+ - a custom ctypes binding to libarchive,
64+ - the 7zip command line tool, and
65+ - optionally libguestfs on Linux.
66+
67+ With these, it is possible to extract a large number of common and less common
68+ archives and compressed file types. ExtractCode tries to extract things in the
69+ same way on all supported OSes, including auto-renaming files that would have
70+ invalid, non-extractible names on certain filesystems or when there are multiple
71+ copies of the same path in a given archive (which is possible in a tar).
72+
73+ The extraction is driven from a "voting" system that considers the file
74+ extension(s) and name, the filetype and mimetype (using a ctypes binding to
75+ libmagic) to select the most appropriate extractor or decompressor function.
76+ It can handle multi-level archives such as tar.gz and can extract recursively
77+ any nested archives.
2378
2479Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
2580
81+
82+ We run CI tests on:
83+
84+ - Azure pipelines https://dev.azure.com/nexB/extractcode/_build
85+
86+
87+ Installation
88+ ------------
89+
90+ To install this package with its full capability (where the binaries for
91+ 7zip and libarchive are installed), use the `full ` extra option::
92+
93+ pip install extractcode[full]
94+
95+ If you want to use the version of binaries (possibly) provided by your operating
96+ system, use the `minimal ` option::
97+
98+ pip install extractcode
99+
100+ In this case, you will need to provide a working and compatible libarchive and
101+ 7zip installed and configured in one of these ways such that ExtractCode can
102+ find them:
103+
104+ - **a typecode-libarchive and typecode-7z plugin **: See the standard ones at
105+ https://github.com/nexB/scancode-plugins/tree/main/builtins
106+ These can either bundle a libarchive library, a 7z executable or expose a
107+ system-installed libraries.
108+ It does so by providing plugin entry points as ``scancode_location_provider ``
109+ for ``extractcode_libarchive `` that should point to a ``LocationProviderPlugin ``
110+ subclass with a ``get_locations() `` method that must return a mapping with
111+ this key:
112+
113+ - 'extractcode.libarchive.dll': the absolute path to a **libarchive ** shared object/DLL
114+
115+ See for example:
116+
117+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
118+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17
119+
120+ And in the same way, the ``scancode_location_provider `` for ``extractcode_7zip ``
121+ should point to a ``LocationProviderPlugin `` subclass with a ``get_locations() ``
122+ method that must return a mapping with this key:
123+
124+ - 'extractcode.sevenzip.exe': the absolute path to a **7zip ** executable
125+
126+ See for example:
127+
128+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40
129+ - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18
130+
131+ - use **environment variables ** to point to installed binaries:
132+
133+ - EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
134+ - EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable
135+
136+
137+ - **a system-installed libarchive and 7zip executable ** available in the system **PATH **.
138+
139+
140+ The supported binary tools versions are:
141+
142+ - libarchive 3.5.x
143+ - 7zip 16.5.x
144+
145+
146+ Development
147+ -----------
148+
26149To set up the development environment::
27150
28- source configure
151+ source configure --dev
152+
29153
30154To run unit tests::
31155
32156 pytest -vvs -n 2
33157
158+
34159To clean up development environment::
35160
36161 ./configure --clean
37162
38163
164+ To run the command line tool in the activated environment::
165+
166+ ./extractcode -h
167+
168+
169+ Configuration with environment variables
170+ ----------------------------------------
171+
172+ ExtractCode will use these environment variables if set:
173+
174+ - EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so `` libarchive
175+ shared library used to support some of the archive formats. If not provided,
176+ ExtractCode will look for a plugin-provided libarchive library path. See
177+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
178+ If no plugin contributes libarchive, then a final attempt is made to look for
179+ it in the PATH using standard DLL loading techniques.
180+
181+ - EXTRACTCODE_7Z_PATH : the path to the ``7z `` 7zip executable used to support
182+ some of the archive formats. If not provided, ExtractCode will look for a
183+ plugin-provided 7z executable path. See
184+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
185+ If no plugin contributes 7z, then a final attempt is made to look for
186+ it in the PATH.
187+
188+ - EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish `` tool from
189+ libguestfs to use to extract VM images. If not provided, ExtractCode will look
190+ in the PATH for an installed ``guestfish `` executable instead.
191+
192+
193+
194+ Adding support for VM images extraction
195+ ---------------------------------------
196+
197+ Adding support for VM images requires the manual installation of the
198+ libguestfs-tools system package. This is suported only on Linux.
199+ On Debian and Ubuntu you can use this command::
200+
201+ sudo apt-get install libguestfs-tools
202+
203+
204+ On Ubuntu only, an additional manual step is required as the kernel executable
205+ file cannot be read by users as required by libguestfish.
206+
207+ Run this command as a temporary and immediate fix::
208+
209+ sudo chmod 0644 /boot/vmlinuz-*
210+ for k in /boot/vmlinuz-*
211+ do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
212+ done
213+
214+ You likely want both this temporary fix and a more permanent fix; otherwise each
215+ kernel update will revert to the default permissions and ExtractCode will stop
216+ working for VM images extraction.
217+
218+ Therefore follow these instructions:
219+
220+ 1. As sudo, create the file /etc/kernel/postinst.d/statoverride with this
221+ content, devised by Kees Cook (@kees) in
222+ https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725/comments/3 ::
223+
224+ #!/bin/sh
225+ version="$1"
226+ # passing the kernel version is required
227+ [ -z "${version}" ] && exit 0
228+ dpkg-statoverride --update --add root root 0644 /boot/vmlinuz-${version}
229+
230+ 2. Set executable permissions::
231+
232+ sudo chmod +x /etc/kernel/postinst.d/statoverride
233+
234+ See also these links for a complete discussion:
235+
236+ - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/759725
237+ - https://bugzilla.redhat.com/show_bug.cgi?id=1670790
238+ - https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24
239+
240+
241+ Alternative
242+ -----------
243+
244+ These other tools are related and were considered before creating ExtractCode:
245+
246+ These tools provide built-in, original extraction capabilities:
247+
248+ - https://libarchive.org/ (integrated in ExtractCode) (BSD license)
249+ - https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)
250+ - https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)
251+
252+ These tools are command line tools wrapping other extraction tools and are
253+ similar to ExtractCode but with different goals:
254+
255+ - https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)
256+ - https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)
0 commit comments