@@ -6,40 +6,89 @@ ExtractCode
66- homepage_url: https://github.com/nexB/extractcode
77- keywords: archive, extraction, libarchive, 7zip, scancode-toolkit, extractcode
88
9+ Supports Windows, Linux and macOS on 64 bits processors and Python 3.6 to 3.9.
910
10- ExtractCode is a universal archive extractor. It uses behind the scenes
11- multiple tools such as:
11+
12+ **ExtractCode is a (mostly) universal archive extractor. **
13+
14+ Install with::
15+
16+ pip install extractcode[full]
17+
18+
19+ Why another extractor?
20+ ----------------------
21+
22+ **it will extract! **
23+
24+ ExtractCode will extract things where other extractors may fail.
25+
26+ - Say you want to extract the tarball of the Linux kernel source code on Windows.
27+ It contains paths that are the same when ignoring the case and therefore will
28+ not extract OK on Windows: some file may be munged or the extract may file.
29+
30+ - Or a tarball (on any OS) may contain multiple times the exact same path. In
31+ these cases the paths showing up earlier in the archive may be "hidden" and
32+ overwritten by the same path showing up later in the archive giving the
33+ impression that there is only one file.
34+
35+ - Or an archive may be damaged a little but most files can still be extracted.
36+
37+ - Or the extracted files are such permissions that you cannot read them and are
38+ not owned by you.
39+
40+ - Or the archive may contain weird paths inluding relative paths that may be
41+ problematic to extract.
42+
43+ - Or the archive may contain special file types (character/device files) that
44+ may be problematic to extract.
45+
46+ - Or an archive may be a virtual disk or some file system(s) images that would
47+ typically need to be mounted to be accessed, and may require root access
48+ and guesswork to find out which partition and filesystem are at play and
49+ which driver to use.
50+
51+ In all these cases, ExtractCode will extract and try hard do the right thing to
52+ obtain the actual archived content when other tools may fail.
53+
54+ It can also extract recursively any type of (nested) archives-in-archives
55+
56+ As a downside, the extracted content may not be exactly what would be expected
57+ to use the contained files: for instance ... but this it is perfectly OK for
58+ file content analysis for software composition or forensic analysis.
59+
60+ Behind the scene, ExtractCode uses multiple tools such as:
1261
1362- the Python standard library,
1463- a custom ctypes binding to libarchive,
15- - the 7zip command line, and
64+ - the 7zip command line tool , and
1665- optionally libguestfs on Linux.
1766
18- With these it is possible to extract a large number of common and
19-
20- less common archives and compressed files. ExtractCode tries to extract things
21- in the same way on all OSes, including auto-renaming files that would not have
22- valid names on certain filesystems or when there are multiple copies of the same
23- path in a given archive (which is possible in a tar).
67+ With these, it is possible to extract a large number of common and less common
68+ archives and compressed file types. ExtractCode tries to extract things in the
69+ same way on all supported OSes, including auto-renaming files that would have
70+ invalid, non-extractible names on certain filesystems or when there are multiple
71+ copies of the same path in a given archive (which is possible in a tar).
2472
25- The extraction is driven from a "voting" system that considers the
26- file extension(s) and name, the filetype and mimetype (using a ctypes
27- binding to libmagic) to select the most appropriate extractor or
28- decompressor function. It can handle multi-level archives such as tar.gz and
29- can extract recursively nested archives.
73+ The extraction is driven from a "voting" system that considers the file
74+ extension(s) and name, the filetype and mimetype (using a ctypes binding to
75+ libmagic) to select the most appropriate extractor or decompressor function.
76+ It can handle multi-level archives such as tar.gz and can extract recursively
77+ any nested archives.
3078
3179Visit https://aboutcode.org and https://github.com/nexB/ for support and download.
3280
81+
3382We run CI tests on:
3483
3584 - Azure pipelines https://dev.azure.com/nexB/extractcode/_build
3685
37- We run CI tests on:
3886
39- - Azure pipelines https://dev.azure.com/nexB/extractcode/_build
87+ Installation
88+ ------------
4089
4190To install this package with its full capability (where the binaries for
42- 7zip and libarchive are installed), use the `full ` option::
91+ 7zip and libarchive are installed), use the `full ` extra option::
4392
4493 pip install extractcode[full]
4594
@@ -48,45 +97,47 @@ system, use the `minimal` option::
4897
4998 pip install extractcode
5099
51- In this case, you will need to provide a working libarchive and 7zip
52- available in one of these ways:
100+ In this case, you will need to provide a working and compatible libarchive and
101+ 7zip installed and configured in one of these ways such that ExtractCode can
102+ find them:
53103
54- - **a typecode-libarchive and typecode-7z plugin **: See the standard ones at
104+ - **a typecode-libarchive and typecode-7z plugin **: See the standard ones at
55105 https://github.com/nexB/scancode-plugins/tree/main/builtins
56106 These can either bundle a libarchive library, a 7z executable or expose a
57107 system-installed libraries.
58108 It does so by providing plugin entry points as ``scancode_location_provider ``
59109 for ``extractcode_libarchive `` that should point to a ``LocationProviderPlugin ``
60- subclass with a ``get_locations() `` method that must return a mapping with this key:
110+ subclass with a ``get_locations() `` method that must return a mapping with
111+ this key:
61112
62- - 'extractcode.libarchive.dll': the absolute path to a libarchive DLL
113+ - 'extractcode.libarchive.dll': the absolute path to a ** libarchive ** shared object/ DLL
63114
64115 See for example:
65116
66117 - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
67118 - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17
68119
69- And the ``scancode_location_provider `` for ``extractcode_7zip `` should point
70- to a ``LocationProviderPlugin `` subclass with a ``get_locations() `` method that must
71- return a mapping with this key:
120+ And in the same way, the ``scancode_location_provider `` for ``extractcode_7zip ``
121+ should point to a ``LocationProviderPlugin `` subclass with a ``get_locations() ``
122+ method that must return a mapping with this key:
72123
73- - 'extractcode.sevenzip.exe': the absolute path to a 7zip executable
124+ - 'extractcode.sevenzip.exe': the absolute path to a ** 7zip ** executable
74125
75126 See for example:
76127
77128 - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/setup.py#L40
78129 - https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_7z-linux/src/extractcode_7z/__init__.py#L18
79130
80- - **environment variables **:
131+ - use **environment variables ** to point to installed binaries :
81132
82133 - EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
83134 - EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable
84135
85136
86- - **a system-installed libarchive and 7zip executable in the system PATH **:
137+ - **a system-installed libarchive and 7zip executable ** available in the system ** PATH **.
87138
88139
89- The supported versions are:
140+ The supported binary tools versions are:
90141
91142- libarchive 3.5.x
92143- 7zip 16.5.x
@@ -95,10 +146,9 @@ The supported versions are:
95146Development
96147-----------
97148
98-
99149To set up the development environment::
100150
101- source configure
151+ source configure --dev
102152
103153
104154To run unit tests::
@@ -116,18 +166,43 @@ To run the command line tool in the activated environment::
116166 ./extractcode -h
117167
118168
169+ Configuration with environment variables
170+ ----------------------------------------
171+
172+ ExtractCode will use these environment variables if set:
173+
174+ - EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so `` libarchive
175+ shared library used to support some of the archive formats. If not provided,
176+ ExtractCode will look for a plugin-provided libarchive library path. See
177+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
178+ If no plugin contributes libarchive, then a final attempt is made to look for
179+ it in the PATH using standard DLL loading techniques.
180+
181+ - EXTRACTCODE_7Z_PATH : the path to the ``7z `` 7zip executable used to support
182+ some of the archive formats. If not provided, ExtractCode will look for a
183+ plugin-provided 7z executable path. See
184+ https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
185+ If no plugin contributes 7z, then a final attempt is made to look for
186+ it in the PATH.
187+
188+ - EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish `` tool from
189+ libguestfs to use to extract VM images. If not provided, ExtractCode will look
190+ in the PATH for an installed ``guestfish `` executable instead.
191+
192+
193+
119194Adding support for VM images extraction
120195---------------------------------------
121196
122- Adding support for VM images requires the manual installation of libguestfs
123- tools system package. This is suport on Linux only. On Debian and Ubuntu you can
124- use this::
197+ Adding support for VM images requires the manual installation of the
198+ libguestfs- tools system package. This is suported only on Linux.
199+ On Debian and Ubuntu you can use this command ::
125200
126201 sudo apt-get install libguestfs-tools
127202
128203
129204On Ubuntu only, an additional manual step is required as the kernel executable
130- file cannot be read as required by libguestfish.
205+ file cannot be read by users as required by libguestfish.
131206
132207Run this command as a temporary and immediate fix::
133208
@@ -136,10 +211,9 @@ Run this command as a temporary and immediate fix::
136211 do sudo dpkg-statoverride --add --update root root 0644 /boot/vmlinuz-$k
137212 done
138213
139-
140- But you likely want both this temporary fix and a permanent fix; otherwise each
141- kernel update will revert to the default permissions and extractcode will stop
142- working for VM images extraction.
214+ You likely want both this temporary fix and a more permanent fix; otherwise each
215+ kernel update will revert to the default permissions and ExtractCode will stop
216+ working for VM images extraction.
143217
144218Therefore follow these instructions:
145219
@@ -164,26 +238,19 @@ See also these links for a complete discussion:
164238 - https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1813662/comments/24
165239
166240
167- Configuration with environment variables
168- ----------------------------------------
241+ Alternative
242+ -----------
169243
170- ExtractCode will use these environment variables if set :
244+ These other tools are related and were considered before creating ExtractCode :
171245
172- - EXTRACTCODE_GUESTFISH_PATH : the path to the ``guestfish `` tool from
173- libguestfs to use to extract VM images. If not provided, ExtractCode will look
174- in the PATH for an installed ``guestfish `` executable instead.
246+ These tools provide built-in, original extraction capabilities:
175247
176- - EXTRACTCODE_LIBARCHIVE_PATH : the path to the ``libarchive.so `` libarchive
177- shared library used to support some of the archive formats. If not provided,
178- ExtractCode will look for a plugin-provided libarchive library path. See
179- https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
180- If no plugin contributes libarchive, then a final attempt is made to look for
181- it in the PATH using standard DLL loading techniques.
248+ - https://libarchive.org/ (integrated in ExtractCode) (BSD license)
249+ - https://www.7-zip.org/ (integrated in ExtractCode) (LGPL license)
250+ - https://theunarchiver.com/command-line (maintenance status unknown) (LGPL license)
182251
183- - EXTRACTCODE_7Z_PATH : the path to the ``7z `` 7zip executable used to support
184- some of the archive formats. If not provided, ExtractCode will look for a
185- plugin-provided 7z executable path. See
186- https://github.com/nexB/scancode-plugins/tree/main/builtins for such plugins.
187- If no plugin contributes 7z, then a final attempt is made to look for
188- it in the PATH.
189-
252+ These tools are command line tools wrapping other extraction tools and are
253+ similar to ExtractCode but with different goals:
254+
255+ - https://github.com/wummel/patool (wrapper on many CLI tools) (GPL license)
256+ - https://github.com/dtrx-py/dtrx (wrapper on a few CLI tools) (recently revived) (GPL license)
0 commit comments