Skip to content

Commit 73f65a4

Browse files
Merge pull request #325 from daniellienert/feature/index-assets-2
FEATURE: Implement asset indexing using attachment-ingest plugin
2 parents d9fe3be + 525084e commit 73f65a4

3 files changed

Lines changed: 132 additions & 7 deletions

File tree

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
<?php
2+
declare(strict_types=1);
3+
4+
namespace Flowpack\ElasticSearch\ContentRepositoryAdaptor\AssetExtraction;
5+
6+
/*
7+
* This file is part of the Flowpack.ElasticSearch.ContentRepositoryAdaptor package.
8+
*
9+
* (c) Contributors of the Neos Project - www.neos.io
10+
*
11+
* This package is Open Source Software. For the full copyright and license
12+
* information, please view the LICENSE file which was distributed with this
13+
* source code.
14+
*/
15+
16+
use Neos\Flow\Annotations as FLow;
17+
use Neos\ContentRepository\Search\AssetExtraction\AssetExtractorInterface;
18+
use Neos\ContentRepository\Search\Dto\AssetContent;
19+
use Flowpack\ElasticSearch\ContentRepositoryAdaptor\ElasticSearchClient;
20+
use Neos\Flow\Log\Utility\LogEnvironment;
21+
use Neos\Media\Domain\Model\AssetInterface;
22+
use Neos\Utility\Arrays;
23+
use Psr\Log\LoggerInterface;
24+
25+
/**
26+
* @Flow\Scope("singleton")
27+
*/
28+
class IngestAttachmentAssetExtractor implements AssetExtractorInterface
29+
{
30+
/**
31+
* @Flow\Inject
32+
* @var ElasticSearchClient
33+
*/
34+
protected $elasticsearchClient;
35+
36+
/**
37+
* @Flow\Inject
38+
* @var LoggerInterface
39+
*/
40+
protected $logger;
41+
42+
/**
43+
* Takes an asset and extracts content and meta data.
44+
*
45+
* @param AssetInterface $asset
46+
* @return AssetContent
47+
* @throws \Flowpack\ElasticSearch\Transfer\Exception
48+
* @throws \Flowpack\ElasticSearch\Transfer\Exception\ApiException
49+
* @throws \Neos\Flow\Http\Exception
50+
*/
51+
public function extract(AssetInterface $asset): AssetContent
52+
{
53+
$request = [
54+
'pipeline' => [
55+
'description' => 'Attachment Extraction',
56+
'processors' => [
57+
[
58+
'attachment' => [
59+
'field' => 'neos_asset',
60+
'indexed_chars' => 100000,
61+
'ignore_missing' => true,
62+
]
63+
]
64+
]
65+
],
66+
'docs' => [
67+
[
68+
'_source' => [
69+
'neos_asset' => $this->getAssetContent($asset)
70+
]
71+
]
72+
]
73+
];
74+
75+
$result = $this->elasticsearchClient->request('POST', '_ingest/pipeline/_simulate', [], json_encode($request))->getTreatedContent();
76+
$extractedAsset = Arrays::getValueByPath($result, 'docs.0.doc._source.attachment');
77+
78+
$this->logger->debug(sprintf('Extracted asset %s of type %s. Extracted %s characters of content', $asset->getResource()->getFilename(), $extractedAsset['content_type'], $extractedAsset['content_length']), LogEnvironment::fromMethodName(__METHOD__));
79+
80+
return new AssetContent(
81+
$extractedAsset['content'] ?? '',
82+
$extractedAsset['title'] ?? '',
83+
$extractedAsset['name'] ?? '',
84+
$extractedAsset['author'] ?? '',
85+
$extractedAsset['keywords'] ?? '',
86+
$extractedAsset['date'] ?? '',
87+
$extractedAsset['content_type'] ?? '',
88+
$extractedAsset['content_length'] ?? '',
89+
$extractedAsset['language'] ?? ''
90+
);
91+
}
92+
93+
/**
94+
* @param AssetInterface $asset
95+
* @return null|string
96+
*/
97+
protected function getAssetContent(AssetInterface $asset): ?string
98+
{
99+
$stream = $asset->getResource()->getStream();
100+
stream_filter_append($stream, 'convert.base64-encode');
101+
$result = stream_get_contents($stream);
102+
return $result !== false ? $result : null;
103+
}
104+
}

Configuration/Objects.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ Neos\ContentRepository\Search\Search\QueryBuilderInterface:
44
Neos\ContentRepository\Search\Indexer\NodeIndexerInterface:
55
className: 'Flowpack\ElasticSearch\ContentRepositoryAdaptor\Indexer\NodeIndexer'
66

7+
Neos\ContentRepository\Search\AssetExtraction\AssetExtractorInterface:
8+
className: 'Flowpack\ElasticSearch\ContentRepositoryAdaptor\AssetExtraction\IngestAttachmentAssetExtractor'
9+
710
Flowpack\ElasticSearch\ContentRepositoryAdaptor\Driver\QueryInterface:
811
scope: prototype
912
factoryObjectName: 'Flowpack\ElasticSearch\ContentRepositoryAdaptor\Factory\QueryFactory'

README.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -358,7 +358,7 @@ For more information on Elasticsearch's Date Formats,
358358

359359
### Working with Assets / Attachments
360360

361-
If you want to index attachments, you need to install the [Elasticsearch Attachment Plugin](https://github.com/elastic/elasticsearch-mapper-attachments).
361+
If you want to index attachments, you need to install the [Elasticsearch Ingest-Attachment Plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html).
362362
Then, you can add the following to your `Settings.yaml`:
363363

364364
```yaml
@@ -368,15 +368,33 @@ Neos:
368368
defaultConfigurationPerType:
369369
'Neos\Media\Domain\Model\Asset':
370370
elasticSearchMapping:
371-
type: attachment
372-
indexing: ${Indexing.indexAsset(value)}
371+
type: text
372+
indexing: ${Indexing.Indexing.extractAssetContent(value)}
373+
```
373374

374-
'array<Neos\Media\Domain\Model\Asset>':
375-
elasticSearchMapping:
376-
type: attachment
377-
indexing: ${Indexing.indexAsset(value)}
375+
or add the attachments content to a fulletxt field in your NodeType configuration:
376+
377+
```yaml
378+
properties:
379+
file:
380+
type: 'Neos\Media\Domain\Model\Asset'
381+
ui:
382+
search:
383+
fulltextExtractor: ${Indexing.extractInto('text', Indexing.extractAssetContent(value))}
378384
```
379385

386+
By default `Indexing.extractAssetContent(value)` returns the asset content. You can use the second parameter to return asset meta data. The field parameter can be set to one of the following: `content, title, name, author, keywords, date, content_type, content_length, language`.
387+
388+
With that, you can for example add the keywords of a file to a higher boosted field:
389+
390+
```yaml
391+
properties:
392+
file:
393+
type: 'Neos\Media\Domain\Model\Asset'
394+
ui:
395+
search:
396+
fulltextExtractor: ${Indexing.extractInto('h2', Indexing.extractAssetContent(value, 'keywords'))}
397+
```
380398

381399

382400
# Query Data

0 commit comments

Comments
 (0)