Skip to content

Commit e06a7fd

Browse files
Automated cloud run deployment with Docker (#121)
* memory management * trigger with docker * lint * lint * unique tag * readme
1 parent 8a8afaa commit e06a7fd

38 files changed

Lines changed: 1236 additions & 3425 deletions

Makefile

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,3 @@ tf_plan:
1111

1212
tf_apply:
1313
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve
14-
15-
bigquery_export_deploy:
16-
cd infra/bigquery-export && npm run build
17-
18-
#bigquery_export_spark_deploy:
19-
# cd infra/bigquery_export_spark && gcloud builds submit --region=global --tag us-docker.pkg.dev/httparchive/bigquery-spark-procedures/firestore_export:latest

README.md

Lines changed: 74 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -6,38 +6,35 @@ This repository handles the HTTP Archive data pipeline, which takes the results
66

77
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the `main` branch is used on each triggered pipeline run.
88

9-
### Crawl results
9+
### HTTP Archive Crawl
1010

1111
Tag: `crawl_complete`
1212

13-
- httparchive.crawl.pages
14-
- httparchive.crawl.parsed_css
15-
- httparchive.crawl.requests
13+
- Crawl dataset `httparchive.crawl.*`
1614

17-
### Core Web Vitals Technology Report
15+
Consumers:
1816

19-
Tag: `crux_ready`
17+
- public dataset and [BQ Sharing Listing](https://console.cloud.google.com/bigquery/analytics-hub/discovery/projects/httparchive/locations/us/dataExchanges/httparchive/listings/crawl)
2018

21-
- httparchive.core_web_vitals.technologies
19+
- Blink Features Report `httparchive.blink_features.usage`
2220

23-
Consumers:
21+
Consumers:
2422

25-
- [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)
23+
- [chromestatus.com](https://chromestatus.com/metrics/feature/timeline/popularity/2089)
2624

27-
### Blink Features Report
25+
### HTTP Archive Technology Report
2826

29-
Tag: `crawl_complete`
27+
Tag: `crux_ready`
3028

31-
- httparchive.blink_features.features
32-
- httparchive.blink_features.usage
29+
- `httparchive.reports.cwv_tech_*` and `httparchive.reports.tech_*`
3330

34-
Consumers:
31+
Consumers:
3532

36-
- chromestatus.com - [example](https://chromestatus.com/metrics/feature/timeline/popularity/2089)
33+
- [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)
3734

3835
## Schedules
3936

40-
1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription
37+
1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataform-service-crawl-complete?authuser=2&project=httparchive) PubSub subscription
4138

4239
Tags: ["crawl_complete"]
4340

@@ -49,30 +46,66 @@ Consumers:
4946

5047
In order to unify the workflow triggering mechanism, we use [a Cloud Run function](./infra/README.md) that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
5148

52-
## Contributing
53-
54-
### Dataform development
55-
56-
1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in Dataform.
57-
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
58-
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
59-
60-
#### Workspace hints
61-
62-
1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
63-
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.
64-
65-
## Repository Structure
66-
67-
- `definitions/` - Contains the core Dataform SQL definitions and declarations
68-
- `output/` - Contains the main pipeline transformation logic
69-
- `declarations/` - Contains referenced tables/views declarations and other resources definitions
70-
- `includes/` - Contains shared JavaScript utilities and constants
71-
- `infra/` - Infrastructure code and deployment configurations
72-
- `dataform-trigger/` - Cloud Run function for workflow automation
73-
- `tf/` - Terraform configurations
74-
- `bigquery-export/` - BigQuery export configurations
75-
- `docs/` - Additional documentation
49+
## Cloud resources overview
50+
51+
```mermaid
52+
graph TB;
53+
subgraph Cloud Run
54+
dataform-service[dataform-service service]
55+
bigquery-export[bigquery-export job]
56+
end
57+
58+
subgraph PubSub
59+
crawl-complete[crawl-complete topic]
60+
dataform-service-crawl-complete[dataform-service-crawl-complete subscription]
61+
crawl-complete --> dataform-service-crawl-complete
62+
end
63+
64+
dataform-service-crawl-complete --> dataform-service
65+
66+
subgraph Cloud_Scheduler
67+
bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]
68+
bq-poller-crux-ready --> dataform-service
69+
end
70+
71+
subgraph Dataform
72+
dataform[Dataform Repository]
73+
dataform_release_config[dataform Release Configuration]
74+
dataform_workflow[dataform Workflow Execution]
75+
end
76+
77+
dataform-service --> dataform[Dataform Repository]
78+
dataform --> dataform_release_config
79+
dataform_release_config --> dataform_workflow
80+
81+
subgraph BigQuery
82+
bq_jobs[BigQuery jobs]
83+
bq_datasets[BigQuery table updates]
84+
bq_jobs --> bq_datasets
85+
end
86+
87+
dataform_workflow --> bq_jobs
88+
89+
bq_jobs --> bigquery-export
90+
91+
subgraph Monitoring
92+
cloud_run_logs[Cloud Run logs]
93+
dataform_logs[Dataform logs]
94+
bq_logs[BigQuery logs]
95+
alerting_policies[Alerting Policies]
96+
slack_notifications[Slack notifications]
97+
98+
cloud_run_logs --> alerting_policies
99+
dataform_logs --> alerting_policies
100+
bq_logs --> alerting_policies
101+
alerting_policies --> slack_notifications
102+
end
103+
104+
dataform-service --> cloud_run_logs
105+
dataform_workflow --> dataform_logs
106+
bq_jobs --> bq_logs
107+
bigquery-export --> cloud_run_logs
108+
```
76109

77110
## Development Setup
78111

@@ -86,6 +119,7 @@ In order to unify the workflow triggering mechanism, we use [a Cloud Run functio
86119

87120
- `npm run format` - Format code using Standard.js, fix Markdown issues, and format Terraform files
88121
- `npm run lint` - Run linting checks on JavaScript, Markdown files, and compile Dataform configs
122+
- `make tf_apply` - Apply Terraform configurations
89123

90124
## Code Quality
91125

dataform.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Dataform
2+
3+
Runs the batch processing workflows. There are two Dataform repositories for [development](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data-test/details/workspaces?authuser=7&project=httparchive) and [production](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workspaces?authuser=7&project=httparchive).
4+
5+
The test repository is used [for development and testing purposes](https://cloud.google.com/dataform/docs/workspaces) and not connected to the rest of the pipeline infra.
6+
7+
Pipeline can be [run manually](https://cloud.google.com/dataform/docs/code-lifecycle) from the Dataform UI.
8+
9+
[Configuration](./tf/dataform.tf)
10+
11+
## Dataform Development Workspace
12+
13+
1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in test Dataform repository.
14+
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
15+
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
16+
17+
*Some useful hints:*
18+
19+
1. In workflow settings vars set `dev_name: dev` to process sampled data in dev workspace.
20+
2. Change `current_month` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
21+
3. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.
22+
23+
## Workspace hints
24+
25+
1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
26+
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.
27+
28+
## Repository Structure
29+
30+
- `definitions/` - Contains the core Dataform SQL definitions and declarations
31+
- `output/` - Contains the main pipeline transformation logic
32+
- `declarations/` - Contains referenced tables/views declarations and other resources definitions
33+
- `includes/` - Contains shared JavaScript utilities and constants
34+
- `infra/` - Infrastructure code and deployment configurations
35+
- `bigquery-export/` - BigQuery export service
36+
- `dataform-service/` - Cloud Run function for dataform workflows automation
37+
- `tf/` - Terraform configurations
38+
- `docs/` - Additional documentation
39+
40+
## GiHub to Dataform connection
41+
42+
GitHub PAT saved to a [Secret Manager secret](https://console.cloud.google.com/security/secret-manager/secret/GitHub_max-ostapenko_dataform_PAT/versions?authuser=7&project=httparchive).
43+
44+
- repository: HTTPArchive/dataform
45+
- permissions:
46+
- Commit statuses: read
47+
- Contents: read, write
48+
49+
## Monitoring
50+
51+
- [Production Dataform workflow execution logs](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workflows?authuser=7&project=httparchive)
52+
53+
- [Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive) policy

docs/infrastructure.md

Lines changed: 0 additions & 132 deletions
This file was deleted.

0 commit comments

Comments
 (0)