Skip to content

Commit ab584a6

Browse files
committed
docs
1 parent 8a294e8 commit ab584a6

1 file changed

Lines changed: 335 additions & 0 deletions

File tree

reports.md

Lines changed: 335 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,335 @@
1+
# HTTP Archive Dynamic Reports
2+
3+
This document describes the HTTP Archive dynamic reports system, which automatically generates standardized reports from HTTP Archive crawl data.
4+
5+
## Overview
6+
7+
The dynamic reports system generates Dataform operations that:
8+
9+
1. Calculate metrics from HTTP Archive crawl data
10+
2. Store results in BigQuery tables partitioned by date and clustered by metric/lens/client
11+
3. Export data to Cloud Storage as JSON files for consumption by external systems
12+
13+
## Architecture
14+
15+
### Core Components
16+
17+
- **`includes/reports.js`** - Defines metrics and lenses
18+
- **`definitions/output/reports/reports_dynamic.js`** - Generates Dataform operations dynamically
19+
- **`includes/constants.js`** - Provides shared constants and the `DataformTemplateBuilder`
20+
21+
## Supported Features
22+
23+
### SQL Types
24+
25+
The system supports two types of SQL queries:
26+
27+
#### 1. Histogram
28+
29+
- **Purpose**: Distribution analysis with binned data
30+
- **Output**: Contains `bin`, `volume`, `pdf`, `cdf` columns
31+
- **Use case**: Page weight distributions, performance metric distributions
32+
- **Export path**: `reports/{date_folder}/{metric_id}_test.json`
33+
34+
#### 2. Timeseries
35+
36+
- **Purpose**: Trend analysis over time
37+
- **Output**: Contains percentile data (p10, p25, p50, p75, p90) with timestamps
38+
- **Use case**: Performance trends, adoption over time
39+
- **Export path**: `reports/{metric_id}_test.json`
40+
41+
### Lenses (Data Filters)
42+
43+
Lenses allow filtering data by different criteria:
44+
45+
- **`all`** - No filter, all pages
46+
- **`top1k`** - Top 1,000 ranked sites
47+
- **`top10k`** - Top 10,000 ranked sites
48+
- **`top100k`** - Top 100,000 ranked sites
49+
- **`top1m`** - Top 1,000,000 ranked sites
50+
- **`drupal`** - Sites using Drupal
51+
- **`magento`** - Sites using Magento
52+
- **`wordpress`** - Sites using WordPress
53+
54+
### Date Range Processing
55+
56+
- Configurable start and end dates
57+
- Processes data month by month using `constants.fnPastMonth()`
58+
- Supports retrospective report generation
59+
60+
## How to Add a New Dynamic Report
61+
62+
### Step 1: Define Your Metric
63+
64+
Add your metric to the `_metrics` object in `includes/reports.js`:
65+
66+
```javascript
67+
const config = {
68+
_metrics: {
69+
// Existing metrics...
70+
71+
myNewMetric: {
72+
SQL: [
73+
{
74+
type: 'histogram', // or 'timeseries'
75+
query: DataformTemplateBuilder.create((ctx, params) => `
76+
WITH pages AS (
77+
SELECT
78+
date,
79+
client,
80+
-- Your binning logic for histogram
81+
CAST(FLOOR(your_metric_value / bin_size) * bin_size AS INT64) AS bin
82+
FROM ${ctx.ref('crawl', 'pages')}
83+
WHERE
84+
date = '${params.date}'
85+
${params.devRankFilter}
86+
${params.lens.sql}
87+
AND is_root_page
88+
AND your_metric_value > 0
89+
)
90+
91+
-- Your aggregation logic here
92+
SELECT
93+
*,
94+
SUM(pdf) OVER (PARTITION BY client ORDER BY bin) AS cdf
95+
FROM (
96+
-- Calculate probability density function
97+
SELECT
98+
*,
99+
volume / SUM(volume) OVER (PARTITION BY client) AS pdf
100+
FROM (
101+
SELECT
102+
*,
103+
COUNT(0) AS volume
104+
FROM pages
105+
WHERE bin IS NOT NULL
106+
GROUP BY date, client, bin
107+
)
108+
)
109+
ORDER BY bin, client
110+
`)
111+
}
112+
]
113+
}
114+
}
115+
}
116+
```
117+
118+
### Step 2: Test Your Metric
119+
120+
The metric will be automatically included in the next run of `reports_dynamic.js`. The system will generate operations for all combinations of:
121+
122+
- Your new metric
123+
- All available lenses
124+
- All SQL types you defined
125+
- The configured date range
126+
127+
### Step 3: Verify Output
128+
129+
Check that the generated operations:
130+
131+
1. Create the expected BigQuery tables
132+
2. Populate data correctly
133+
3. Export to Cloud Storage in the expected format
134+
135+
## Metric SQL Requirements
136+
137+
### Template Parameters
138+
139+
Your SQL template receives these parameters:
140+
141+
```javascript
142+
{
143+
date: '2025-07-01', // Current processing date
144+
devRankFilter: 'AND rank <= 10000', // Development filter
145+
lens: {
146+
name: 'top1k', // Lens name
147+
sql: 'AND rank <= 1000' // Lens SQL filter
148+
},
149+
metric: { id: 'myMetric', ... }, // Metric configuration
150+
sql: { type: 'histogram', ... } // SQL type configuration
151+
}
152+
```
153+
154+
### Required Columns
155+
156+
#### For Histogram Type
157+
158+
- `date` - Processing date
159+
- `client` - 'desktop' or 'mobile'
160+
- `bin` - Numeric bin value
161+
- `volume` - Count of pages in this bin
162+
- `pdf` - Probability density function value
163+
- `cdf` - Cumulative distribution function value
164+
165+
#### For Timeseries Type
166+
167+
- `date` - Processing date
168+
- `client` - 'desktop' or 'mobile'
169+
- `timestamp` - Unix timestamp in milliseconds
170+
- `p10`, `p25`, `p50`, `p75`, `p90` - Percentile values
171+
172+
### Best Practices
173+
174+
1. **Filter root pages**: Always include `AND is_root_page` unless you specifically need all pages
175+
2. **Handle null values**: Use appropriate null checks and filtering
176+
3. **Use consistent binning**: For histograms, use logical bin sizes (e.g., 100KB increments for page weight)
177+
4. **Optimize performance**: Use appropriate WHERE clauses and avoid expensive operations
178+
5. **Test with dev filters**: Your queries should work with the development rank filter
179+
180+
## Lenses
181+
182+
Lenses SQL are a valid BigQuery WHERE clause conditions that can be appended to the main query.
183+
184+
## Processing Details
185+
186+
### Operation Generation
187+
188+
For each combination of date, metric, SQL type, and lens, the system:
189+
190+
1. **Creates a unique operation name**: `{metricId}_{sqlType}_{date}_{lensName}`
191+
2. **Generates BigQuery SQL** that:
192+
- Deletes existing data for the date/metric/lens combination
193+
- Inserts new calculated data
194+
- Exports results to Cloud Storage
195+
3. **Tags operations** with `crawl_complete` tags to be triggered on crawl completion.
196+
197+
### Table Structure
198+
199+
Reports are stored in BigQuery tables with this structure:
200+
201+
- **Partitioned by**: `date`
202+
- **Clustered by**: `metric`, `lens`, `client`
203+
- **Dataset**: `reports`
204+
- **Naming**: `{metricId}_{sqlType}` (e.g., `bytesTotal_histogram`)
205+
206+
### Export Process
207+
208+
1. Data is calculated and stored in BigQuery
209+
2. A `run_export_job` function exports filtered data to Cloud Storage
210+
3. Export paths follow the pattern:
211+
- Histogram: `reports/[{lens}/]{date_underscore}/{metric_id}.json`
212+
- Timeseries: `reports/[{lens}/]{metric_id}.json`
213+
214+
### Development vs Production
215+
216+
- **Development**: Uses `TABLESAMPLE` and rank filters for faster processing
217+
- **Production**: Processes full datasets
218+
- **Environment detection**: Automatic based on `dataform.projectConfig.vars.environment`
219+
220+
## Configuration
221+
222+
### Date Range
223+
224+
Modify the `DATE_RANGE` object in `reports_dynamic.js`:
225+
226+
```javascript
227+
const DATE_RANGE = {
228+
startDate: '2025-01-01', // Start processing from this date
229+
endDate: '2025-07-01' // Process up to this date
230+
}
231+
```
232+
233+
### Export Configuration
234+
235+
Modify the `EXPORT_CONFIG` object:
236+
237+
```javascript
238+
const EXPORT_CONFIG = {
239+
bucket: 'your-storage-bucket',
240+
storagePath: 'reports/',
241+
dataset: 'reports',
242+
testSuffix: '.json'
243+
}
244+
```
245+
246+
## Troubleshooting
247+
248+
### Debugging
249+
250+
1. **Check operation logs** in Dataform for SQL errors
251+
2. **Verify table creation** in BigQuery console
252+
3. **Check export logs** in Cloud Run for export errors
253+
4. **Verify Cloud Storage paths** for exported files
254+
5. **Test SQL templates** individually before adding to the dynamic system
255+
6. **Use development environment** with smaller datasets for testing
256+
257+
## Examples
258+
259+
### Adding a JavaScript Bundle Size Metric
260+
261+
```javascript
262+
jsBytes: {
263+
SQL: [
264+
{
265+
type: 'histogram',
266+
query: DataformTemplateBuilder.create((ctx, params) => `
267+
WITH pages AS (
268+
SELECT
269+
date,
270+
client,
271+
CAST(FLOOR(FLOAT64(summary.bytesJS) / 1024 / 50) * 50 AS INT64) AS bin
272+
FROM ${ctx.ref('crawl', 'pages')}
273+
WHERE
274+
date = '${params.date}'
275+
${params.devRankFilter}
276+
${params.lens.sql}
277+
AND is_root_page
278+
AND INT64(summary.bytesJS) > 0
279+
)
280+
281+
SELECT
282+
*,
283+
SUM(pdf) OVER (PARTITION BY client ORDER BY bin) AS cdf
284+
FROM (
285+
SELECT
286+
*,
287+
volume / SUM(volume) OVER (PARTITION BY client) AS pdf
288+
FROM (
289+
SELECT
290+
*,
291+
COUNT(0) AS volume
292+
FROM pages
293+
WHERE bin IS NOT NULL
294+
GROUP BY date, client, bin
295+
)
296+
)
297+
ORDER BY bin, client
298+
`)
299+
},
300+
{
301+
type: 'timeseries',
302+
query: DataformTemplateBuilder.create((ctx, params) => `
303+
WITH pages AS (
304+
SELECT
305+
date,
306+
client,
307+
FLOAT64(summary.bytesJS) AS bytesJS
308+
FROM ${ctx.ref('crawl', 'pages')}
309+
WHERE
310+
date = '${params.date}'
311+
${params.devRankFilter}
312+
${params.lens.sql}
313+
AND is_root_page
314+
AND INT64(summary.bytesJS) > 0
315+
)
316+
317+
SELECT
318+
date,
319+
client,
320+
UNIX_DATE(date) * 1000 * 60 * 60 * 24 AS timestamp,
321+
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(101)] / 1024, 2) AS p10,
322+
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(251)] / 1024, 2) AS p25,
323+
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(501)] / 1024, 2) AS p50,
324+
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(751)] / 1024, 2) AS p75,
325+
ROUND(APPROX_QUANTILES(bytesJS, 1001)[OFFSET(901)] / 1024, 2) AS p90
326+
FROM pages
327+
GROUP BY date, client, timestamp
328+
ORDER BY date, client
329+
`)
330+
}
331+
]
332+
}
333+
```
334+
335+
This would automatically generate reports for JavaScript bundle sizes across all lenses and the configured date range.

0 commit comments

Comments
 (0)