在BigQuery內做GA4的頁面瀏覽大表
前言
這是第一個我真正用在視覺化的表!還沒整理跟優化,甚至還沒寫說明==
也好,有任何看不懂得直接留言。貼心提醒按著Shift滑動滾輪可以橫向滑動。
正文
演變歷程 - 第一版
先照之前的作品,把要的事件(頁面瀏覽)跟要的參數(Title, event_timestamp等)篩選抓取出來。馬上遇到第一個問題:campaign雖然是session-level, 卻只有在第一次的page_view出現。所以我先用了另一張表整理出每個session對應的campaign, 再用COALESCE讓每個本來有campaign的session也有了欄位(而非原本的null)。
-- 這是第一版真正可以利用的大表!
WITH page_view_log AS (
SELECT
DISTINCT
event_date,
event_name,
user_pseudo_id,
event_timestamp,
MAX (CASE WHEN params_new.key = 'page_location' THEN params_new.value.string_value ELSE NULL END) OVER (PARTITION BY event_timestamp, user_pseudo_id) AS `page_location`,
MAX (
CASE WHEN params_new.key = 'page_title' THEN params_new.value.string_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `page_title`,
MAX (
CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `campaign`,
-- 這種session level的資料怎麼吐到每個事件內?
MAX (
CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `ga_session_id`,
-- 這種session level的資料怎麼吐到每個事件內?
-- traffic_source.source,
FROM
`ga4-bq-for-rice.analytics_123456789.events_*` AS original
LEFT JOIN UNNEST (event_params) AS params_new
WHERE
REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM
AND REGEXP_CONTAINS(event_name, 'page_view') -- 只要瀏覽頁面
-- ORDER BY
-- ga_session_id -不重要
-- WHERE REGEXP_CONTAINS(page_location, 'example\\-domain\\.com\\/product\\/.*') -- 指定產品頁
-- LIMIT
-- 30 -- testing
)
,session_campaign AS (
SELECT
ga_session_id,
MAX(campaign) AS campaign
FROM
(
SELECT -- 這裡應能簡化成只拿session_id 跟 campaign
DISTINCT MAX (
CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `campaign`, -- 這種session level的資料怎麼吐到每個事件內?
MAX (
CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `ga_session_id`, -- 這種session level的資料怎麼吐到每個事件內?
FROM
`ga4-bq-for-rice.analytics_123456789.events_*` AS original
LEFT JOIN UNNEST (event_params) AS params_new
WHERE
REGEXP_CONTAINS(event_name, 'page_view') -- 只要瀏覽
-- AND REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM頁面
-- ORDER BY
-- ga_session_id -- 不重要
-- LIMIT
-- 30 -- testing
)
GROUP BY
ga_session_id
)
SELECT
pvl.* EXCEPT (campaign),
COALESCE(pvl.campaign, sc.campaign) AS campaign,
REGEXP_EXTRACT(page_location, r'.*://[^/]+(/[^?#]*)') AS path, -- The [^?#]* part of the regular expression matches any characters that are not the start of a query string (?) or a fragment identifier (#).
REGEXP_EXTRACT_ALL(page_location, r'.*example\-domain\.com.*\/([^?#]*)') AS last_component -- 移除所有query跟文字錨點,配對批量上傳至CMS的資料使用
FROM
page_view_log AS pvl
JOIN session_campaign AS sc ON pvl.ga_session_id = sc.ga_session_id
ORDER BY pvl.ga_session_id
成品如上圖~其實我只截了部分欄位,主要是記錄我在這邊卡很久,因為我要用REGEXP_EXTRACT這個function,但是又要用到or判斷,要如何在REGEXP_EXTRACT中使用多個括號,就卡了我一天。最後問了ChatGPT才有解...
第二版的扣來了~
-- 這是第一版真正可以利用的大表!
-- 全管道(這應該比較好用,但在bq要撈比較久,到Looker Studio資料處裡也要時間
-- 既然with可以用with的資料,那就可以優化很多是
WITH original AS (
SELECT
*
FROM
`ga4-bq-for-rice.analytics_123456789.events_2023031*`
WHERE
REGEXP_CONTAINS(event_name, 'page_view')
),
page_view_log AS (
SELECT
DISTINCT event_date,
event_name,
user_pseudo_id,
event_timestamp,
MAX (
CASE WHEN params_new.key = 'page_location' THEN params_new.value.string_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `page_location`,
MAX (
CASE WHEN params_new.key = 'page_title' THEN params_new.value.string_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `page_title`,
MAX (
CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `campaign`,
MAX (
CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `ga_session_id`,
traffic_source.source,
traffic_source.medium,
geo.*
FROM
original
LEFT JOIN UNNEST (event_params) AS params_new
-- WHERE
-- REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM
),
session_campaign AS (
SELECT
ga_session_id,
MAX(campaign) AS campaign,
MAX(rice_id) AS rice_id
FROM
(
SELECT
-- 這裡應能簡化成只拿session_id 跟 campaign
DISTINCT MAX (
CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `campaign`,
MAX (
CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
) OVER (
PARTITION BY event_timestamp, user_pseudo_id
) AS `ga_session_id`,
MAX (
CASE WHEN properties_new.key = 'rice_id' THEN properties_new.value.int_value ELSE NULL END
) OVER (PARTITION BY user_pseudo_id) AS `rice_id`
FROM
original
LEFT JOIN UNNEST (event_params) AS params_new,
UNNEST (user_properties) AS properties_new
-- WHERE
-- REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM頁面
)
GROUP BY
ga_session_id
)
SELECT
pvl.*
EXCEPT
(campaign),
COALESCE(pvl.campaign, sc.campaign) AS campaign,
sc.rice_id,
REGEXP_EXTRACT(
page_location, r'.*://[^/]+(/[^?#]*)'
) AS path, -- The [^?#]* part of the regular expression matches any characters that are not the start of a query string (?) or a fragment identifier (#).
REGEXP_EXTRACT_ALL(
page_location, r'.*example\-domain\.com.*\/([^?#]*)'
) AS last_path_component -- 移除所有query跟文字錨點,配對批量上傳至CMS的資料使用,其實表產出再取代就好,這樣搜尋的query(?search=)就有解
FROM
page_view_log AS pvl
JOIN session_campaign AS sc ON pvl.ga_session_id = sc.ga_session_id
ORDER BY
pvl.user_pseudo_id,
pvl.ga_session_id
成品局部截圖 |
主要更動
- 發現裡面WITH可以吃前一個WITH的東西,故優化了WITH結構。
- 將-user-level的參數丟進來(就是那個不小心被我碼整行的欄位...算了懶得改)
接下來
- 針對帶query的url做優化,目前這樣所有query都被篩掉,會導致看不到搜尋結果頁搜了甚麼(也不是不行,可以到事件的表看)
- 速度看能不能優化
- Code看能不能更整齊,這點好像有點難,因為UNNEST, MAX, JOIN等等很多function都是合併使用。
- 這只是針對page_view的表,還要做其他event的表~
- 本來好奇user-level的值(xxxid)能不能帶到所有一樣user_pseudo_id,會有這問題是因為我不熟MAX, OVER,跟PARTITION BY 的應用,知道這個之後這也是可以優化的部分(能否MAX裡面的CASE WHEN 只寫一次就好),如果需求清楚,在一開始原資料就DISTINCT也無妨
Reference
- https://codebeautify.org/sqlformatter#
- https://dotblogs.com.tw/SteveLiu/2019/05/21/174116
- https://stackoverflow.com/questions/5375634/can-i-use-multiple-with
- https://stackoverflow.com/questions/34056485/select-all-columns-except-some-in-google-bigquery
- https://ithelp.ithome.com.tw/articles/10190257
- https://stackoverflow.com/questions/59577857/how-to-filter-out-nulls-in-google-bigquery
創建時間:2023.03.21
留言
張貼留言