在BigQuery內做GA4的頁面瀏覽大表

前言

這是第一個我真正用在視覺化的表!還沒整理跟優化,甚至還沒寫說明==
也好,有任何看不懂得直接留言。貼心提醒按著Shift滑動滾輪可以橫向滑動。

正文

演變歷程 - 第一版

先照之前的作品,把要的事件(頁面瀏覽)跟要的參數(Title, event_timestamp等)篩選抓取出來。馬上遇到第一個問題:campaign雖然是session-level, 卻只有在第一次的page_view出現。所以我先用了另一張表整理出每個session對應的campaign, 再用COALESCE讓每個本來有campaign的session也有了欄位(而非原本的null)。

隨著code越來越長,我用到WITH來簡化code,但效能應該沒差太多。
-- 這是第一版真正可以利用的大表!
WITH page_view_log AS (
  SELECT 
    DISTINCT
    event_date,
    event_name,
    user_pseudo_id,
    event_timestamp,
    MAX (CASE WHEN params_new.key = 'page_location' THEN params_new.value.string_value ELSE NULL END) OVER (PARTITION BY event_timestamp, user_pseudo_id) AS `page_location`,
    MAX (
      CASE WHEN params_new.key = 'page_title' THEN params_new.value.string_value ELSE NULL END
    ) OVER (
      PARTITION BY event_timestamp, user_pseudo_id
    ) AS `page_title`, 
    MAX (
      CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
    ) OVER (
      PARTITION BY event_timestamp, user_pseudo_id
    ) AS `campaign`, 
    -- 這種session level的資料怎麼吐到每個事件內?
    MAX (
      CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
    ) OVER (
      PARTITION BY event_timestamp, user_pseudo_id
    ) AS `ga_session_id`, 
    -- 這種session level的資料怎麼吐到每個事件內?
    -- traffic_source.source,
  FROM 
    `ga4-bq-for-rice.analytics_123456789.events_*` AS original 
    LEFT JOIN UNNEST (event_params) AS params_new 
  WHERE 
    REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM
    AND REGEXP_CONTAINS(event_name, 'page_view') -- 只要瀏覽頁面
  -- ORDER BY 
  --   ga_session_id  -不重要
  -- WHERE REGEXP_CONTAINS(page_location, 'example\\-domain\\.com\\/product\\/.*')  -- 指定產品頁
  -- LIMIT 
  --   30 -- testing
    )

,session_campaign AS (
  SELECT 
    ga_session_id, 
    MAX(campaign) AS campaign 
  FROM 
    (
      SELECT -- 這裡應能簡化成只拿session_id 跟 campaign
        DISTINCT MAX (
          CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
        ) OVER (
          PARTITION BY event_timestamp, user_pseudo_id
        ) AS `campaign`, -- 這種session level的資料怎麼吐到每個事件內?
        MAX (
          CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
        ) OVER (
          PARTITION BY event_timestamp, user_pseudo_id
        ) AS `ga_session_id`, -- 這種session level的資料怎麼吐到每個事件內?
      FROM 
        `ga4-bq-for-rice.analytics_123456789.events_*` AS original 
        LEFT JOIN UNNEST (event_params) AS params_new 
      WHERE 
        REGEXP_CONTAINS(event_name, 'page_view') -- 只要瀏覽
        -- AND REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM頁面
      -- ORDER BY 
      --   ga_session_id -- 不重要
      -- LIMIT 
      --   30 -- testing
        ) 
  GROUP BY 
    ga_session_id
) 


SELECT 
  pvl.* EXCEPT (campaign),
  COALESCE(pvl.campaign, sc.campaign) AS campaign,
  REGEXP_EXTRACT(page_location, r'.*://[^/]+(/[^?#]*)') AS path,  -- The [^?#]* part of the regular expression matches any characters that are not the start of a query string (?) or a fragment identifier (#).
  REGEXP_EXTRACT_ALL(page_location, r'.*example\-domain\.com.*\/([^?#]*)')  AS last_component  -- 移除所有query跟文字錨點,配對批量上傳至CMS的資料使用
FROM 
  page_view_log AS pvl
  JOIN session_campaign AS sc ON pvl.ga_session_id = sc.ga_session_id
ORDER BY pvl.ga_session_id

成品如上圖~其實我只截了部分欄位,主要是記錄我在這邊卡很久,因為我要用REGEXP_EXTRACT這個function,但是又要用到or判斷,要如何在REGEXP_EXTRACT中使用多個括號,就卡了我一天。最後問了ChatGPT才有解...

第二版的扣來了~

-- 這是第一版真正可以利用的大表!
-- 全管道(這應該比較好用,但在bq要撈比較久,到Looker Studio資料處裡也要時間
-- 既然with可以用with的資料,那就可以優化很多是
WITH original AS (
  SELECT 
    * 
  FROM 
    `ga4-bq-for-rice.analytics_123456789.events_2023031*` 
  WHERE 
    REGEXP_CONTAINS(event_name, 'page_view')
), 
page_view_log AS (
  SELECT 
    DISTINCT event_date, 
    event_name, 
    user_pseudo_id, 
    event_timestamp, 
    MAX (
      CASE WHEN params_new.key = 'page_location' THEN params_new.value.string_value ELSE NULL END
    ) OVER (
      PARTITION BY event_timestamp, user_pseudo_id
    ) AS `page_location`, 
    MAX (
      CASE WHEN params_new.key = 'page_title' THEN params_new.value.string_value ELSE NULL END
    ) OVER (
      PARTITION BY event_timestamp, user_pseudo_id
    ) AS `page_title`, 
    MAX (
      CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
    ) OVER (
      PARTITION BY event_timestamp, user_pseudo_id
    ) AS `campaign`, 
    MAX (
      CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
    ) OVER (
      PARTITION BY event_timestamp, user_pseudo_id
    ) AS `ga_session_id`, 
    traffic_source.source, 
    traffic_source.medium, 
    geo.* 
  FROM 
    original 
    LEFT JOIN UNNEST (event_params) AS params_new 
  -- WHERE 
  --   REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM
    ), 
session_campaign AS (
  SELECT 
    ga_session_id, 
    MAX(campaign) AS campaign, 
    MAX(rice_id) AS rice_id 
  FROM 
    (
      SELECT 
        -- 這裡應能簡化成只拿session_id 跟 campaign
        DISTINCT MAX (
          CASE WHEN params_new.key = 'campaign' THEN params_new.value.string_value ELSE NULL END
        ) OVER (
          PARTITION BY event_timestamp, user_pseudo_id
        ) AS `campaign`, 
        MAX (
          CASE WHEN params_new.key = 'ga_session_id' THEN params_new.value.int_value ELSE NULL END
        ) OVER (
          PARTITION BY event_timestamp, user_pseudo_id
        ) AS `ga_session_id`, 
        MAX (
          CASE WHEN properties_new.key = 'rice_id' THEN properties_new.value.int_value ELSE NULL END
        ) OVER (PARTITION BY user_pseudo_id) AS `rice_id` 
      FROM 
        original 
        LEFT JOIN UNNEST (event_params) AS params_new, 
        UNNEST (user_properties) AS properties_new 
      -- WHERE 
      --   REGEXP_CONTAINS(traffic_source.source, 'some_source') -- 只要EDM頁面
        ) 
  GROUP BY 
    ga_session_id
) 
SELECT 
  pvl.* 
EXCEPT 
  (campaign), 
  COALESCE(pvl.campaign, sc.campaign) AS campaign, 
  sc.rice_id, 
  REGEXP_EXTRACT(
    page_location, r'.*://[^/]+(/[^?#]*)'
  ) AS path,   -- The [^?#]* part of the regular expression matches any characters that are not the start of a query string (?) or a fragment identifier (#).
  REGEXP_EXTRACT_ALL(
    page_location, r'.*example\-domain\.com.*\/([^?#]*)'
  ) AS last_path_component -- 移除所有query跟文字錨點,配對批量上傳至CMS的資料使用,其實表產出再取代就好,這樣搜尋的query(?search=)就有解
FROM 
  page_view_log AS pvl 
  JOIN session_campaign AS sc ON pvl.ga_session_id = sc.ga_session_id 
ORDER BY 
  pvl.user_pseudo_id, 
  pvl.ga_session_id


成品局部截圖


主要更動
  • 發現裡面WITH可以吃前一個WITH的東西,故優化了WITH結構。
  • 將-user-level的參數丟進來(就是那個不小心被我碼整行的欄位...算了懶得改)

接下來

  • 針對帶query的url做優化,目前這樣所有query都被篩掉,會導致看不到搜尋結果頁搜了甚麼(也不是不行,可以到事件的表看)
  • 速度看能不能優化
  • Code看能不能更整齊,這點好像有點難,因為UNNEST, MAX, JOIN等等很多function都是合併使用。
  • 這只是針對page_view的表,還要做其他event的表~
  • 本來好奇user-level的值(xxxid)能不能帶到所有一樣user_pseudo_id,會有這問題是因為我不熟MAX, OVER,跟PARTITION BY 的應用,知道這個之後這也是可以優化的部分(能否MAX裡面的CASE WHEN 只寫一次就好),如果需求清楚,在一開始原資料就DISTINCT也無妨



Reference

  • https://codebeautify.org/sqlformatter#
  • https://dotblogs.com.tw/SteveLiu/2019/05/21/174116
  • https://stackoverflow.com/questions/5375634/can-i-use-multiple-with
  • https://stackoverflow.com/questions/34056485/select-all-columns-except-some-in-google-bigquery
  • https://ithelp.ithome.com.tw/articles/10190257
  • https://stackoverflow.com/questions/59577857/how-to-filter-out-nulls-in-google-bigquery
創建時間:2023.03.21

留言

這個網誌中的熱門文章

如何在BigQuery內呈現類似舊版GA的事件表格

學習紀錄 - 1

面試經驗 - 2