Exporting Matomo data to Looker studio through GCP [EN]

Like several analytics tools, Matomo is a challenge for non analysts. Our client’s teams, who are not data analysts, need insights to make decisions. We aim to create simple dashboard for them to monitor their activity without having to deepdive in an analytics tool.
In this article, we will demonstrate and share with you the code to easily export data from Matomo Cloud or Matomo On Premise to build a dashboard in Looker Studio.
There are a few requirements :
  • think upfront of the data you need in your dashboard and the custom reports you need to create to get them
  • create a GCP environment. It is not free but almost. The cost of implementation and running for our clients are about a baguette and a croissant per month (9.09€ over the past 90 days for the use case we took here)
Image without caption

Data prep in matomo

Create the custom reports you need and note the IDs
Image without caption
Get the Site IDs
Image without caption
Get your authentification Token :
Image without caption

Data Extract

In GCP, create a new Cloud Function.

Relevant configuration details

Image without caption
Image without caption
Image without caption

main.py

python
import json import os import requests import time as tt import pandas as pd import pyarrow from datetime import datetime, date, timedelta from dateutil import parser from google.cloud import bigquery from google.cloud import secretmanager # ADJUST NAMES HERE project_id = "PROJECT_ID" headers = {'Content-Type': 'application/json'} TOK = os.environ.get("ENV_TOK", "Specified environment variable is not set.") def extract(input_date): params = (('token_auth', TOK), \\ ('apiAction', 'getCustomDimension'),\\ ('language', 'en'),\\ ('apiModule', 'CustomDimensions'),\\ ('date', input_date),\\ ('idDimension', '1'),\\ ('filter_limit', -1),\\ ('format', 'JSON')) r = requests.get('<https://MYSITE.matomo.cloud/?module=API&method=API.getProcessedReport&idSite=1&period=day&flat=1>', params=params,headers=headers,verify = False) print(params) print(r) # Replace MYSITE by the one in your URL j = json.loads(r.text) return j def execute_bq_job(df): print("start_bq") cl = bigquery.Client() # ADJUST NAMES HERE PROJECT_ID = "PROJECT_NAME" DATASET_ID = "DATASET_ID" table_name = "TABLE_NAME" try: dataset = bigquery.Dataset(f"{PROJECT_ID}.{DATASET_ID}") dataset.location = "EU" dataset = cl.create_dataset(dataset) # API request except: print('dataset not created') dataset = cl.dataset(DATASET_ID) job_config = bigquery.LoadJobConfig() job_config.autodetect = True job_config.write_disposition = "WRITE_APPEND" job_config.create_disposition = "CREATE_IF_NEEDED" load_job = cl.load_table_from_dataframe( df, dataset.table(table_name), job_config=job_config ) # API request print("Starting job {}".format(load_job.job_id)) load_job.result() # Waits for table load to complete. print("Job finished.") def clean_columns(df): print("clean_start") df.columns = df.columns.str.replace(r'(', '') df.columns = df.columns.str.replace(r')', '') df.columns = df.columns.str.replace(r'[', '') df.columns = df.columns.str.replace(r']', '') print("end clean") return df.columns.str.replace(' ', '_') def cloud_handler(request): # Par défaut si aucun paramètre de date n'est renseigné la fonction tournera sur la donnée de la veille """ :param request: :return: success/error message """ request_json = request.get_json(silent=True) request_args = request.args default_date = datetime.now() - timedelta(days=1) if request_json and 'start_date' in request_json: start_date = request_json['start_date'] elif request_args and 'start_date' in request_args: start_date = request_args['start_date'] else: start_date = default_date.strftime("%Y-%m-%d") if request_json and 'end_date' in request_json: end_date = request_json['end_date'] elif request_args and 'end_date' in request_args: end_date = request_args['end_date'] else: end_date = default_date.strftime("%Y-%m-%d") # command: zip -r cloud_function.zip applibs/bigquery_util.py applibs/google_cred.json main.py requirements.txt try: print(f"Start processing at: {datetime.now().isoformat()}") input_date = datetime.strptime(start_date,"%Y-%m-%d") end_date = datetime.strptime(end_date,"%Y-%m-%d") delta = timedelta(days=1) print("end_date : ", end_date) while input_date <= end_date: print("input_date : ", input_date) # Extract data request_data = extract(input_date) data = request_data['reportData'] df = pd.DataFrame(data = data) print(data) date = parser.parse(request_data['prettyDate']).strftime("%Y-%m-%d") json_data = [] processed_datetime = str(datetime.now()) print(processed_datetime) print(input_date) print(date) # store data if data: df.columns = clean_columns(df) df["date"] = datetime.strptime(date,"%Y-%m-%d") df["_processed_datetime"] = processed_datetime execute_bq_job(df) print(f"End processing at: {datetime.now().isoformat()}") input_date += delta #if res != "success": # print("error: " + res) # return {"message": res}, 400 return {"message": "Data saved successfully."}, 200 except Exception as error_info: print(error_info) return {"message": error_info}, 400

requirements.txt

python
google-api-core==2.4.0 google-api-python-client==2.36.0 google-auth==2.3.3 google-auth-httplib2==0.1.0 google-cloud-bigquery==2.32.0 google-cloud-core==2.2.2 google-cloud-firestore==2.3.4 google-cloud-storage==2.1.0 google-resumable-media==2.1.0 googleapis-common-protos==1.56.4 google-cloud-secret-manager python-dateutil==2.8.2 pandas pyarrow

Data Processing

Raw data
Image without caption
sql
select _processed_datetime, date(date) as date, split(label,' - ')[safe_offset(0)] as content_name, split(label,' - ')[safe_offset(1)] as page_url, nb_visits, nb_hits, bounce_rate, avg_time_on_dimension, exit_rate, avg_time_generation from `un-nexans.src_matomo.matomo_contentName_group` qualify row_number() over(partition by date, label order by _processed_datetime desc) = 1
It splits the label and avoids duplicated rows when the data is imported several times. Output
Image without caption

Schedule your import

Use the Cloud Scheduler to schedule daily imports
Image without caption

Data Modelization

To aggregate data, use the join function. The metrics should be consistent accross tables so you can use one source only. In the following exemple, we use the page URL and date as keys to join the data.
sql
with prep as (select * from `un-nexans.stg_matomo.v_deduped_matomo_content1_group` qualify row_number() over(partition by date, page_url order by nb_hits) = 1), prep2 as (select * from `un-nexans.stg_matomo.v_deduped_matomo_content2_group`qualify row_number() over(partition by date, page_url order by nb_hits) = 1), prep3 as (select * from `un-nexans.stg_matomo.v_deduped_matomo_content3_group`qualify row_number() over(partition by date, page_url order by nb_hits) = 1), prep4 as (select * from `un-nexans.stg_matomo.v_deduped_matomo_contentName_group`qualify row_number() over(partition by date, page_url order by nb_hits) = 1) SELECT DISTINCT prep.date, prep.page_url, prep.content_path_1, prep2.content_path_2, prep3.content_path_3, prep4.content_name, prep.nb_visits, prep.nb_hits, prep.bounce_rate, prep.avg_time_on_dimension, prep.exit_rate, prep.avg_time_generation FROM prep full JOIN prep2 using(date, page_url) full JOIN prep3 using(date, page_url) full JOIN prep4 using(date, page_url)

Data orchestration

Create a new query to create a partionned table that you will refresh every day or every hour to get the most recent data in Looker.
sql
create or replace table `un-nexans.mart_nexans.fct_aggregated_matomo_nexans_events` partition by date_trunc(date, month) as select * from `un-nexans.mart_nexans.v_aggregated_matomo_nexans_events`
Image without caption

Data Visualization in Looker

With the same credentials go to looker studio and in your dashboard, add a new data set. With the Big Query connector, you can access all your tables. Just add the partioned table and create your dashboard !