Clickstream analytics in Azure Data Explorer
Clickstream analysis is the process of collecting, analyzing, and reporting aggregated data about user’s journey on a website. User’s interactions with websites are collected, with applications like Adobe analytics and Tealium. Bringing clickstream data into a centralized place and combining it with other data sources for rich analytics is often required. Typically, clickstream data can reach around 2-3TB/day in size. Using Relational databases to analyze such data might not be suitable or cost effective.Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis on large volumes of Telemetry, Logs, Time Series data streaming from applications, websites, IoT devices, and more.
Proposed Architecture Following an end-to-end proposed architecture how you can use Azure Data Explorer to bring Clickstream historical and incremental data into Azure Data Explorer and analyze, aggregate and visualize the data.
Source System
Adobe Analytics dataset usually contains about 1000+ fields. Adobe Analytics provides Data feed functionality which can be used to extract the data to Azure Data Lake Store. Data can be extracted with different configurations, such as sending analytics daily or hourly, in single or multiple files with .zip or .gz format.
Tealium datasets are extracted in JSON structure with 100’s of nested fields. The data can be sent to Azure Event hub in near real-time.
Bringing Data into ADX
Creating the table structure and mapping for 1000+ column is a cumbersome task; it can be made easy with 1 Click Ingestion.
There are multiple ways to load the data.
– Historical Load using Lightingest. LightIngest is a command-line utility for ad-hoc data ingestion into Azure Data Explorer. Simplest and efficient way to load all the historical data, with just one command. Although adobe files are in TSV format but due to special characters in the files, you should use the format TSVE in ADX. Similarly, for Tealium historical load, use JSON format.
– For Tealium real time streaming connect Eventhub to ADX table.
– Incremental load with ADF is useful for loading because you can easily control the loading flow and manage all your ETL or ELT pipelines from a single tool.
Querying & Analytical workload
The Analytical work requires more compute but is usually required at working hours. Using a follower cluster will make it easy to pause/resume and optimize for read workload. This will be useful for charge back to different groups and will provide workload isolation. Cluster pause/resume can be done using Logic app or Azure Automation.
Estimated Cluster Size
The cluster size will depend on the daily data ingestion and how many days needs to be retained in hot cache. You can estimate your cluster size based on your requirements.