Events tracking at GetYourGuide
Bora Kaplan, Data Engineer, GetYourGuide
Thiago Rigo, Engineering Manager, GetYourGuide
#datalift use case in production
Presented live at #datalift No 6 on 26 November 2021
Bora Kaplan, Data Engineer:
"To be able to provide people incredible experiences we have to learn about them. And to do that, we cannot just rely on our services' databases because it's not going to track every single action a user takes. That is why it is paramount to build specific pipelines and services for the purpose of users' events tracking."
Fateme Kamali, Data Scientist:
"They demonstrated the evolution of the event tracking pipeline by providing snapshots of the architecture in 2016, 2018, and 2021. Tracking user events is essential to enable data analytics and machine learning."
GetYourGuide is Europe's largest marketplace for travel experiences, with over 45 million tickets sold for tours and activities in 150+ countries. Users book and enjoy incredible experiences. We have to know what incredible experiences mean for our users for that to happen. We learn about our users with a dedicated pipeline and services for tracking events of user behavior on the platform.
Bora Kaplan and Thiago Rigo are data engineers enabling data analytics and machine learning at GetYourGuide. They outline their journey for users’ events tracking, that is, knowing the actions users take on both web and app versions of the GetYourGuide platform: from humble beginnings with schemaless JSON, going through strongly typed tracking with Thrift, and back to JSON with OpenAPI and AsyncAPI. They outline each approach’s pain points and benefits and show the tech stack and architecture. Finally, they describe what they envision for the future.
Watch the recording on YouTube: #datalift: Events Tracking at GetYourGuide
2016: Logs parsing and Ping API
Starting with a straightforward approach, our infrastructure team set up streaming to our data lake for all the webserver logs. The server logs were in the format of TSV files, and by parsing them, we already had enough data to capture user behavior on the website.
We used an API called Ping for the mobile app that received events data and wrote it as JSON to the data lake. For example, what we call an event: If you go and open the app, it triggers an event called “app open.” We do not maintain a lot of infrastructures, and it allows the capture of richer data than just the database data.
However, we did have some shortcomings: Server logs data isn’t rich enough, and it can change. Also, the app data was not of high quality. The Ping service simply received the event and wrote it directly to the data lake without further checking or verification. Hence, the data wouldn’t always conform to a schema. Lastly, the Ping service was not maintained anymore, which prevented us from extending the service to add extra data, like geolocation.
2018: Focus on events and Thrift
Leaving behind the Ping API, we built a v2 architecture focused on the events only. We introduced three components:
Thrift Schemas: Apache Thrift is a powerful open-source software framework that allows defining data types and service interfaces in a simple definition file.
Collector: An API we created, replacing Ping. It receives events data from the app and validates the data against the Thrift definition, checking the data types. Only the correct data flows downstream.
Analytics Quickline: Reads the Thrift binary data from Kafka and writes it out as parquet. Parquet is a more efficient data lake format than TSV or JSON.
This new pipeline benefits from the strong schema definition, which ensures the data quality by the strict type definition. And since we own the API, we can do basic data enrichment, for example, with geolocation, which is essential for us. It is a more stable source as opposed to logs.
Some shortcomings: There was an explosion of event types. We went from 30 to 250 types, so our team became a bottleneck controlling the schema definitions. We observed low ownership with the producer teams regarding the quality of their data since our engineering team was usually the first team to be contacted when something went wrong. Moreover, due to how we used Thrift, all the properties were marked as optional. Additional properties were often not sent by the payload, and we did not have a good way to monitor this.
2021: Real-time enrichment and decentralized schemas
We took all we learned throughout the years and developed a new approach using OpenAPI & AsyncAPI, and this allowed each team to own their schema and allowed us to do real-time enrichment. We do that by keeping the Collector and Quickline from the previous architecture and adding two new components:
Schemapi Registry: Contains all the OpenAPI & AsyncAPI definitions
Enrichment Pipeline: Multiple streaming applications
With this approach, we have better data quality and more metadata available. UsingOpenAPI and AsyncAPI from the get-go, we already have richer property validation: synthetic and semantic validation. Through the Analytics Enrichment Pipeline, we provide real-time enrichment of the data. And we are using JSON instead of Thrift because it is a company standard, besides being human-readable.
The result is that there is better monitoring because everything is decentralized. Teams are more independent: Each can create their schema, publish it, and have alerts, which leads to better monitoring and control.
The shortcomings are different: this approach is more complex since we added more components and more code to maintain. Each team needs to publish and validate their schemas, and we need to introduce new tools around that.
Engineering is an ongoing project; thus, we want to ensure a smooth migration phase and provide documentation for producers and consumers.
We will reduce the complexity and grey areas of some tools, e.g., OpenAPI and AsyncAPI are great tools, but they can be a black box sometimes.
Provide tooling for data discoverability because we collect many events for people to use and create value. So we want to make sure they know these data exist and know how to use them correctly.
And after the migration phase is complete, we need to ensure data quality and automate anomaly detection.
Analytics and ML use cases
These events fuel many different use cases at GetYourGuide. One relevant metric that can be calculated with user' events is the click-through rate of each tour. That allows understanding of what tours perform well for each different profile of users, and this supports the recommendation algorithm.
Another big internal use case is marketing attribution. Every time a user arrives at GetYourGuide, the attribution tracking event is triggered. It contains important marketing campaign information such as the referral channel.
And based on the events, we have built an experimentation platform. It is an in-house tool for A/B experiments to understand which variations are working better.
From September 2020 to June 2021, we hosted Season 1 entirely online. The #datalift No 1 to No 5 events had 40+ AI Guild members showing best practices from 12 industries to over 5.2k registered attendees.
We are now in the middle of Season 2, which runs from November 2021 to July 2022 and is a hybrid experience: Online + 3-day Summit.
Invitation to #datalift summit
#datalift Summit is in Berlin from 22 to 24 June 2022.
The confirmed speakers and partners are up on the main page www.thedatalift.eu
Get your Early Bird ticket with a discount until 31 January!