• ABOUT US
    • History
    • Our Team
    • Advertising
    • Subscription
  • CONTACT US
Georgia Today
No Result
View All Result
  • News
  • Politics
  • Business & Economy
  • Social & Society
  • Sports
  • Culture
  • News
  • Politics
  • Business & Economy
  • Social & Society
  • Sports
  • Culture
No Result
View All Result
Georgia Today
No Result
View All Result

New Identity Opportunities with Aerospike

Lineate Spotlight: Thoughts on the Future of the Ad-tech Industry

by Georgia Today
February 29, 2024
in Business & Economy, Newspaper
Reading Time: 7 mins read
Luca Polare Opens 2nd Branch in Baku

With the phase out of third-party cookies coming in the middle of Q4 2024, the urgency around identity solutions has reached a fever pitch among companies in the advertising ecosystem. Among ad tech companies, publishers, and advertisers, the user ID matching process is key to producing high-quality data for ad requests, reporting, and audience targeting. The value of building a graph of users, households, and devices is going to become more important than ever in maximizing value and generating the most relevant advertising. But capturing and linking all this data at scale has always been a significant technical challenge. The obvious choices of graph databases tended not to scale to the levels needed in ad tech, but the graph querying capabilities still need to be addressed. In 2022, we wrote a paper showing how we were able implement a scalable solution for identity graphs using HBase and Spark. Since that time, promising new technologies have emerged that offer the ability to broaden the capabilities of in-house identity graphs. Our early experiments with Aerospike Graph on AWS have been particularly promising.

Mapping an Identity Graph
Maximizing value in digital advertising depends on getting the right message to the right people at the right time. In the ad tech world, much of this goal is accomplished with the ability to map first-party data, such as information known by a publisher about a particular reader, with anonymized behavioral or demographic information known to DSPs, SSPs, and other enrichers as well as third-party data providers. Matching these disparate data streams together gives ad tech companies the ability to produce compelling user profiles, and these profiles give publishers and advertisers the ability to provide a more relevant experience and achieve more effective results. In practice, every company in the chain is building an identity graph that maps users to relevant data, devices to users, users to households, and so forth.


On a technical level, this process means that companies need to match IDs from many sources such as publishers, DSPs, SSPs, and other data enrichers and merge them with their own data and data they may acquire from other sources. Ad tech companies need to not only capture user and device data but also make it queryable and actionable for things like determining buyer intent, optimizing yield, building segments, and facilitating richer reporting and attribution. This need makes it crucial to combine all collected ID pairs into a useful identity graph. An identity graph helps advertisers get a full picture of a user’s journey through a marketing funnel across multiple browsers, apps, and devices. For example, if a user looks for a product on his or her desktop computer, the advertiser is able to target that same user on his or her mobile phone, through email campaigns, or in social networks. So, this set of user IDs makes up an identity graph that stores the relationship between a user’s personal devices, browsers, IP addresses, or household devices.
As a result, ad tech companies and publishers need to store and join many different data sources. When dealing with these data sources, the main challenge is that tracking signals may come from the same user multiple times per minute, with thousands to millions of users coming each second, producing at minimum terabytes of data each day. Without a carefully designed architecture, the data volume becomes so large that it is not feasible to run a query against even one month’s worth of data, let alone make decisions at scale.

Understanding the Technical Challenge
A user ID tracking system receives only pairs of IDs, not an entire set of IDs at once. For example, when a user logs into a website, the tracking system may need to map a site identifier to an email hash. Another may be a request to map a partner’s user ID to a company’s ID. Each of these pairs is separate, with each party or application offering only its own narrow identification of a user.


A tracking system’s main task is to build out the user/device graph, starting with a single pair of IDs and creating more complex relationships from there. The system evolves in real time, continually building out and enhancing the graph as well as answering queries on the fly. At the same time, the entire graph must be usable at any time for merging into reports and interpreting large batches. This means it is a key requirement to not only store and query data but also to handle hundreds of thousands of insertions per second, almost 24/7, all while other applications are using the system uninterrupted (doing full-scan data jobs or single key-value or range-key queries).
Our typical approach to handling ultrafast data queries at such scale has been to use databases with highly optimized key-value query capabilities, such as Aerospike or Redis. We have had a lot of success in particular using Aerospike in ad tech because it has demonstrated exceptional speed and recoverability as well as the ability to handle large data volumes with a relatively small cluster size. But the graph queries across the identity graph don’t lend themselves to such an architecture. At the same time, dedicated graph databases tend to work primarily on single node data volumes or on graphs that can reasonably be effectively sharded at design time, neither of which is practical in this use case.

Solving in the Real World
To take a concrete example, we had a client generating approximately a terabyte of tracking data each day. The cookie-based tracker alone produced more than ten thousand files per hour to analyze, and one of the partners supplied about 400 thousand files per hour. This data needed to be deduplicated, merged into a queryable graph, and used on a consistent basis. We built several proofs of concept using graph databases, key-value databases, and columnar databases and ran them through a series of tests to simulate ad tech workloads.


The fact that the dataset is a large graph suggests that graph databases would be an ideal choice for managing this task. For example, the built-in graph traversal features seem like an easy way to store IDs and their relationships and to quickly do multi-hop queries. For example, such functionality makes reaching an email hash through a browser cookie, or getting a mobile advertisement ID through an email hash, relatively simple to implement. The challenges of scale were clear, but the native graph query functionality was too promising to ignore.
However, we found the convenience of graph databases was outweighed by the challenges of scaling to the data sizes we needed. We experimented with Amazon Neptune, NebulaGraph, TigerGraph, and others, and we found that different graph solutions could store large amounts of data and make quick multi-hop ID queries. However, in ad tech, there is often a need to implement full table scans—for example, when a partner needs to join a company’s user or device data with its own and output a user intersection between the two systems. These wide queries proved problematic with all the graph databases we tried. In the end, we came to the conclusion that it would be easier to overlay certain graph query functionality on top of the database with better horizontal scalability than to attempt to incorporate horizontal scaling into a database with stronger graph capabilities.

Building our own graph
In the end, we elected to roll our own graph capabilities on top of Apache HBase deployed on Amazon EMR. By building custom indexes that store a predefined graph path in each record, we have been able to re-create the most important aspects of graph functionality needed for the ad tech use case. Not every graph capability we wanted was available to us, notably multi-hop lookups in arbitrary directions, such as from email hash to cookie and back. But the key functionality, such as from querying device ID to cookie to email hash, works well in a production capacity. Apache Spark jobs on top of this allowed us to run the full table scans in parallel.


Moreover, the Apache HBase solution scaled well. Our testing approach involved providing clean and prepared-for-insertion data of approximately 500 thousand ID pairs per second, each with different data retention requirements. (Different types of IDs have different lifecycles, so some IDs in the database needed to be stored for just a few weeks, but other types of IDs may need to be stored for multiple months.) More than 50 percent of the data consisted of duplicates that needed to be cleaned within a few minutes of insertion. The solution demonstrated response times in milliseconds even while full table scans were running. We have since incorporated similar architecture into production, and we have used this design approach in other high-scale projects, both inside and out of ad tech.

Seeing a New Possibility
Although we are proud of this solution and it has proven successful in the real world, it comes with inherent limitations. In particular, although we were able to maintain the very convenient graph abstraction over our datasets, we ultimately needed to hard-code precomputed query paths into Apache HBase indexes. This means that we need to know every type of graph query that will be done at scale at the time the indexes are built, and it means that the indexing burden grows exponentially with the number of query types. This is a standard limitation of solutions built on top of data pre-aggregation.
In the past year, Aerospike has released Aerospike Graph, which provides a convenient and familiar Gremlin Query Language interface on top of Aerospike core components. It leverages Apache TinkerPop to build graph computing capabilities on top of Aerospike’s engine. Aerospike Graph has the potential to dramatically simplify the custom indexing we have been implementing, tackling a majority of the use cases at higher speed and supporting the ability for ad hoc graph queries, which our existing implementations currently lack.
We have had a very good experience using Aerospike as a key-value store and have seen exceptional performance in dozens of deployments (as well as our own research). It seems highly promising to perform multi-hop queries over such a low-latency back end. We have run Apache Spark on top of Aerospike, though it is not clear if this part of the system is suitable for the new Aerospike product. It is research we are currently undertaking.

Conclusion
Like so much in ad tech, there is no single silver bullet for maintaining an identity graph, but it can be done cost-effectively using the right technologies for each function. By writing our own indexes on top of columnar data stores like Apache HBase, we have been able to create production-
ready identity graphs that handle millions of incoming identity pairs per second while simultaneously allowing batch processing of datasets.
The challenge of such homegrown solutions is that they require significant customization based on the specific kinds of queries needed. We are constantly refining the solution to optimize for various needs and have implemented similar ones on technologies like parquet files on Amazon S3. One area in particular we continue to think about is how to make the graph query capability more flexible without introducing too much complexity. The recent release from Aerospike shows promise in our early tests, and we will continue to explore this database further.

About Lineate:

Lineate is a leading implementer of cloud solutions, helping organizations scale their technology to manage vast data volumes and user bases. Our expertise in distributed systems along with our unique culture focused on knowledge breadth and creativity sets us apart in a competitive industry. With a global presence spanning seven countries, our client base is primarily in the United States and Western Europe.

Tags: Aerospikeidentity graphsLineate
ShareShareTweet

Related Posts

McDonald’s opens 27th location in Tbilisi’s Didube district
Business & Economy

McDonald’s opens 27th location in Tbilisi’s Didube district

June 6, 2025
Firefighters extinguish cars hit by a Russian military strike, in Sumy, Ukraine June 3. Source: State Emergency Service of Ukraine/REUTERS
Newspaper

Ukraine Latest: Zelensky Calls for Sanctions after ‘Savage’ Russian Attack on Sumy Kills Four

June 5, 2025
Argentina House, functioning in Georgia, and the House of Georgia, operating in Argentina, headed respectively by Nikoloz Makharadze-Patarklishvili and Diego Roberto Manavella. Source: FB
Business & Economy

The Mutual Houses in Argentina and Sakartvelo

June 5, 2025

Recommended

Putin, Xi, and allied leaders mark Russia’s Victory Day at Moscow parade

Putin, Xi, and allied leaders mark Russia’s Victory Day at Moscow parade

4 weeks ago
Experience Seamless Connectivity with Silknet eSIM in Georgia

Experience Seamless Connectivity with Silknet eSIM in Georgia

12 months ago
Champion Karateka Luka Khvedeliani on the Benefits of Georgian Karate for Georgia’s Youth

Georgia to Celebrate First Europe Day with European Union Candidate Status

1 year ago
Georgian Foreign Minister Holds Farewell Meeting with French Ambassador to Georgia

Georgian Foreign Minister Holds Farewell Meeting with French Ambassador to Georgia

3 years ago
Natia Mezvrishvili on Dealing with 2 Political Giants

Natia Mezvrishvili on Dealing with 2 Political Giants

3 years ago
Giorgi Gakharia: We were Told We Were Capable of Nothing – It’s All a Lie and Ukraine is a Great Example of This

Giorgi Gakharia: We were Told We Were Capable of Nothing – It’s All a Lie and Ukraine is a Great Example of This

3 years ago
GT Interview with Giorgi Badridze

GT Interview with Giorgi Badridze

3 years ago
Russo-Ukrainian War and Georgia – Analysis from security expert Kakha Kemoklidze

Russo-Ukrainian War and Georgia – Analysis from security expert Kakha Kemoklidze

3 years ago

Navigation

  • News
  • Politics
  • Business & Economy
  • Social & Society
  • Sports
  • Culture
  • International
  • Where.ge
  • Newspaper
  • Magazine
  • GEO
  • OP-ED
  • About Us
    • History
    • Our Team
    • Advertising
    • Subscription
  • Contact

Highlights

MFA shuts down NATO & EU info centre amid restructuring

Rasa Juknevičienė: Report urges fair elections, sanctions on Georgian regime, and release of political prisoners

European Parliament’s Foreign Affairs Committee passes draft report on Georgia with amendments

Zurabishvili: New wave of repression shows government’s weakness

Georgian Dream sues TV channels over critical language

EU Parliament panel holds urgent session on Georgia

Trending

Experience Seamless Connectivity with Silknet eSIM in Georgia
Business & Economy

Experience Seamless Connectivity with Silknet eSIM in Georgia

by Georgia Today
June 26, 2024

Why Silknet's eSIM could be your top choice in Georgia  Since its introduction, eSIM technology has become...

Photo by the author

Virtuosity and Versatility: Marc-André Hamelin Opens Tbilisi Piano Festival 2024

May 30, 2024
  • Where.ge
  • Newspaper
  • GEO
  • Magazine
  • Old Website

2000-2024 © Georgia Today

No Result
View All Result
  • News
  • Politics
  • Business & Economy
  • Social & Society
  • Sports
  • Culture
  • International
  • Where.ge
  • Newspaper
  • Magazine
  • GEO
  • OP-ED
  • About Us
    • History
    • Our Team
    • Advertising
    • Subscription
  • Contact

2000-2024 © Georgia Today