One of the major points that companies must consider these days is how to store, sort, and manage the user data they receive. Especially since the implementation of online information regulatory policies such as the General Data Protection Regulation (GDPR) and the California Consumer Protection Act (CCPA), companies must take care to ensure they are managing, storing, and deleting user data in accordance with the applicable regulatory standards.
In this blog post, I am going to provide an overview of a recent project in which I used Amazon Kinesis Data Streams to perform client data management. As part of the project, I was responsible for setting up an automatic system that monitored data requests from the client’s users and processed any requests for account deletion.
To understand the process we used for this project, it’s important to first understand the basic terminology associated with Amazon Kinesis Data Streams. Here are a few definitions:
- Amazon Kinesis Data Streams: Used to collect and process large streams of data records in real time. The Amazon Kinesis Data Streams process is further discussed below.
- Stream: A sequence of data that is uploaded to Amazon Kinesis Data Streams and then broken down into shards, which are further broken down into data records.
- Data Record: A unit of stored data which is an immutable sequence of bytes. Each data record is composed of a sequence number, a partition key, and a data blob.
- Shard: A uniquely identified sequence of data records in a Kinesis Data Stream. A Kinesis Data Stream is composed of one or more shards, which in turn are composed of multiple data records.
- Partition Key: A key used to group data by shard within a Kinesis Data Stream. It is associated with each data record to determine which shard a given data record belongs to.
- Sequence Number: A specific number that identifies each data record within a shard. The sequence number must work in tandem with the partition key, because the sequence number does not specify which shard the data belongs to.
- Retention Period: The length of time that data records are accessible after they are added to the Kinesis Data Stream. The default retention period for Amazon Kinesis Data Streams is 24 hours after creation, though the producer can change the retention period as required.
- Producer: Any device that inputs records into Amazon Kinesis Data Streams. For example, a producer can be a computer, a laptop, or a web server.
- Consumer: Also known as an Amazon Data Streams application, it is any application that gets records from the Kinesis Data Streams and processes them.
Amazon Kinesis Data Streams Architecture
Amazon Kinesis Data Streams are used to collect and process large streams of data records in real time. The benefit of Kinesis Data Streams is that the application can start consuming the data almost immediately (typically less than 1 second) after it is added.
The typical process for Amazon Kinesis Data Streams is:
- A producer uploads data into the Amazon Kinesis Data Stream. When the producer uploads the data, it must specify the partition key(s) and sequence number it is using.
Each data stream within a Kinesis Data Stream has a pre-set number of shards, each of which provides a fixed unit of capacity. Therefore, the data capacity of your stream is determined by the number of shards that you have.
This is especially important for Amazon Kinesis Data Streams because Amazon charges you per shard. The more shards you have, the more capacity you have, but the more Amazon is going to charge you.
If your stream size changes, Amazon allows you to increase or decrease the number of shards allotted to the stream. This process of increasing and decreasing shards to change the data rate is called “resharding.”
- The uploaded data stream is broken down into shards, which are further broken down into data records.For example, when using Amazon Kinesis Data Stream for video streaming, the stream you are uploading might contain a full video, while the shard is only a small portion of the video, and the data record is only a few frames.
As a rule of thumb, when you upload your stream, you should have more partition keys than shards. This is because each shard should be broken down into multiple data records that will each have their own individual partition key. The more partition keys you have, the easier it will be for the data records to be evenly-distributed across the shards in a stream, which will make the whole process more efficient.
- Consumers read and process the information stored in the data records.The consumer, also known as the Amazon Kinesis Data Streams application, can consume and process data using the Kinesis Client Library. The consumer identifies, collects, and processes all of the necessary shards and data records within the stream.
When a consumer is built, it can also be set up to run on multiple Amazon EC2 instances under an Auto Scaling group. The Auto Scaling group can monitor the instances and automatically scale the number of instances as the load changes over time. This means that you can have multiple consumers all working simultaneously to process the data that has been uploaded by the producer.
The Benefits of Amazon Kinesis Data Streams
As I mentioned above, Amazon Kinesis Data Streams are ideal for rapid and continuous data intake and aggregation. The type of data used with Amazon Kinesis Data Streams can include:
- IT infrastructure log data
- Application logs
- Social Media
- Market data
- Web clickstream data
Because the response time for the data intake and processes is all in real time, the processing for Amazon Kinesis Data Streams is fairly lightweight.
Amazon Kinesis Data Streams are also ideal for situations where there are multiple tech stacks that all must be analyzed. For example, in our project, we had tech stacks that used a wide range of software products and programming languages. Instead of setting up consumers for each individual type of tech stack, we were able to process all of the data through one Amazon Kinesis Data Streams consumer process.
Amazon Kinesis Data Streams are typically used in situations such as:
- Accelerated log and data feed intake and processing
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing
Though Amazon Kinesis Data Streams can be used in a wide range of situations, in my opinion, it works best for situations where data production and/or processing can be parallelized. A typical example is the map and reduce paradigm.
A Practical Example: The GDPR Consumer
The goal of my project was to find a way to delete user data from my client’s database for users that requested to be removed from the system. I was asked to use Amazon Kinesis Data Streams, since our tech stack was not compatible with the client’s. Therefore, I implemented a consumer solution leveraging the client’s Amazon Kinesis Data Stream.
Our system was consumer-based and was focused on three main actors:
- The Shard Data: The stream that is uploaded by the producer.
- The Shard Manager: The portion of the consumer that collects the data and store it in a common hash map.
- The Shard Worker: The portion of the consumer that reads the data from the hash map populated by the Shard Manager.
I began by focusing on the shard data that was input by the producer (in this case, a dashboard the client had built). The client’s stream was not specific to us, but it included data requests for all their products. I needed to be able to filter and collect any event in which a user sent a request to delete their account from our system. Therefore, I set up a system in which the consumer’s shard manager searched for any stream that was tagged as “GDPR,” “Delete User,” and the client’s name.
The target streams were then read by the consumer’s shard worker. Once the stream was read, if the shard worker identified a “delete user” request, the stream was sent via a secure API code to the backend of the client’s database with a request to anonymize the client data.
Throughout the project, we only encountered two main obstacles:
- Our Client’s application did not run on the Amazon EC2 instance. Amazon EC2 instances make it easier to integrate with Amazon Kinesis Data Streams and allow you to run instance by instance consumers with auto scaling. Because we could not use Amazon EC2 instances, I set up a consumer that had a defined instance that ran continuously.
While this was not necessarily a challenging obstacle to overcome, using a defined instance so that we could work within the client’s existing system did make the set up process and the consumer slightly less efficient.
- We didn’t know how many shards composed each stream because the client’s producer did not specify the amount when the streams were uploaded. To overcome this obstacle, we had to have our shard manager repeatedly reading the steams to find and collect all of the applicable shards. We did not know either whenever new data was input, so we needed to keep track of the user requests we already processed.
Once our parameters were met (the shard manager went through the whole stream), we were then able to use the shard worker to assess the shards. After a determined amount of time, the shard manager would go through the stream again to detect new data. One again, I was able to overcome this obstacle; however, it was at the expense of some efficiency.
Overall, we were able to create a consumer in Amazon Kinesis Data Streams that successfully isolated and addressed our client’s user “delete account” requests. Now that system is set up, we are able to periodically monitor the process to ensure it is still running smoothly.