BYTE the Cloud | Transcript: Ep. 85 | Amazon Lake Formation Overview & Exam Prep | Analytics | SAA-C03

Ep. 85 | Amazon Lake Formation Overview & Exam Prep | Analytics | SAA-C03 | AWS Solutions Architect Associate

February 28, 2025 / 24:33/E85

Chris 0:00
Hey, there future cloud gurus. If you're a mid level cloud engineer, data is like the heart and soul of any company these days, in AWS. Well, that's where so much of that data lives. So today, we're going to do a deep dive into a service that can help you really tame that data beast. We're talking about Amazon Lake Formation. Think of this deep dive as like your cheat sheet, not just for understanding how Lake Formation works, but also how to crush those AWS exams. You know, what's

Kelly 0:28
really cool about Lake Formation? It tackles a problem that's been around forever, the headache of building and managing data lakes. Traditionally, this has been such a long and complex process requiring specialized skills and a ton of patients, but Lake Formation, it swoops in to make things way easier, offering a one stop shop for ingesting, cleaning, securing and governing your data. It's pretty awesome. Okay,

Chris 0:50
so it's like bringing order to the wild west of data. Can you give our listener an idea of where we'd actually see Lake Formation being used like real world examples? Oh,

Kelly 0:59
absolutely. Imagine a healthcare organization, they've got patient data scattered all over the place, right? EMRs, lab results, insurance claims, it's a mess. Lake Formation can centralize all of that into a secure data lake, and then researchers can actually query that data to look for breakthroughs in disease treatment. Pretty cool, huh?

Chris 1:17
That's amazing. It sounds like Lake Formation could really change the game in a bunch of industries. So even if you're listening and you're not a data scientist, understanding Lake Formation is becoming like a must have skill for cloud engineers,

Kelly 1:29
exactly as cloud engineers, you're the architects of this whole cloud infrastructure, and knowing how a service like Lake Formation fits into that, how it makes it possible for different teams to securely access and work with data. That's super powerful.

Chris 1:43
Okay, so we agree Lake Formation is a big deal. Now let's get into the nitty gritty. What are the core features that make this service so unique? One of

Kelly 1:51
the coolest features is how easy it makes data ingestion. With Lake Formation, you can pull data from all sorts of sources, databases, S3 buckets, you name it, and it uses pre built connectors, so no more wrestling with custom scripts just to get the data in. That's a huge time saver. Less

Chris 2:07
scripting, more problem solving. I like the sound of that. So once the data in, how does Lake Formation help make sense of it all? Think

Kelly 2:15
of it this way. Lake Formation has its own built in librarian. It creates something called a data catalog. It's a central place for all the metadata, basically a map to your data lake. This makes your data discoverable and searchable for different teams, no one's wasting time trying to find the right data set. It's all right there. That's a great

Chris 2:33
way to explain it. A librarian for the data lake. So we've got easy data ingestion and a searchable catalog. But what about security? We're talking about sensitive information here.

Kelly 2:43
You're absolutely right. Security is critical, especially in the cloud Lake Formation has this really robust set of security features and access controls. You can define exactly who can access what data, and you can get really specific with it control access down to specific rows and columns using something called IAM or AWS, identity and access management. Wow.

Chris 3:03
So it's not just locking down the whole data lake. You can get super granular with those controls based on a user's role and how sensitive the data is. That's really impressive. Now, what happens when you need to clean up the data a bit before you can analyze it?

Kelly 3:17
That's where data transformation comes in. Think of it like prepping ingredients before you cook. Lake Formation works really well with AWS Glue, a powerful ETL service, you can use Glue to clean, transform and prepare your data for analysis, basically, make sure it's ready for prime time.

Chris 3:33
Okay, so Lake Formation preps the data for analysis. This almost sounds too good to be true. Are there any downsides, any limitations our listeners should know about

Kelly 3:41
you're right to ask about that. Like with any tech, there are trade offs. Cost is definitely a factor. Even though Lake Formation simplifies a lot, it's important to be mindful of costs, especially if you're working with massive data sets. That

Chris 3:54
makes sense. It's always smart to weigh the pros and cons. Anything else to keep in mind?

Kelly 3:59
Yeah, there's one more thing. Even though Lake Formation makes things simpler, there's still a bit of a learning curve. It takes some time and effort to really understand all the ins and outs of the service and use it effectively. But hey, that's what we're here for, right? Absolutely, that's

Chris 4:12
what the deep dive is all about. Okay, so we've talked about the features, the benefits and even some of the limitations of Lake Formation. Now let's zoom out a bit. How does it fit into the bigger picture of the AWS ecosystem?

Kelly 4:25
Leg formation is designed to play well with others. It integrates with other AWS services like S3 Glue, Athena, Redshift, the list goes on. It's not a standalone thing. It's part of a whole data analytics world.

Chris 4:37
It's like a central hub connecting all these different data services in AWS creating this powerful network for managing and analyzing data.

Kelly 4:45
You got it. It's all about these services working together to build a super robust data analytics platform. Okay,

Chris 4:53
so we've laid the groundwork. Now let's get to the part. I know our listeners are really interested in how to conquer those. AWS exams. What kind of Lake Formation questions might pop up?

Kelly 5:03
Let's start with a scenario about secure data sharing. Imagine a company that wants to share data from their data lake with an external partner, but they only want to share certain data sets. How could they use Lake Formation to do this securely?

Chris 5:18
That's a great question. I can see how this would really require understanding Lake Formations, fine grained access control. It's not just a simple, all or nothing situation. You're

Kelly 5:28
absolutely right. They'd need to use Lake Formations IAM integration, create roles and policies specifically for this partner, and then grant access only to the specific databases or tables they're allowed to see. And don't forget, Lake Formation keeps audit trails. You can see who accessed what data and when it's all tracked,

Chris 5:43
so it's not just about giving access, but also making sure there's accountability. Those audit trails are crucial for security and compliance. Let's say a company is dealing with a ton of data coming in from different sources into their data lake. How can they use Lake Formation to make this process more efficient, avoid any bottlenecks. This is

Kelly 6:03
a common challenge when you're working with a lot of data Lake Formation has a few tools to help with this. First, those pre built connectors for different data sources. They really streamline the data flow, and they need to make sure they're using efficient data formats and compression. And then to handle those peak data loads, they can schedule ingestion tasks during off peak hours that helps prevent bottlenecks. So it's

Chris 6:26
a multi pronged approach, optimized connectors, efficient formats and smart scheduling. It seems like cost optimization is always a big concern for companies using cloud services. What can they do with Lake Formation to keep those cloud builds under control? It all

Kelly 6:39
comes down to life cycle management. Lake Formation works with S3 life cycle policies to define how data is stored and how it's moved between storage classes over time. So data that's accessed a lot can be kept in fast, readily available storage, while data that's rarely touched can be archived to a cheaper storage tier, like Amazon S3 Glacier. So it's about

Chris 7:00
finding that balance between easy data access and keeping storage costs down

Kelly 7:05
exactly. And Lake Formation gives you visibility into how your data is being stored and used so you can make informed decisions about those life cycle policies and your overall cost optimization strategies. Data

Chris 7:15
visibility is key. You can't optimize what you can't see. Okay, let's give our listeners one more exam style question. Let's say a company needs to make absolutely sure that no object in their S3 bucket can be overwritten or deleted by anyone for an entire year they already have object level versioning enabled. How would they achieve this level of protection?

Kelly 7:34
That sounds like a job for S3 object lock, a feature that's specifically designed to prevent accidental or malicious deletion or modification of objects. Object lock has two modes, governance mode and compliance mode,

Chris 7:47
okay, two modes. What's the difference? That sounds like something that could be important for the exam?

Kelly 7:52
You're right. Understanding these modes is key. Governance mode is more flexible. It lets users with special permissions modify or delete objects even when a retention period is in place. Compliance mode is like the Fort Knox of data protection. Once you set a retention period in compliance mode, no one, not even the root user, can change or delete the object until that retention period is over.

Chris 8:15
So if they want absolute, unbreakable protection for a year, they'd have to go with compliance mode. That makes sense. But what if they only want to prevent accidental deletion, but still want to be able to overwrite with newer versions of the object? How would they handle that?

Kelly 8:29
That's where legal holds come in. Think of a legal hold as an indefinite do not delete flag on an object. It works separately from a retention period. It stays in place until it's manually removed.

Chris 8:39
So they could combine versioning with a legal hold on those objects. That way they prevent deletion, but can still update with newer versions Exactly.

Kelly 8:48
It's a good balance between data protection and flexibility. Okay, I think

Chris 8:52
we've given our listeners a lot to think about. In this first part of our deep dive into Lake Formation, we've covered the core features, the benefits, talked about some limitations, and even tackled some exam style scenarios to get those brains working.

Kelly 9:04
We've really dug into how Lake Formation makes setting up a data lake easier, how it enhances security and governance, and how seamlessly it integrates with the broader AWS ecosystem.

Chris 9:14
Stay tuned for part two, where we'll dive into even more real world scenarios and explore how Lake Formation can help you solve those everyday data challenges.

Kelly 9:23
Welcome back, data wranglers ready to go even deeper into Lake Formation, and you know how it can help you ace those AWS exams?

Chris 9:30
For sure, I bet our listeners are ready to put all that Lake Formation knowledge to the test. What kind of scenarios should they be ready for?

Kelly 9:37
Okay, let's say there's a company that wants to set up a secure and efficient workflow for analyzing sensitive data. This data is in their S3 data lake, and they have a team of data analysts who need to be able to query this data using Amazon Athena, but they also need to make sure that only authorized people can access it and that the data itself is protected from unauthorized modification. How can they do all of that? With Lake Formation, that

Chris 10:01
sounds like a classic challenge, trying to balance data accessibility with security. It's like walking a tightrope, exactly

Kelly 10:07
you nailed it, and Lake Formation gives you the tools to walk that tightrope without, you know, falling off. First things first, they need to define the data in their S3 data lake as a data source in Lake Formation. This creates a link between that data in S3 and the Lake Formation data catalog. This way the data is properly managed and easy to find. So

Chris 10:27
it's like registering your data with Lake Formation central directory, making sure it's on the map right. What's next? Okay,

Kelly 10:33
next up, they've got to grant those data analysts the right permissions in Lake Formation, so we're talking about their IAM roles or users. These permissions should give them access to the specific databases and tables they need to query in Athena. And here's where it gets really cool. Lake Formation lets you define very granular permissions. You can control access down to the column level. Wow, that's

Chris 10:58
pretty awesome. It's not just giving them access to the whole data lake. You can get really specific with the rules based on their roles and what information they need. What about protecting the data from being modified

Kelly 11:07
to stop any unauthorized modifications, they can use S3 object lock to make the data immutable. Remember those object lock modes we talked about earlier. They can choose to use object lock in either governance mode, which lets authorize users make changes, or they can use compliance mode, which stops any modifications even by the root user. It's locked down. So it's

Chris 11:28
about choosing the right level of protection makes sense. So they've defined their data, they've got access control and immutability. Set up, anything else to think about for a secure data analysis workflow.

Kelly 11:38
Oh yeah. They should also definitely enable encryption for the data in their S3 data lake. It's a great extra security layer. This makes sure that even if someone manages to get unauthorized access to those S3 objects, the data itself is still protected. It's like extra armor for your sensitive info.

Chris 11:56
Great point. So they're using Lake Formation to define data sources control access with specific permissions. They've got encryption and immutability with object lock. Sounds like a really solid and secure setup for data analysis with Athena. Now let's switch gears a bit and talk about data transformation. Imagine a company has a massive data set and stored in a relational database. They want to move this data into their data lake, but in a format that's optimized for analytics. How would they approach this using Lake Formation? This

Kelly 12:25
is where Lake Formation and AWS Glue team up to create this powerful data transformation pipeline. Glue is AWS is fully managed ETL service, and it works perfectly with Lake Formation. Okay, so

Chris 12:35
how do these two services actually work together? First, they define

Kelly 12:38
that relational database as a data source in Lake Formation, then they'd use AWS Glue to create an ETL job. This job extracts the data from the database, transforms it into a format like Apache parquet, which is perfect for analytics, and then loads this transformed data into their S3 data lake.

Chris 12:56
So Glue does the heavy lifting of transforming the data, making it ready for analysis. I've heard parquet mentioned before, especially with big data. What makes it so special? Parquet

Kelly 13:05
is a columnar storage format. This means it stores data by columns instead of rows, and that's a game changer for analytics, because it lets you query only the columns you actually need. This makes queries way faster and uses less storage compared to, say, CSV or JSON files.

Chris 13:22
Ah, so for someone who's used to those traditional formats, switching to part K is like trading in a bicycle for a sports car when you're querying huge data sets. And Lake Formation helps by providing that central metadata repository for all this transformed data, right? You got it. Analysts

Kelly 13:36
can easily find and understand the data even after it's been transformed. It's like having a data dictionary to keep everything organized and easy to search. Okay,

Chris 13:46
ready for another exam? Scenario, let's say a company needs to keep a close eye on their Lake Formation environment. You know, for security and compliance, they want to be alerted if there were any unauthorized access attempts or if something changes in the configuration that goes against their security policies. What would you recommend?

Kelly 14:03
That sounds critical for any company that's working with sensitive data. You need to have eyes everywhere.

Chris 14:09
Absolutely. In this case, I'd definitely recommend integrating Lake Formation with AWS CloudTrail and Amazon CloudWatch. CloudTrail logs API activities, so if they enable it for their Lake Formation account, they can track literally every action that happens within the service. So

Kelly 14:23
CloudTrails Like a security camera recording every move in Lake Formation. That's super valuable for audits and investigations. It's essential for security and compliance, no doubt. And to take it a step further, they can connect CloudTrail With CloudWatch to create alarms. These alarms trigger notifications When specific events happen. It's like an early warning system. So

Chris 14:44
for example, they could set up an alarm to notify them if someone tries to access data from a weird IP address, or if someone tries to change a permission in a way that breaks their security rules. That kind of real time monitoring is incredibly powerful, you know, catch those issues, but. Before they become a problem. Absolutely

Kelly 15:01
this proactive approach helps them spot and respond to potential security threats and compliance violations right away. And remember, they can beef up their security even more by using late formation with other AWS security services like Amazon GuardDuty and AWS Security Hub. It's all about layers of protection. So it's

Chris 15:19
about building a multi layered security approach, using different services to protect data from all angles. Speaking of angles, let's talk about cost optimization. Let's say a company is worried about storage costs for their data lake. The lake has data that's accessed all the time, but also data that's rarely used. How can they use Lake Formation to save money on storage. This

Kelly 15:41
is where those S3 life cycle policies we talked about come in. They're super helpful. Life Cycle policies let you set rules for how data is stored and how it moves between different storage classes over time. You base these rules on how the data is used.

Chris 15:55
So if there's data that's not accessed often, they can set up a life cycle rule to automatically move it to a cheaper storage class, like Amazon, S3 intelligent tiering or S3 Glacier. That way they're not paying a premium for storage. They're not even using

Kelly 16:07
right and they can even set up rules to delete data after a certain amount of times. Like decluttering your digital storage space, you know, makes things more efficient. It's

Chris 16:16
like taking those boxes you never use and moving them to a cheaper self storage unit. You free up space and save money. How do they know which data to move and when that sounds complicated? That's

Kelly 16:27
where Lake Formation, centralized view of data storage and usage comes in. It gives you all the info you need to make smart decisions about your life cycle, policies and other cost saving strategies.

Chris 16:37
Data visibility is so important I can't optimize what you can't see, Okay, last exam scenario for part two, let's say a company wants to make absolutely sure that all data that goes into their data lake follows a specific schema. This enforces data quality and consistency from the get go. How can they do this with Lake Formation? Enforcing

Kelly 16:57
a schema during data ingestion is a smart move for data quality. Lake Formation has something called schema on read, which lets you define a schema when you query the data. But in this case, they need to make sure that schemas followed up front for that, they can use AWS Glue, data, catalog, schema, registry and AWS Glue jobs for validation. Okay,

Chris 17:16
so it's two step process, define the schema and then use Glue to enforce it. Can you break those steps down a bit more

Kelly 17:21
sure? They would start by defining the schema they want in the AWS Glue data catalog, schema registry. This creates a central definition of how the data should be structured. It's like a blueprint. Next, they'd use AWS Glue to create ETL jobs that check the incoming data against this schema during ingestion. Glue has built in transformations for data validation, like checking data types, making sure constraints are met and handling null values. It's like a quality control checkpoint. So

Chris 17:48
it's like a bouncer at the door of the data lake, making sure the data is dressed appropriately right. By enforcing a schema during ingestion, they're proactively maintaining data quality and consistency, which is super important for reliable Analysis and Reporting down the line exactly,

Kelly 18:03
it's much easier to deal with data quality problems up front than trying to fix a messy data lake later. Well,

Chris 18:09
Said, I think we've covered a ton in this second part of our Lake Formation. Deep Dive, for

Kelly 18:14
sure, we've talked about how Lake Formation helps secure sensitive data, streamlines transformation, makes monitoring for security and compliance easier, optimizes storage costs and enforces data quality through schema validation. It's a lot.

Chris 18:28
Our listeners should be feeling pretty good about tackling Lake Formation questions on those AWS exams and using what they've learned in real world situations. But get ready, because we've got one more part to go. Stay

Kelly 18:39
tuned for part three, where we'll dive into some advanced Lake Formation concepts that will take your understanding to the next level. Welcome

Chris 18:47
back for the final part of our deep dive into Amazon Lake Formation. I'm really excited to explore some of the more advanced concepts, the stuff that can really take your data lake game to the next level.

Kelly 18:59
We've covered a lot, but yeah, there's always more to learn. In this last part, we'll dig into things like data lineage, data versioning, and even how Lake Formation works with machine learning. Data lineage

Chris 19:10
sounds interesting. I've heard the term before, but I'm not sure I totally get it like, what does it actually mean in the context of a data lake?

Kelly 19:17
Okay, think of data lineage as like a detective case file for your data. It's all about tracking where your data comes from, how it moves through the data lake, what changes are made to it, and where it ends up. It's like having a complete history of your data's life cycle.

Chris 19:31
So it's not just knowing where the data is now, but understanding the whole journey, where it came from, how it's transformed and where it's stored now. That sounds super valuable for data quality and compliance,

Kelly 19:42
for sure, and Lake Formation has built in tools for data lineage tracking. It's a game changer for data governance, debugging, you know, figuring out what went wrong. Let's say a data analyst finds something weird in a report that's based on data from the lake with data lineage they can. Trace that data back to the source, see all the transformations that happened along the way and find any potential errors or inconsistencies. That's

Chris 20:07
a great example. It's like following a trail of breadcrumbs back to the root of the problem. And it seems like data lineage would be really important for compliance too.

Kelly 20:15
Oh, absolutely. You can use data lineage to show exactly how sensitive data is handled within the data lake, it's like a detailed audit trail that can satisfy even the toughest regulations. So it's

Chris 20:25
a win, win, good for data quality and for compliance. Sounds like data lineage is a must have for anyone who's serious about building a solid, trustworthy data lake.

Kelly 20:34
Couldn't agree more. Now let's move on to another important concept, data versioning in Lake Formation, just like with software, keeping track of different versions of your data is crucial. It helps maintain data integrity, allies for rollbacks and lets you see how things have changed over time. Makes

Chris 20:54
sense. But how does data versioning actually work in a data lake? It seems like that could get pretty complicated. It's

Kelly 21:00
actually pretty straightforward. With Lake Formation, you can use S threes, built in versioning features to manage different versions of the data in your lake. Every time data is changed, a new version is created and the old versions are kept. So you have a history of the data. So it's

Chris 21:15
like a time machine for your data. You can go back and see how the data looked at different points in time. What are the advantages of doing things this way? Well,

Kelly 21:22
versioning gives you a safety net. If there's a mistake during data processing, or if you need to go back to an earlier version, you can easily roll back. No problem. All right. There. That definitely

Chris 21:32
takes some of the pressure off. Knowing you can always go back is really reassuring. Anything else?

Kelly 21:38
Yeah, versioning also lets you track changes over time. You can see how data has evolved and who made those changes. This is really useful for audits and compliance, because it gives you a clear record of every modification. It's all documented. It

Chris 21:53
seems like data versioning is just like a fundamental best practice for any data lake. It gives you peace of mind, knowing you can recover from mistakes and see the full history of your data. You

Kelly 22:03
got it? Okay, let's shift gears a bit and talk about how Lake Formation works with machine learning. More and more companies are using machine learning to get insights from their data. So how does Lake Formation support that? That's

Chris 22:14
a great question. Machine learning is everywhere these days, so it's important to understand how a data lake platform like Lake Formation can help with those workflows.

Kelly 22:23
The good news is, Lake Formation works seamlessly with AWS machine learning services like Amazon, SageMaker. You can build train and deploy models directly from your data lake. It's like a direct connection from your data to those powerful machine learning algorithms. Can

Chris 22:38
you give us a specific example? Sure,

Kelly 22:41
imagine a company that wants to build a model to predict which customers are likely to leave, you know, churn. They have all this customer data in their data lake, stuff like demographics, purchase history and engagement metrics. First, they'd use Lake Formation to define the relevant data sources and set up permissions for their data scientists and ml engineers. So

Chris 23:00
just like we talked about before, secure and controlled data access is the first step. You

Kelly 23:05
know it, security is always number one. Then they'd use Amazon SageMaker to build and train their model. Sagemaker can access the data directly in the data lake because of Lake Formations, integration. That's

Chris 23:16
really handy. No need to move data around manually. Saves a lot of time and potential errors, right?

Kelly 23:21
It streamlines the whole process. Once the model is trained, they can deploy it using SageMaker and use it to make predictions in real time. And this is where data lineage is so useful. Lake Formation tracks the lineage for machine learning workflows. So you can see exactly which data was used to train a model and what transformations were applied so you

Chris 23:41
can trace a model all the way back to the original data and understand how it was created. That seems super important for transparency and explainability, especially as these models get more complex and have a bigger impact, it's

Kelly 23:54
all about building trust in the models and making sure they're based on reliable data. Well, I think we've covered an incredible amount of ground in this deep dive into Amazon Lake Formation.

Chris 24:05
We've gone from the basics of setting up a data lake to advanced concepts like data lineage and machine learning. It's been quite a journey. I'm

Kelly 24:12
sure our listeners are ready to tackle Lake Formation questions on their AWS exams and put all this knowledge to work in real world data lake projects,

Chris 24:21
remember the data world is constantly changing. Keep learning, keep exploring and keep pushing the boundaries of what's possible with data in the cloud, until next time. Happy data diving.

Ep. 85 | Amazon Lake Formation Overview & Exam Prep | Analytics | SAA-C03 | AWS Solutions Architect Associate

Broadcast by

headphones Listen Anywhere

Listen Anywhere