Ep. 82 | Amazon Athena Overview & Exam Prep | Analytics | SAA-C03 | AWS Solutions Architect Associate

Chris 0:00
All right, let's kick things off today with a deep dive into Amazon Athena. That's good specifically for cloud engineers. Let's all picture this. You're a cloud engineer, and you've got tons of data just sitting there in S3 right? But you don't want to deal with setting up servers. Yeah, no one wants to do that. That's where Athena comes.

Kelly 0:22
Exactly. Athena really shines because it lets you query that data directly in S3

Chris 0:27
Yeah, so no need for, like, a traditional data warehouse. Exactly.

Kelly 0:30
It's like a magic wand, you know? It unlocks insights from your data directly

Chris 0:35
in S3 Yeah, all

Kelly 0:36
using good old SQL. So

Chris 0:38
we keep hearing this term serverless, but what does that really mean for us cloud engineers?

Kelly 0:43
Yeah, it's a bit of a buzzword. It is, yeah. So basically, serverless means you can just focus on analyzing the data, right, and not managing the infrastructure. Makes sense. Athena handles all the hard stuff for you, like provisioning servers, scaling resources and all that. You just get to focus on the data itself, yeah, and you only pay for the queries you run, that's awesome, yeah, which can save you a lot of money compared to, like, running your own data warehouse.

Chris 1:06
So no more provisioning EC2 instances, right? Configuring Hadoop clusters all that. Yeah, exactly. It simplifies things a lot. So let's talk about some real world examples. Okay, how are cloud engineers actually using Athena? Yeah,

Kelly 1:19
so let's say you have like, terabytes of log files sitting in S3 been there. Yeah, and you need to figure out what's causing that, like, weird app error you've been seeing. Yeah. You can use Athena to just query those logs directly. Wow, using SQL and find those, like, patterns and anomalies without having to, like, move the data or transform it.

Chris 1:42
So it's a huge time saver, huge Okay, so we've got troubleshooting. What else can we use Athena for?

Kelly 1:47
Well, let's say you're working on a website and you want to understand user behavior. Sure you could use Athena to query Clickstream data. Makes sense. That's also stored in S3 you can find out like, which pages are popular, where users are dropping off, like, why are they leaving the site so that can help us improve the user experience? Exactly, yeah, and lead to better business outcomes. It's

Chris 2:07
really sounding like Athena is a pretty versatile tool. Oh, yeah for sure. So it's not just for technical folks either, right?

Kelly 2:13
Anyone who needs to analyze data can benefit from it.

Chris 2:17
So we've established that Athena is serverless. It uses SQL, yep, it has a wide range of applications. Now let's dig a little deeper. Okay, what are some of the features and benefits that make it such a powerhouse for data analysis?

Kelly 2:32
Yeah, so one of the coolest things about Athena is that it's really flexible with data formats. Okay, how? So it doesn't matter if you have CSV, JSON, parquet, Avro, or even, like compressed files, Athena

Chris 2:45
can handle it. Athena can handle it all. So you don't have to convert your data. You

Kelly 2:48
don't have to convert it before querying. That's great. Yeah, that's a huge relief for cloud engineers who are always dealing with, like, different data formats, right? Totally. It just makes things easier.

Chris 2:58
But every service has its limitations. Of course, what are some things we need to be aware of with Athena?

Kelly 3:04
So while Athena is great for ad hoc queries and batch analysis, it's not the best choice for, like, real time transactional workloads. So

Chris 3:12
if you need, like, super fast responses, if you need

Kelly 3:14
ultra low latency or better options, you might want to consider other services like DynamoDB or Aurora, right?

Chris 3:21
So for those use cases where speed is critical, Athena might not be the best fit, right?

Kelly 3:25
And while Athena can handle huge data sets, yeah, we talked about that, query performance can vary a bit depending on the size and format of your data, so we need to choose our data formats carefully, especially for large data sets, using efficient formats like parquet can make a big difference, okay, so Parquet for performance, right? Got it. It's all about choosing the right tool for the job, right? And Athena really excels at serverless querying of S3 data, for sure. And it fits in nicely with other AWS services like what you can use AWS Glue for data cataloging, cleaning and transformation. Okay, and then Athena can query that prepared data directly in S3 so

Chris 4:06
it's a powerful end to end solution.

Kelly 4:07
It is all right. Well,

Chris 4:08
let's shift gears a little bit, okay, and put our Athena knowledge to the test. It sounds good. I know a lot of our listeners are studying for those cloud engineer certification exams, right? So let's look at some example questions. Let's do it that might pop up on those exams, I'm

Kelly 4:20
ready fire away. Okay, first question, you need to query a large data set stored in S3 okay, but you need to do it in a cost effective way, and you don't want to manage any servers. Got it which AWS service comes to mind? This

Chris 4:36
sounds familiar? Yeah, we've been talking about a lot, right? Serverless, cost effective querying data in S3 I'm gonna say Amazon. Athena. Bingo.

Kelly 4:44
You got it. Awesome. Athena is definitely the right choice here. It's serverless, so you don't have to worry about managing server. It can query that data directly in S3 right? No complex data pipelines needed, and you only pay for what you use for. If makes it really cost effective, especially compared to running your own data warehouse.

Chris 5:05
Makes sense. Okay, let's get a little more technical. Imagine you have a team and they need to analyze web server log files. Okay, these log files are stored in S3 Yeah, they're massive and they're in a compressed format. Sure? How can we query these logs efficiently using Athena? This

Kelly 5:24
is a great question, and it really shows how powerful Athena is. Okay? So first of all, Athena can query compressed data formats directly. Okay, like what kind of formats we're talking about, like G zip or B zip, two. Got it. You don't have to uncompress those files before querying. That's really convenient. Yeah, it saves you a ton of time in storage.

Chris 5:45
So we can query those compressed logs right in S3 right in S3 but I'm guessing there's more to it than that. There is, especially when we're talking about efficiency with large data sets. Exactly.

Kelly 5:54
So while Athena can handle compressed files, the efficiency of your queries really depends on the data format itself, okay,

Chris 6:02
so some formats are better for analysis than others, exactly.

Kelly 6:05
In this case, I'd really recommend using a columnar format like parquet, okay, parquet. What parquet? Well, parquet is designed for efficient data storage and retrieval, especially when you're dealing with large data sets, okay, unlike row based formats, parquet stores data in columns. Okay, so how does that help? Well, that means Athena can read only the columns it needs for that specific query. That makes sense. So it doesn't have to scan through all the data. It can be very targeted exactly, and that leads to faster query performance and lower costs.

Chris 6:36
So it's not just about compressing the data, it's about how the data is organized. You got it? Okay, so parquet plus Athena, that sounds like a good combo. It's

Kelly 6:45
a winning combination. All

Chris 6:46
right, let's try another scenario. What if a marketing team wants to run ad hoc SQL queries on data in S3 okay, but this data is updated frequently. What considerations should we keep in mind when using Athena for this kind of use case, this

Kelly 7:02
is where things get interesting. Athena seems like the perfect fit for ad hoc queries, right? It's serverless. It supports SQL. It can query that data, right in S3 Yeah, it checks all the boxes. But here's the thing, remember how Athena's pricing works? It's paper query exactly you pay for each query you run, so if the data set is updated frequently, and that leads to more queries,

Chris 7:23
the cost could add up exactly. So how do we manage those costs, especially when the data is changing so often?

Kelly 7:29
That's where a data catalog, like AWS, Glue data catalog comes in, okay? Glue data catalog, I've heard of that, yeah. Think of it like a central metadata repository for your data lake. Got it so instead of Athena scanning all of S3 for the data it needs, it can just check the Glue

Chris 7:45
data catalog. Ah, so Glue data catalog tells Athena where the data is exactly.

Kelly 7:50
It's like a road map for Athena. So the queries are more efficient, not more efficient. That reduces costs, and it's really valuable for those frequently updated data sets, because Glue data catalog keeps track of all those schema changes,

Chris 8:02
so our queries are always consistent, always consistent. That's really important when the data is constantly evolving. It is one last question before we wrap up this exam prep section. Shoot. Let's say you're working on a project and you need to query data from multiple S3 buckets, okay, and these buckets are in different AWS accounts. All right, how can you make this easier using Athena? That's

Kelly 8:26
a good question. Multi account scenarios can be a little tricky. They can but Athena has some pretty elegant solutions for this. Okay, I'm listening. The key is to use cross account access within Athena so

Chris 8:39
we don't have to copy or move data between accounts. You don't have to move data around. That

Kelly 8:43
can get messy and expensive. You can just set up permissions, okay, permissions in the other accounts exactly the accounts that own the data, so they give our Athena account permission to access their buckets Exactly. And you have a lot of control over these permissions, so we

Chris 8:58
can restrict access to specific buckets, folders or even individual objects

Kelly 9:02
you got it so you can make sure your data stays secure and compliant. That's good to know. And once you have those permissions set up, you can create something called external tables in your Athena account. External tables, what are those? Think of them like virtual views of data that lives outside of your Athena account. Interesting. They let Athena query the data directly in those external S3 buckets, without having to import it or copy it. So it simplifies things a lot it does. It makes it much easier to query data from multiple accounts.

Chris 9:31
So it's like all the data is in your account, right? But without all the hassle of data duplication. I like it. Athena is really like the Swiss Army knife of serverless querying. I think that's a pretty good analogy. It's flexible, cost effective. It plays well with other AWS services. It does. But before we get too carried away with all the amazing things Athena can do, yeah, we need to talk about data

Kelly 9:52
quality. Of course, data quality is crucial.

Chris 9:55
How do we make sure we're getting the right answers when we're using Athena

Kelly 9:59
right? It's not. Just about getting answers. It's about getting the right answers Exactly.

Chris 10:02
So how do we do that? Well,

Kelly 10:03
it involves a few things, using the right tools, having good processes, and a bit

Chris 10:09
of data hygiene. Okay, so what are some practical steps we can take? First and

Kelly 10:13
foremost, you need to validate your data, so

Chris 10:17
don't just trust that the data in S3 is accurate

Kelly 10:19
Exactly. Don't just assume it's a good fena to run queries that check for data anomalies, like what kind of anomalies, things like null values, unexpected data types or values that are just way outside of the normal range.

Chris 10:33
So we're using Athena to check the data that we're going to be using with Athena, right? It's like a self check. So we're using Athena to kind of audit its own food supply. Yeah,

Kelly 10:41
exactly. That's pretty clever. And you can even automate these checks. Okay, how would you do that? You can build them right into your data pipelines, or you can schedule them to run regularly, so we're constantly monitoring the data exactly that way. You're not just reacting to problems after they happen, being proactive

Chris 10:57
instead of reactive, right? Exactly. I like that. So are there any other techniques we can use to make sure our data is accurate? Well,

Kelly 11:04
there's also data profiling, okay, what's that? It's all about analyzing the characteristics of your data,

Chris 11:09
okay, so we're looking at the structure, the content and the patterns in the data exactly,

Kelly 11:13
and you can use tools to generate statistics like frequency distributions, okay, data ranges or even identify patterns and anomalies, so

Chris 11:22
we can see if there are any inconsistencies or outliers, right?

Kelly 11:25
And that can help you spot potential data quality issues. So

Chris 11:29
it's like getting a Health Report for your data. Yeah, that's

Kelly 11:31
a great way to put it. It tells you where the problems are exactly and once you know what the common errors or inconsistencies are, you can fix them. You can create data cleaning rules to improve the quality of your data so

Chris 11:44
it's a continuous process of improvement. It is data quality isn't just about the tools in tech, though. Is it right? It's about people too. So how do we get everyone on board with data quality? It's all about communication

Kelly 11:55
and making sure everyone understands what the data means, how it should be collected, and what the valid values are exactly. You need to have clear data definitions and standards across the team,

Chris 12:05
so we're all speaking the same language when it comes to data, right? And data governance is important too,

Kelly 12:10
right? Oh, absolutely. You need to have processes for data ownership, validation and change management. So who's responsible for the data? Yep. And how are changes tracked and approved. What

Chris 12:21
happens if there's a data quality issue? Exactly? It's about making sure everyone is accountable for the data. Data

Kelly 12:27
is valuable. It is we need to treat it with respect. Well, I

Chris 12:30
think we've covered a lot today, yeah, from the basics of Amazon Athena to some really practical tips for data quality and exam prep. It's been a great deep dive into Athena it has. And for our listeners who are studying for those cloud certifications, right, hopefully you feel a lot more confident about tackling those Athena questions. Absolutely,

Kelly 12:50
practice makes perfect. It does. So

Chris 12:52
keep experimenting with Athena, see what it can do and keep pushing the boundaries of serverless querying. I love that. Keep querying, keep learning, yeah, and keep exploring the world of data.

Kelly 13:04
It's a fascinating world. It

Chris 13:05
is. Thanks for joining us on this deep dive into Amazon, Athena. It's

Kelly 13:08
been a pleasure until next time. Happy

Chris 13:10
querying. Everyone

Kelly 13:11
happy querying.

Ep. 82 | Amazon Athena Overview & Exam Prep | Analytics | SAA-C03 | AWS Solutions Architect Associate
Broadcast by