Ep. 80 | Amazon EMR Overview & Exam Prep | Analytics | SAA-C03 | AWS Solutions Architect Associate

Chris 0:00
All right, cloud engineers get ready, because today we are going deep on AWS. EMR, yeah, elastic map produce, if you're not familiar, you probably already use a ton of AWS services you do. DMR, is one you're going to want to know really well. Oh, yeah, especially, especially if you're thinking about the AWS Solutions Architect Associate exam, for sure. That's a big one. Yeah, think of it kind of like I've heard it your weapon, yeah, your secret weapon for all that data

Kelly 0:28
you got to deal with. That's right? Massive data sets, huge data.

Chris 0:32
We're talking terabytes, petabytes, stuff that, like spreadsheets, just can't handle, right? So EMR lets you just spin up these clusters right? On AWS, okay? And they're loaded with frameworks. They're like, spark a dupy of presto, all those big data names, yeah, and it's all managed for you.

Kelly 0:49
Okay? So less infrastructure headaches, that's the idea. More time to focus on the data to be a data wizard,

Unknown Speaker 0:55
there you go.

Kelly 0:56
So why is EMR so important? Yeah, why is it such a big deal, yeah,

Chris 1:01
why should our listeners care about it? For cloud engineers especially, yeah, building and managing applications in the cloud. Okay, So

Kelly 1:07
picture this, right? You have this web application, right? Yeah, millions of users, right, generating gigabytes of log data every hour, every hour, every hour, every hour. That's a lot of data, it is.

Chris 1:19
So EMR can go through all that, okay, identify patterns, pinpoint anomalies, okay, and help you optimize your app. Wow. Or, let's say you're building a machine learning model, okay, yeah, to predict customer churn, okay, churn. EMR can work through those massive data sets to train the model, to train the model without breaking a sweat, without breaking a sweat.

Kelly 1:42
Okay, so this isn't just like, this isn't just like some little niche service tucked away, not at all like, Oh, this is just for data scientists. No, it's for anyone, like, dealing with big data. This is, this is a core tool for any cloud engineer, if

Chris 1:55
you need to extract insights from all that data, from every from

Kelly 1:59
everything. EMR is your tool, yeah, okay. And you know what else? It's not just about the processing power, it's about how well it fits in with all the other AWS services. Okay, so think of S3 as your data lake, right? Yeah, storing all that raw data, it's a lake of data. It is EMR can access and process that data, okay, directly, straight from the lake, straight from the lake. No need to move it around. Nice. It's beautiful. And then you've got, I am for controlling who can access what access control, yeah, KMS for encryption, okay, CloudWatch, for monitoring. So it all works. It all works seamlessly,

Chris 2:35
seamlessly. I like it. It's beautiful. Okay, so EMR is not an island, right? It's a powerful tool. It is that fits right into the the ecosystem cloud engineers work exactly, okay, I'm starting to see the big picture here. Good. Let's talk about some of the features that make EMR powerful. So powerful, yeah, what are the key features? Well, we

Kelly 2:59
touched on it managed clusters. Okay, huge win. Okay, imagine setting up a Hadoop cluster yourself. Oh, God, all the configuration, networking, the storage, yeah, no, thanks. EMR takes care of all of that. You just specify the size, okay, the types of instances you need, yeah, the software you want, okay, and EMR does it, wow. You can even use Spot Instances to save some money. Oh, yeah.

Chris 3:22
Spot Instances are great. They are especially for those short lived jobs, those short lived jobs there, yeah. So we've got ease of use. We do covered. We do, what about security, though?

Kelly 3:33
Security? Yes, we're dealing

Chris 3:34
with sensitive data, right? How does EMR keep everything safe? Security

Kelly 3:39
is built in, okay? From the beginning, something around, from the ground up. Okay, so remember, I am M Yeah, it's not just for S3 buckets, okay. You use it to manage permissions within your EMR environment. Okay, within EMR, within EMR too. Okay, so you want to restrict who can submit jobs or access certain data, modify cluster configurations I am, is what you use,

Chris 4:03
all right? So it's like, yeah, it's like, the security guard. It is of EMR, it is a security guard, okay? And then on top of that,

Kelly 4:11
we got KMS for encryption, right? So your data stays protected, even

Chris 4:15
at rest, even at rest, okay, so you're covered. We've got the ease of use, yeah, we've got the security technology is there too. What else to love about EMR? There's got to be a catch, right? Well, everything has its limitations. Every technology

Kelly 4:30
has its limitations, that's true. So EMR is great for batch processing, yeah, large scale data analysis, right? But it's not really for real time, low latency applications. Okay, so then, so if you need to process streaming data with like, millisecond response times, yeah, you might want to look at something like Kinesis. Okay,

Chris 4:50
so know your use case exactly, and pick the right tool, right tool for the job, right tool for the job. Speaking of knowing your use case, yes, let's talk about the. Do it the AWS Solutions Architect Associate exam. Okay, what are they really going to ask you on this exam?

Kelly 5:06
Okay, so the exam is about applying your knowledge, yeah, not just spitting back facts, right? They want to see that you can troubleshoot, okay, optimize, optimize

Chris 5:15
and secure. EMR. EMR in real world scenarios, exactly. Okay, so let's put on our exam hats here. All right, let's do it. It's a question,

Kelly 5:24
yeah, give me a question they might ask us. What might they throw at us? Yeah, what's the first curveball? Okay, imagine this, okay. You're processing sensitive financial data. Oh, on EMR, okay. The first question that should pop into your head is, how do I protect this security? Security, big one. What's the game plan?

Chris 5:45
What do we do? What do we do? All right, so think back to those layers of security we talked about. Okay, the first one, lock down access with I am right.

Kelly 5:53
I am am our security guard. Security Guard, exactly. I'm sure only the right people, only authorized users and applications, can get in, can get in. What's next? All right, next, encryption, encryption. We have to encrypt that data, okay, both when it's moving and when it's at rest, in transit and at rest, that's right, remember KMS, yeah, we want to use KMS to encrypt the data in S3 okay, and then configure EMR, oh, to use those KMS keys, right? So the

Chris 6:24
automatically encrypted is encrypted when it's

Kelly 6:26
when it's written to us three, written to S3 and decrypted when it's read, when it's read, yeah,

Chris 6:31
got it so we have, I am controlling access, right? KMS, encrypting the data, locking it down. Anything else we got to worry about when it comes to when it comes to financial data,

Kelly 6:43
financial data, you got to think about compliance. Oh, yeah, of course. Regulation, regulation. So PCI, DSS, or other relevant regulations, you might need to implement additional security measures or auditing capabilities to make sure you're to make sure you're compliant.

Chris 6:59
So it's not just the technical side, yeah, it's also the compliance. It's the big picture, the big picture, the landscape. All right, that's a pretty solid answer to that first question.

Kelly 7:09
I think so anything else? What else might they ask? They

Chris 7:11
might ask us, yeah, what's another curveball?

Kelly 7:14
Okay, let's say a company needs to analyze data from a bunch of different sources. Okay, different source, yeah. So they've got data in S3 okay, some in DynamoDB, okay, and even some in in an on premises database. Oh, wow. The question is, how do you bring all that together and

Chris 7:30
analyze it with EMR, using EMR? Yes. Okay, so that's a Data Integration challenge. It is, where do we even begin with that?

Kelly 7:37
Okay, well, this is where EMRs flexibility is really useful. Okay, I like flexibility for S3 it's easy, okay, EMR can access and process data in S3 directly, right from the data lake, right from the data lake. No problem, okay, but for DynamoDB, yeah, we need another tool, okay, we need

Chris 7:55
hive. Hive, Hive, okay, remind me what hive is all about. Okay,

Kelly 7:59
so hive is like a SQL interface, cool. Okay, that sits on top of Hadoop, okay. It lets you query and process data in lots of different formats, okay, including DynamoDB tables. So

Chris 8:10
for our DynamoDB data, yes, we would use hive to query it. You got it and then feed that into EMR.

Kelly 8:18
Feed it right in, okay, for processing. What

Chris 8:20
about the on premises database, though?

Kelly 8:24
Yes, the tricky one, how do we get that data into the mix? So you have options here, okay, I like options, and the best one depends on the specifics, yeah, on a few factors. Okay. So first, you could use AWS DataSync. DataSync. DataSync. It's a service designed for replicating data from on premises to AWS, including S3 so we

Chris 8:44
use DataSync. You could to move that data from the on prem database to S3 Yep, and then EMR can grab it. Easy, easy peasy. There you go. Okay, but what if you need, what if you need more real time access? Ah, to that on premises data.

Kelly 8:59
Okay, so in that case, you might want a secure connection, okay, between your on prem network, yeah, and your AWS VPC, your VPC, yeah. So you could use AWS Direct Connect or a VPN, okay, so then we'd have a direct secure pipeline. You got it to the database, to the on prem database, and then EMR can just query it. EMR can query it directly without moving the data to S3 first. Very nice. There you go.

Chris 9:22
Okay? So, lots of different options, lots of us for different situations. And remember,

Kelly 9:26
the exam is about showing, showing that you understand, understand,

Chris 9:31
right? The trade offs involved, because there's no one perfect solution exactly. Okay. So we tackled security. We did. We tackled data integration, we did. What else might come up on the exam? What else might they ask us? Yeah, what other EMR really? Oh, I

Kelly 9:46
know. Let's talk about performance optimization.

Chris 9:48
Performance. Yeah, love it doesn't love performance. Let's say a company's

Kelly 9:53
using EMR for some really heavy data processing, okay, and they're seeing performance bottlenecks. Oh. Oh, so the question is, how do we design the cluster to get rid of those bottlenecks for optimal performance? Okay,

Chris 10:06
so we're talking about squeezing every ounce of performance, every ounce out of this EMR cluster we are. Where do we start? Okay, first, yeah,

Kelly 10:15
you got to choose the right instance types, okay, for your nodes, right? Because there's because EMR offers a variety a variety of different instance types, all sorts of instances optimized for different things. Okay, some are great for memory intensive tasks. Some for compute intensive you need to analyze the workload right and pick the right instance types so

Chris 10:35
no one size fits all. Approach exactly. You gotta tailor it. You have to tailor it. What else should we think about?

Kelly 10:40
Okay, next Yeah. Think about how you're gonna partition your data. Partition the data. Yeah. Instead of having one node, yeah, trying to process a massive data set, you can split the data up and distribute it, distribute it

Chris 10:54
across multiple nodes, across multiple nodes, yeah. So we're talking about parallel processing. Here. We are dividing and conquering, dividing

Kelly 11:00
and conquering the data to speed things up. I like it. It's a good strategy.

Chris 11:05
So this is where choosing the right data format comes into play. Ah, yes, the data format right? Because some big impacts, some formats are better for certain things than others, exactly.

Kelly 11:16
So columnar formats like parquet, parquet can really improve query performance, okay, compared to traditional row based format.

Chris 11:25
So we got instance types do, data partitioning, yes, data formats, uh huh. What else? What about storage? Oh, storage optimization, optimization,

Kelly 11:33
yes. Remember, EMR uses S3 it does. And S3 has all those different storage classes for different needs. So if you access data frequently, okay, you might choose S3 standard, standard, but if you don't access it very often, yeah, you could use S3 infrequent access, infrequent access, or Glacier to save money.

Chris 11:55
So it's not just about the compute, it's not it's also about the storage, the storage, how you access it, how you access it, how you optimize it exactly. I think we've covered a lot of ground here we have, but I know there's still more. Oh, EMR, much more. And the exam so much more. We're gonna pick up right where we left off in the next part of this deep dive. Sounds good. Stay tuned. Okay, so

Kelly 12:17
let's keep going with these exam style questions about EMR, you know, the ones that really test you, yeah, yeah. Let's see what else they can come up with. Okay, so imagine you have to set up an EMR cluster for a company that's analyzing tons of sensor data, sensor data, okay, from IoT devices, IoT, all right, but the data needs to be processed in real time, real time, yeah, to trigger alerts and automate responses. Oh, okay, so

Chris 12:42
we're talking so the question is about real time data streaming, is EMR the right tool and low latency requirements? If

Kelly 12:49
not, what are the alternatives? Yeah, we

Chris 12:51
talked about this earlier. We did EMR is great for batch processing, right? But not really for real time work, not really. So if you're on the exam and you say, Yeah, EMR is perfect for this. You're gonna get

Kelly 13:01
some points off. Yeah, that's not gonna fly, no? Because knowing the limitations of a service, right is just as important as knowing its strengths. Yeah,

Chris 13:09
knowing when not to use something, when not to use is very important. So what is the right answer here? What

Kelly 13:17
is the right answer? What

Chris 13:18
should we use for real time data processing?

Kelly 13:20
Okay? For real time, you want a service that's designed for it, okay, so something like Kinesis data streams, Kinesis data streams, yeah, or maybe Apache, Kafka on Amazon, MSK, okay, those can handle those high volume, low latency data streams, okay? And they integrate well with other AWS services, right?

Chris 13:41
So you can trigger alerts and automate actions Exactly. It's all about understanding why, why we choose a certain server, why you're choosing that service. Yeah, makes sense. Okay, what other curveballs they might throw? Let's

Kelly 13:51
shift gears, okay, to cost optimization. Cost optimization. Yeah, everyone loves to save money, everyone so let's say a company is running an EMR cluster for machine learning. Okay, they like the performance. Okay, good performance. But those monthly bills are a bit of a shock. Yeah,

Chris 14:09
sticker shock.

Kelly 14:10
So the question is, how

Chris 14:11
do we make it cheaper without sacrificing performance? Without sacrificing performance, right? So balancing performance and cost, the classic balancing act, yeah, trying to find that sweet spot. Sweet

Kelly 14:21
Spot. Where do we start? Okay, well, first, yeah, we need to know how the workload behaves. Okay, are those machine learning jobs running 24/7, okay, or are they only running sometimes? So,

Chris 14:34
like, Are there periods of low utilization exactly, if

Kelly 14:37
the cluster isn't always busy? Yeah, that's where we can save some money. Okay,

Chris 14:41
so we're looking for those idle times, yes, where we might have too many resources over provisioned.

Kelly 14:46
Yes. And what do we do about that? EMR has auto scaling. Oh, auto scaling. You can set it up to automatically scale down the cluster when

Chris 14:55
it's not being used, yeah, during those quiet times, and then scale it back up when it's needed. And the. Demand goes up, so we're not paying for resources we're not using exactly that's the beauty of auto scaling. Auto Scaling is a great friend. It is a good friend when it comes to cost optimization. It's always there to help. What about the instances themselves?

Kelly 15:13
The instances, yes,

Chris 15:15
are there ways to choose more cost effective instance types? There are, without sacrificing performance, we have our friend, the spot instance. Oh, Spot Instances, yeah, they can be a game changer. Yeah. Just to remind everyone, remind me, Spot Instances are basically like spare EC, two capacity. They are that AWS is selling at a discount, at a big discount, but they can be interrupted. That's the catch, if someone else needs them, if the demand goes up, right? So you can't use them for everything, right? But for some machine learning tasks, especially if they can handle interruptions, they can save a lot of money, a lot of money. So we've got auto scaling. We do, we've got Spot Instances. Yes, anything else we can do to optimize costs? Well, always make sure you're using

Kelly 15:59
the right size instances right now. Don't use a giant instance if you don't need it, if you don't need it. So basically, EMR has a ton of different instance types, right?

Chris 16:08
Different CPUs, memory, storage. You gotta choose, pick the one that fits, the one that fits, not too big, not too small. Just right? Just write the

Kelly 16:16
Goldilocks zone, the Goldilocks zone of instances.

Chris 16:20
Okay, so we talked about cost optimization. We did. What other challenges might we face on the exam? Hmm,

Kelly 16:27
let's talk about troubleshooting. Troubleshooting. Everyone's favorite.

Chris 16:31
Oh yeah. Love troubleshooting. So

Kelly 16:32
imagine EMR jobs are suddenly failing. Oh no, error messages, jobs stalling,

Chris 16:39
losing time and money. It's how do we even start to figure this out? Okay,

Kelly 16:42
first things first, yeah, check

Chris 16:44
the logs. The logs, of course,

Kelly 16:46
the logs are your friends, right?

Chris 16:47
They always tell the story. Start

Kelly 16:49
with the EMR cluster logs. Okay, they often have clues about what's wrong. Okay, so

Chris 16:54
we're looking for error messages. Yeah, error messages, warning, anything that stands out,

Kelly 16:58
anything unusual,

Chris 16:59
Okay, what else should we look at? Then you

Kelly 17:01
want to check the application logs. Okay, for the specific jobs that are failing, right?

Chris 17:05
So the EMR cluster logs are for overall health, yeah, them a big picture. And then the application logs are for specific job issues, for the details, yes, okay, what if the logs don't tell us anything? Hmm, if

Kelly 17:17
the logs aren't helpful, yeah, it's time to think about resource contention. Resource contention. So if other applications are running on the cluster, yeah, they might be using up all the resources, like

Chris 17:29
CPU, Memory, Disk, io, exactly, and that can make your EMR job slow down. So we need to check those utilization metrics, yes,

Kelly 17:37
see if anything is maxed out. CloudWatch can help us with that, right? CloudWatch is our friend here. CloudWatch is always our friend. Gives you all those metrics, CPU,

Chris 17:45
memory, network, disk, io, all that good stuff, all the

Kelly 17:51
juicy details. And

Chris 17:52
if we're still stuck, if you're

Kelly 17:53
still scratching your head, we can always call AWS support. You can always phone a friend,

Chris 17:58
right? Sometimes you need a little extra help. Sometimes you do and they have access to more information. They have those deeper tools. Okay, so we've got a good troubleshooting strategy, I think so, check the logs, yes, review the configurations, investigate resource contention, right? And if all else fails,

Kelly 18:16
call AWS support. Call

Chris 18:18
AWS support. Here

Unknown Speaker 18:19
you go. Okay, I

Chris 18:21
think we've covered a lot of ground. We have. Let's shift gears in the last part of this deep dive, okay, and explore some more thought provoking questions. Sounds good? Let's wrap things up. All right, let's wrap up our EMR deep dive with a few more of those brain ticklers. You know, those questions that might pop up on the exam or, even better, in your day to day cloud work? Yeah,

Kelly 18:41
I love that practical stuff exactly.

Chris 18:44
So let's say you've got this company that wants to use EMR, okay to process a massive data set, okay, but they also need to make sure that data is protected, even if something goes wrong, like a hardware failure. Oh,

Kelly 18:56
that's a good one. This is about data durability and resilience. You know, big topics super important, especially with big data on EMR. So first thing to remember, EMR uses S3 for storage, right? And S3 is designed for like, 11 nines of durability, 11 nines. 11 nines.

Chris 19:16
That's like saying it's a lot. You're more likely to win the lottery twice than lose data in S3 pretty

Kelly 19:21
much, wow. Okay, so just by using S3 you're already in good shape. Okay, good start, but we can't stop there, right? We need to think about the data while it's moving between the EMR cluster and S3

Chris 19:31
right? Data in transit's just as vulnerable as data at rest, exactly.

Kelly 19:35
So how do we protect it during that transfer? Yeah, good question. Encryption, that's the key. Encryption always a good answer. EMR supports encryption in transit using SSL, TLS, okay. So while the data is moving between EMR and S3 it's encrypted. It's protected.

Chris 19:51
Okay, so we got durability and S3 Yes, and encryption in transit, uh huh. What about the cluster itself, though? The cluster, yeah. What if one of those. Servers just decides to quit.

Kelly 20:01
Okay, so EMR clusters, they're built to be fault tolerant. Fault tolerant. Each cluster has multiple nodes, okay? And if one fails, yeah, EMR just spins up a new one. Oh, okay, it's automatic. So it just keeps running. Data Processing keeps going like nothing happened, exactly. It's like a self healing system. Very cool. Can

Chris 20:21
we do even more for data protection? Oh, yeah,

Kelly 20:24
of course, you can enable data replication and EMR data replication, okay, this replicates your data, okay, across multiple availability zones, okay, multiple availability zones in a region. So that gives you even more redundancy. So even if a whole

Chris 20:39
availability zone goes down like a power outage.

Kelly 20:41
Yeah, a big one. Your data is still safe in another availability zone.

Chris 20:46
So we've got S3 durability, yes, encryption in transit, fault tolerant clusters, right? And data replication across availability zones, all

Kelly 20:57
working together. That's pretty impressive, pretty resilient system. It is very nice. Okay, last question,

Chris 21:02
let's say a company's running this long running EMR cluster, okay, but they notice that it's not really being used much during off peak hours. Oh, yeah, that happens a lot, yeah. How can they save some money in that scenario?

Kelly 21:15
Okay, so this is a chance to show off those cost optimization skills. All right, let's hear it. First, you'd want to talk about auto scaling. Oh, auto scaling again, our good friend, auto scaling. You can set it up to scale down the cluster when it's not busy during those off peak hours, and then scale back up when the demand increases. So

Chris 21:32
it's like a smart thermostat for your EMR cluster.

Kelly 21:35
I like that analogy. It adjusts based on the usage. Perfect. Okay, what else? And don't forget about Spot Instances. Spot

Chris 21:42
Instances, yeah, yeah, those could save a

Kelly 21:44
lot of money. They can if you can handle some interruption, right? So if AWS

Chris 21:49
has some spare capacity, we can grab it at a discount, exactly. Why not? And if this is a long running cluster, yeah, with predictable usage, predictable usage, you could even think about reserved instances.

Kelly 22:00
Ah, reserved instances. Yeah, get that discount for the core notes, for the core notes. Okay, so so many options. It's

Chris 22:06
the right pricing model for you, for your needs, exactly. Okay, I think we've really covered a lot. Here we have. It's been a good deep dive. Any final words of wisdom for our listeners before they take on the AWS Solutions Architect Associate exam. Just remember,

Kelly 22:19
EMR is a powerful service. You know, don't be afraid to experiment with it. See how it can solve your problems. Yeah, play around with it. Get to know it. And when it comes to the exam, focus on understanding the core concepts. You know how to apply them to solve problems. Don't just

Chris 22:35
memorize facts Exactly. Well said, thanks for joining us on this deep dive into EMR. It's been fun. We hope you learned something new, maybe even had some of those aha moments along the way, the best kind of moments, as always, keep diving deeper into the cloud, keep learning and best of luck on your AWS journey.

Kelly 22:52
See you next time.

Ep. 80 | Amazon EMR Overview & Exam Prep | Analytics | SAA-C03 | AWS Solutions Architect Associate
Broadcast by