AWS Cost Allocation

mins read

Top 6 Strategies for AWS EMR (Elastic Map Reduce) Cost Optimization

Pricing Insights and Best Practices for AWS EMR Costs

Sanika Kotgire

published on

July 26, 2024

Did you know?

EMR can process truly massive datasets. In a benchmark, AWS EMR on Amazon Elastic Compute Cloud (EC2) successfully processed a 100 petabyte dataset – that's equivalent to 100 million gigabytes! This feat showcases the scalability and power of EMR for handling enormous data volumes.

‍

Amazon EMR (previously Amazon Elastic MapReduce) is a leading managed cluster platform on AWS, simplifying the execution of big data frameworks like Apache Hadoop and Apache Spark for processing and analyzing vast datasets. With a market share of 11.93% in the big data infrastructure market, Amazon EMR competes with 12 other tools in this category. As of 2024, over 4,857 companies globally rely on Amazon EMR, with key industries including Cloud Services , Data Analytics, and Digital Transformation.^[1]

In this guide, you'll explore cost-saving models and expert-driven strategies to optimize Amazon EMR expenses effectively.

‍

Amazon EMR Pricing

This table below summarizes the deployment scenarios, their descriptions, categories, and pricing details for Amazon EMR configurations.^[2]

Deployement Scenario	Category	Pricing
Amazon EMR on Amazon EC2-Pricing based on EC2 instance types used and duration.	General Purpose
	Current Generation	$0.039 - $2.78208 per hour
	Previous Generation	$0.03 - $0.27 per hour

	Compute Optimized
	Current Generation	$0.039 - $2.78208 per hour
	Previous Generation	$0.026 - $0.27 per hour

	Memory Optimized
	Current Generation	$0.0504 - $6.669 per hour
	Previous Generation	$0.067 - $0.27 per hour

	Storage Optimized
	Current Generation	$0.07722 - $2.7456 per hour
	Previous Generation	$0.173 - $0.27 per hour

	Accelerated Computing	$0.132 - $24.58 per hour
Amazon EMR on Amazon EKS-Pricing based on Kubernetes cluster costs and EMR usage.	per vCPU per hour	$0.01012
	per GB per hour	$0.00111125
Amazon EMR on AWS Outposts- Pricing involves Outposts usage costs and EMR configuration.	-	Same as Cloud-based instances of EMR
Amazon EMR Serverless-Pricing varies based on compute and storage usage, without fixed instance costs.	Compute and Memory
	Linux/86	$0.052624 per vCPU per hour
	Linux/ARM	$0.0057785 per GB per hour

	Ephemeral Storage	$0.000111 per storage GB per hour
Amazon EMR WAL-Pricing based on the usage of auto-scaling features for workload demands.	ReadRequestGiB	$0.0883
	WriteRequestGiB	$0.0883
	EMR-WAL-WALHours	$0.0018

‍

EMR Pricing Sample Example

Suppose you submit a Spark job to EMR Serverless. Let’s assume that the job is configured to use a minimum of 25 workers and a maximum of 75 workers, each configured with 4 VCPU and 30GB of memory. Consider that no additional ephemeral storage was configured. If your job runs for 30 minutes using 25 workers (or 100 vCPU) and was automatically scaled to add 50 more workers (200 more vCPU) for 15 minutes

Total vCPU-hours cost = (100 * $0.052624 * 0.5) + (200 * $0.052624* 0.25) = (number of vCPU * per vCPU-hours rate * job runtime in hour) = $5.2624
Total GB-hours = (750 * $0.0057785 * 0.5) + (1500 * $0.0057785 * 0.25) = (Total GB of memory configured * per GB-hours rate * job runtime in hour) = $4.333875
Total EMR Serverless Charges = $9.596275
Additional Charges: If your application uses other AWS services such as Amazon S3, you are charged standard S3 rates.

‍

Strategies to reduce EMR costs

‍

‍

Below are several effective strategies aimed at reducing EMR costs and optimizing performance without experiencing downtime.^[4]

^‍‍‍

1. Implement Amazon EMR Managed Scaling

‍

‍

To reduce costs and optimize performance on Amazon EMR, implementing Amazon EMR Managed Scaling is key. This feature automatically adjusts cluster sizes based on workload metrics, scaling up during peak usage and scaling down during idle periods. Starting from mid-December 2022, these capabilities are enabled by default for newer EMR versions, ensuring improved cluster utilization by up to 15% and cost reductions by up to 19%. EMR Managed Scaling requires minimal setup and automatically incorporates enhanced algorithms for efficient resource management, making it an ideal choice for optimizing your EMR infrastructure without manual intervention.^[3]

The image illustrates how EMR Managed Scaling optimizes resource utilization in EMR clusters. By dynamically adjusting cluster capacity (orange line) to match fluctuating job demand (blue line), it significantly reduces wasted resources (gray areas) and improves cost-efficiency.

‍

Thomson Reuters reduced costs by running each Apache Spark job on ephemeral Amazon EMR clusters, which close out after job completion, and by dynamically adjusting core nodes using scaling based on workload needs. This approach streamlined their workflows, reduced cluster runtime by 48%, and improved resource utilization. -John Engelhart Associate Architect, Thomson Reuters^[5]

‍

2. Optimize Amazon S3 Objects

To optimize handling of Amazon S3 objects:

‍Compact: Avoid creating small files (less than 128 MB) to minimize S3 LIST requests and enhance job performance. For instance, a query executed over a single file can be 3.6 times faster than over 25,000 smaller files of equivalent data size.

Here's a simple example using AWS CLI:

# Merge all files in a bucket into larger files (e.g., 128 MB each)
aws s3 ls s3://your-bucket/ --recursive | 
awk '$3 > 0 {print $4}' | 
xargs -I {} aws s3 cp s3://your-bucket/{} - | 
aws s3 cp - s3://your-bucket/larger-files/combined-file.txt

‍

‍ Convert: Utilize columnar file formats such as Parquet and ORC for improved read performance. These formats are especially effective when queries frequently access subsets of columns. Queries on Parquet files, despite their larger size, can perform up to 74 times faster compared to JSON (text) files, optimizing both storage efficiency and query speed.

Assuming you have data on Amazon S3 in JSON format, and you want to convert it to Parquet using Apache Spark on Amazon EMR:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Convert JSON to Parquet")
  .getOrCreate()

// Load JSON data from S3
val jsonDF = spark.read.json("s3://your-bucket/json-data/*.json")

// Write data in Parquet format back to S3
jsonDF.write.parquet("s3://your-bucket/parquet-data/")

‍

‍Compress: Compressing your data reduces storage needs and minimizes S3 to EMR node network traffic. Use compression algorithms that support file splitting or optimal parallelization, such as Apache Parquet or Apache ORC, which offer built-in compression. Parquet files can be significantly smaller than JSON files of the same data size, enhancing efficiency in data processing.

To compress data on Amazon S3, you can use AWS CLI commands or tools like Apache Hive or Apache Pig on Amazon EMR to process and compress files:

# Compress all files in a bucket using gzip
aws s3 ls s3://your-bucket/ --recursive | 
  awk '$3 > 0 {print $4}' | 
  xargs -I {} aws s3 cp s3://your-bucket/{} - | 
  gzip | 
  aws s3 cp - s3://your-bucket/compressed-data/

‍

These practices ensure efficient data processing on Amazon S3, enhancing performance and reducing EMR costs.

‍

3. Appropriate Instance Selection and Graviton3 Instances

To optimize Amazon EMR clusters, select EC2 instance families based on workload needs and cluster size. For balanced requirements, consider m6g.xlarge or m7g.xlarge. Compute-intensive tasks benefit from c7g instances, while memory-intensive applications like Spark perform well on r7g instances. For the master node, m7g is suitable for smaller clusters, while larger ones or critical services may require 8xlarge or higher. Additionally, leveraging Amazon EC2 Graviton3 instances starting from EMR Versions 6.4.0 and later can yield up to 40% cost savings and 25% improved performance compared to previous-generation instances.

Explore more on choosing Graviton instances for optimizing EMR costs and performance with these recommended blogs:‍

These resources will help you make informed decisions when selecting Graviton instances for your Amazon EMR deployments.

‍

With Amazon EMR on EKS and the ARM-based AWS Graviton 2 instances, we improved the overall performance of our big data operations by 30% and reduced cost by 20%.”-Li Rui Vice President of Technology, Mobiuspace^[6]‍‍

‍

4. Optimizing Subnet Allocation

To optimize Amazon EMR costs and ensure high availability, it's crucial to manage subnet allocation effectively. An Amazon EMR cluster with multiple primary nodes can only reside in a single Availability Zone or subnet. Amazon EMR cannot replace a failed primary node if the subnet is fully utilized or oversubscribed during a failover. To avoid this scenario and potential downtime, dedicate an entire subnet to each Amazon EMR cluster. This approach ensures there are enough private IP addresses available and prevents resource contention.

‍Example: Dedicated Subnet Allocation:

Scenario: You have an EMR cluster with multiple primary nodes.
Action: Allocate a dedicated subnet with sufficient IP addresses for the cluster.
Result: Ensures primary node replacement without failover issues, maintaining cluster performance and availability.

By dedicating subnets, you reduce the risk of resource shortages and enhance the reliability and efficiency of your EMR clusters. This practice not only minimizes downtime but also optimizes costs by maintaining smooth and continuous operation.

‍

5. Use Instance Fleet with an allocation strategy

Setting up an Instance Fleet for Amazon EMR involves choosing the right types of computing power you need, like `m5.xlarge` for regular tasks or `r5.xlarge` for jobs that need more memory. You set how many minimum and maximum instances you want, so your system can handle busy times but also save money during slower times. Use a smart strategy to pick the cheapest Spot Instances available, which are temporary but can save a lot of money. EMR can also automatically add or remove instances based on how much work there is, which helps keep costs down. For example, you might have a setup with one `m5.xlarge` for managing tasks, at least two `r5.xlarge` for steady jobs, and up to six `r5.xlarge` Spot Instances for when you need extra power. Keep an eye on performance and costs using AWS tools like CloudWatch and Cost Explorer, adjusting as needed to make sure your setup is both affordable and efficient.

‍

6. Mix On-Demand and Spot instances

Mixing On-Demand and Spot instances in AWS EMR environments helps reduce costs by utilizing On-Demand instances for stable, critical workloads that require continuous availability. Spot instances are employed for non-critical tasks and transient workloads, offering significant savings due to their lower pricing compared to On-Demand rates. Automated management through AWS Spot Fleet or EMR Managed Scaling optimizes instance usage based on pricing fluctuations, ensuring efficient resource allocation. This approach, coupled with careful monitoring and application design for fault tolerance, maximizes cost efficiency without compromising performance or reliability.

‍Scenario: Optimizing EMR Cluster Costs: "Data Analytics Inc.," a data-driven company, manages an EMR cluster for daily data analytics tasks. Core Nodes (On-Demand): 4 instances of m5.xlarge Task Nodes (Spot): 8 instances of r5.xlarge. They use On-Demand instances for core nodes to ensure stability and Spot instances for non-critical tasks to capitalize on cost savings during low-priced periods.
‍Current Costs‍

On-Demand: $0.768 per hour

Spot (assuming): $0.72 per hour

Total current hourly cost: $1.488

‍Optimized Costs‍

On-Demand: $0.768 per hour

Adjusted Spot usage (6 instances): $0.54 per hour

Total optimized hourly cost: $1.308

‍Savings‍

Monthly Savings: Approximately $130.56

By optimizing their EMR cluster with a mix of On-Demand and Spot instances, Data Analytics Inc. can save about $130.56 per month, ensuring cost efficiency without compromising performance.

‍

‍Conclusion

‍In conclusion, Amazon EMR stands out as a powerful tool for organizations handling large-scale data processing needs. By implementing the strategies outlined in this guide—such as leveraging EMR Managed Scaling, optimizing S3 object handling, selecting appropriate EC2 instance types like Graviton2, and ensuring efficient cluster management—businesses can not only enhance performance but also achieve substantial cost savings. These practices underscore EMR's capability to streamline operations, improve resource utilization, and support agile data analytics at scale within the AWS ecosystem.

‍

‍References

1. Amazon EMR - Market Share, Competitor Insights in Big Data Infrastructure

2. Big Data Processing and Data Analytics – Amazon EMR Pricing

3. Reduce Amazon EMR cluster costs by up to 19% with new enhancements in Amazon EMR Managed Scaling | AWS Big Data Blog

4. Cost Optimizations | AWS Open Data Analytics

5. Optimizing Fast Access to Big Data Using Amazon EMR at Thomson Reuters | Case Study | AWS

6. Mobiuspace Delivers up to 40% Improved Price-Performance Using Amazon EMR on EKS

‍

‍FAQ’s

1. How is Amazon's elastic map reduce different from a traditional database?

Amazon EMR differs by focusing on distributed processing of big data using frameworks like Apache Spark and Hadoop, rather than storing and querying structured data like traditional databases (e.g., MySQL, PostgreSQL). EMR is designed for parallel processing of large datasets across multiple nodes, providing scalability and fault tolerance for big data workloads.

2. What are the benefits of using AWS EMR instead of using a local cluster?

Using AWS EMR offers benefits such as elastic scalability to adjust cluster sizes based on workload demands, cost-effectiveness by leveraging pay-as-you-go pricing, and integration with AWS services for enhanced data processing capabilities and management. This contrasts with local clusters that often lack scalability, require upfront hardware investments, and are limited in cloud-native integrations and automation.

3. Is AWS EMR serverless?

Amazon EMR does offer a serverless option known as EMR Serverless. This allows you to run big data applications without having to manage clusters. With EMR Serverless, you can run Spark and Hive workloads, and the service automatically provisions and scales the necessary resources to handle your workloads, charging you only for the resources used. This eliminates the need for manual cluster setup, management, and scaling, making it a fully serverless solution for processing big data.

4. What is the main advantage of using AWS EMR elastic MapReduce for big data processing?

The primary advantage of using AWS EMR is its ability to process large volumes of data efficiently and cost-effectively. It leverages cloud infrastructure for scalable compute and storage resources, integrates with various data sources and analytics tools, and supports a wide range of big data processing frameworks (e.g Spark, Hadoop, Hive) for analytics, ETL (Extract, Transform, Load), and machine learning workloads.

5. What is a valid use case for Amazon EMR?

A valid use case for Amazon EMR is running large-scale data analytics and processing tasks, such as analyzing server logs, performing ETL (Extract, Transform, Load) operations, or training machine learning models on large datasets. EMR's scalability and integration with AWS services make it ideal for efficiently handling and analyzing massive amounts of data.

Subscribed !

Your information has been submitted

Oops! Something went wrong while submitting the form.