Amazon Athena is essential for businesses handling extensive datasets on Amazon S3. Its automatic scalability and processing of petabytes of data ensure efficient and dependable analysis. With over 5,303 customers leveraging Amazon Athena, commanding a market share of 19.92% and a global rank of 2, its significance in the industry is evident.[1]
Optimizing Athena costs is pivotal for sustaining profitability and operational efficiency. In this article, we will delve into the Amazon Athena pricing structure and explore strategies to reduce Athena costs effectively.
Prominent companies with over 10,000 employees that use Amazon Athena for their data analytics needs include Tech Mahindra, Mphasis, PricewaterhouseCoopers, Cardinal Health, and PostNL.
Utilization of Amazon Athena spans across various domains, empowering organizations to extract valuable insights from their data stored in Amazon S3. Amazon Athena empowers various domains by enabling organizations to extract insights from data stored in Amazon S3. It facilitates querying and analysis of big data, supports machine learning workflows, aids AI initiatives, and enhances data analytics with ad hoc querying and exploratory analysis. In software development, Athena aids in log analysis and performance monitoring. It also supports complex SQL queries for diverse analytical tasks and fosters innovation and competitiveness in digital transformation.[1]
The table below outlines Amazon Athena's service pricing for data analysis. SQL queries are priced by data scanned, with an option for dedicated resources. Apache Spark execution is available, with additional costs covering standard S3 rates and federated queries.[2]
Consider a scenario where you have a table stored on Amazon S3 as an uncompressed text file, with a total size of 3 TB. Let's break down the costs and optimizations involved in running a query to retrieve data from a single column of this table:
a) Query Cost without Compression:
b) Columnar Format Optimization:
This breakdown demonstrates the significant cost savings achieved through columnar format optimization.
Given below are some strategies that can help you optimize performance and save costs without compromising efficiency.[3]
Partitioning your data in Amazon Athena can significantly reduce costs and improve query performance. By dividing your table into parts based on column values like date, country, or region, you can limit the amount of data scanned for each query. This reduces both query runtime and cost, as Athena charges based on the amount of data scanned from Amazon S3.Partitioning acts like adding virtual columns to your table. You define these partitions when you create the table. For example, if you partition a dataset by year, you only scan data from specific years when querying, which reduces the data volume and costs.
Here's an example of how a dataset might be partitioned by year in an S3 bucket:
$ aws s3 ls s3://athena-examples/flight/parquet/
PRE year=1987/
PRE year=1988/
PRE year=1989/
PRE year=1990/
PRE year=1991/
PRE year=1992/
PRE year=1993/
a) Creating a Partitioned Table:
When creating a table for this dataset, you include a PARTITIONED BY clause:
CREATE EXTERNAL TABLE flights (
dest STRING,
origin STRING,
...
)
PARTITIONED BY (year STRING)
STORED AS PARQUET
LOCATION 's3://athena-examples/flight/parquet/';
After the table is created, you need to add each partition using an AWS Glue crawler, the ALTER TABLE ADD PARTITION command, or by running MSCK REPAIR TABLE.
b) Querying Partitioned Data:
When you query a partitioned table, specify the partition key in the WHERE clause to limit the data scanned.
SELECT dest, origin FROM flights WHERE year = '1991';
This query only scans data from s3://athena-examples/flight/parquet/year=1991/, thus reducing costs.
The table below compares query runtimes and costs between partitioned and non-partitioned tables. The dataset contains 74 GB of data, partitioned by the l_shipdate column.
To reduce Athena costs, compress your data to decrease its size, which reduces the amount of data scanned from Amazon S3 and lowers query costs. Athena supports various compression formats like gzip, Snappy, and zstd. Note that many common compression formats are not splittable, so only one processing node can handle a single compressed file. To optimize performance, balance the number and size of your compressed files. Formats like Parquet and ORC are splittable and recommended for better performance and hence reducing Athena costs.
Example Code Snippet:
The code creates an external table for uncompressed CSV data in S3 and another for compressed Parquet data with Snappy compression. It then copies data from the uncompressed table to the compressed table, reducing future Athena query costs by minimizing data scanned.
-- Step 1: Create an External Table for Uncompressed Data
CREATE EXTERNAL TABLE IF NOT EXISTS your_database.uncompressed_table (
column1 STRING,
column2 INT,
column3 DOUBLE
-- Define other columns as necessary
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://your-bucket/path/to/csv-files/';
-- Step 2: Create an External Table for Compressed Data in Parquet Format
CREATE EXTERNAL TABLE IF NOT EXISTS your_database.compressed_table (
column1 STRING,
column2 INT,
column3 DOUBLE
-- Define other columns as necessary
)
STORED AS PARQUET
LOCATION 's3://your-bucket/path/to/parquet-files/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
-- Step 3: Insert Data from Uncompressed Table to Compressed Table
INSERT INTO your_database.compressed_table
SELECT * FROM your_database.uncompressed_table;
Bucketing your data in Amazon Athena helps reduce costs by minimizing the amount of data that a query reads. Bucketing distributes records into separate files based on the value of a specific column, ensuring that all records with the same value are stored together. This is particularly useful for columns with high cardinality, like user IDs or device IDs, and for queries that frequently search for specific values in these columns.When you bucket data, you organize it into a set number of buckets based on the values in one column. This organization allows Athena to scan only the relevant bucket instead of the entire dataset when executing queries, reducing the data scanned and, therefore, the cost.
Let's consider a customer table of size 2.29 GB, with the c_custkey column used for bucketing into 32 buckets. Here's how we compare the costs:
By bucketing the customer table on the c_custkey column into 32 buckets, the cost of the query was reduced from $0.01145 to $0.0003645, achieving a 97% reduction in cost and scanning significantly less data.
Optimizing partition processing using partition projection in Amazon Athena can significantly reduce costs by speeding up query execution for highly partitioned tables. Instead of retrieving partition metadata from AWS Glue or another metastore, partition projection calculates partition values and locations in-memory, which is faster and reduces query runtime. This approach eliminates the need for manual partition metadata updates, simplifies partition management, and is particularly beneficial for tables with high cardinality columns or finely granular date ranges. By enabling partition projection, queries become more efficient and cost-effective, ultimately lowering Athena processing costs. For implementation, you can configure partition projection in your table creation statement with appropriate settings for partition types and ranges.
To enable partition projection, you can modify your table creation statement to include the necessary configurations. Here’s an example of how to set it up:
CREATE TABLE my_table (
id STRING,
value STRING
)
PARTITIONED BY (year STRING, month STRING)
WITH (
partition_projection.enabled = 'true',
partition_projection.year.type = 'integer',
partition_projection.year.range = '2010,2030',
partition_projection.month.type = 'integer',
partition_projection.month.range = '1,12'
);
Query Result Reuse in Amazon Athena helps reduce costs by avoiding unnecessary data scans and query reruns. By reusing the results of previous queries when the source data hasn't changed, Query Result Reuse minimizes data scan costs and query latency. This feature is particularly useful for frequently run queries on datasets that update infrequently, such as daily or less often. It enables users to specify a maximum age for result reuse, ensuring efficient query execution and cost savings. With Query Result Reuse, Athena automatically checks for existing results matching the query criteria, transparently rerunning the query only when necessary. This capability benefits not only individual users but also entire workgroups, improving productivity and lowering overall costs for shared datasets and dashboards.[4]
Here's an example of how to enable Query Result Reuse when running a query using AWS SDK for Python:
import boto3
client = boto3.client('athena')
response = client.start_query_execution(
WorkGroup='my_work_group',
QueryString='SELECT * FROM my_table LIMIT 10',
ResultReuseConfiguration={
'ResultReuseByAgeConfiguration': {
'Enabled': True,
'MaxAgeInMinutes': 60
}
}
)
Optimizing query performance in Athena directly correlates with reducing costs. When queries run more efficiently, they consume fewer computational resources, resulting in lower costs incurred by the user. Here's how each of the mentioned best practices contributes to cost reduction:
a) Optimize ORDER BY: Efficiently ordering query results reduces the amount of data processed and transmitted, ultimately lowering the cost associated with data scanning and network usage.
b) Optimize joins: Optimizing join operations minimizes the computational overhead involved in merging datasets, leading to reduced query execution time and, consequently, lower costs.
c) Optimize GROUP BY: Streamlining GROUP BY operations reduces the amount of data processed for aggregation, resulting in quicker query execution and reduced costs.
d) Use approximate functions: Approximate functions like approximate count distinct or approximate quantiles offer faster execution times compared to their exact counterparts. By leveraging these functions, users can achieve quicker query results and lower costs.
e) Only include the columns that you need: Selecting only the necessary columns reduces the volume of data processed, leading to faster query execution and decreased costs associated with data scanning and transmission.
Here's an example SQL query that incorporates all the mentioned query tuning best practices:
SELECT product_category, SUM(sales_amount) AS total_sales
FROM sales_transactions
INNER JOIN customers ON sales_transactions.customer_id = customers.customer_id
WHERE transaction_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY product_category
ORDER BY total_sales DESC
LIMIT 10;
We filter transactions by date, join transaction and customer data, group by product category, sort by total sales, and limit results to the top 10 categories.
In summary, Amazon Athena is a vital tool for businesses managing large datasets on Amazon S3, offering scalability and efficient data analysis. Optimizing Athena costs is crucial for profitability and efficiency. Strategies like data partitioning, compression, bucketing, partition projection,Optimizing query performance and Query Result Reuse can effectively reduce costs and improve query performance, enhancing productivity and decision-making processes.
References
1. https://6sense.com/tech/interactive-content-marketing/amazon-athena-market-share#free-plan-signup
2. https://aws.amazon.com/athena/pricing/
3. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
1. How much does Athena cost per terabyte?
Athena costs $5 per terabyte scanned. This pricing model emphasizes efficient data scanning and querying practices to manage costs effectively.
2. What is the optimal file size for Athena?
The optimal file size for Athena is generally between 1 to 2 GBs. This size strikes a balance between query performance and cost efficiency.
3. Is Amazon Athena fast?
Amazon Athena is designed for fast query execution, typically delivering results within seconds.
4. What is AWS Athena good for?
AWS Athena is particularly useful for querying large datasets stored in Amazon S3 using standard SQL. It's ideal for ad-hoc queries, log analysis, and extracting insights from structured and unstructured data.
5. How to save on Athena costs?
To optimize Athena costs, consider practices such as data partitioning, using compression formats like Parquet or ORC, defining optimal file sizes, and leveraging result set caching to minimize redundant queries. These strategies help reduce the amount of data scanned and improve query efficiency, thereby lowering overall costs.
Strategical use of SCPs saves more cloud cost than one can imagine. Astuto does that for you!