Amazon Practice Questions, Discussions & Exam Topics by our Authors

A company stores a 100 MB dataset in an Amazon S3 bucket as an Apache Parquet file. A data engineer needs to profile the data before performing data preparation steps on the data. Wh...

Key requirement You need to profile a 100 MB Parquet dataset in Amazon S3 with minimum operational overhead, before doing data preparation. Data profiling = understanding schema, nulls, distributions, stats, duplicates, etc., without building a heavy processing pipeline. --- ✅ Correct Option: A) AWS Glue DataBrew profiling job Why this is correct AWS Glue DataBrew is specifically designed for: No-code / low-code data profiling Directly reading from Amazon S3 Native support for Parquet format Automatically generating: column statistics data type inference null/unique counts distribution insights Minimal setup (no clusters, no SQL engines, no ingestion pipelines) Key reasoning factors Most operationally efficient → fully managed, no infrastructure Direct S3 integration → no movement of data required Built-in profiling feature → purpose-built for this requirement Works well for small-to-medium datasets (like 100 MB) When to use DataBrew profiling Quick exploration of datasets in S3 Data quality checks before ETL Non-engineering / analyst-driven profiling --- ❌ Why other options are wrong B) Amazon Managed Service for Apache Flink Amazon Managed Service for Apache Flink Why rejected: Designed for real-time streaming analytics, not batch profiling Requires setting up a Flink application Overkill for a static 100 MB Parquet file Dashboard is for monitoring streaming jobs, not data profilin...

Author: Sofia · Last updated Jun 25, 2026

A company uses an Amazon Redshift cluster to manage data, including vendor sales data. The company wants to store a copy of the vendor data in an Amazon S3 bucket. A data engineer sets up an AWS Glue job to upload the data to the S3 bucket data on a schedule. The data engineer set up a network conn...

The key requirement is to store a copy of data from Amazon Redshift into Amazon S3. The AWS-native and most standard pattern for exporting data from Amazon Redshift to Amazon S3 is the UNLOAD operation, which requires the Redshift cluster to have an IAM role with permissions to write to the target S3 bucket. Why Option A is correct Amazon Redshift does not use direct user credentials to access S3; it relies on an IAM role attached to the cluster. The IAM role grants permissions like `s3:PutObject`, `s3:ListBucket`, etc., enabling Redshift to write data during UNLOAD operations. Even though a Glue job is mentioned, the question emphasizes Redshift-to-S3 data movement, and the missing foundational requirement is proper IAM role association. Without this role, Redshift cannot securely export data to S3, even if networking is configured. Why other options are incorrect B) Add the S3 bucket to AWS Glue Data Catalog and use Redshift Spectrum Redshift Spectrum is used to query data in S3, not to export or copy data into S3. The Glue Data Catalog is a metadata layer, not a data transf...

Author: Sofia · Last updated Jun 25, 2026

A data engineer needs a fully automated solution to check for new data in multiple databases and process data that the solution finds. The solution must run every hour. The solution must be compatible with Amazon RDS, Amazon DynamoDB, and Amazon OpenSearch Service. The solution must be able to process up to 10 MB of data at one time. The solution must be...

Key requirements breakdown We need a solution that: Runs every hour → scheduling needed (EventBridge / Airflow / etc.) Checks multiple data sources: Amazon RDS, Amazon DynamoDB, Amazon OpenSearch Service Processes small payloads (≤ 10 MB) → lightweight compute preferred (Lambda is sufficient) Has low cost + low operational overhead → avoid always-on clusters like EMR or heavy orchestration unless necessary Has robust error handling Must support event-driven, automated workflow --- Option A: EventBridge → Step Functions → Lambda + orchestration states Why it looks good: EventBridge provides hourly scheduling Step Functions provides strong workflow orchestration + retries + error handling per state Lambda is suitable for small-scale processing (10 MB is fine) Works with RDS, DynamoDB, OpenSearch via SDK calls Why it is NOT optimal: Step Functions adds extra cost and operational complexity Overkill for a simple “check + process up to 10 MB” pipeline Using Step Functions just to orchestrate a single Lambda check + processing is unnecessary abstraction When this is used: Complex multi-step workflows (ETL pipelines, branching logic, human approval steps, long-running processes) --- Option B: EventBridge → Lambda → SQS → Lambda (processing) Why this is the BEST fit: EventBridge triggers hourly → simple and serverless First Lambda checks multiple sources (RDS, DynamoDB, OpenSearch) If data exists, it pushes messages to SQS Second Lambda processes messages asynchronously Key advantages: Highly cost-efficient (pay-per-use) Fully serverless (no infrastructure management) SQS provides: Built-in retry mechanism Dead-letter queue (DLQ) support → strong error handling Scales aut...

Author: ThunderBear · Last updated Jun 25, 2026

A data engineer is writing a query to join two tables in Amazon Athena. The data engineer needs to choose the correct join order for the tables to optimize...

In Amazon Athena (which runs on the Trino/Presto engine), join performance is heavily influenced by how the engine builds and probes hash tables during execution, but the query optimizer can also reorder joins when statistics are available. The question, however, is explicitly asking about choosing the correct join order while writing the query, so we focus on deterministic, developer-controlled optimization rather than relying on optimizer behavior. --- Option A: Smaller table on the left, larger on the right ❌ This is generally not optimal for hash join performance. In most hash join implementations, the engine typically builds an in-memory hash table from the smaller dataset. If the smaller table is placed on the left, it may still work, but it does not align with the common optimization pattern and may lead to suboptimal execution planning depending on engine behavior. When it might be used: When query readability or business logic dictates left-to-right ordering In LEFT OUTER JOINs where left-side preservation is required --- Option B: Larger table on the left, smaller table on the right ✅ (Correct) This is the best manual optimization choice for Athena-style distributed query execution. In a typical hash join: The right-side table is used as the build side The left-side table is used as the probe side Placing the smaller table on the right allows: Faster hash table construction Lower memory consumption Reduced shuffle and spill risk This is especially important in Athena where queries run in a distributed, serverless environment and memory efficiency directly ...

Author: Harper · Last updated Jun 25, 2026

A retail company needs to implement a solution to capture data updates from multiple Amazon Aurora MySQL databases. The company needs to make the updates available for analytics in near real time. The solution must be serverless and require ...

The correct choice is D. Why option D is correct (key exam reasoning) Option D uses Aurora zero-ETL integrations with Amazon Redshift Serverless, which is specifically designed for this exact requirement: near real-time analytics from Amazon Aurora MySQL into Amazon Redshift Serverless with minimal operational overhead. Key factors: True serverless integration → no infrastructure to manage (no replication servers, no streaming clusters) Near real-time replication → changes from Aurora are automatically and continuously replicated No ETL pipeline to build or maintain No schema conversion tooling required Deep native AWS integration → lowest latency + lowest ops burden This is exactly what “least operational overhead + near real-time analytics” points to in AWS exam logic. --- Why the other options are incorrect ❌ A) AWS DMS + schema conversion + Redshift Serverless Uses AWS Database Migration Service Why it is wrong: Requires schema conversion management (typically via AWS SCT) → extra operational overhead DMS tasks must be manually configured and monitored Not truly “serverless end-to-end” More suitable for database migration projects, not continuous analytics pipelines When A is used: One-time or phased database migrations (e.g., Oracle/MySQL → AWS) Heterogeneous schema conversions --- ...

Author: Liam · Last updated Jun 25, 2026

A healthcare company stores patient records in an on-premises MySQL database. The company creates an application to access the MySQL database. The company must enforce security protocols to protect the patient records. The company currently rotates database credentials every 30 days to minimize the risk of unauthorized access. The company wants a solution that does re...

The key requirements are: Protect MySQL credentials for an on-premises database Enable automatic rotation Avoid application code changes for each rotation Minimize operational overhead ✅ Correct Answer: C) Use AWS Secrets Manager to automatically rotate credentials. Allow the application to retrieve the credentials by using API calls. --- Why Option C is correct AWS Secrets Manager is designed specifically for this use case. Key factors: Native credential storage + encryption (no custom encryption logic needed) Automatic rotation support (can be configured with Lambda-based rotation for MySQL, including on-prem) Application decoupling from static credentials The app retrieves the current valid secret at runtime via API No need to modify code every time credentials rotate Least operational overhead AWS handles rotation workflow, versioning, and secure storage When this option is used: Database credentials rotation (RDS or on-prem) API keys, passwords, secrets Centralized secrets management with automated rotation --- Why other options are incorrect ❌ A) IAM role + temporary credentials AWS Identity and Access Management temporary credentials work well for AWS-native services (like RDS/Aurora IAM auth). But: ❌ MySQL on-...

Author: Ahmed · Last updated Jun 25, 2026

A company has an Amazon S3 based data lake. The data lake contains datasets that belong to multiple departments. The data lake ingests millions of customer records each day. A data engineer needs to design an access and storage solution that allows departments to access only the subset of the company's dataset that each department requ...

This is a classic data lake fine-grained access control question focused on least privilege + least operational overhead across multiple departments. Let’s evaluate each option based on key AWS design factors: --- Option A: IAM policies and IAM roles with S3 prefixes This approach uses direct S3 access control via IAM + bucket prefixes. Why it is NOT optimal: You must manually design and maintain: IAM roles per department IAM policies per dataset prefix/path As the number of datasets and departments grows, policy management becomes: Complex Error-prone Hard to audit No centralized data catalog or governance layer No easy way to manage table/column-level permissions When it can be used: Small-scale data lakes Simple folder-based access control needs Few departments and static datasets --- Option B: Amazon Redshift + Spectrum as entry point This uses Redshift as a query layer over S3. Why it is NOT optimal: Adds unnecessary infrastructure (Redshift cluster) Still requires IAM role management for S3 access Spectrum improves querying but: Does not solve centralized access governance for multiple departments Operational overhead includes: Cluster management (unless serverless, still not primary governance tool) Schema management duplication between S3 and Redshift When it can be used: Data warehouse-centric architectures When most users query via SQL analytics in Redshift When you already have Redshift as a core platform --- Option C: AWS Lake Formation with LF-Tags (Correct Answer) Why this is the BEST choice: This is the native AWS solution for least-privilege data lake governance. Key advantages: C...

Author: Layla · Last updated Jun 25, 2026

A company needs to store and analyze a large amount of IoT sensor data. The company needs to retain the data indefinitely. The company analyzes the data in an Amazon Redshift cl...

For this scenario, the key requirements are: Very large volume IoT sensor data Indefinite retention (long-term storage) Cost-effective analytics using Amazon Redshift Likely high ingestion rate + mostly analytical querying rather than full data loading into warehouse Key AWS design principle For massive, long-term datasets, the most cost-effective architecture is: > Store data cheaply in Amazon S3 + query only what you need using Amazon Redshift Spectrum (external tables) > Avoid loading everything into Redshift storage unless necessary. --- Option Analysis A) S3 in JSON + auto-copy into Redshift Data is stored in S3 but then copied into Redshift cluster storage ❌ Expensive for large IoT datasets because: Redshift storage is costly for indefinite retention Requires resizing/maintenance of cluster JSON format also inefficient (larger size, slower parsing) When this is used: When data is frequently queried with very low latency requirements inside Redshift When dataset size is moderate and needs full ingestion into warehouse --- B) S3 in Parquet + Redshift Spectrum (CORRECT) Data remains in Amazon S3 (cheap, virtually unlimited storage) Stored in Apache Parquet (columnar, compressed, optimized for analytics) Queried using Redshift Spectrum (external tables) ✔ Key advantages: Lowest storage cost (S3 + no Redshift storage expansion) Parquet reduc...

Author: Kai99 · Last updated Jun 25, 2026

A retail company wants to implement real-time analytics for an ecommerce platform. The company needs to collect clickstream data from the company's website and mobile apps. The company needs to store the data for a...

Key requirement Collect clickstream data from website + mobile apps Support real-time or near-real-time analytics Store data for analytics Least ongoing maintenance The core idea: choose a fully managed ingestion + storage pipeline, avoiding infrastructure management like EC2, brokers, or database clusters. --- Option A: Kinesis Data Firehose → provisioned Amazon Redshift cluster Why it’s NOT ideal (least maintenance angle): Amazon Data Firehose is fully managed (good) But Amazon Redshift provisioned cluster requires ongoing maintenance node sizing and scaling performance tuning (vacuum/analyze) patching and cluster management This adds operational overhead When it is used: When you need direct loading into a data warehouse (Redshift) for fast SQL analytics When you want near real-time ELT into Redshift specifically --- Option B: EC2 agents + CLI batch upload to S3 Why it’s rejected: Requires EC2 instances to be managed patching, scaling, monitoring, failures Batch uploads are not real-time CLI-based ingestion is manual and error-prone When it is used: Legacy systems or custom collectors where streaming services are not available Simple batch ETL pipelines (low scale, non-real-time) --- Option C: Amazon Kinesis Data Streams → Amazon Data Firehose → Amazon S3 Why this is the BEST answer: Both serv...

Author: BlazingPhoenix22 · Last updated Jun 25, 2026

A company creates a new non-production application that runs on an Amazon EC2 instance. The application needs to communicate with an Amazon RDS database instance using Java Database Connectivity (JDBC). The EC2 instances and...

The correct solution is C) Update the database security group to allow connections from the EC2 instances. Key reasoning factors For Amazon RDS and Amazon EC2 communication using JDBC, the primary control mechanism is network-level access via security groups, because both services are in the same VPC/subnet. RDS does not require IAM roles or special parameters for basic database connectivity. Instead, it relies on: VPC networking Security group inbound rules (port-based access, e.g., 3306 for MySQL, 5432 for PostgreSQL) Since the EC2 instance must initiate a TCP connection to RDS, the RDS security group must explicitly allow inbound traffic from: The EC2 instance’s security group (preferred), or Its IP range (less secure and less flexible) This is the standard AWS best practice for EC2 → RDS communication. --- Why other options are incorrect A) Modify the IAM role assigned to the database instance Incorrect because IAM roles do not control network connectivity between EC2 and RDS. IAM roles for RDS are used for features like S3 access (e.g., backups, exports), not JDBC connections. Even with IAM authentication enabled, security groups are still required. --- B) Mo...

Author: William · Last updated Jun 25, 2026

A company uses AWS Step Functions to orchestrate a data pipeline. The company has configured the Step Functions logs to push to Amazon CloudWatch Logs when the log level is FATAL The company has enabled logs for all AWS services in the pipeline. A state named "preprocessing" invokes an AWS Lambda function named "preprocessing." The Lambda function preprocesses data before proceeding to the next state. ...

Correct answer reasoning We are trying to locate error details when the Lambda function `preprocessing` fails during a Step Functions execution. Key facts from the question: Step Functions orchestrates the workflow A Lambda function (`preprocessing`) is invoked in a state We need error details when preprocessing fails Step Functions logs are enabled and sent to CloudWatch Logs Log level is set to FATAL (so only failure-level workflow events are captured in Step Functions logs) --- Evaluate each option A) Step Functions TaskFailed event in `/aws/vendedlogs/states` log group ✅ This is the correct place. Step Functions emits execution-level events like: `TaskStarted` `TaskSucceeded` `TaskFailed` Since the log level is set to FATAL, only failure-related workflow events are logged. If Lambda fails, Step Functions records it as a TaskFailed event. This log will include: error name cause (often includes Lambda error message) execution context 👉 This is the best place to see orchestration-level failure details. When this option is used: Debugging workflow failures in Step Functions Identifying which state failed and why (including Lambda invocation failures) High-level error tracing across services --- B) CloudTrail SendTaskFailure event in CloudTrail/logs/preprocessing ❌ CloudTrail logs API activity, not runtime execution errors inside Lambda. `SendTaskFailure` is related to callback pattern workflows, not standard Lambda invocation in Step Functions. Also, CloudTrail does not create a log grou...

Author: Ahmed97 · Last updated Jun 25, 2026

An application uses an AWS Lambda function that is configured with managed runtimes. The Lambda function successfully writes logs to the default Amazon CloudWatch Logs log group. A data engineer wants to modify the logging behavior to show only ERROR level...

Let’s carefully analyze this AWS exam-style question. --- Requirement Recap - Lambda function with managed runtimes. - Logs currently go to default CloudWatch Logs group. - Need to filter logs by severity: - Application logs → only ERROR. - System logs → only WARN. This requires control over log filtering and routing, not just formatting or permissions. --- Option Analysis A) Add additional permissions to the Lambda execution role - Permissions only control whether logs can be written. - They do not affect log filtering or severity levels. - Rejected: Irrelevant to the requirement. --- B) Set the log level to ERROR in the Lambda function code - This affects application logs only. - You can configure your logging library to emit only ERROR-level logs. - Limitation: Does not affect system logs (runtime logs generated by Lambda service). - Rejected: Partial solution, not complete. --- C) Configure the Lambda function t...

Author: Noah Williams · Last updated Jun 25, 2026

A data engineer needs to validate the quality of files that are uploaded to an Amazon S3 bucket every day. The files are in CSV and JSON formats and schema variations exist. The data engineer needs a repeatable process to monitor data quality metrics such as null values, format inconsistencies, and outliers. The process must provide reusable...

The correct option is C. Why C is correct: AWS Glue DataBrew profiling + reusable recipes AWS Glue DataBrew is designed specifically for no-code data profiling and data quality management, which directly matches the requirements: Key factors in the scenario: Daily validation of S3 data (CSV + JSON) → DataBrew supports direct ingestion from Amazon S3. Schema variations → DataBrew can handle flexible schema datasets without custom parsing logic. Data quality metrics (nulls, format issues, outliers) → Built-in profiling jobs automatically compute these metrics. Rule-based, reusable validation → DataBrew allows creation of reusable data quality rules and recipes. Minimal manual effort / no custom code → Fully managed, visual interface with scheduling support. Scalable across datasets → Same recipe and profiling job can be applied across multiple datasets consistently. This makes it ideal for repeatable, enterprise-scale data quality monitoring with low operational overhead. --- Why other options are incorrect A) AWS Glue Studio with PySpark ETL logic AWS Glue Studio Requires custom PySpark code to implement validation logic. You would need to manually code checks for nulls, for...

Author: Sofia2021 · Last updated Jun 25, 2026

A company needs a solution to process streaming data by using Apache Spark in a Kubernetes environment. The solution must support event-driven scaling and optimize resource utilization. The company needs to integrate the solution with existing Kubernetes infrastructure deployed on Amazon ...

The question asks for a solution that: Runs Apache Spark streaming on Kubernetes (Amazon EKS) Supports event-driven scaling Optimizes resource utilization Has least operational overhead Key evaluation factors 1. Managed vs self-managed Spark → managed reduces ops overhead 2. Event-driven scaling support → KEDA is purpose-built for event-based autoscaling 3. Kubernetes-native integration → EKS-native Spark execution model 4. Scaling correctness for streaming workloads → reacts to events (queues/streams), not just CPU metrics --- Option Analysis A) Self-managed Spark on EKS + KEDA Uses self-managed Apache Spark on Amazon Elastic Kubernetes Service Adds Kubernetes Event-Driven Autoscaling (KEDA) for scaling Why it’s not optimal: Self-managed Spark = high operational burden (job scheduling, tuning, upgrades, fault handling) Even though KEDA enables event-driven scaling, you still manage Spark internals More DevOps overhead than managed alternatives When used: Useful when you need full control over Spark internals or custom Spark builds, but accept higher maintenance cost. --- B) Amazon EMR on EKS + KEDA (Correct Answer) Uses Amazon EMR on EKS Integrates with Kubernetes Event-Driven Autoscaling (KEDA) Why this is best: EMR on EKS provides fully managed Spark runtime Offloads cluster and job lifecycle management to AWS KEDA enables true event-driven auto...

Author: Oscar · Last updated Jun 25, 2026

A legal company is building a data pipeline to power an application that will handle peak traffic during business hours. The application will provide information about relevant laws and available lawyers. The legal document database will be updated one time each day. The application must display up-to-date lawyer availability from a calendar database and provide complex full-text search of legal documents. The company wants to use AWS Glue for extract, transfor...

Key requirements breakdown ETL tool required: AWS Glue for batch/ETL processing Legal documents: updated once per day → batch processing is sufficient Lawyer availability: must be updated within 5 minutes → near real-time/event-driven updates required Search: complex full-text search over legal documents Goal: least operational overhead --- Option Analysis A) Step Functions + Glue + S3 + RDS for search Step Functions can orchestrate Glue jobs, but that’s not the main issue. Problem: Amazon RDS for full-text search RDS is a relational database. Full-text search at scale (legal documents) becomes complex and operationally heavy. Requires manual indexing, tuning, and scaling. ❌ Not suitable for complex search workloads. When RDS is appropriate: Structured transactional data (OLTP) Simple queries, not large-scale search engines --- B) Step Functions + Glue + S3 + Amazon OpenSearch Service S3 acts as a data lake for processed documents. Amazon OpenSearch Service provides: Built-in full-text search Relevance scoring Filtering and aggregations Near real-time indexing Step Functions + Glue supports: Scheduled daily ETL for legal documents Event-driven updates for lawyer availability (via triggers) ✔ Meets all requirements: Daily batch ingestion (legal docs) Near real-time updates (availability within 5 minutes) Complex search requirement handled natively When OpenSearch is appropriate: Full-text search systems (legal, logs, product catalogs) Analytics + search hyb...

Author: StarryEagle42 · Last updated Jun 25, 2026

A data engineer needs to deploy a serverless data pipeline. In the pipeline, CSV files are uploaded to an Amazon S3 bucket, which invokes an AWS Lambda function. The Lambda function transforms the CSV files to JSON format and stores the results in a second S3 bucket. The data engineer has created an AWS Serverless Application Model (AWS SAM) template that includes t...

Correct approach in AWS SAM context We need a fully serverless, repeatable deployment using AWS SAM, meaning: Infrastructure (S3 buckets + Lambda + event trigger) should be defined in the SAM template Deployment should be automated, not manually configured in the console We should use SAM-supported deployment workflow (build → package → deploy or guided deploy) --- Key requirement breakdown S3 bucket uploads trigger Lambda → requires S3 event notification defined in IaC (SAM/CloudFormation) Must use AWS SAM deployment flow Avoid manual console configuration (not IaC-compliant) Must properly handle Lambda packaging and CloudFormation deployment --- Option Analysis ✅ Option A > Add first S3 bucket and S3 event source in SAM template → `sam build` → `sam deploy --guided` Why this works Uses Infrastructure as Code (IaC) fully in SAM template (S3 + Lambda + event source) `sam build` prepares deployment artifacts correctly `sam deploy --guided` handles: S3 packaging bucket creation CloudFormation stack deployment IAM permissions setup Fully automated and recommended SAM workflow Key strengths End-to-end SAM-native deployment No manual steps Event-driven architecture correctly defined in template When this is used First-time deployments When you want interactive configuration (guided deploy) CI/CD pipelines with SAM CLI abstraction --- ❌ Option B > `sam deploy --s3-bucket` + manually configure S3 trigger in console Why it is wrong Event trigger is configured manually → ...

Author: Oscar · Last updated Jun 25, 2026

A company stores objects in an Amazon S3 bucket. The company crawls the objects so that Amazon Athena can query the data. A data engineer manually moved all objects from the partition with a path prefix of status=01 to the prefix status=02. The status=01 partition location is now empty. However, the status=01 partition location...

In this scenario, the data in Amazon S3 has been manually reorganized, but the AWS Glue Data Catalog still contains outdated partition metadata for `status=01`. Since the S3 prefix is now empty, the issue is stale partition entries in the catalog, not missing partitions. --- Key idea S3 is the source of truth for data storage Glue Data Catalog stores metadata (including partitions) Athena queries rely on catalog metadata, so stale partitions must be explicitly removed --- Option Analysis A) MSCK REPAIR TABLE Purpose: Scans S3 and adds missing partitions to the Glue Data Catalog Works when: New partitions exist in S3 but are not registered in Glue Why it’s wrong here: The problem is the opposite: partition exists in catalog but NOT in S3 MSCK does not remove stale partitions --- B) ALTER TABLE DROP PARTITION ✅ Purpose: Explicitly removes partition metadata f...

Author: Ming · Last updated Jun 25, 2026

A company uses Amazon Redshift for its data warehouse. A data engineer must query a table named orders.complete_orders_history, which contains 100 columns. The query must return all columns except columns named compa...

Author: Lucas Carter · Last updated Jun 25, 2026

A manufacturing company uses AWS Glue jobs to process IoT sensor data to generate predictive maintenance models. A data engineer needs to implement automated data quality checks to identify temperature readings that are outside the expected range of -50=C2=B0C to 150=C2=B0C. The data quality checks must also identify records that are missing timestamp values. Th...

The correct answer is B. Why B is correct Option B uses AWS Glue Data Quality rules (built on AWS Deequ) to define and automatically enforce data validation checks directly within Glue jobs. This solution fits all requirements: Minimal coding: Rules are declarative (rule sets), not custom code. Automated checks: Runs as part of Glue jobs or pipelines. Missing timestamp detection: Can use completeness rules (e.g., NOT NULL checks). Temperature validation: Can define range constraints (e.g., -50°C to 150°C). Designed for Glue workflows: Native integration with AWS Glue ETL jobs. ML-based anomaly detection mentioned in the option is not strictly required for this use case, but it does not invalidate the solution since Glue Data Quality already covers the deterministic checks needed. --- Why other options are incorrect A) AWS Glue DataBrew AWS Glue DataBrew is a low-code data preparation tool and can perform profiling and validation rules like completeness and numeric ranges. Why it is rejected: It is primarily designed for interactive data preparation, not automated validation inside Glue ETL pipelines. Less suitable for production-grade Glue job-based workflows for continuous IoT ingestion. Not the most direct integration with Glue job-based data quality enforcement. When it would be used: Exploratory data cl...

Author: Jack · Last updated Jun 25, 2026

Two data engineering teams use separate AWS accounts. Both teams request access to the same datashare in an Amazon Redshift cluster that is in a third AWS account. The datashare is named salesshare. A data engineer must use the Amazon Redshift SQL interface to gr...

In Amazon Redshift datasharing, a producer cluster (in one AWS account) can share a datashare with consumer accounts. The correct permission syntax depends on how the consumer is identified: ACCOUNT → used when sharing across AWS accounts (most common for cross-account sharing). NAMESPACE → used when sharing to a specific Redshift cluster or Redshift Serverless namespace (often same-account or direct namespace targeting, not typical “two AWS accounts” exam framing). Syntax must also use correct keywords: ACCOUNT / NAMESPACE (singular), not plural forms. --- Key requirement in the question Two separate AWS accounts must be granted access Same datashare: `salesshare` Must use Amazon Redshift SQL interface Must grant access to both accounts --- Option Analysis ❌ A) `GRANT USAGE ON DATASHARE salesshare TO ACCOUNTS ...` Incorrect syntax: ACCOUNTS (plural) is invalid Redshift only supports ACCOUNT, not ACCOUNTS Even if conceptually correct, this would fail at execution When it would be used? Not valid in any Redshift datasharing scenario --- ❌ B) `GRANT USAGE ON DATASHARE salesshare TO NAMESPACES ...` Incorrect syntax: NAMESPACES (plural) is invalid Namespace is valid only in singular form Also, namespace is typically used for Redshift Serv...

Author: Ethan · Last updated Jun 25, 2026

A company stores Apache Parquet files in an Amazon S3 data lake. The data lake receives thousands of files from multiple sources every hour. The files range in size from 50 KB to 100 KB. The company is evaluating the implementation of Apache Iceberg tables for the data lake. The company is using AWS Glue Data Catalog as part of the evaluation. The company needs a solution to optimize query perform...

The correct answer is C. Why C is correct C) Configure Iceberg table properties to enable automatic compaction based on thresholds for file size and the number of files. Apache Iceberg is designed to handle large-scale datasets with frequent small file ingestion. In this scenario, the core problem is small file explosion (50–100 KB files arriving continuously), which degrades query performance because query engines must open and process a very large number of files. Iceberg solves this through native table maintenance operations, especially: File compaction (rewrite_data_files / bin packing) Triggered based on file count thresholds or size thresholds Managed at the table level, independent of ingestion frequency This ensures: Stable query performance over time Reduced metadata overhead Efficient scan planning in Athena/Glue/Trino/Spark When this option is used: Continuous ingestion with many small files Long-term lakehouse optimization using Iceberg-managed maintenance Environments where automated table-level optimization is preferred over external ETL scheduling --- Why the other options are incorrect A) Use an AWS Glue job to compact files into 512 MB + run crawler Compaction is conceptually correct, but: Uses external ETL orchestration instead of Iceberg-native features 512 MB is arbitrary and not Iceberg-aware tuning Glue crawler does not improve performanc...

Author: Olivia · Last updated Jun 25, 2026

A company needs to optimize storage costs for an Amazon S3 bucket. The S3 bucket receives 10 million objects every day. The objects range in size from 2 KB to 5 MB. The objects need to be immediately accessible for the first 60 days. Users access objects infrequently from 61 to 180 days. The objects must be accessibl...

Requirements breakdown (key factors) 0–60 days: must be immediately accessible → S3 Standard 61–180 days: infrequent access → S3 Standard-IA (lower cost, still millisecond access) 181–365 days: must be retrievable within 1 hour → S3 Glacier Flexible Retrieval (supports expedited retrieval) After 365 days: automatic deletion Also important: 10 million objects/day → fully automated, scalable lifecycle management is required Avoid operational overhead (no manual batch jobs or Lambda per-object handling) --- Option analysis ✅ B) S3 Lifecycle policy (Correct) Transitions: S3 Standard → S3 Standard-IA after 60 days ✔️ (matches infrequent access phase) S3 Standard-IA → S3 Glacier Flexible Retrieval after 180 days ✔️ (meets “within 1 hour” retrieval requirement via expedited retrieval option) Expiration after 365 days ✔️ Fully managed by Amazon Web Services S3 Lifecycle service → no custom compute required Scales automatically to millions of objects per day ✔️ Why this is ideal: Native S3 feature designed for time-based cost optimization Minimal operational overhead Deterministic transitions at scale --- ❌ A) S3 Intelligent-Tiering with Archive Access Intelligent-Tiering is best for unknown or unpredictable access patterns Problem: r...

Author: Emily · Last updated Jun 25, 2026

A ride-sharing company stores records for all rides in an Amazon DynamoDB table. The table includes the following columns and types of values: The table currently contains billions of items. The table is partitioned by RideID and uses TripStartTime as the sort key. The company wants to use the data to build a personal interface to give drivers the ability to view the rides that each ...

The requirement is to allow drivers to view their completed rides filtered by RideStatus, without scanning a billions-item DynamoDB table. This is fundamentally an access-pattern problem, so the solution must support efficient Query operations, not scans. Key reasoning In Amazon DynamoDB, efficient access depends on having the correct partition key for the query pattern. Since the requirement is “view rides for each driver,” the primary access pattern becomes: Query by DriverID Optionally filter or sort by RideStatus To support this at scale, a Global Secondary Index (GSI) is required because the base table is partitioned by RideID, which does not support querying by DriverID. --- Option analysis A) Create a local secondary index (LSI) on DriverID Incorrect. LSIs must use the same partition key as the base table (RideID). You cannot use DriverID as the LSI partition key. Therefore, this does not support querying by driver efficiently. --- B) Create a GSI with RiderID as partition key and RideStatus as ...

Author: Ethan Smith · Last updated Jun 25, 2026

A company needs to use Amazon Athena to analyze data that is in an Amazon S3 bucket. A data engineer needs to configure AWS Glue table partitions for year, month, and day. The data engineer needs to create the partitions ev...

The correct answer is: Selected option: C) Set partition projection properties for the AWS Glue table. --- Why Option C is correct (Partition Projection) Amazon Athena supports partition projection, which allows Athena to dynamically compute partitions at query time instead of storing them in the AWS Glue Data Catalog. In this scenario: Data is organized by year/month/day Partitions need to be “available daily” Schema may change frequently Partition projection is ideal because: You do NOT need to manually create partitions You do NOT need crawlers or Lambda jobs Athena automatically “knows” the partition structure based on configuration It significantly improves performance and reduces Glue metadata overhead It is specifically designed for time-based partition patterns So instead of creating partitions every day, you configure rules like: year range month format day format Athena then generates the partition logic at query time. --- Why other options are incorrect A) AWS Glue DataBrew AWS Glue DataBrew is used for: Data cleaning Transformation (visual ETL) ❌ Not used for: Managing Glue Data Catalog partitions Automating S3 partition registration So it cannot solve partition management for Athena. --- ...

Author: Andrew · Last updated Jun 25, 2026

A company runs an Apache Spark application every night in an Amazon EMR cluster. The company uses Amazon EC2 instances to supply compute capacity for the EMR cluster. The company deployed the Spark application in cluster mode. An error occurs in the Spark application. A log for the error is stored in the applicatio...

In Amazon EMR, a Spark application running in cluster mode on YARN places the driver inside a YARN container (Application Master) on one of the cluster nodes. Any `stderr` from the driver is captured in the container logs and EMR step logs, which are commonly archived to Amazon S3 if logging is enabled. Option analysis A) YARN ResourceManager logs via live cluster web UI The ResourceManager UI provides job-level and scheduling information, not detailed container stdout/stderr logs. It shows application status, queues, and tracking links, but not the driver’s error logs. Rejected because RM logs are not where container-level stderr is stored. --- B) Persistent application UI → first YARN container log in Spark UI In cluster mode, the driver runs in the first container (ApplicationMaster), and Spark UI can sometimes link to logs. However, Spark UI primarily shows metrics, stages, and executor logs, and log access is indirect or depends on YARN log aggregation being configured. This is not the most reliable or primary location in EMR exam conte...

Author: Amelia · Last updated Jun 25, 2026

A company needs to build a data pipeline to process a 1-TB file from an Amazon S3 bucket. The pipeline needs to create three DataFrames based on business logic. The pipeline must save all three DataFrames to a second S3 bucket in parallel. The company needs to set the pipeline to be the target of an Amazon EventBridge r...

The correct answer is C) Configure an AWS Glue workflow to run three AWS Glue jobs in parallel to process the file. Why Option C is correct (key reasoning) This requirement is best solved using serverless ETL with built-in orchestration and parallel job execution: Lowest maintenance overhead: AWS Glue is fully managed (no servers, no cluster management like EMR). Handles large-scale data (1 TB): Glue is designed for big data processing using Apache Spark under the hood. Parallel processing supported natively: A Glue Workflow (or job triggers) can run multiple Glue jobs concurrently, each building one DataFrame and writing to S3. Event-driven integration: EventBridge can directly trigger a Glue workflow when a new object lands in S3. Minimal operational complexity: No need to manage additional orchestration layers or infrastructure. Why other options are incorrect A) EMR Spark Streaming application Requires provisioning and managing EMR clusters → high operational overhead. Overkill for a file-triggered ETL pipeline. Better suited for continuous streaming or complex Spark workloads requiring full cluster control. When it would be used: When you need deep Spark customization, long-running jobs, or integ...

Author: Sam · Last updated Jun 25, 2026

A company needs a solution to store and query product data that has variable attributes. The solution must support unpredictable and high-volume queries with single-digit millisecond latency, even during sudden traffic spikes. The solution must retrieve items by a primary identifier named Product ID. The so...

Let’s evaluate this against the key requirements: Key requirements breakdown Variable / flexible attributes → schema should not be rigid Unpredictable + high-volume traffic spikes → must auto-scale smoothly Single-digit millisecond latency → extremely low read latency required Primary access pattern: Product ID lookup Secondary queries: Category + Brand filtering AWS exam bias: managed scaling + indexing support matters heavily --- ✅ Option A — Amazon DynamoDB with on-demand + GSIs (SELECTED) Amazon Web Services solution: Amazon DynamoDB with on-demand capacity + Global Secondary Indexes (GSIs) Why this works Single-digit millisecond latency → DynamoDB is designed for consistent low-latency key-value access Massive scale + traffic spikes → on-demand capacity automatically handles unpredictable workloads Primary key access (Product ID) → direct partition key lookup (fastest path in DynamoDB) Secondary queries (Category, Brand) → handled efficiently using GSIs Flexible schema → supports variable attributes (no fixed schema required) Fully managed, no infrastructure tuning required When to use this Use DynamoDB when: You need high-scale OLTP workloads Access patterns are known (primary + secondary indexes) You need predictable millisecond performance at any scale Data is semi-structured or evolving --- ❌ Option B — Amazon Aurora (Multi-AZ + read replicas) Why it is rejected Even with read replicas, latency is typically higher than single-digit ms at scale Relational schema is not ideal for variable/unpredictable attributes Indexing helps but does not match DynamoDB’s key-value performance u...

Author: Daniel · Last updated Jun 25, 2026

A gaming company uses AWS Glue to perform read and write operations on Apache Iceberg tables for real-time streaming data. The data in the Iceberg tables is in Apache Parquet format. The company is experiencin...

The key issue is slow query performance on Apache Iceberg tables stored in Parquet format. In Iceberg, performance problems in AWS Glue environments are most commonly caused by small files, lack of up-to-date statistics, and inefficient scan planning, especially in streaming ingestion scenarios. We evaluate each option based on how Iceberg + AWS Glue Data Catalog actually optimize read performance. --- A) Use AWS Glue Data Catalog to generate column-level statistics for the Iceberg tables on a schedule. ✔️ (Correct) Column-level statistics (min/max values, null counts, etc.) help the Iceberg query engine and AWS services like Athena or Spark: Perform predicate pushdown more effectively Skip unnecessary data files during scans Improve query planning accuracy Why it improves performance: If stats are missing or outdated, the engine may scan far more Parquet files than necessary. Scheduled stats generation ensures query planners make better decisions for filtering streaming data. When used: Large Iceberg datasets High-cardinality queries with filters (e.g., timestamps, user IDs) Streaming ingestion with frequent updates --- B) Use AWS Glue Data Catalog to automatically compact the Iceberg tables. ✔️ (Correct) Compaction merges small Parquet files into larger optimized files. Reduces file listing overhead Improves scan efficiency Reduces metadata load in Iceberg snapshot planning Why it improves performance: Streaming workloads often generate many small files, which is one of the biggest causes of slow Iceberg queries. When used:...

Author: FrostFalcon88 · Last updated Jun 25, 2026

A company must retain specific data for 1 year. A data engineer observes that one of the company's Amazon S3 buckets contains millions of objects that are older than 3 years. Versioning is enabled on the bucket. To reduce costs, the data engineer implements an S3 Lifecycle rule to expire objects after 365 days. The new S3 Lifecycle rule...

Key concept: Why object count increased The bucket has S3 Versioning enabled, which fundamentally changes how deletions work: A normal “delete” does not remove data Instead, S3 creates a delete marker Older versions of the object remain stored So when your lifecycle rule expires objects after 365 days, it only: Removes current objects logically (via delete markers) But does NOT automatically remove old versions ➡️ Result: object count increases instead of decreasing --- Correct approach To truly reduce storage and permanently delete old data, you must explicitly handle noncurrent (previous) versions. --- ✅ Correct Answer: D) Add an additional S3 Lifecycle rule to delete the current and expired versions of objects that are older than 365 days. Why D is correct This option correctly addresses both lifecycle dimensions in a versioned bucket: 1. Current object expiration Ensures active versions are expired after 365 days 2. Noncurrent version expiration Permanently deletes older versions (the real cause of storage growth) 3. Delete marker cleanup (implicit in lifecycle behavior) Prevents accumulation of unnecessary m...

Author: Ava · Last updated Jun 25, 2026

A data engineer needs to deploy a complex pipeline. The stages of the pipeline must be able to run a script. The data engineer must use only fully managed and serverless ...

Key requirement breakdown Pipeline is complex with multiple stages Each stage must be able to run scripts (custom code) Must use fully managed + serverless services only Must support orchestration + scheduling --- Option A) AWS Glue Jobs and Workflows AWS Glue Why it fits AWS Glue is a fully managed, serverless ETL service Glue Jobs can run custom scripts (Python/Scala) Glue Workflows orchestrate multiple jobs with dependencies Native scheduling support (via triggers or EventBridge) Key strengths Serverless compute (no infrastructure management) Built for data pipeline stages Supports multi-step orchestration Handles retries, dependencies, and monitoring When to use ETL pipelines with multiple stages Spark-based or script-based transformations Serverless data processing workflows --- Option B) Amazon MWAA (Apache Airflow) Amazon Managed Workflows for Apache Airflow Why it fits Fully managed Airflow = serverless orchestration layer Can orchestrate complex multi-stage pipelines Supports running scripts via operators (Python, Bash, Glue, EMR, etc.) But limitation in exam context MWAA is orchestration-only, not execution engine You still need external compute (Glue, Lambda, EMR, etc.) to run scripts More operational overhead compared to Glue-only solution When to use Very complex DAGs across multiple AWS services Enterprise-level workflow orchestration Cross-system pipelines (S3, Redshift, EMR, APIs, etc.) --- Option C) EC2 + EventBridge Amazon EC2 Amazon EventBridge Why it is rejected EC2 i...

Author: Ishaan · Last updated Jun 25, 2026

A company needs to implement a data mesh architecture in which domains for trading, risk, and compliance teams each have own their data. The teams need to share specific views with one another. The teams have over 1,000 tables across 50 databases in AWS Glue Data Catalog. All three teams use Amazon Athena to perform on-demand analysis. The teams use Amazon Redshift to generate complex reports. The compliance team must audit all data access. Access to personally identifiable information (PII) data must...

✅ Correct Answer: C C) Use AWS Lake Formation to set up cross-domain access to tables. Set up fine-grained access controls. --- Why Option C is correct This scenario is a classic data mesh on AWS with strict governance requirements. The key service designed for exactly this use case is AWS Lake Formation. Key reasons: 1. Native data mesh support Each domain (trading, risk, compliance) owns its datasets. Lake Formation enables decentralized data ownership with centralized governance, which is the core of data mesh. 2. Fine-grained access control (critical requirement) Supports column-level, row-level, and tag-based access control (LF-Tags). This is essential for restricting PII data access. 3. Cross-domain data sharing Allows secure sharing of datasets across domains without copying data. Works seamlessly with both: Amazon Athena Amazon Redshift Spectrum 4. Auditability Integrates with AWS CloudTrail to log all Lake Formation data access actions. This satisfies the compliance requirement for auditing. 5. Scalable for 1000+ tables / 50 databases Centralized catalog governance over AWS Glue Data Catalog scales well. --- Why other options are incorrect ❌ Option A (Athena views + Redshift integration) Views in Athena are not a governance mechanism. Does not provide centr...

Author: Aria · Last updated Jun 25, 2026

A retail company stores point-of-sale transaction data in an Amazon RDS for MySQL database. The company maintains historical sales analytics in Amazon Redshift. The company needs to create daily reports that combine the current day's transactions with historical sales patterns for trend analysis. The company requires a solution that...

The correct answer is A) Configure AWS Database Migration Service (AWS DMS) to continuously replicate data from RDS for MySQL to Amazon Redshift. Use Redshift queries to create consolidated reports. --- Why Option A is correct This scenario requires: Near real-time data from Amazon RDS for MySQL Integration with historical data in Amazon Redshift Low operational overhead Minimized data transfer cost Support for analytics workloads AWS Database Migration Service with Change Data Capture (CDC) is designed exactly for this: Key reasons: Continuous replication (CDC): Streams only changed data from RDS to Redshift, enabling near real-time sync. Low operational overhead: Fully managed service; no need to build or maintain pipelines. Efficient data transfer: Only incremental changes are moved, reducing network cost. Direct analytics readiness: Data lands in Amazon Redshift where it can be joined with historical datasets. When Option A is used: Near real-time analytics from transactional databases Cross-system replication into data warehouses Minimal maintenance data pipelines --- Why other options are incorrect ❌ B) Redshift federated queries to RDS Uses Redshift to query RDS directly without copying data. Problem 1: Performance bottleneck — RDS (MySQL) is an OLTP system and not optimized for heavy analytical queries. Problem 2: Latency — Cross-database joins are slower and not ideal for daily reporting at scale. Problem 3: Operational risk — Analytical queries can impact production RDS workloads. Best used when: Occasional lookups of small datasets Lightweight enrichment queries, not large-scale reporting --- ...

Author: Ming88 · Last updated Jun 25, 2026

A company uses a data stream in Amazon Kinesis Data Streams to collect transactional data from multiple sources. The company uses an AWS Glue extract, transform, and load (ETL) pipeline to look for outliers in the data from the stream. When the workflow detects an outlier, it sends a notification to an Amazon Simple Notification Service (Amazon SNS) topic. The SNS topic initiates a second workflow to retrieve logs for the outliers and stores the logs in an Amazon S3 bucket. The company experiences delays in the notifications to the SNS topic during periods when the data stream is processing a high volume of data. When the comp...

The key issue in this scenario is not the SNS notification mechanism itself, but the backpressure and resource exhaustion in the AWS Glue Spark environment during high-throughput periods. The CloudWatch metric `glue.driver.BlockManager.disk.diskSpaceUsed_MB` indicates that the Spark driver/executors are spilling data to disk, which typically happens when: Executor memory is insufficient Shuffle operations or transformations are too large for allocated resources The job is under-provisioned during traffic spikes This leads to slower ETL processing, which in turn delays detection of outliers and subsequent SNS notifications. --- Option Analysis A) Increase the number of data processing units (DPUs) in AWS Glue ETL jobs This increases compute and memory resources manually. It directly reduces disk spillover and improves throughput. However, it is static scaling. Requires manual tuning as load changes. ✔ Works, but not optimal for fluctuating workloads ❌ More operational effort (manual adjustments needed) --- B) Use Amazon EMR instead of AWS Glue EMR provides fine-grained control over Spark tuning. However, it introduces: Cluster management overhead Capacity planning Patching and scaling complexity ❌ Highest operational effort ❌ Not aligned with “least operational effort” requirement --- C) Use AWS Step Functions to orchestrate a parallel workflow state St...

Author: Leo · Last updated Jun 25, 2026

A company has a data pipeline that processes transaction data in real time. The company needs a notification system that alerts different teams based on the type of processing error without any delay. For security-related errors, the system must immediately notify the security team. For data validation errors, the system must notify the data quality ...

Key requirement here is real-time, event-driven routing of different error types to different teams with minimal operational overhead. That strongly points toward a fully managed event routing service without custom compute or polling. Option B (Correct): Amazon EventBridge → SNS per error type Amazon EventBridge with rules is the best fit because it is designed for real-time event filtering and routing. You define event patterns (e.g., `security_error`, `validation_error`, `system_error`) Each rule directly routes to a dedicated Amazon Simple Notification Service topic Each team subscribes to its own SNS topic (email, SMS, Lambda, etc.) No custom code, no polling, no message transformation layer required Why it works best (key factors): True event-driven architecture (push-based, near real-time) Built-in filtering and routing Fully managed → lowest operational overhead Highly scalable and decoupled When to use this: When multiple consumers need different subsets of the same event stream When routing logic is based on event attributes When you want minimal infrastructure management --- Option A: SNS + Lambda routing (Rejected) This uses a single SNS topic and a Lambda function to inspect and forward messages. Adds custom compute layer (Lambda) → increases operational overhead You must maintain routing logic in code Extra hop before delivery → slightly higher ...

Author: Mia · Last updated Jun 25, 2026

A data engineer is designing a log table for an application that requires continuous ingestion. The application must provide dependable API-based access to specific records from other applications. The application must handle more than 4,000 concurr...

The correct choice is D. The key requirements in this scenario are very high concurrent write (4,000/sec) and read (6,500/sec) workloads, continuous ingestion, and API-based access to individual records from other applications. This clearly points to a high-throughput, low-latency key-value/NoSQL system rather than a data warehouse or data lake query engine. --- Why D is correct: Amazon DynamoDB Amazon DynamoDB is designed specifically for: Massive horizontal scalability (handles thousands to millions of requests per second) Low-latency reads and writes (single-digit millisecond performance) High concurrency workloads without connection bottlenecks Native API-based access (PutItem, GetItem, Query via AWS SDK/REST) Automatic scaling with provisioned or on-demand capacity In this case, provisioning capacity to meet 4,000+ writes/sec and 6,500+ reads/sec is straightforward and aligns exactly with DynamoDB’s design. 👉 This is the standard AWS choice for: Logging/event ingestion systems Session stores Real-time APIs IoT telemetry ingestion High-throughput transactional workloads --- Why the other options are incorrect A) Amazon Redshift + KEY distribution + Data API Amazon Redshift is optimized for OLAP (analytics), not high-frequency OLTP writes. ❌ High write concurrency (4,000/sec) is inefficient due to columnar storage overhea...

Author: Elizabeth · Last updated Jun 25, 2026

A company uses AWS Glue ETL pipelines to process data. The company uses Amazon Athena to analyze data in an Amazon S3 bucket. To better understand shipping timelines, the company decides to collect and store shipping and delivery dates in addition to order data. The company adds a data quality check to ensure that shipping date is greater than order date and that delivery date is gre...

Correct Answer: C) Use AWS Glue Data Quality to create a custom rule that uses the three date columns. --- Why Option C is correct (MOST cost-effective) AWS Glue Data Quality is purpose-built for exactly this kind of requirement: enforcing business rules during ETL processing in AWS Glue jobs. Key reasons: Native integration with AWS Glue ETL pipelines → no extra services or orchestration needed. Rule-based validation allows direct expression of business logic: `shipping_date > order_date` `delivery_date > shipping_date` Efficient routing of records: Valid records continue processing Invalid records can be automatically written to a separate S3 bucket Cost-effective because: No additional query engines (Athena) or separate ETL tools (DataBrew) Runs inline within existing Glue job execution Scales with Glue jobs without additional operational overhead Best-fit scenario: Use Glue Data Quality when: You need automated validation inside ETL pipelines You want to enforce data rules before loading or transforming data You need consistent, repeatable quality checks at scale --- Why other options are incorrect A) AWS Glue DataBrew DATEDIFF function AWS Glue DataBrew Designed for visual, interactive data preparation, not production ETL enforcement Re...

Author: CrimsonViperX · Last updated Jun 25, 2026

A company is developing machine learning (ML) models. A data engineer needs to apply data quality rules to training data. The company stores the training data in an Amazon S3 bucket. ...

The requirement is to apply data quality rules to training data stored in Amazon S3 with the least operational overhead. The key idea is to prefer fully managed, purpose-built data quality services over custom code, orchestration, or infrastructure management. --- Option A: AWS Lambda + Amazon CloudWatch AWS Lambda + Amazon CloudWatch This approach uses custom code inside Lambda to validate data and raise exceptions, then monitors failures via CloudWatch alarms. Why it is not ideal: Requires manual coding of all data quality rules No built-in data profiling or rule management framework Error handling is indirect (exception-based monitoring) Becomes hard to scale as rules grow When it is used: Simple validations (e.g., file format checks, lightweight record validation) Event-driven triggers with minimal processing logic --- Option B: AWS Glue DataBrew + Amazon EventBridge (BEST) AWS Glue DataBrew + Amazon EventBridge + Amazon S3 This solution uses DataBrew rulesets for data quality validation and profile jobs for automated checks, triggered by S3 events through EventBridge. Why this is the best choice: Fully managed, no-code / low-code data quality service Built-in data profiling + rule-based validation No infrastructure provisioning (serverless execution) EventBridge enables fully automated ingestion-triggered execution Very low operational overhead When it is used: Data quality validation on datasets in S3 ...

Author: Isabella1 · Last updated Jun 25, 2026

A data engineer is configuring an AWS Glue Apache Spark extract, transform, and load (ETL) job. The job contains a sort-merge join of two large and equally sized DataFrames. The job is failing with ...

This question is about a Spark shuffle disk failure in AWS Glue during a sort-merge join. Key observation A sort-merge join of two large, equally sized DataFrames triggers: Full shuffle of both datasets across executors Sort phase on each partition Heavy spill to local disk (ephemeral storage) Intermediate files can exceed worker disk capacity The error “No space left on device” is a classic symptom of shuffle spill exhaustion on Glue worker local storage. --- Option Analysis A) Use the AWS Glue Spark shuffle manager ✅ (Correct) AWS Glue supports an optimized shuffle mechanism (depending on Glue version, often backed by S3-based shuffle or optimized shuffle service). Why this works Offloads shuffle blocks from local disk → scalable storage (like S3-backed shuffle / managed shuffle service) Eliminates dependency on ephemeral EBS/NVMe disk capacity Directly addresses the root cause: shuffle spill disk exhaustion during sort-merge join When to use Large joins requiring heavy shuffle Disk spill errors in Glue ETL jobs Memory/disk pressure during aggregation, sort, join stages --- B) Deploy an Amazon EBS volume for the job ❌ (Incorrect) Why it is wrong AWS Glue is a fully managed service You cannot attach or control EBS volumes for Glue worker nodes Even if storage were increased, Spark shuffle is not designed to rely on external EBS for spill When EBS wou...

Author: Mia · Last updated Jun 25, 2026

A data engineer is using an Apache Iceberg framework to build a data lake that contains 100 =D0=A2'B' of data. The data engineer wants to run AWS Glue Apache Spark jobs that use the Iceberg f...

To run Apache Iceberg on AWS Glue Spark jobs, the key is enabling Glue’s Iceberg integration and ensuring the runtime has access to Iceberg libraries (only when not using a Glue version that already includes them). --- ✅ Correct options C) Set Iceberg as a value for the `--datalake-formats` job parameter This is the core required step in AWS Glue. Glue uses the `--datalake-formats` parameter to enable managed table formats like Iceberg Setting it to `iceberg` tells Glue to: Load Iceberg-compatible Spark extensions Enable Glue Data Catalog integration with Iceberg tables This is mandatory in all supported Glue versions (4.0+ especially) 👉 Why it’s correct: Without this, Glue will not initialize Iceberg runtime support. --- B) Specify the path to a specific version of Iceberg using the `--extra-jars` parameter and set datalake-formats This is used when Iceberg is not already bundled in the Glue runtime. `--extra-jars` allows you to: Attach Iceberg libraries manually Ensure compatibility with a specific Iceberg version (important in custom or older setups) Still requires `--datalake-formats iceberg` to activate Iceberg mode 👉 Why it’s correct: In real-world AWS exam context, this applies when: Using older Glue versions Or when you need a specific Iceberg version not included by default --- ❌ Why othe...

Author: Lucas · Last updated Jun 25, 2026

A company generates yearly financial statements for customers and stores the statements in an Amazon S3 bucket. Customers rarely access the documents after 1 week. The company must retain the statements for 7 years. The statements must remain readi...

The requirements are: Retain data for 7 years Access is rare after 1 week Data must remain readily accessible (low-latency retrieval) Must be most cost-effective Key decision factors 1. Storage cost over long retention (7 years) → archive classes preferred 2. Access latency requirement → “readily accessible” rules out slow retrieval tiers 3. Operational simplicity → avoid custom automation when S3 Lifecycle can handle it 4. Access pattern → frequently unused after 1 week → lifecycle transition ideal --- Option Analysis A) S3 Glacier Deep Archive after 7 days Pros: Cheapest storage class for long-term retention Cons: Retrieval takes hours (typically 12+ hours) ❌ Fails requirement: “readily accessible” Use case: Legal/compliance archival where access is extremely rare (audit records, backups never expected to be restored quickly) --- B) S3 Intelligent-Tiering for all objects Pros: Automatically moves data between tiers based on access patterns Cons: Ongoing monitoring and automation fees per object Not as cost-effective for known pattern (always cold after 1 week) ❌ Overkill because access pattern is predictable Use case: Unknown or changing access patterns (analytics data, user-generated content with unpredictab...

Author: CrimsonViperX · Last updated Jun 25, 2026

A data engineer at a company is optimizing extract, transform, and load (ETL) workflows. The current architecture uses Amazon EMR and Apache Spark for large-scale transformations and AWS Glue for other ETL tasks. The workflows load processed data into an Amazon S3 based data lake. The company wants to move to a fully managed serverless solution that can orchestrate multiple ETL jobs and automate execution. The new solution mus...

Correct Answer: A) Migrate all ETL jobs to AWS Glue. Use AWS Glue workflows to orchestrate the pipeline. --- Why Option A is correct This option best satisfies all requirements together: 1. Fully managed + serverless requirement AWS Glue is a fully managed, serverless ETL service. No infrastructure provisioning or cluster management is needed. Scales automatically based on workload. 2. Spark requirement is satisfied AWS Glue uses Apache Spark under the hood for ETL jobs (Glue ETL jobs run on Spark). So the requirement to “continue to use Spark” is fully met. 3. Orchestration + automation AWS Glue Workflows allow: Chaining multiple ETL jobs Adding triggers (time-based or event-based) Managing dependencies between jobs This enables end-to-end pipeline automation with minimal manual intervention. 4. Minimal operational overhead No separate orchestration service required. No cluster management (unlike EMR or Airflow setups). --- Why other options are incorrect ❌ B) Step Functions + EventBridge + Glue + EMR While Step Functions + EventBridge is a strong orchestration combo, this option is not fully serverless in the required sense because: It still uses Amazon EMR, which requires cluster management (even if semi-managed). Complexity is higher due to multiple...

Author: FlamePhoenix2025 · Last updated Jun 25, 2026

A company runs a multi-tenant Amazon EMR cluster on Amazon EC2 instances. Multiple teams perform interactive query analyses and data transformations on the data in the EMR cluster. The teams can access the cluster only through EMR Studio workspaces and EMR steps. The teams need to use EMR steps to run Apache Spark jobs to fetch data from an Amazon DynamoDB table. The DynamoDB table contains confidential data that must be accessibl...

The correct answer is A) Set up runtime roles for EMR steps. Why A is correct (key reasoning) In a multi-tenant Amazon EMR environment where multiple teams submit jobs through EMR Studio and EMR steps, the main requirement is per-job (per-team) isolation of AWS permissions. Runtime roles for EMR steps allow each submitted step (for example, a Spark job) to assume a specific IAM role at execution time. This enables: Assigning a dedicated IAM role per team Granting that role access to only the required DynamoDB table Ensuring other teams’ steps do not inherit or reuse those permissions Achieving fine-grained, job-level security isolation So, only the designated team’s Spark job can access the confidential DynamoDB data. --- Why the other options are incorrect B) Set up AWS Lake Formation permissions Lake Formation is designed for data lakes on Amazon S3, not DynamoDB. It controls table/column-level access in S3-based analytics. It cannot enforce access control for DynamoDB tables used directly by Spark jobs. 👉 Use case: Fine-grained access control for S3 data lake analytics (Glue, Athena, EMR on S3). --- ...

Author: MysticJaguar44 · Last updated Jun 25, 2026

A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket. The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a cor...

Key requirements from the scenario Processes 500 GB/day of CSV data stored in Amazon S3 Schema managed in AWS Glue Data Catalog Convert CSV → Apache Parquet and store back in S3 Workflow is long-running Needs ~15 GiB memory capacity for concurrent processing First two processes run in parallel Third (correlation) process starts only after both finish Goal: least operational overhead --- ✅ Correct Option Analysis: C C) Use AWS Glue workflows to run the first two processes in parallel. Ensure that the third process starts after the first two processes have finished. This is the best fit because: AWS Glue is designed specifically for schema-aware ETL from S3 using Glue Data Catalog Supports native orchestration via Glue Workflows Allows parallel job execution + dependency chaining Uses serverless Spark, so no cluster management Can scale worker types (e.g., G.2X) to meet ~15 GiB+ memory requirements Minimal operational overhead (no servers, no orchestration service needed) Why it fits best: Native integration: S3 + Glue Catalog + ETL = built-in pattern Supports DAG-like workflows (parallel + sequential dependencies) Fully managed → lowest operational burden --- ❌ Why other options are rejected A) Amazon MWAA + AWS Glue Amazon Managed Workflows for Apache Airflow (MWAA) adds significant operational overhead You must manage: Airflow DAGs Environment scaling Scheduler health Over...

Author: Michael · Last updated Jun 25, 2026

A company needs to aggregate and filter a large amount of streaming data in real-time with low latency. The company needs to store the data in Amazon S3 for analysis. Which s...

Key requirement breakdown The question emphasizes: Large streaming data Real-time aggregation and filtering Low latency Storage in Amazon S3 Most operationally efficient solution The critical keyword is aggregation in real time, which typically implies stateful stream processing (windowing, joins, continuous aggregations) rather than simple record-level transformations. --- Option Analysis A) Kinesis Data Streams + AWS Lambda + S3 What it does well: Kinesis Data Streams provides durable real-time ingestion. Lambda can process records and push to S3. Why it’s not ideal: Lambda is stateless and record/batch oriented, not designed for complex streaming aggregation (like sliding/tumbling windows). Requires custom code + scaling + shard management awareness, increasing operational overhead. Harder to maintain for high-throughput real-time analytics. When it is used: Simple filtering/transformation per event Lightweight enrichment or routing logic --- B) Amazon Data Firehose (no Lambda) → S3 What it does well: Fully managed ingestion and delivery to S3. Very operationally efficient (no infrastructure management). Why it’s not ideal: Firehose supports only basic transformations (format conversion, simple record transformation via Lambda optional). It does not support real-time aggregation or stateful processing. Cannot handle window-based analytics. When it is used: Straightforward streaming ingestion into S3/Redshift/OpenS...

Author: SolarFalcon11 · Last updated Jun 25, 2026

A data engineer is building a solution to detect sensitive information that is stored in a data lake across multiple Amazon S3 buckets. The solution must detect personally identifiable information (PII) that is in a proprietar...

The correct answer is B) Use Amazon Macie with managed data identifiers (the option text says “Amazon Made,” which is a typo and refers to Amazon Macie). Why Option B is correct (least operational overhead) Amazon Macie is a fully managed data security and data privacy service that automatically discovers, classifies, and protects sensitive data stored in Amazon S3. It is the lowest operational overhead solution because: It is serverless and fully managed (no infrastructure, no pipelines to maintain) It can scan multiple S3 buckets continuously or on demand It uses managed data identifiers for built-in PII detection It supports custom data identifiers, which is critical for detecting PII in proprietary data formats It integrates natively with S3, making it ideal for data lake-wide scanning This makes it the best fit for enterprise-scale PII detection with minimal engineering effort. --- Why the other options are incorrect A) AWS Glue Detect PII transform AWS Glue provides a Detect PII transform, but: It is part of an ETL pipeline, meaning you must build, schedule, and maintain Glue jobs Requires data processing workflows for each dataset Higher operational overhead compared to Macie Better suited for transforming or cleansing data during ETL, n...

Author: Ishaan · Last updated Jun 25, 2026

A company has a data processing pipeline that runs multiple SQL queries in sequence against an Amazon Redshift cluster. The company merges with a second company. The original company modifies a query that aggregates sales revenue data to join sales tables from both companies. The sales table for the first company is named Table S1. The sales table for the second company is named Table S2. Table S1 contains 10 billion records. Table S2 co...

The performance issue is almost certainly caused by data movement during the join after combining two very large fact tables in Amazon Redshift. With 10 billion rows (S1) and 900 million rows (S2), an inefficient distribution style or poor join key selection will force Redshift to redistribute massive amounts of data across nodes, which severely degrades query performance. Let’s evaluate each option using key Redshift performance principles: data distribution, join optimization, and skew avoidance. --- ✅ Correct Answer: B and E --- ✅ B) Use the KEY distribution style for both sales tables. Select a high cardinality column to use for the join. Why this works: KEY distribution ensures rows with the same join key are stored on the same compute node. Using a high cardinality join column (e.g., transaction_id, order_id, unique customer-order mapping) ensures: Even distribution of data across nodes Minimal data skew Reduced or eliminated data shuffling during joins Why this is critical here: The query joins S1 and S2 on sales-related attributes If both tables share a well-chosen join key and are distributed on it, Redshift can perform a local join on each node, which is the fastest possible execution path. When this is used: Large fact-to-fact joins Frequent joins on a consistent key across large datasets --- ✅ E) Use Amazon Redshift Advisor to review and select optimizations to implement. Why this works: Amazon Redshift Advisor analyzes: Data distribution skew Missing sort keys Inefficient distribution styles Query patterns and join performance It can recommend: Switching t...

Author: Aria · Last updated Jun 25, 2026

A company needs to implement a workflow to process transactions. Each transaction goes through multiple levels of validation. Each validation level depends on the preceding validation level. The workflow must either process or reject each transaction within 24-hours. The workfl...

Requirements breakdown (key exam signals) Multi-step workflow with dependencies → needs orchestration (state management) Must complete within 24 hours → long-running workflow support required Each transaction processed or rejected within 24 hours Least operational cost → prefer fully managed, minimal infrastructure, minimal orchestration overhead --- Option A: Standard AWS Step Functions + Wait for Callback ✅ Why it works: AWS Step Functions (Standard workflows) support long-running executions (up to 1 year) → easily satisfies the 24-hour requirement. “Wait for Callback” pattern allows each validation step to pause execution efficiently until external validation completes. Built-in state management handles dependencies between validation levels cleanly. Fully managed → low operational overhead compared to DIY orchestration. Cost perspective: Standard Step Functions charge per state transition, but: No servers to manage No custom retry/state logic Efficient for moderate-volume transactional workflows Best use case: Business workflows with: Long waits (human/external systems) Strict step dependencies Auditability requirements --- Option B: Express Step Functions + Wait for Callback ❌ Why it fails: Express workflows are designed for very short-lived executions (typically seconds to minutes), not 24-hour workflows. Not suitable for long-running callback-based orchestration. Even though Express is cheaper per execution, it does not meet the duration requireme...

Author: Maya · Last updated Jun 25, 2026

A company stores information about its subscribers in an Amazon S3 bucket. The company runs an analysis every time a subscriber ends their subscription. The company uses AWS Lambda functions to respond to events from the S3 bucket by performing analyses. The Lambda functions clean data from the S3 bucket and initiate an AWS Glue workflow. The Lambda functions have 128 MB of memory and 512 MB of ephemeral storage. The Lambda functions have a timeout of 15 seconds. All three functions successfully finish running. ...

The key issue here is Lambda CPU saturation (~100%) causing slow execution, which directly indicates the function is resource-constrained (especially memory/CPU allocation) rather than being blocked, retrying, or timing out. In AWS Lambda, CPU is allocated proportionally to memory. So increasing memory is the primary way to increase CPU performance. --- ✅ Correct Option: A) Increase the memory of the Lambda functions to 512 MB Why this is correct In AWS Lambda, CPU, network, and disk throughput scale linearly with memory allocation. The function is already completing successfully, but CPU is pegged at 100%, meaning it is CPU-bound. Increasing memory from 128 MB → 512 MB will: Increase available CPU power Reduce execution time Lower overall pipeline latency (important since this triggers an AWS Glue workflow) This directly improves performance without changing architecture. When this option is used: CPU-bound Lambda workloads (data processing, compression, transformation) High execution time with stable success rate Heavy computation inside Lambda (like S3 data cleaning) --- ❌ Option B) Increase the number of retries (Maximum Retry Attempts) Why this is incorrect Retries only affect failure handling, not performance. The problem states all functions complete successfully. Increasing retries wou...

Author: Kai99 · Last updated Jun 25, 2026

A company uses an organization in AWS Organizations to manage multiple AWS accounts. The company uses an enhanced fanout data stream in Amazon Kinesis Data Streams to receive streaming data from multiple producers. The company runs the data stream in an account named Account A. The company wants to use an AWS Lambda function in an account named Account 'B' to process the data from the data stream. The company creates ...

This scenario is about cross-account access to an Amazon Web Services Amazon Kinesis Data Streams stream from an AWS Lambda function in another account, using enhanced fan-out consumers. Key idea In AWS, permissions to read a Kinesis Data Stream from another account are granted using a resource-based policy on the stream, not SCPs or Lambda-side policies. --- Option Analysis ❌ A) Use an SCP in Account A A Service Control Policy (SCP): Only sets permission boundaries in AWS Organizations It does NOT grant permissions Cannot be used to explicitly allow cross-account stream access 👉 SCPs are used for guardrails (deny/limit), not access grants. --- ❌ C) Use an SCP in Account B Same issue as A, plus incorrect direction: Account B is the consumer account (Lambda side) SCP still cannot grant access Even if attached to Account B, it still won’t enable Kinesis read access 👉 SCPs are irrelevant for granting resource-level permissions. --- ❌ D) Add a resource-based policy to the Lambda function This is incorrect because: The Lambda function is not the resource being accessed The access is from Lambda → Kinesis stream Permissions must be attached to the Kinesis stream, n...

Author: Lina Zhang · Last updated Jun 25, 2026

A company has a data pipeline that uses an Amazon RDS instance, AWS Glue jobs, and an Amazon S3 bucket. The RDS instance and AWS Glue jobs run in a private subnet of a VPC and in the same security group. A user made a change to the security group that prevents the AWS Glue jobs from connecting to the RDS instance. After the change, the security group contains a single r...

Key idea (what broke and what is needed) AWS Glue jobs must reach Amazon RDS over the database port (commonly 3306 for MySQL/Aurora, 5432 for PostgreSQL, etc.). After the change, the security group only allows SSH inbound from a specific IP, so all application traffic between Glue and RDS is blocked. In AWS security groups: They are stateful (return traffic is automatically allowed). Best practice is to allow access using security group references (SG-to-SG), not IPs. For private subnet resources like Glue and RDS, communication should use private networking + SG rules, not SSH rules. --- Option analysis ❌ B) UDP + RDS private IP as source Wrong protocol: RDS does not use UDP for database connections. Wrong design: Using an instance’s private IP is brittle (can change with replacement/failover). SG rules typically prefer security group references, not fixed instance IPs. Result: Will not restore connectivity. --- ❌ C) TCP + DNS name of RDS as source Security groups do not accept DNS names as sources. Only supports: IP ranges, prefix lists, or security groups. DNS is resolved dynamically and is not valid in SG rules. Result: Invalid configuration. --- ...

Author: Suresh · Last updated Jun 25, 2026

What Our Friends Say

What Our Friends Say

Amazon Practice Questions, Discussions & Exam Topics by our Authors

A company stores a 100 MB dataset in an Amazon S3 bucket as an Apache Parquet file. A data engineer needs to profile the data before performing data preparation steps on the data. Wh...

A data engineer is writing a query to join two tables in Amazon Athena. The data engineer needs to choose the correct join order for the tables to optimize...

A retail company needs to implement a solution to capture data updates from multiple Amazon Aurora MySQL databases. The company needs to make the updates available for analytics in near real time. The solution must be serverless and require ...

A company needs to store and analyze a large amount of IoT sensor data. The company needs to retain the data indefinitely. The company analyzes the data in an Amazon Redshift cl...

A retail company wants to implement real-time analytics for an ecommerce platform. The company needs to collect clickstream data from the company's website and mobile apps. The company needs to store the data for a...

A company creates a new non-production application that runs on an Amazon EC2 instance. The application needs to communicate with an Amazon RDS database instance using Java Database Connectivity (JDBC). The EC2 instances and...

An application uses an AWS Lambda function that is configured with managed runtimes. The Lambda function successfully writes logs to the default Amazon CloudWatch Logs log group. A data engineer wants to modify the logging behavior to show only ERROR level...

A company needs a solution to process streaming data by using Apache Spark in a Kubernetes environment. The solution must support event-driven scaling and optimize resource utilization. The company needs to integrate the solution with existing Kubernetes infrastructure deployed on Amazon ...

A company uses Amazon Redshift for its data warehouse. A data engineer must query a table named orders.complete_orders_history, which contains 100 columns. The query must return all columns except columns named compa...

Two data engineering teams use separate AWS accounts. Both teams request access to the same datashare in an Amazon Redshift cluster that is in a third AWS account. The datashare is named salesshare. A data engineer must use the Amazon Redshift SQL interface to gr...

A company needs to use Amazon Athena to analyze data that is in an Amazon S3 bucket. A data engineer needs to configure AWS Glue table partitions for year, month, and day. The data engineer needs to create the partitions ev...

A gaming company uses AWS Glue to perform read and write operations on Apache Iceberg tables for real-time streaming data. The data in the Iceberg tables is in Apache Parquet format. The company is experiencin...

A data engineer needs to deploy a complex pipeline. The stages of the pipeline must be able to run a script. The data engineer must use only fully managed and serverless ...

A data engineer is designing a log table for an application that requires continuous ingestion. The application must provide dependable API-based access to specific records from other applications. The application must handle more than 4,000 concurr...

A company is developing machine learning (ML) models. A data engineer needs to apply data quality rules to training data. The company stores the training data in an Amazon S3 bucket. ...

A data engineer is configuring an AWS Glue Apache Spark extract, transform, and load (ETL) job. The job contains a sort-merge join of two large and equally sized DataFrames. The job is failing with ...

A data engineer is using an Apache Iceberg framework to build a data lake that contains 100 =D0=A2'B' of data. The data engineer wants to run AWS Glue Apache Spark jobs that use the Iceberg f...

A company generates yearly financial statements for customers and stores the statements in an Amazon S3 bucket. Customers rarely access the documents after 1 week. The company must retain the statements for 7 years. The statements must remain readi...

A company needs to aggregate and filter a large amount of streaming data in real-time with low latency. The company needs to store the data in Amazon S3 for analysis. Which s...

A data engineer is building a solution to detect sensitive information that is stored in a data lake across multiple Amazon S3 buckets. The solution must detect personally identifiable information (PII) that is in a proprietar...

A company needs to implement a workflow to process transactions. Each transaction goes through multiple levels of validation. Each validation level depends on the preceding validation level. The workflow must either process or reject each transaction within 24-hours. The workfl...