An office supplies company uses Amazon EMR with Apache Spark for its data transformation workloads.  Due to supplier systems issues currently outside their control, duplicates are  being seen in the data feeds. What would be the most efficient and simplest method to remove those duplicates?

Match the real world distribution to its corresponding statistical distribution

A) The probability of exactly x pet owners selected at random being men.
B) The probability of catching x fishes in h hours given average hourly catch of y
C) The distribution of  male adult heights in Germany


A product owner is working with the development team to rapidly prototype a new image application. They’ll use AWS Built in algorithms to test feasibility. Which of the following would be valid options for image applications?

Which of the following must be set by the user (required hyper parameters), for SageMaker’s built-in algorithm, XGBoost?(assume classification)


Which of the following built-in SageMaker Algorithms can be used for dimensionality reduction?

The Product Marketing team at a high street footwear brand want to add a new feature to their app  that allows users to upload an image and have their shoes replaced in that image with the latest offering. Based on user permissions, the uploaded photo may later be shared on social media #upgrademytrainers

Which of the following services would help build this use case?


A data scientist is examining a subset of data on an AWS notebook instance. The data has been loaded into a Pandas DataFrame and a correlation command df.corr() has been run.

Which of the following is can be determined from resulting correlation table, below?


Which of these SageMaker built-in algorithms support SGD, Adam, rmsprop optimisers?

You’ll likely need to be sure of precision, accuracy, recall and perhaps f1 scores for the exam. Confusion matrix and related calculations for both binary classification and multi-class classification should be understood.

Watch out for matrix being drawn with either prediction or actuals on the left/top, as this can be confusing if not spotted.

A data scientist is creating a  virus detection model utilising global pandemic data. She is evaluating the latest binary classification results.Given the following product requirements, which of these models would fulfil the criteria at lowest cost  based on the confusion matrices given?

a) The test must support claims of  “at least 90% accuracy”
b) At least 90% of virus positives must being found.
c) The cost of a false negative is to be considered 4 times more than a false positive.

A data scientist wishes to use SageMaker notebook instances to orchestrate AWS services whilst developing and deploying new models.  In particular, she wishes to control an Amazon EMR spark instance.

What actions are needed?

