QUESTION
A product owner is working with a data science team who use Spark on AWS EMR for big data projects and have significant expertise with Spark. They have indicated a preference to work with Spark + MLLib to address the real-time predictions use case you’ve proposed.
Which of the following would you recommend as the most scalable and efficient way to proceed?
The team have significant expertise with Spark, and this together with the depth of Spark’s support for feature transformations, high performance via in memory caching, would likely indicate that Spark is a better candidate for ETL (than Glue)
However, there are a number of reasons for not using Spark for Machine Learning, although MLLib provides some useful algorithms vi MLLib.
Those are
To decouple ETL and Machine Learning so that they can be scaled independently. SageMaker can train on huge data sets, so no need to create huge EMR clusters. Differing CPU/GPU memory requirements between ETL and training SageMaker support for real-time inference – difficult on Spark ML models Deep Learning libraries like TensorFlow or Apache MXNet are not available in SparkML Hence, “Use Spark for ETL, SageMaker to train & deploy models” .
This AWS Online Tech Talk really helps clarify the relationship between EMR Spark usage and SageMaker.
VIDEO
No responses yet