python - `pyspark mllib` versus `pyspark ml` packages -


what difference between pyspark mllib , pyspark ml packages ? :

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html

pyspark mllib appears target algorithms @ dataframe level pyspark ml

one difference found pyspark ml implements pyspark.ml.tuning.crossvalidator while pyspark mllib not.

my understanding library should use if implementing algorithms on apache spark framework mllib there appears split ?

there not appear interoperability between each of frameworks without transforming types each contain different package structure.

from experience pyspark.mllib classes can used pyspark.rdd's, whereas (as mention) pyspark.ml classes can used pyspark.sql.dataframe's. there mention support in documentation pyspark.ml, first entry in pyspark.ml package states:

dataframe-based machine learning apis let users assemble , configure practical machine learning pipelines.

now reminded of article read while regarding 3 api's available in spark 2.0, relative benefits/drawbacks , comparative performance. a tale of 3 apache spark apis: rdds, dataframes, , datasets. in midst of doing performance testing on new client servers , interested if there ever scenario in worth developing rdd based approach opposed dataframe based approach (my approach of choice), digress.

the gist there situations in each highly suited , others might not be. 1 example remember if data structured dataframes confer performance benefits on rdd's, apparently drastic complexity of operations increase. observation datasets , dataframes consume far less memory when caching rdd's. in summation author concluded low level operations rdd's great, high level operations, viewing, , tying other api's dataframes , datasets superior.

so come full circle question, believe answer resounding pyspark.ml classes in package designed utilize pyspark.sql.dataframes. imagine performance of complex algorithms implemented in each of these packages significant if test against same data structured dataframe vs rdd. furthermore, viewing data , developing compelling visuals both more intuitive , have better performance.


Comments