pyspark.SparkContext.runJob

SparkContext.runJob(rdd: pyspark.rdd.RDD[T], partitionFunc: Callable[[Iterable[T]], Iterable[U]], partitions: Optional[Sequence[int]] = None, allowLocal: bool = False) → List[U][source]

Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.

If ‘partitions’ is not specified, this will run over all partitions.

Examples

>>>
>>> myRDD = sc.parallelize(range(6), 3)
>>> sc.runJob(myRDD, lambda part: [x * x for x in part])
[0, 1, 4, 9, 16, 25]
>>>
>>> myRDD = sc.parallelize(range(6), 3)
>>> sc.runJob(myRDD, lambda part: [x * x for x in part], [0, 2], True)
[0, 1, 16, 25]