RDD.
sample
Return a sampled subset of this RDD.
New in version 0.7.0.
can elements be sampled multiple times (replaced when sampled out)
expected size of the sample as a fraction of this RDD’s size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0
seed for the random number generator
RDD
a new RDD containing a sampled subset of elements
See also
RDD.takeSample()
RDD.sampleByKey()
pyspark.sql.DataFrame.sample()
Notes
This is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame.
DataFrame
Examples
>>> rdd = sc.parallelize(range(100), 4) >>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14 True