Home | Trees | Indices | Help |
|
---|
|
object --+ | rdd.RDD --+ | SchemaRDD
An RDD of Row objects that has an associated schema.
The underlying JVM object is a SchemaRDD, not a PythonRDD, so we can utilize the relational query api exposed by SparkSQL.
For normal pyspark.rdd.RDD operations (map, count, etc.) the SchemaRDD is not operated on directly, as it's underlying implementation is a RDD composed of Java objects. Instead it is converted to a PythonRDD in the JVM, on which Python operations can be done.
Instance Methods | |||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
Inherited from Inherited from |
Properties | |
Inherited from |
Method Details |
x.__init__(...) initializes x; see help(type(x)) for signature
|
Save the contents as a Parquet file, preserving the schema. Files that are written out using this method can be read back in as a SchemaRDD using the SQLContext.parquetFile method. >>> import tempfile, shutil >>> parquetFile = tempfile.mkdtemp() >>> shutil.rmtree(parquetFile) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.saveAsParquetFile(parquetFile) >>> srdd2 = sqlCtx.parquetFile(parquetFile) >>> srdd2.collect() == srdd.collect() True |
Registers this RDD as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this SchemaRDD. >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.registerAsTable("test") >>> srdd2 = sqlCtx.sql("select * from test") >>> srdd.collect() == srdd2.collect() True |
Inserts the contents of this SchemaRDD into the specified table. Optionally overwriting any existing data. |
Return the number of elements in this RDD. Unlike the base RDD implementation of count, this implementation leverages the query optimizer to compute the count on the SchemaRDD, which supports features such as filter pushdown. >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.count() 3L >>> srdd.count() == srdd.map(lambda x: x).count() True
|
Persist this RDD with the default storage level
(
|
Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet.
|
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
|
Mark this RDD for checkpointing. It will be saved to a file inside the
checkpoint directory set with
|
Return whether this RDD has been checkpointed or not
|
Gets the name of the file to which this RDD was checkpointed
|
Return a new RDD that is reduced into `numPartitions` partitions. >>> sc.parallelize([1, 2, 3, 4, 5], 3).glom().collect() [[1], [2, 3], [4, 5]] >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect() [[1, 2, 3, 4, 5]]
|
Return a new RDD containing the distinct elements in this RDD. >>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect()) [1, 2, 3]
|
Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. Note that this method performs a shuffle internally. >>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5]) >>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8]) >>> rdd1.intersection(rdd2).collect() [1, 2, 3]
|
Return a new RDD that has exactly numPartitions partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using `coalesce`, which can avoid performing a shuffle. >>> rdd = sc.parallelize([1,2,3,4,5,6,7], 4) >>> sorted(rdd.glom().collect()) [[1], [2, 3], [4, 5], [6, 7]] >>> len(rdd.repartition(2).glom().collect()) 2 >>> len(rdd.repartition(10).glom().collect()) 10
|
Return each value in >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)]) >>> y = sc.parallelize([("a", 3), ("c", None)]) >>> sorted(x.subtract(y).collect()) [('a', 1), ('b', 4), ('b', 5)]
|
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Fri May 30 01:48:46 2014 | http://epydoc.sourceforge.net |