Spark Sql 两种将 RDD 转为 DataFrame 的方式
创建时间:
字数:304
Spark 支持两种将 rdd 转为 dataframe 的方式: 反射推断和编程指定 Schema
使用反射推断Schema
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| from pyspark.sql import SparkSession from pyspark.sql.types import Row
spark = SparkSession\ .builder \ .appName("reflect rdd to dataFrame") \ .getOrCreate() sc = spark.sparkContext
lines = sc.parallelize([ "Michael, 29", "Andy, 30", "Justin, 19", ])
parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip())))
schemaPeople = spark.createDataFrame(people)
# 必须注册为临时表才能供下面的查询使用 schemaPeople.createOrReplaceTempView("people")
# 返回 dataframe teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <=19")
teenagers.show() # +------+ # | name| # +------+ # |Justin| # +------+
# 将 dataframe 再转为 rdd teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect() for name in teenNames: print(name) # Name: Justin
|
以编程的方式指定Schema
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| from pyspark.sql import SparkSession from pyspark.sql.types import StructField, StructType, StringType spark = SparkSession\ .builder\ .appName("coding_rdd")\ .getOrCreate()
sc = spark.sparkContext
lines = sc.parallelize([ "Michael, 29", "Andy, 30", "Justin, 19", ])
people = lines.map(lambda l: tuple(l.split(",")))
# 定义 schema schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] schema = StructType(fields) schemaPeople = spark.createDataFrame(people, schema)
# 必须注册为临时表才能供下面查询使用 schemaPeople.createOrReplaceTempView("people") results = spark.sql("SELECT * FROM people") results.show() # +-------+---+ # | name|age| # +-------+---+ # |Michael| 29| # | Andy| 30| # | Justin| 19| # +-------+---+
|
文章标题:Spark Sql 两种将 RDD 转为 DataFrame 的方式
文章字数:304
本文作者:Waterandair
发布时间:2018-06-19, 11:20:47
最后更新:2019-12-28, 14:03:59
原始链接:https://waterandair.github.io/2018-06-19-spark-sql-rdd.html
版权声明: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。