Spark Sql 两种将 RDD 转为 DataFrame 的方式

  1. 使用反射推断Schema
  2. 以编程的方式指定Schema

Spark 支持两种将 rdd 转为 dataframe 的方式: 反射推断和编程指定 Schema

使用反射推断Schema

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from pyspark.sql import SparkSession
from pyspark.sql.types import Row

spark = SparkSession\
.builder \
.appName("reflect rdd to dataFrame") \
.getOrCreate()
sc = spark.sparkContext

lines = sc.parallelize([
"Michael, 29",
"Andy, 30",
"Justin, 19",
])

parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip())))

schemaPeople = spark.createDataFrame(people)

# 必须注册为临时表才能供下面的查询使用
schemaPeople.createOrReplaceTempView("people")

# 返回 dataframe
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <=19")

teenagers.show()
# +------+
# | name|
# +------+
# |Justin|
# +------+

# 将 dataframe 再转为 rdd
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
print(name)
# Name: Justin

以编程的方式指定Schema

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType
spark = SparkSession\
.builder\
.appName("coding_rdd")\
.getOrCreate()

sc = spark.sparkContext

lines = sc.parallelize([
"Michael, 29",
"Andy, 30",
"Justin, 19",
])

people = lines.map(lambda l: tuple(l.split(",")))

# 定义 schema
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
schemaPeople = spark.createDataFrame(people, schema)

# 必须注册为临时表才能供下面查询使用
schemaPeople.createOrReplaceTempView("people")
results = spark.sql("SELECT * FROM people")
results.show()
# +-------+---+
# | name|age|
# +-------+---+
# |Michael| 29|
# | Andy| 30|
# | Justin| 19|
# +-------+---+

文章标题:Spark Sql 两种将 RDD 转为 DataFrame 的方式

文章字数:304

本文作者:Waterandair

发布时间:2018-06-19, 11:20:47

最后更新:2019-12-28, 14:03:59

原始链接:https://waterandair.github.io/2018-06-19-spark-sql-rdd.html

版权声明: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。

目录
×

喜欢就点赞,疼爱就打赏

github