Spark Sql 两种将 RDD 转为 DataFrame 的方式
    
    
    
        
        创建时间:
        
    
    
        
        字数:304
        
        
        
    
    
    
    
    
      
        Spark 支持两种将 rdd 转为 dataframe 的方式: 反射推断和编程指定 Schema
使用反射推断Schema
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 
 | from pyspark.sql import SparkSessionfrom pyspark.sql.types import Row
 
 spark = SparkSession\
 .builder \
 .appName("reflect rdd to dataFrame") \
 .getOrCreate()
 sc = spark.sparkContext
 
 lines = sc.parallelize([
 "Michael, 29",
 "Andy, 30",
 "Justin, 19",
 ])
 
 parts = lines.map(lambda l: l.split(","))
 people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip())))
 
 schemaPeople = spark.createDataFrame(people)
 
 # 必须注册为临时表才能供下面的查询使用
 schemaPeople.createOrReplaceTempView("people")
 
 # 返回 dataframe
 teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <=19")
 
 teenagers.show()
 # +------+
 # |  name|
 # +------+
 # |Justin|
 # +------+
 
 # 将 dataframe 再转为 rdd
 teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
 for name in teenNames:
 print(name)
 # Name: Justin
 
 | 
以编程的方式指定Schema
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 
 | from pyspark.sql import SparkSessionfrom pyspark.sql.types import StructField, StructType, StringType
 spark = SparkSession\
 .builder\
 .appName("coding_rdd")\
 .getOrCreate()
 
 sc = spark.sparkContext
 
 lines = sc.parallelize([
 "Michael, 29",
 "Andy, 30",
 "Justin, 19",
 ])
 
 people = lines.map(lambda l: tuple(l.split(",")))
 
 # 定义 schema
 schemaString = "name age"
 
 fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
 schema = StructType(fields)
 schemaPeople = spark.createDataFrame(people, schema)
 
 # 必须注册为临时表才能供下面查询使用
 schemaPeople.createOrReplaceTempView("people")
 results = spark.sql("SELECT * FROM people")
 results.show()
 # +-------+---+
 # |   name|age|
 # +-------+---+
 # |Michael| 29|
 # |   Andy| 30|
 # | Justin| 19|
 # +-------+---+
 
 | 
 
    文章标题:Spark Sql 两种将 RDD 转为 DataFrame 的方式
    文章字数:304
    本文作者:Waterandair
    发布时间:2018-06-19, 11:20:47
    最后更新:2019-12-28, 14:03:59
    原始链接:https://waterandair.github.io/2018-06-19-spark-sql-rdd.html
    
        版权声明: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。