Spark DataFrames: registerTempTable vs not -
i started dataframe yesterday , liking far.
i dont understand 1 thing though... (referring example under "programmatically specifying schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)
in example dataframe registered table (i guessing provide access sql queries..?) exact same information being accessed can done peopledataframe.select("name").
so question is.. when want register dataframe table instead of using given dataframe functions? , 1 option more efficient other?
the reason use registertemptable( tablename )
method dataframe, in addition being able use spark-provided methods of dataframe
, can issue sql queries via sqlcontext.sql( sqlquery )
method, use dataframe sql table. tablename
parameter specifies table name use dataframe in sql queries.
val sc: sparkcontext = ... val hc = new hivecontext( sc ) val customerdataframe = mycodetocreateorloaddataframe() customerdataframe.registertemptable( "cust" ) val query = """select custid, sum( purchaseamount ) cust group custid""" val salespercustomer: dataframe = hc.sql( query ) salespercustomer.show()
whether use sql or dataframe methods select
, groupby
largely matter of preference. understanding sql queries translated spark execution plans.
in case, found kinds of aggregation , windowing queries needed, computing running balance per customer, available in hive sql query language, suspect have been difficult in spark.
if want use sql, want create hivecontext
instead of regular sqlcontext
. hive query language supports broader range of sql available via plain sqlcontext
.
Comments
Post a Comment