Spark DataFrames: registerTempTable vs not -


i started dataframe yesterday , liking far.

i dont understand 1 thing though... (referring example under "programmatically specifying schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)

in example dataframe registered table (i guessing provide access sql queries..?) exact same information being accessed can done peopledataframe.select("name").

so question is.. when want register dataframe table instead of using given dataframe functions? , 1 option more efficient other?

the reason use registertemptable( tablename ) method dataframe, in addition being able use spark-provided methods of dataframe, can issue sql queries via sqlcontext.sql( sqlquery ) method, use dataframe sql table. tablename parameter specifies table name use dataframe in sql queries.

val sc: sparkcontext = ... val hc = new hivecontext( sc ) val customerdataframe = mycodetocreateorloaddataframe() customerdataframe.registertemptable( "cust" ) val query = """select custid, sum( purchaseamount ) cust group custid""" val salespercustomer: dataframe = hc.sql( query ) salespercustomer.show() 

whether use sql or dataframe methods select , groupby largely matter of preference. understanding sql queries translated spark execution plans.

in case, found kinds of aggregation , windowing queries needed, computing running balance per customer, available in hive sql query language, suspect have been difficult in spark.

if want use sql, want create hivecontext instead of regular sqlcontext. hive query language supports broader range of sql available via plain sqlcontext.


Comments

Popular posts from this blog

PHP DOM loadHTML() method unusual warning -

python - How to create jsonb index using GIN on SQLAlchemy? -

c# - TransactionScope not rolling back although no complete() is called -