scala - From DataFrame to RDD[LabeledPoint] -


i trying implement document classifier using apache spark mllib , having problems representing data. code following:

import org.apache.spark.sql.{row, sqlcontext} import org.apache.spark.sql.types.{stringtype, structfield, structtype} import org.apache.spark.ml.feature.tokenizer import org.apache.spark.ml.feature.hashingtf import org.apache.spark.ml.feature.idf  val sql = new sqlcontext(sc)  // load raw data tsv file val raw = sc.textfile("data.tsv").map(_.split("\t").toseq)  // convert rdd dataframe val schema = structtype(list(structfield("class", stringtype), structfield("content", stringtype))) val dataframe = sql.createdataframe(raw.map(row => row(row(0), row(1))), schema)  // tokenize val tokenizer = new tokenizer().setinputcol("content").setoutputcol("tokens") val tokenized = tokenizer.transform(dataframe)  // tf-idf val htf = new hashingtf().setinputcol("tokens").setoutputcol("rawfeatures").setnumfeatures(500) val tf = htf.transform(tokenized) tf.cache val idf = new idf().setinputcol("rawfeatures").setoutputcol("features") val idfmodel = idf.fit(tf) val tfidf = idfmodel.transform(tf)  // create labeled points val labeled = tfidf.map(row => labeledpoint(row.getdouble(0), row.get(4))) 

i need use dataframes generate tokens , create tf-idf features. problem appears when try convert dataframe rdd[labeledpoint]. map dataframe rows, method of row return type, not type defined on dataframe schema (vector). therefore, cannot construct rdd need train ml model.

what best option rdd[labeledpoint] after calculating tf-idf?

casting object worked me.

try:

// create labeled points val labeled = tfidf.map(row => labeledpoint(row.getdouble(0), row(4).asinstanceof[vector])) 

Comments

Popular posts from this blog

python - How to create jsonb index using GIN on SQLAlchemy? -

PHP DOM loadHTML() method unusual warning -

c# - TransactionScope not rolling back although no complete() is called -