scala - From DataFrame to RDD[LabeledPoint] -
i trying implement document classifier using apache spark mllib , having problems representing data. code following:
import org.apache.spark.sql.{row, sqlcontext} import org.apache.spark.sql.types.{stringtype, structfield, structtype} import org.apache.spark.ml.feature.tokenizer import org.apache.spark.ml.feature.hashingtf import org.apache.spark.ml.feature.idf val sql = new sqlcontext(sc) // load raw data tsv file val raw = sc.textfile("data.tsv").map(_.split("\t").toseq) // convert rdd dataframe val schema = structtype(list(structfield("class", stringtype), structfield("content", stringtype))) val dataframe = sql.createdataframe(raw.map(row => row(row(0), row(1))), schema) // tokenize val tokenizer = new tokenizer().setinputcol("content").setoutputcol("tokens") val tokenized = tokenizer.transform(dataframe) // tf-idf val htf = new hashingtf().setinputcol("tokens").setoutputcol("rawfeatures").setnumfeatures(500) val tf = htf.transform(tokenized) tf.cache val idf = new idf().setinputcol("rawfeatures").setoutputcol("features") val idfmodel = idf.fit(tf) val tfidf = idfmodel.transform(tf) // create labeled points val labeled = tfidf.map(row => labeledpoint(row.getdouble(0), row.get(4)))
i need use dataframes generate tokens , create tf-idf features. problem appears when try convert dataframe rdd[labeledpoint]. map dataframe rows, method of row return type, not type defined on dataframe schema (vector). therefore, cannot construct rdd need train ml model.
what best option rdd[labeledpoint] after calculating tf-idf?
casting object worked me.
try:
// create labeled points val labeled = tfidf.map(row => labeledpoint(row.getdouble(0), row(4).asinstanceof[vector]))
Comments
Post a Comment