python - How to vectorize a data frame with several text columns in scikit learn without losing track of the origin columns -


i have several pandas data series, , want train data map output, df["output"].

so far have merged series one, , separated each commas.

df = pd.read_csv("sourcedata.csv") sample = df["cata"] + "," + df["catb"] + "," + df["catc"]  def my_tokenizer(s):     return s.split(",")  vect = countvectorizer() vect = countvectorizer(analyzer='word',tokenizer=my_tokenizer, ngram_range=(1, 3), min_df=1)  train = vect.fit_transform(sample.values)  lf = logisticregression() lfit = lf.fit(train, df["output"]) pred = lambda x: lfit.predict_proba(vect.transform([x])) 

the problem bag of words approach , doesn't consider
- unique order in each category. ("orange banana" different "banana orange")
- text 1 category has different significance in ("us" in 1 category mean country of origin vs destination)

for example, entire string be:
pred("us, chiquita banana, china")
category a: country of origin
category b: company & type of fruit (order matter)
category c: destination

the way doing ignores type of ordering, , generates spaces in feature names reason (which messes things more):

in [1242]: vect.get_feature_names()[0:10] out[1242]: [u'',  u' ',  u'  ',  u'   ',  u'    ',  u'     ',  u'   us',  u'   ca',  u'   uk'] 

any suggestions welcome!! lot

ok, first let's prepare data set, selecting relevant columns , removing leading , trailing spaces using strip:

sample = df[['cata','catb','catc']] sample = df.apply(lambda col: col.str.strip()) 

from here have couple of options how vectorize training set. if have smallish number of levels across of features (say less 1000 in total), can treat them categorical variables , set train = pd.get_dummies(sample) convert them binary indicator variables. after data this:

cata_us   cata_ca ... cat_b_chiquita_banana   cat_b_morningstar_tomato ... catc_china ... 1         0           1                       0                            1    ... 

notice that variable names start origin column, makes sure model know come from. you're using exact strings word order in second column preserved.

if have many levels work, or want consider individual words in catb bigrams, apply countvectorizer separately each column, , use , use hstack concatenate resulting output matrices:

import scipy.sparse sp vect = countvectorizer(ngram_range=(1, 3)) train = sp.hstack(sample.apply(lambda col: vect.fit_transform(col))) 

Comments

Popular posts from this blog

PHP DOM loadHTML() method unusual warning -

python - How to create jsonb index using GIN on SQLAlchemy? -

c# - TransactionScope not rolling back although no complete() is called -