python - How to vectorize a data frame with several text columns in scikit learn without losing track of the origin columns -
i have several pandas data series, , want train data map output, df["output"].
so far have merged series one, , separated each commas.
df = pd.read_csv("sourcedata.csv") sample = df["cata"] + "," + df["catb"] + "," + df["catc"] def my_tokenizer(s): return s.split(",") vect = countvectorizer() vect = countvectorizer(analyzer='word',tokenizer=my_tokenizer, ngram_range=(1, 3), min_df=1) train = vect.fit_transform(sample.values) lf = logisticregression() lfit = lf.fit(train, df["output"]) pred = lambda x: lfit.predict_proba(vect.transform([x]))
the problem bag of words approach , doesn't consider
- unique order in each category. ("orange banana" different "banana orange")
- text 1 category has different significance in ("us" in 1 category mean country of origin vs destination)
for example, entire string be:
pred("us, chiquita banana, china")
category a: country of origin
category b: company & type of fruit (order matter)
category c: destination
the way doing ignores type of ordering, , generates spaces in feature names reason (which messes things more):
in [1242]: vect.get_feature_names()[0:10] out[1242]: [u'', u' ', u' ', u' ', u' ', u' ', u' us', u' ca', u' uk']
any suggestions welcome!! lot
ok, first let's prepare data set, selecting relevant columns , removing leading , trailing spaces using strip
:
sample = df[['cata','catb','catc']] sample = df.apply(lambda col: col.str.strip())
from here have couple of options how vectorize training set. if have smallish number of levels across of features (say less 1000 in total), can treat them categorical variables , set train = pd.get_dummies(sample)
convert them binary indicator variables. after data this:
cata_us cata_ca ... cat_b_chiquita_banana cat_b_morningstar_tomato ... catc_china ... 1 0 1 0 1 ...
notice that variable names start origin column, makes sure model know come from. you're using exact strings word order in second column preserved.
if have many levels work, or want consider individual words in catb
bigrams, apply countvectorizer
separately each column, , use , use hstack
concatenate resulting output matrices:
import scipy.sparse sp vect = countvectorizer(ngram_range=(1, 3)) train = sp.hstack(sample.apply(lambda col: vect.fit_transform(col)))
Comments
Post a Comment