python - How is Levenshtein Distance calculated on Simplified Chinese characters? -
i have 2 queries:
query1:你好世界 query2:你好
when run code using python library levenshtein:
from levenshtein import distance, hamming, median lev_edit_dist = distance(query1,query2) print lev_edit_dist
i output of 12. question how value 12 derived?
because in terms of strokes difference, theres more 12.
according documentation, supports unicode:
it supports both normal , unicode strings, can't mix them, arguments function (method) have of same type (or subclasses).
you need make sure chinese characters in unicode though:
in [1]: levenshtein import distance, hamming, median in [2]: query1 = '你好世界' in [3]: query2 = '你好' in [4]: print distance(query1,query2) 6 in [5]: print distance(query1.decode('utf8'),query2.decode('utf8')) 2
Comments
Post a Comment