python - How is Levenshtein Distance calculated on Simplified Chinese characters? -


i have 2 queries:

    query1:你好世界     query2:你好 

when run code using python library levenshtein:

from levenshtein import distance, hamming, median lev_edit_dist = distance(query1,query2) print lev_edit_dist 

i output of 12. question how value 12 derived?

because in terms of strokes difference, theres more 12.

according documentation, supports unicode:

it supports both normal , unicode strings, can't mix them, arguments function (method) have of same type (or subclasses).

you need make sure chinese characters in unicode though:

in [1]: levenshtein import distance, hamming, median  in [2]: query1 = '你好世界'  in [3]: query2 = '你好'  in [4]: print distance(query1,query2) 6  in [5]: print distance(query1.decode('utf8'),query2.decode('utf8')) 2 

Comments

Popular posts from this blog

PHP DOM loadHTML() method unusual warning -

python - How to create jsonb index using GIN on SQLAlchemy? -

c# - TransactionScope not rolling back although no complete() is called -