python - Checking the language of extracted trends from twitter -


i extracting top hashtags twitter using tweepy module in python. there 1 major problem face, wish check if tag in english or not. tags not in english should removed.

example:

tags=['askorange','charlestonshooting','replytoasong','uberlive','otecmatkasyn'] 

should not have otecmatkasyn.

what need use language detector api. 1 one offered google, not free. option language detection api.

after choose best api you, you'd need parse text makes sense sentence. example, tag 'askorange' must split read 'ask orange'. can iterate on each character of string, check if uppercase , insert space there:

new_tags = [] tag in tags:     new_word = tag     uppercases = 0 # in case sentence has several uppercases     in xrange(1, len(tag)):         if tag[i].istitle():             new_word = new_word[:i+uppercases] + ' ' + new_word[i+uppercases:]             uppercases = uppercases + 1     new_tags.append(new_word) 

finally, send list of new_tags api detect language.


Comments

Popular posts from this blog

PHP DOM loadHTML() method unusual warning -

python - How to create jsonb index using GIN on SQLAlchemy? -

c# - TransactionScope not rolling back although no complete() is called -