encoding - Remove all characters which cannot be decoded in Python -


i try parse html file python script using xml.etree.elementtree module. charset should utf-8 according header. there strange character in file. therefore, parser can't parse it. opened file in notepad++ see character fs. tried open several encodings don't find correct one.

as have many files parse, know how remove bytes can't decode. there solution?

i know how remove bytes can't decode. there solution?

this simple:

with open('filename', 'r', encoding='utf8', errors='ignore') f:     ... 

the errors='ignore' tells python drop unrecognized characters. can passed bytes.decode() , other places take encoding argument.

since decodes bytes unicode, may not suitable xml parser wants consume bytes. in case, should write data disk (e.g. using shutil.copyfileobj()) , re-open in 'rb' mode.

in python 2, these arguments built-in open() don't exist, can use io.open() instead. alternatively, can decode 8-bit strings unicode strings after reading them, more error-prone in opinion.


but turns out op doesn't have invalid utf-8. op has valid utf-8 happens include control characters. control characters mildly annoying filter out since have run them through function this, meaning can't use copyfileobj():

import unicodedata  def strip_control_chars(data: str) -> str:     return ''.join(c c in data if unicodedata.category(c) != 'cc') 

cc unicode category "other, control character, described on unicode website. include broader array of "bad characters," strip entire "other" category (which contains useless stuff anyway):

def strip_control_chars(data: str) -> str:     return ''.join(c c in data if not unicodedata.category(c).startswith('c')) 

this filter out line breaks, it's idea process file line @ time , add line breaks in @ end.

in principle, create codec doing incrementally, , could use copyfileobj(), that's using sledgehammer swat fly.


Comments

Popular posts from this blog

PHP DOM loadHTML() method unusual warning -

python - How to create jsonb index using GIN on SQLAlchemy? -

c# - TransactionScope not rolling back although no complete() is called -