encoding - Remove all characters which cannot be decoded in Python -
i try parse html file python script using xml.etree.elementtree
module. charset should utf-8 according header. there strange character in file. therefore, parser can't parse it. opened file in notepad++ see character . tried open several encodings don't find correct one.
as have many files parse, know how remove bytes can't decode. there solution?
i know how remove bytes can't decode. there solution?
this simple:
with open('filename', 'r', encoding='utf8', errors='ignore') f: ...
the errors='ignore'
tells python drop unrecognized characters. can passed bytes.decode()
, other places take encoding
argument.
since decodes bytes unicode, may not suitable xml parser wants consume bytes. in case, should write data disk (e.g. using shutil.copyfileobj()
) , re-open in 'rb'
mode.
in python 2, these arguments built-in open()
don't exist, can use io.open()
instead. alternatively, can decode 8-bit strings unicode strings after reading them, more error-prone in opinion.
but turns out op doesn't have invalid utf-8. op has valid utf-8 happens include control characters. control characters mildly annoying filter out since have run them through function this, meaning can't use copyfileobj()
:
import unicodedata def strip_control_chars(data: str) -> str: return ''.join(c c in data if unicodedata.category(c) != 'cc')
cc unicode category "other, control character, described on unicode website. include broader array of "bad characters," strip entire "other" category (which contains useless stuff anyway):
def strip_control_chars(data: str) -> str: return ''.join(c c in data if not unicodedata.category(c).startswith('c'))
this filter out line breaks, it's idea process file line @ time , add line breaks in @ end.
in principle, create codec doing incrementally, , could use copyfileobj()
, that's using sledgehammer swat fly.
Comments
Post a Comment