Hi,
We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the remove_non_prining_characters normalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used?
|
non_printing_characters_re = re.compile( |
There you have this:
non_printing_characters_re = re.compile(
f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
)
Which we modified, to keep newlines (\n) and tabs (\t), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:
additional_chars_to_remove = [160, 173, 8203]
non_printing_characters_re = re.compile(
f"[{''.join(map(chr, list(range(0,9)) + list(range(11, 32)) + list(range(127,160)) + additional_chars_to_remove))}]"
)
There could of course be more characters that one may want to remove.
To be clear, I am writing this here for two reasons:
- To get your feedback. Do you think this is a good idea to use for the final data cleaning?
- If so, this could be incorporated into this repository to help other people that might be thinking about this.
Thanks for your amazing contributions!
Hi,
We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the
remove_non_prining_charactersnormalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used?data_tooling/ac_dc/normalization.py
Line 5 in e28064e
There you have this:
Which we modified, to keep newlines (
\n) and tabs (\t), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:There could of course be more characters that one may want to remove.
To be clear, I am writing this here for two reasons:
Thanks for your amazing contributions!