Reason for not applying remove_non_prining_characters normalization

Hi,

We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the `remove_non_prining_characters` normalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used? 

https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/normalization.py#L5 

There you have this:
```
non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
)
```

Which we modified, to keep newlines (`\n`) and tabs (`\t`), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:

```
additional_chars_to_remove = [160, 173, 8203]
non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,9)) + list(range(11, 32)) + list(range(127,160)) + additional_chars_to_remove))}]"
)
```

There could of course be more characters that one may want to remove. 

To be clear, I am writing this here for two reasons:
1. To get your feedback. Do you think this is a good idea to use for the final data cleaning?
2. If so, this could be incorporated into this repository to help other people that might be thinking about this.

Thanks for your amazing contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason for not applying remove_non_prining_characters normalization #416

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reason for not applying remove_non_prining_characters normalization #416

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions