Clean your Text (part one)
Before diving into the cleaning and filtering process, you’ll need to set up your environment with the necessary libraries, start by installing a few key packages: datasets, emoji, and fasttext.
!pip install datasets emoji fasttext
With the necessary libraries in place, it’s time to load your dataset and begin the cleaning process , first, we’ll use the datasets library to load your algerian darija dataset, this will give us access to the training split of the dataset, which we can then process and refine.
from datasets import load_dataset
ds = load_dataset("ayoubkirouane/Algerian-Darija" , split='train')
Remove Emojis
Next up is handling emojis in your dataset, emojis can be a source of noise, especially if they’re not relevant to the text analysis , to clean up the dataset, we’ll define a function to remove emojis and then apply it to your dataset.
import emoji
# Function to remove emojis
def remove_emojis(example):
example['Text'] = emoji.replace_emoji(example['Text'], replace='')
return example
# Apply the function to the 'Text' column
ds = ds.map(remove_emojis)
We use the emoji library’s replace_emoji
function to strip out emojis from the text field , the remove_emojis
function updates each entry in the dataset, and ds.map(remove_emojis)
applies this function to all examples in your dataset , this step will help clean up your text, making it more uniform and ready for further processing.
Remove Emails / Phone Numbers / special characters / English words / non-Arabic words
Now let’s move on to more comprehensive text cleaning , this step will tackle various common issues in your dataset, such as urls, special characters, english words, non-arabic text .. etc.
import re
def clean(example):
example['Text'] = re.sub(r'http[s]?://\S+', '', example['Text']) # Remove URLs
example['Text'] = re.sub(r'[^\w\s]', '', example['Text']) # Remove special characters (anything that is not a word character or whitespace)
example['Text'] = re.sub(r'\b[A-Za-z]+\b', '', example['Text']) # Remove English words
example['Text'] = re.sub(r'\b[^\u0600-\u06FF\s]+\b', '', example['Text']) # Remove non-Arabic words (anything that is not Arabic)
example['Text'] = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', example['Text']) # Remove email addresses
example['Text'] = re.sub(r'\b\d{10,15}\b', '', example['Text']) # Remove phone numbers (10 to 15 digits)
example['Text'] = re.sub(r'\n+', '', example['Text']) # Replace multiple newlines with a single space
example['Text'] = re.sub(r'\s+', ' ', example['Text']).strip() # Replace multiple spaces (including newlines) with a single space and strip leading/trailing spaces
return example
ds = ds.map(clean)
The clean function uses regex to remove urls, special characters, english words, non-Arabic words, email addresses, phone numbers, and excessive whitespace, ensuring the text is cleaner and more uniform for analysis.
Applying this function to your dataset will help ensure that the text is in a more consistent and analyzable format, paving the way for more effective processing and model training.
Remove examples that are too short
To ensure that your dataset contains only meaningful and sufficiently lengthy text entries, you can apply a filter based on the length of paragraphs , this step is crucial for removing entries that are too short or have insufficient content.
import heapq
def paragraph_length_filter(x):
lines = x['Text'].split('\n')
if (
len(lines) < 1
or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
):
return False
return True
dataset = ds.filter(
paragraph_length_filter,
load_from_cache_file=False
)
paragraph_length_filter
: this function evaluates each text entry by splitting it into lines , it then checks if there are at least a certain number of lines and ensures that the longest lines (top three) are above a minimum length threshold , if these conditions are not met, the function returns False
, indicating that the entry should be excluded.
By applying this filter you will keep only those text entries that are more likely to be useful for training, based on their length and content quality.
Remove repeated text within training examples
To further refine your dataset by removing entries with excessive repetitions, you can implement a filter that identifies and excludes text with too many duplicated paragraphs or characters , this ensures that the content remains diverse and useful for training.
find_duplicates
: This function identifies duplicate paragraphs by comparing each paragraph against a set of unique paragraphs , it counts how many paragraphs are duplicates and the total number of characters in these duplicates.
import re
def find_duplicates(paragraphs):
"""
Use this function to find the number of repetitions
in the paragraphs.
"""
unique_x = set()
duplicate_chars = 0
duplicate_elements = 0
for element in paragraphs:
if element in unique_x:
duplicate_chars += len(element)
duplicate_elements += 1
else:
unique_x.add(element)
return duplicate_elements, duplicate_chars
paragraph_repetition_filter
: This function splits the text into paragraphs based on two or more newlines , it then uses find_duplicates to count the number of duplicate paragraphs and the amount of duplicated text if the proportion of duplicate paragraphs exceeds 30% or the proportion of duplicate characters exceeds 20%, the function returns False, indicating that the entry should be filtered out.
def paragraph_repetition_filter(x):
"""
Returns False iff a page has too many repetitions.
"""
text = x['Text']
paragraphs = re.compile(r"\n{2,}").split(text.strip()) # Split by paragraphs (2 or more newlines)
paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs) # Find number of duplicates in paragraphs
if paragraphs_duplicates / len(paragraphs) > 0.3:
return False
if char_duplicates / len(text) > 0.2:
return False
return True
# Apply The filter
dataset = dataset.filter(
paragraph_repetition_filter,
load_from_cache_file=False
)
This filtering step helps to ensure that your dataset is composed of diverse and meaningful text samples, which is essential for training robust and effective models.
Remove Duplication
To further clean your dataset by removing duplicate entries, you can use a deduplication function , this ensures that each entry in your dataset is unique, which is important for training high-quality models.
def deduplication(ds):
def dedup_func(x):
"""Use this function to remove duplicate entries"""
if x['Text'] in unique_text:
return False
else:
unique_text.add(x['Text'])
return True
unique_text = set()
ds = ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)
return ds
dataset = deduplication(dataset)
dedup_func
: This inner function checks if a text entry is already in the unique_text
set , if it is the function returns False indicating that the entry should be filtered out , if it is not the function adds the entry to the unique_text
set and returns True
, allowing it to remain in the dataset.
This de-duplication step helps ensure that your dataset contains only unique entries, reducing redundancy and improving the quality of your data.
Quality filter ( Language Filtering )
First you will need to download the pre-trained FastText language detection model , this model helps identify the language of the text entries.
!wget -q https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Use the downloaded model to filter your dataset, retaining only those entries predicted to be in the target language (arabic in this case).
import urllib
from fasttext.FastText import _FastText
def language_filter(ds):
# load language detection model
model = _FastText('lid.176.bin')
def is_arabic(x):
# Predict language of the text and probability
language, score = model.predict(x['Text'].replace("\n", ""))
language = language[0].split("__")[2]
return score > 0.4 and language == "ar" # change code here if building a model in another language
ds = ds.filter(is_arabic, load_from_cache_file=False, num_proc=1)
return ds
dataset = language_filter(dataset)
Filtering function is_arabic
: this function takes each text entry, predicts its language, and checks the probability , the language code ar
is used for arabic
, adjust this code if you're working with a different language model.