Clean your Text (part one)

image

Before diving into the cleaning and filtering process, you’ll need to set up your environment with the necessary libraries, start by installing a few key packages: datasets, emoji, and fasttext.

!pip install datasets emoji fasttext

With the necessary libraries in place, it’s time to load your dataset and begin the cleaning process , first, we’ll use the datasets library to load your algerian darija dataset, this will give us access to the training split of the dataset, which we can then process and refine.


from datasets import load_dataset

ds = load_dataset("ayoubkirouane/Algerian-Darija" , split='train')

Remove Emojis

Next up is handling emojis in your dataset, emojis can be a source of noise, especially if they’re not relevant to the text analysis , to clean up the dataset, we’ll define a function to remove emojis and then apply it to your dataset.

import emoji

# Function to remove emojis
def remove_emojis(example):
    example['Text'] = emoji.replace_emoji(example['Text'], replace='')
    return example

# Apply the function to the 'Text' column
ds = ds.map(remove_emojis)

We use the emoji library’s replace_emoji function to strip out emojis from the text field , the remove_emojis function updates each entry in the dataset, and ds.map(remove_emojis) applies this function to all examples in your dataset , this step will help clean up your text, making it more uniform and ready for further processing.


Remove Emails / Phone Numbers / special characters / English words / non-Arabic words

Now let’s move on to more comprehensive text cleaning , this step will tackle various common issues in your dataset, such as urls, special characters, english words, non-arabic text .. etc.

import re
def clean(example):
    example['Text'] = re.sub(r'http[s]?://\S+', '', example['Text'])  # Remove URLs
    example['Text'] = re.sub(r'[^\w\s]', '', example['Text'])  # Remove special characters (anything that is not a word character or whitespace)
    example['Text'] = re.sub(r'\b[A-Za-z]+\b', '', example['Text'])  # Remove English words
    example['Text'] = re.sub(r'\b[^\u0600-\u06FF\s]+\b', '', example['Text'])  # Remove non-Arabic words (anything that is not Arabic)
    example['Text'] = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', example['Text'])  # Remove email addresses
    example['Text'] = re.sub(r'\b\d{10,15}\b', '', example['Text'])  # Remove phone numbers (10 to 15 digits)
    example['Text'] = re.sub(r'\n+', '', example['Text'])  # Replace multiple newlines with a single space
    example['Text'] = re.sub(r'\s+', ' ', example['Text']).strip()  # Replace multiple spaces (including newlines) with a single space and strip leading/trailing spaces

    return example

ds = ds.map(clean)

The clean function uses regex to remove urls, special characters, english words, non-Arabic words, email addresses, phone numbers, and excessive whitespace, ensuring the text is cleaner and more uniform for analysis.

Applying this function to your dataset will help ensure that the text is in a more consistent and analyzable format, paving the way for more effective processing and model training.


Remove examples that are too short

To ensure that your dataset contains only meaningful and sufficiently lengthy text entries, you can apply a filter based on the length of paragraphs , this step is crucial for removing entries that are too short or have insufficient content.

import heapq

def paragraph_length_filter(x):
    lines = x['Text'].split('\n')
    if (
        len(lines) < 1
        or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
    ):
        return False
    return True
dataset = ds.filter(
    paragraph_length_filter,
    load_from_cache_file=False
)

paragraph_length_filter: this function evaluates each text entry by splitting it into lines , it then checks if there are at least a certain number of lines and ensures that the longest lines (top three) are above a minimum length threshold , if these conditions are not met, the function returns False, indicating that the entry should be excluded.

By applying this filter you will keep only those text entries that are more likely to be useful for training, based on their length and content quality.


Remove repeated text within training examples

To further refine your dataset by removing entries with excessive repetitions, you can implement a filter that identifies and excludes text with too many duplicated paragraphs or characters , this ensures that the content remains diverse and useful for training.

find_duplicates: This function identifies duplicate paragraphs by comparing each paragraph against a set of unique paragraphs , it counts how many paragraphs are duplicates and the total number of characters in these duplicates.

import re

def find_duplicates(paragraphs):
    """
    Use this function to find the number of repetitions
    in the paragraphs.
    """
    unique_x = set()
    duplicate_chars = 0
    duplicate_elements = 0
    for element in paragraphs:
        if element in unique_x:
            duplicate_chars += len(element)
            duplicate_elements += 1
        else:
            unique_x.add(element)
    return duplicate_elements, duplicate_chars

paragraph_repetition_filter: This function splits the text into paragraphs based on two or more newlines , it then uses find_duplicates to count the number of duplicate paragraphs and the amount of duplicated text if the proportion of duplicate paragraphs exceeds 30% or the proportion of duplicate characters exceeds 20%, the function returns False, indicating that the entry should be filtered out.

def paragraph_repetition_filter(x):
    """
    Returns False iff a page has too many repetitions.
    """
    text = x['Text']
    paragraphs = re.compile(r"\n{2,}").split(text.strip())                # Split by paragraphs (2 or more newlines)
    paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)  # Find number of duplicates in paragraphs
    if paragraphs_duplicates / len(paragraphs) > 0.3:
        return False
    if char_duplicates / len(text) > 0.2:
        return False
    return True

# Apply The filter
dataset = dataset.filter(
    paragraph_repetition_filter,
    load_from_cache_file=False
)

This filtering step helps to ensure that your dataset is composed of diverse and meaningful text samples, which is essential for training robust and effective models.


Remove Duplication

To further clean your dataset by removing duplicate entries, you can use a deduplication function , this ensures that each entry in your dataset is unique, which is important for training high-quality models.

def deduplication(ds):
    def dedup_func(x):
        """Use this function to remove duplicate entries"""
        if x['Text'] in unique_text:
            return False
        else:
            unique_text.add(x['Text'])
            return True

    unique_text = set()

    ds = ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)
    return ds

dataset = deduplication(dataset)

dedup_func: This inner function checks if a text entry is already in the unique_text set , if it is the function returns False indicating that the entry should be filtered out , if it is not the function adds the entry to the unique_text set and returns True, allowing it to remain in the dataset.

This de-duplication step helps ensure that your dataset contains only unique entries, reducing redundancy and improving the quality of your data.


Quality filter ( Language Filtering )

First you will need to download the pre-trained FastText language detection model , this model helps identify the language of the text entries.

!wget -q https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Use the downloaded model to filter your dataset, retaining only those entries predicted to be in the target language (arabic in this case).

import urllib
from fasttext.FastText import _FastText

def language_filter(ds):
    # load language detection model
    model = _FastText('lid.176.bin')

    def is_arabic(x):
        # Predict language of the text and probability
        language, score = model.predict(x['Text'].replace("\n", ""))

        language = language[0].split("__")[2]
        return score > 0.4 and language == "ar" # change code here if building a model in another language

    ds = ds.filter(is_arabic, load_from_cache_file=False, num_proc=1)
    return ds

dataset = language_filter(dataset)

Filtering function is_arabic: this function takes each text entry, predicts its language, and checks the probability , the language code ar is used for arabic , adjust this code if you're working with a different language model.