Sure, here’s how you can structure these sections for your project guide:

Text Extraction and Structuring using Large Language Models

Data extraction and structuring are crucial tasks in processing large volumes of text, enabling efficient organization and retrieval of information. Leveraging advanced language models (LLMs) for these tasks can significantly enhance accuracy and automation. This guide explores how LLMs, particularly the Phi-3 model, can be utilized to streamline and improve data extraction and structuring processes.

Why Use LLMs for This

Large Language Models (LLMs) are ideal for data extraction and structuring due to their ability to understand and process vast amounts of text with high accuracy. They can automatically identify and categorize information, recognize patterns, and maintain context across lengthy documents. This capability reduces manual effort and increases efficiency, making LLMs a powerful tool for handling complex text processing tasks.

Why Choose the Phi-3 Model?

The Phi-3 model, especially the microsoft/Phi-3-mini-128k-instruct variant, is an excellent choice for text filtering, data extraction, and structuring due to its distinctive features:

Size and Efficiency: The Phi-3 model offers a compact yet powerful solution, balancing performance and resource usage effectively. This makes it well-suited for filtering tasks without imposing excessive computational demands

Large Context Window (128k Tokens): A notable advantage of Phi-3 is its large context window, which enables the model to handle up to 128k tokens in a single pass. This is crucial for processing lengthy documents and maintaining coherence, thereby enhancing the accuracy of data extraction and structuring

Contextual Understanding: Phi-3 excels in contextual comprehension, thanks to its advanced architecture and training. This deep understanding is essential for nuanced text filtering and extraction tasks, such as detecting sensitive information and categorizing content with complex criteria.

Code Example

This command logs you into your Hugging Face account via the command line. It uses the --token flag to provide your authentication token, allowing access to download and use models from Hugging Face’s hub.

Make sure to replace the token with your actual one to ensure proper access.

! huggingface-cli login --token XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Importing Libraries

We import the necessary libraries to perform our task.

from transformers import pipeline
from datasets import load_dataset
from tqdm import tqdm
import pandas as pd 
import re
import json
import fitz
import os

Setup the text-generation pipeline

To use the pipeline function for text generation with the microsoft/Phi-3-mini-128k-instruct model, first, the code sets up a text-generation pipeline by specifying the task as text-generation. This tells the model that it will be used for generating text based on the input it receives. In the next step, it loads the microsoft/Phi-3-mini-128k-instruct model, which is designed to handle large context windows and is well-suited for generating detailed and coherent text.

pipe = pipeline("text-generation", 
                model="microsoft/Phi-3-mini-128k-instruct", 
                trust_remote_code=True)

The trust_remote_code=True argument is included to ensure that code from the model repository is trusted and executed, which is necessary for the model to function correctly.

Define the system prompt and user prompt

The system_prompt sets up the model to extract titles and text from PDFs exactly as they appear, while the user_prompt directs the model to process the parsed text, ensuring no modification and returning results in a clear JSON format with titles and their associated text.

system_prompt = """
You are a text extraction model specialized in extracting titles and their corresponding text from parsed PDF content. Ensure that the extracted content is preserved exactly as it appears in the PDF.
"""

user_prompt = """
Please process the provided parsed PDF text and extract titles and their corresponding text as they appear in the document. Do not rephrase or modify the content. Return the results in JSON format, where each entry includes the title and the associated text.

**Format:**
```json
[
    {
        "title": "Exact Title from PDF",
        "text": "Exact text associated with the title from PDF"
    },
    {
        "title": "Another Title",
        "text": "Text associated with another title"
    }
]

```
"""

Define the Geneation Function

In the text2json function, the process begins by creating a list of messages where the system_prompt and user_prompt define the model's role and task. It then includes the parsed PDF text with a label to direct the model.

def text2json(text) :
    """
    This function takes a string as input and returns a JSON-formatted text structure with titles and their corresponding text.
    """

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
        {"role": "user", "content": f"MY PDF : {text}" }
    ]
    generation_args = {
        "max_new_tokens": 4096,
        "return_full_text": True,
        "temperature": 0.0,
        "do_sample": False
    }
    output = pipe(messages, **generation_args)
    return output[0]['generated_text'][3]['content']

After setting up the message list, the function specifies the generation parameters, such as a maximum of 4096 tokens and a temperature of 0.0 for deterministic output. Next, it runs the pipe function with these messages and parameters to generate the output. Finally, the function returns the content from the generated text in JSON format.

Define Pdf parser function

The parse_pdf function starts by initializing an empty string to hold the combined text from the PDF. It then opens the specified PDF file and iterates through each page to extract text using fitz. For each page, the extracted text is appended to the combined_text string.

def parse_pdf(file_path):
    """
    Parse the specified PDF file and extract its text into a single string.

    Args:
    file_path (str): The path to the PDF file.

    Returns:
    str: The text extracted from the PDF.
    """
    combined_text = ""

    try:
        # Open the PDF file
        pdf_document = fitz.open(file_path)

        # Extract text from each page
        for page_num in range(len(pdf_document)):
            page = pdf_document.load_page(page_num)
            combined_text += page.get_text()

        pdf_document.close()
    except Exception as e:
        print(f"Error processing {file_path}: {e}")

    return combined_text

After processing all pages, the function closes the PDF document and returns the combined text as a single string. If any error occurs during this process, it prints an error message.

Define Json Extraction Function

The extract_json_from_string function begins by initializing two lists: one for storing successfully extracted json objects and another for capturing errors.

It uses a regular expression to search for json-like structures within the provided text. If no matches are found, an error is recorded. For each JSON string identified, the function attempts to parse it into a json object using json.loads.

def extract_json_from_string(text):
    """
    Extract json objects from a string.

    Args:
    text (str): The string to search for json objects.

    Returns:
    tuple: A tuple containing two lists: one for successfully extracted json objects, and another for errors.
    """
    objs = []
    errors = []

    # Use regex to find JSON-like structures in the text
    matches = re.findall(r'{.*}', text, re.DOTALL)
    if not matches:
        errors.append("No JSON patterns found")

    for index, json_string in enumerate(matches):
        try:
            # Load JSON object
            json_obj = json.loads(json_string)
            objs.append(json_obj)
        except json.JSONDecodeError as e:
            errors.append((index, json_string, f"JSONDecodeError: {e}"))
        except Exception as e:
            errors.append((index, json_string, f"Error: {e}"))

    return objs, errors

Successfully parsed objects are added to the objs list, while any parsing errors are recorded along with the index and the problematic string.

Finally, the function returns both the list of json objects and the list of errors encountered during extraction.

Define Main Function

The llm_extractor function starts by generating a list of pdf files in the specified folder path, filtering for files with a .pdf extension.

It then initializes an empty list to store results. For each pdf file, the function parses the text using parse_pdf, processes it with text2json to extract json data, and uses extract_json_from_string to parse the json.

def llm_extractor(folder_path) : 
    """
    Extract json objects from pdf files in the specified folder. 

    Args:
    folder_path (str): The path to the folder containing the pdf files.

    Returns:
    DataFrame: A pandas DataFrame containing the extracted json objects.
    """
    pdf_files = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.lower().endswith('.pdf')]
    obj_list = []
    for i in pdf_files : 
        pdf_text = parse_pdf(i)
        text_json = text2json(pdf_text)
        objs, _ = extract_json_from_string(text_json)
        obj_list.append({"pdf_file_name" : i , "text" :pdf_text , "json" : objs[0]})

    final_df = pd.DataFrame(obj_list)
    return final_df

The results, including the pdf file name, the extracted text, and the json object, are appended to the obj_list. Finally, the function converts this list into a DataFrame and returns it.

Test the Function

To test the llm_extractor function, you can use the following code:

final_df = llm_extractor(folder_path="/kaggle/working/data")

You can save the result to a csv file using the following code:

final_df.to_csv("output.csv")