Introduction
In today’s digital age, social media platforms like LinkedIn are treasure troves of data. Analyzing this data can help us understand trends, engagement, and the overall effectiveness of posts. In this guide, we will explore how to leverage two powerful tools—DSPy and Pandas—to analyze LinkedIn posts and extract valuable insights. Our goal is to provide a step-by-step approach that is easy to follow and understand, even for beginners.
What is Pandas?
Pandas is a widely-used data manipulation library in Python, essential for data analysis. It provides powerful data structures like DataFrames, which allow you to organize and manipulate data in a tabular format (think of it like a spreadsheet). With Pandas, you can perform operations such as filtering, grouping, and aggregating data.
Key Features of Pandas
- DataFrame Structure: A DataFrame is a two-dimensional labeled data structure that can hold data of different types (like integers, floats, and strings).
- Data Manipulation: Pandas makes it easy to clean and preprocess data, making it ready for analysis.
- Integration with Other Libraries: It works well with other Python libraries, such as Matplotlib for visualization and NumPy for numerical operations.
For a foundational understanding of Pandas, check out Danielle B.’s Python Pandas Tutorial.
What is DSPy?
DSPy is a framework designed for programming language models (LMs) to optimize data analysis. Unlike traditional methods that rely heavily on prompting, DSPy enables users to structure data and model interactions more effectively, making it particularly useful for analyzing large datasets.
Key Features of DSPy
-
Prompt Programming: DSPy is a programming language designed to compile (and iteratively optimize) ideal prompts to achieve the desired output from a query.
-
High Reproducibility of Responses: When used with proper signatures and optimizers, DSPy can provide highly reliable and reproducible answers to your questions with zero—and I mean zero—hallucinations. We have tested DSPy over the last 21 days through various experiments 😎 with Mistral-Nemo as the LLM of choice, and it has either provided the correct answer or remained silent.
-
Model Interactions: Unlike most ChatGPT clones and AI tools that utilize OpenAI or other models in the backend, DSPy offers similar methods for using local or online API-based LLMs to perform tasks. You can even use GPT4o-mini
as a manager or judge, local LLMs like phi3
as readers, and Mistral
as writers. This allows you to create a complex system of LLMs and tasks, which in the field of Generative AI, we refer to as a Generative Feedback Loop (GFL).
- Custom Dataset Loading: DSPy makes it easy to load and manipulate your own datasets or stream datasets from a remote or localhost server.
To get started with DSPy, visit the DSPy documentation, which includes detailed information on loading custom datasets.
Systematic Optimization
Choose from a range of optimizers to enhance your program. Whether generating refined instructions or fine-tuning weights, DSPy’s optimizers are engineered to maximize efficiency and effectiveness.
Modular Approach
With DSPy, you can build your system using predefined modules, replacing intricate prompting techniques with straightforward and effective solutions.
Cross-LM Compatibility
Whether you’re working with powerhouse models like GPT-3.5 or GPT-4, or local models such as T5-base or Llama2-13b, DSPy seamlessly integrates and enhances their performance within your system.
Citations:
[1] https://dspy-docs.vercel.app
Getting started with LinkedIn post data
There are web scraping tools online which are paid and free. You can use any one of them for educational purposes, as long as you don’t have personal data. For security reasons, though we will release the dataset, we have to refrain from revealing our sources.
the dataset we will be using is this Dataset
.
Don’t try to open the dataset in excel or Google sheets, it might break!
open it in text editors or in Microsoft Datawrangler
Loading the data
To get started, follow these steps:
-
Download the Dataset: Download the dataset from the link provided above.
-
Set Up a Python Virtual Environment:
- Open your terminal or command prompt.
- Navigate to the directory or folder where you want to set up the virtual environment.
- Create a virtual environment by running the following command:
python -m venv myenv
- Activate the virtual environment:
-
Create a Subfolder for the Data:
- Inside your main directory, create a subfolder to hold the data. You can do this with the following command:
mkdir data
-
Create a Jupyter Notebook:
- Install Jupyter Notebook if you haven’t already:
pip install jupyter
- Start Jupyter Notebook by running:
jupyter notebook
- In the Jupyter interface, create a new notebook in your desired directory.
- Follow Along: Use the notebook to analyze the dataset and perform your analysis.
By following these steps, you’ll be set up and ready to work with your dataset!
Checking the text length on the post
To gain some basic insights from the data we have, we will start by checking the length of the posts.
import pandas as pd
import os
def add_post_text_length(input_csv_path):
# Read the CSV file into a DataFrame
df = pd.read_csv(input_csv_path)
# Check if 'Post Text' column exists
if 'Post Text' not in df.columns:
raise ValueError("The 'Post Text' column is missing from the input CSV file.")
# Create a new column 'Post Text_len' with the length of 'Post Text'
df['Post Text_len'] = df['Post Text'].apply(len)
# Define the output CSV file path
output_csv_path = os.path.join(os.path.dirname(input_csv_path), 'linkedin_posts_cleaned_An1.csv')
# Write the modified DataFrame to a new CSV file
df.to_csv(output_csv_path, index=False)
print(f"New CSV file with post text lengths has been created at: {output_csv_path}")
# Example usage
input_csv = 'Your/directory/to/code/LinkedIn/pure _data/linkedin_posts_cleaned_o.csv' # Replace with your actual CSV file path
add_post_text_length(input_csv)
Emoji classification
Social media is a fun space, and LinkedIn is no exception—emojis are a clear indication of that. Let’s explore how many people are using emojis and the frequency of their usage.
import pandas as pd
import emoji
# Load your dataset
df = pd.read_csv('Your/directory/to/code/LinkedIn/pure _data/linkedin_posts_cleaned_An1.csv') ### change them
# Create a new column to check for emojis
df['has_emoji'] = df['Post Text'].apply(lambda x: 'yes' if any(char in emoji.EMOJI_DATA for char in x) else 'no')
# Optionally, save the updated dataset
df.to_csv('Your/directory/to/code/LinkedIn/pure _data/linkedin_posts_cleaned_An2.csv', index=False) ### change them
The code above will perform a binary classification of posts, distinguishing between those that contain emojis and those that do not.
Quatitative classification of emojis
We will analyze the data on emojis, concentrating on their usage by examining different emoji types and their frequency of use.
import pandas as pd
import emoji
from collections import Counter
# Load the dataset
df = pd.read_csv('Your/directory/to/code/LinkedIn/pure _data/linkedin_posts_cleaned_An2.csv') ### change them
# Function to analyze emojis in the post text
def analyze_emojis(post_text):
# Extract emojis from the text
emojis_in_text = [char for char in post_text if char in emoji.EMOJI_DATA]
# Count total number of emojis
num_emojis = len(emojis_in_text)
# Count frequency of each emoji
emoji_counts = Counter(emojis_in_text)
# Prepare lists of emojis and their frequencies
emoji_list = list(emoji_counts.keys()) if emojis_in_text else ['N/A']
frequency_list = list(emoji_counts.values()) if emojis_in_text else [0]
return num_emojis, emoji_list, frequency_list
# Apply the function to the 'Post Text' column and assign results to new columns
df[['Num_emoji', 'Emoji_list', 'Emoji_frequency']] = df['Post Text'].apply(
lambda x: pd.Series(analyze_emojis(x))
)
# Optionally, save the updated dataset
df.to_csv('Your/directory/to/code/LinkedIn/pure _data/linkedin_posts_cleaned_An3.csv', index=False) ### change them
# Display the updated DataFrame
print(df[['Serial Number', 'Post Text', 'Num_emoji', 'Emoji_list', 'Emoji_frequency']].head())
Hashtag classification
Hashtags are an important feature of online posts, as they provide valuable context about the content. Analyzing the hashtags in this dataset will help us conduct more effective Exploratory Data Analysis (EDA) in the upcoming steps.
Doing both binary classification of posts using hashtags and the hashtags that have been used
import pandas as pd
import re
# Load the dataset
df = pd.read_csv('Your/directory/to/code/DSPyW/LinkedIn/pure _data/linkedin_posts_cleaned_An3.csv')
# Function to check for hashtags and list them
def analyze_hashtags(post_text):
# Find all hashtags in the post text using regex
hashtags = re.findall(r'hashtag\s+#\s*(\w+)', post_text)
# Check if any hashtags were found
has_hashtags = 'yes' if hashtags else 'no'
# Return the has_hashtags flag and the list of hashtags
return has_hashtags, hashtags if hashtags else ['N/A']
# Apply the function to the 'Post Text' column and assign results to new columns
df[['Has_Hashtags', 'Hashtag_List']] = df['Post Text'].apply(
lambda x: pd.Series(analyze_hashtags(x))
)
# Optionally, save the updated dataset
df.to_csv('Your/directory/to/code/DSPyW/LinkedIn/pure _data/linkedin_posts_cleaned_An4.csv', index=False)
# Display the updated DataFrame
print(df[['Serial Number', 'Post Text', 'Has_Hashtags', 'Hashtag_List']].head())
Prepare the dataset for dspy
DSPy loves datasets which are in a datastructure we call list of dictionaries. We will convert out datset into a list of dictionaries and learn to split it for testing and training in future experiments coming soon on AI&U
import pandas as pd
import dspy
from dspy.datasets.dataset import Dataset
class CSVDataset(Dataset):
def __init__(self, file_path, train_size=5, dev_size=50, test_size=0, train_seed=1, eval_seed=2023) -> None:
super().__init__()
# define the inputs
self.file_path=file_path
self.train_size=train_size
self.dev_size=dev_size
self.test_size=test_size
self.train_seed=train_seed
#Just to have a default seed for future testing
self.eval_seed=eval_seed
# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)
# Shuffle the DataFrame for randomness
df = df.sample(frac=1, random_state=train_seed).reset_index(drop=True)
# Split the DataFrame into train, dev, and test sets
self._train = df.iloc[:train_size].to_dict(orient='records') # Training data
self._dev = df.iloc[train_size:train_size + dev_size].to_dict(orient='records') # Development data
self._test = df.iloc[train_size + dev_size:train_size + dev_size + test_size].to_dict(orient='records') # Testing data (if any)
# Example usage
# filepath
filepath='Your/directory/to/code/DSPyW/LinkedIn/pure _data/linkedin_posts_cleaned_An4.csv' # change it
# Create an instance of the CSVDataset
dataset = CSVDataset(file_path=filepath,train_size=200, dev_size=200, test_size=1100, train_seed=64, eval_seed=2023)
# Accessing the datasets
train_data = dataset._train
dev_data = dataset._dev
test_data = dataset._test
# Print the number of samples in each dataset
print(f"Number of training samples: {len(train_data)}, \n\n--- sample: {train_data[0]['Post Text'][:300]}") ### showing post text till 30 characters
print(f"Number of development samples: {len(dev_data)}")
print(f"Number of testing samples: {len(test_data)}")
Setting up LLMs for inference
We are using **mistral-nemo:latest**
, as a strong local LLM for inference, as it can run on most gaming laptops and it has performed reliabliy on our experiments for the last few weeks.
Mistral NeMo is a state-of-the-art language model developed through a collaboration between Mistral AI and NVIDIA. It features 12 billion parameters and is designed to excel in various tasks such as reasoning, world knowledge application, and coding accuracy. Here are some key aspects of Mistral NeMo:
Key Features
-
Large Context Window: Mistral NeMo can handle a context length of up to 128,000 tokens, allowing it to process long-form content and complex documents effectively [1], [2].
-
Performance: This model is noted for its advanced reasoning capabilities and exceptional accuracy in coding tasks, outperforming other models of similar size, such as Gemma 2 and Llama 3, in various benchmarks[2],[3].
-
Multilingual Support: Mistral NeMo supports a wide range of languages, including English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi, making it versatile for global applications[2], [3].
-
Tokenizer: It utilizes a new tokenizer called Tekken, which is more efficient in compressing natural language text and source code compared to previous models. This tokenizer enhances performance across multiple languages [2], [3].
-
Integration and Adaptability: Mistral NeMo is built on a standard architecture that allows it to be easily integrated into existing systems as a drop-in replacement for earlier models like Mistral 7B [1], [2].
- Fine-tuning and Alignment: The model has undergone advanced fine-tuning to enhance its ability to follow instructions and engage in multi-turn conversations, making it suitable for interactive applications[2], [3].
Mistral NeMo is released under the Apache 2.0 license, promoting its adoption for both research and enterprise use.
import dspy
# Define the languge Model
olm=dspy.OpenAI(api_base="http://localhost:11434/v1/", api_key="ollama", model="mistral-nemo:latest", stop='\n\n', model_type='chat')
dspy.settings.configure(lm=olm)
Using DSPy Signatures to Contextualize and Classify LinkedIn Posts
we are using hashtags and emojis as guides to classify the the posts done on LinkedIn.
While hashtags being strings of text we know that they can act as good hints.
But we also want to check if emojis are also powerful features in finding context.
there will be a final dataset that will have these classifications and contexts
in some future experiments we will explore the correctness and ways to achive correctness in predicting the context and classification with High accuracy
import dspy
# Define the signature for the model
class PostContext(dspy.Signature):
"""Summarize the LinkedIn post context in 15 words and classify it into the type of post."""
post_text = dspy.InputField(desc="Can be a social media post about a topic ignore all occurances of \n, \n\n, \n\n\n ")
emoji_hint = dspy.InputField(desc="is a list of emojis that can be in the post_text")
hashtag_hint = dspy.InputField(desc="is a list of hashtags like 'hashtag\s+#\s*(\w+)' that gives a hint on main topic")
context = dspy.OutputField(desc=f"Generate a 10 word faithful summary that describes the context of the {post_text} using {hashtag_hint} and {emoji_hint}")
classify=dspy.OutputField(desc=f"Classify the subject of {post_text} using {context} as hint, ONLY GIVE 20 Word CLASSIFICATION, DON'T give Summary")
# Select only the desired keys for
selected_keys = ['Post Text','Post Text_len','has_emoji','Num_emoji','Emoji_list','Emoji_frequency','Has_Hashtags', 'Hashtag_List']
# Prepare trainset and devset for DSPy
trainset = [{key: item[key] for key in selected_keys if key in item} for item in train_data]
devset = [{key: item[key] for key in selected_keys if key in item} for item in dev_data]
testset=[{key: item[key] for key in selected_keys if key in item} for item in test_data]
# Print lengths of the prepared datasets
#print(f"Length of trainset: {len(trainset)}")
#print(f"Length of devset: {len(devset)}")
# Define the languge Model
olm=dspy.OpenAI(api_base="http://localhost:11434/v1/", api_key="ollama", model="mistral-nemo:latest", stop='\n\n', model_type='chat')
dspy.settings.configure(lm=olm)
# Initialize the ChainOfThoughtWithHint model
predict_context=dspy.ChainOfThoughtWithHint(PostContext)
# Example prediction for the first post in the dev set
if devset:
example_post = devset[5]
prediction = predict_context(
post_text=example_post['Post Text'],
emoji_hint=example_post['Emoji_list'],
hashtag_hint=example_post['Hashtag_List']
)
print(f"Predicted Context for the example post:\n{prediction.context}\n\n the type of post can be classified as:\n\n {prediction.classify} \n\n---- And the post is:\n {example_post['Post Text'][:300]} \n\n...... ")
#print(example_post['Post Text_len'])
Now we will move onto creating the context and classification for the dataset
Make a subset of data with that has Hashtags and emojis that can be used for faithful classification and test if the model is working or not.
# Define the languge Model
olm=dspy.OpenAI(api_base="http://localhost:11434/v1/", api_key="ollama", model="mistral-nemo:latest", stop='\n\n', model_type='chat')
dspy.settings.configure(lm=olm)
# Initialize the ChainOfThoughtWithHint model
predict_context_with_hint=dspy.ChainOfThoughtWithHint(PostContext)
for i in range(len(trainset)):
if trainset[i]["Post Text_len"]<1700 and trainset[i]["Has_Hashtags"]== "yes":
ideal_post=trainset[i]
prediction = predict_context_with_hint(
post_text=ideal_post['Post Text'],
emoji_hint=ideal_post['Emoji_list'],
hashtag_hint=ideal_post['Hashtag_List']
)
print(f"The predicted Context is:\n\n {prediction.context}\n\n And the type of post is:\n\n {prediction.classify}\n\n-----")
else:
continue
write down the subset in a new version of the input csv file with context and classification
now that we have the classified and contextualized the data in the post we can store the data in a new csv
import pandas as pd
import dspy
import os
# Define the language Model
olm = dspy.OpenAI(api_base="http://localhost:11434/v1/", api_key="ollama", model="mistral-nemo:latest", stop='\n\n', model_type='chat')
dspy.settings.configure(lm=olm)
# Initialize the ChainOfThoughtWithHint model
predict_context_with_hint = dspy.ChainOfThoughtWithHint(PostContext)
def process_csv(input_csv_path):
# Read the CSV file into a DataFrame
df = pd.read_csv(input_csv_path)
# Check if necessary columns exist
if 'Post Text' not in df.columns or 'Post Text_len' not in df.columns or 'Has_Hashtags' not in df.columns:
raise ValueError("The input CSV must contain 'Post Text', 'Post Text_len', and 'Has_Hashtags' columns.")
# Create new columns for predictions
df['Predicted_Context'] = None
df['Predicted_Post_Type'] = None
# Iterate over the DataFrame rows
for index, row in df.iterrows():
if row["Post Text_len"] < 1600 and row["Has_Hashtags"] == "yes":
prediction = predict_context_with_hint(
post_text=row['Post Text'],
emoji_hint=row['Emoji_list'],
hashtag_hint=row['Hashtag_List']
)
df.at[index, 'Predicted_Context'] = prediction.context
df.at[index, 'Predicted_Post_Type'] = prediction.classify
# Define the output CSV file path
output_csv_path = os.path.join(os.path.dirname(input_csv_path), 'LinkedIn_data_final_output.csv')
# Write the modified DataFrame to a new CSV file
df.to_csv(output_csv_path, index=False)
print(f"New CSV file with predictions has been created at: {output_csv_path}")
# Example usage
input_csv = 'Your/directory/to/code/DSPyW/LinkedIn/pure _data/linkedin_posts_cleaned_An4.csv' # Replace with your actual CSV file path
process_csv(input_csv)
Conclusion
Combining DSPy with Pandas provides a robust framework for extracting insights from LinkedIn posts. By following the outlined steps, you can effectively analyze data, visualize trends, and derive meaningful conclusions. This guide serves as a foundational entry point for those interested in leveraging data science tools to enhance their understanding of social media dynamics.
By utilizing the resources and coding examples provided, you can gain valuable insights from your LinkedIn posts and apply these techniques to other datasets for broader applications in data analysis. Start experimenting with your own LinkedIn data today and discover the insights waiting to be uncovered!
This guide is designed to be engaging and informative, ensuring that readers, regardless of their experience level, can follow along and gain valuable insights from their LinkedIn posts. Happy analyzing!
References
- Danielle B.’s Post – Python pandas tutorial – LinkedIn 🐼💻 Excited to share some insights into using pandas for data analysis in Py…
- Unlocking the Power of Data Science with DSPy: Your Gateway to AI … Our YouTube channel, “DSPy: Data Science and AI Mastery,” is your ultimate …
- Creating a Custom Dataset – DSPy To create a list of Example objects, we can simply load data from the source and…
- Models Don’t Matter: Building Compound AI Systems with DSPy and … To get started, we’ll install the DSPy library, set up the DBRX fo…
- A Step-by-Step Guide to Data Analysis with Pandas and NumPy In this blog post, we will walk through a step-by-step guide on h…
- DSPy: The framework for programming—not prompting—foundation … DSPy is a framework for algorithmically optimizing LM prom…
- An Exploratory Tour of DSPy: A Framework for Programing … – Medium An Exploratory Tour of DSPy: A Framework for Programing Language M…
- Inside DSPy: The New Language Model Programming Framework … The DSPy compiler methodically traces the program’…
- Leann Chen on LinkedIn: #rag #knowledgegraphs #dspy #diffbot We designed a custom DSPy pipeline integrating with knowledge graphs. The …
-
What’s the best way to use Pandas in Program of Thought #1004 I want to build an agent to answer questions using…
Let’s take this conversation further—join us on LinkedIn here.
Want more in-depth analysis? Head over to AI&U today.