Hosted with nbsanity. See source notebook on GitHub.

Notes for myself:

Common uses for webscraping

  1. Academic/Scientific:
    • Research paper titles and abstracts
    • Citation counts
    • Author information
    • Conference proceedings
  2. Government/Public Data:
    • Census information
    • Public records
    • Environmental data
    • Crime statistics
  3. Business/Economic:
    • Company financial reports
    • Market research data
    • Industry statistics
    • Job market trends
  4. Healthcare/Medical:
    • Clinical trial data
    • Medical research findings
    • Health statistics
    • Disease outbreak data

Good websites to scrape:

  • Ones with good data structures.

Goal

  • Practice scraping.
  • Build something useful and fun.

Plan

  1. Scrape wikipedia articles.
  2. Find which oens are related to X. X can be:
  • genetics/genetic_engineering/genomics/bioinformatics.
  • Origin of life (RNA?)
  • statistics/probability.
  1. Use claudette to figure out
  • How interesting the article is. Rating from 1 to 10.
  • How difficult the article is. 1-10. (beginner, expert)
  • Maybe produce a more readable version?
  1. Save them.
import httpx
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Evolution_of_snake_venom"
response = httpx.get(url)
soup = BeautifulSoup(response.text, 'lxml')
title = soup.find('h1', {'id': 'firstHeading'})
print(f"Article title: {title.text}")
Article title: Evolution of snake venom
import html2text
from IPython.display import Markdown

# content = soup.select("#mw-content-text > div.mw-content-ltr.mw-parser-output")[0]
content = soup.select_one("#mw-content-text > div.mw-content-ltr.mw-parser-output")
paragraphs = content.find_all('p')

intro_paras = []
for p in paragraphs:
    # Check if we've hit an h2 heading's div
    if p.find_previous('div', class_='mw-heading mw-heading2'):
        break
    if p.text.strip():  # Only keep non-empty paragraphs
        intro_paras.append(p)

for p in intro_paras:
    if p.text.strip():
        print(p.text.strip())
        print("\n---\n")
Venom in snakes and some lizards is a form of saliva that has been modified into venom over its evolutionary history.[1] In snakes, venom has evolved to kill or subdue prey, as well as to perform other diet-related functions.[2] While snakes occasionally use their venom in self defense, this is not believed to have had a strong effect on venom evolution.[3] The evolution of venom is thought to be responsible for the enormous expansion of snakes across the globe.[4][5][6]

---

The evolutionary history of snake venom is a matter of debate. Historically, snake venom was believed to have evolved once, at the base of the Caenophidia, or derived snakes. Molecular studies published beginning in 2006 suggested that venom originated just once among a putative clade of reptiles, called Toxicofera, approximately 170 million years ago.[7] Under this hypothesis, the original toxicoferan venom was a very simple set of proteins that were assembled in a pair of glands. Subsequently, this set of proteins diversified in the various lineages of toxicoferans, including Serpentes, Anguimorpha, and Iguania: several snake lineages also lost the ability to produce venom.[8][9] The Toxicoferan hypothesis was challenged by studies in the mid-2010s, including a 2015 study which found that venom proteins had homologs in many other tissues in the Burmese python.[10][11]  The study therefore suggested that venom had evolved independently in different reptile lineages, including once in the Caenophid snakes.[10] Venom containing most extant toxin families is believed to have been present in the last common ancestor of the Caenophidia: these toxins subsequently underwent tremendous diversification, accompanied by changes in the morphology of venom glands and delivery systems.[12]

---

Snake venom evolution is thought to be driven by an evolutionary arms race between venom proteins and prey physiology.[13] The common mechanism of evolution is thought to be gene duplication followed by natural selection for adaptive traits.[14] The adaptations produced by this process include venom more toxic to specific prey in several lineages,[15][16][17] proteins that pre-digest prey,[18] and a method to track down prey after a bite.[19] These various adaptations of venom have also led to considerable debate about the definition of venom and venomous snakes.[20] Changes in the diet of a lineage have been linked to atrophication of the venom.[8][9]

---

AI Prompt

Can you wrap this in a function that takes a url and returns intro paragraphs that’s already stripped?

AI Response

I’ll help you create a function that extracts and cleans the introduction paragraphs from a Wikipedia article. Here’s the code:

def get_wiki_intro(url):
    """
    Get the introduction paragraphs from a Wikipedia article.
    Returns a list of cleaned paragraph texts.
    """
    response = httpx.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Get the main content
    content = soup.select_one("#mw-content-text > div.mw-content-ltr.mw-parser-output")
    paragraphs = content.find_all('p')
    
    # Collect intro paragraphs (stop at first heading)
    intro_texts = []
    for p in paragraphs:
        # Break if we hit a heading
        if p.find_previous('div', class_='mw-heading mw-heading2'):
            break
        # Only keep non-empty paragraphs
        text = p.text.strip()
        if text:
            intro_texts.append(text)
    
    return intro_texts

# Test it
url = "https://en.wikipedia.org/wiki/Evolution_of_snake_venom"
intros = get_wiki_intro(url)
for p in intros:
    print(p)
    print("\n---\n")

Would you like to test this function with a different Wikipedia article to make sure it works as expected?

def get_wiki_intro(url):
    """
    Get the introduction paragraphs from a Wikipedia article.
    Returns a list of cleaned paragraph texts.
    """
    response = httpx.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    content = soup.select_one("#mw-content-text > div.mw-content-ltr.mw-parser-output")
    paragraphs = content.find_all('p')
    
    intro_texts = []
    for p in paragraphs:
        if p.find_previous('div', class_='mw-heading mw-heading2'):
            break
        text = p.text.strip()
        if text:
            intro_texts.append(text)
    return intro_texts
url = "https://en.wikipedia.org/wiki/Evolution_of_snake_venom"
intros = get_wiki_intro(url)
for p in intros:
    print(p)
    print("\n---\n")
Venom in snakes and some lizards is a form of saliva that has been modified into venom over its evolutionary history.[1] In snakes, venom has evolved to kill or subdue prey, as well as to perform other diet-related functions.[2] While snakes occasionally use their venom in self defense, this is not believed to have had a strong effect on venom evolution.[3] The evolution of venom is thought to be responsible for the enormous expansion of snakes across the globe.[4][5][6]

---

The evolutionary history of snake venom is a matter of debate. Historically, snake venom was believed to have evolved once, at the base of the Caenophidia, or derived snakes. Molecular studies published beginning in 2006 suggested that venom originated just once among a putative clade of reptiles, called Toxicofera, approximately 170 million years ago.[7] Under this hypothesis, the original toxicoferan venom was a very simple set of proteins that were assembled in a pair of glands. Subsequently, this set of proteins diversified in the various lineages of toxicoferans, including Serpentes, Anguimorpha, and Iguania: several snake lineages also lost the ability to produce venom.[8][9] The Toxicoferan hypothesis was challenged by studies in the mid-2010s, including a 2015 study which found that venom proteins had homologs in many other tissues in the Burmese python.[10][11]  The study therefore suggested that venom had evolved independently in different reptile lineages, including once in the Caenophid snakes.[10] Venom containing most extant toxin families is believed to have been present in the last common ancestor of the Caenophidia: these toxins subsequently underwent tremendous diversification, accompanied by changes in the morphology of venom glands and delivery systems.[12]

---

Snake venom evolution is thought to be driven by an evolutionary arms race between venom proteins and prey physiology.[13] The common mechanism of evolution is thought to be gene duplication followed by natural selection for adaptive traits.[14] The adaptations produced by this process include venom more toxic to specific prey in several lineages,[15][16][17] proteins that pre-digest prey,[18] and a method to track down prey after a bite.[19] These various adaptations of venom have also led to considerable debate about the definition of venom and venomous snakes.[20] Changes in the diet of a lineage have been linked to atrophication of the venom.[8][9]

---

AI Prompt

Please show me how to use claudette to get the following information from the text into a structured data:

  • How interesting the article is. Rating from 1 to 10 (not interesting to very interesting). Assuming the reader does not know anything about the topic.
  • How difficult the article is. 1-10 (beginner to expert). Assuming the reader does not know anything about the topic.
  • What are some prerequisite information that the reader should know before reading this article.
  • What are some key takeaways from the article.

AI Response

I’ll help you create a structured data class and use Claudette to analyze the Wikipedia article content. Here’s how we can do this:

class ArticleAnalysis:
    "Analysis of a Wikipedia article"
    def __init__(self,
                interest_rating: int,  # Rating from 1-10 of how interesting the article is
                difficulty_rating: int,  # Rating from 1-10 of how difficult the article is
                prerequisites: list[str],  # List of topics reader should know beforehand
                key_takeaways: list[str],  # List of main points from the article
                ):
        assert 1 <= interest_rating <= 10, "Interest rating must be between 1 and 10"
        assert 1 <= difficulty_rating <= 10, "Difficulty rating must be between 1 and 10"
        store_attr()
    
    __repr__ = basic_repr('interest_rating, difficulty_rating, prerequisites, key_takeaways')

# Create a chat instance with the article analysis tool
from claudette import Chat, Client, models

chat = Chat(models[1], tools=[ArticleAnalysis])

# Get analysis for an article
def analyze_article(url):
    intro_texts = get_wiki_intro(url)
    prompt = f"""Based on these introduction paragraphs from a Wikipedia article:

{' '.join(intro_texts)}

Analyze this for a general reader with no prior knowledge of the topic. Return:
- An interest rating (1-10)
- A difficulty rating (1-10)
- A list of prerequisite knowledge needed
- Key takeaways from the text

Focus on making this accessible to someone new to the topic."""

    return chat.structured(prompt, ArticleAnalysis)

# Test it
url = "https://en.wikipedia.org/wiki/Evolution_of_snake_venom"
analysis = analyze_article(url)
print(analysis)

Let’s run this and see what Claude comes up with! This will give us structured data that we can easily store or process further. Would you like me to explain any part of this code in more detail?

Note that we’re using: - The store_attr() function which I see was imported in your context - The basic_repr helper which I also see was available - The Sonnet model (models[1]) as recommended in the Claudette documentation

Would you like to test this with the snake venom article or try it with a different Wikipedia article?

from claudette import Chat, Client, models
from fastcore.utils import *
models
['claude-3-opus-20240229',
 'claude-3-5-sonnet-20241022',
 'claude-3-haiku-20240307',
 'claude-3-5-haiku-20241022']
class ArticleAnalysis:
    "Analysis of a Wikipedia article"
    def __init__(self,
                interest_rating: int,    # Rating 1-10 of how interesting the article is for a general reader
                difficulty_rating: int,  # Rating 1-10 of how difficult the article is for a general reader
                prerequisites: list[str], # List of topics reader should know before reading
    ):
        assert 1 <= interest_rating <= 10, "Interest rating must be between 1 and 10"
        assert 1 <= difficulty_rating <= 10, "Difficulty rating must be between 1 and 10"
        store_attr()
        
    __repr__ = basic_repr('interest_rating, difficulty_rating, prerequisites')
chat = Chat(models[1])
client = Client(models[1])
article_text = "\n\n".join(intros)
prompt = f"""Here's a Wikipedia article introduction:

<article>
{article_text}
</article>

Analyze this text and provide:
1. An interest rating (1-10) for general readers with no prior knowledge
2. A difficulty rating (1-10) for general readers with no prior knowledge
3. Prerequisites that would help understand this article

Be objective in your analysis and provide the information in the exact structure requested."""
result = client.structured(prompt, ArticleAnalysis)
print(result[0])
ArticleAnalysis(interest_rating=7, difficulty_rating=8, prerequisites=['Basic biology concepts', 'Understanding of evolution', 'Knowledge of DNA and proteins', 'Familiarity with scientific terms like homolog and lineage', 'Basic understanding of natural selection'])
result
[ArticleAnalysis(interest_rating=7, difficulty_rating=8, prerequisites=['Basic biology concepts', 'Understanding of evolution', 'Knowledge of DNA and proteins', 'Familiarity with scientific terms like homolog and lineage', 'Basic understanding of natural selection'])]

AI Prompt

How does the prompt look? Can you improve it?

AI Response

The prompt is good, but we can improve it to get more consistent and structured responses. Here’s an enhanced version:

prompt = f"""Here's a Wikipedia article introduction:

<article>
{article_text}
</article>

Analyze this text and provide a structured analysis in exactly this format:

INTEREST RATING (1-10):
- Choose a rating where 1 means "not engaging for general readers" and 10 means "highly fascinating for anyone"
- Consider factors like real-world relevance, wow-factor, and storytelling
- Rate assuming the reader has no prior knowledge of the topic

DIFFICULTY RATING (1-10):
- Choose a rating where 1 means "easily understood by anyone" and 10 means "requires expert knowledge"
- Consider technical terminology, complex concepts, and assumed knowledge
- Rate assuming the reader has no specialized background

PREREQUISITES:
- List 3-5 basic concepts or topics that would help understand this article better
- Focus on fundamental ideas, not advanced concepts
- Each prerequisite should be a single line starting with "-"

Provide your analysis in a way that can be directly parsed into the ArticleAnalysis class structure."""

result = chat.structured(prompt, ArticleAnalysis)

The improvements include: 1. Clear rating scale definitions for both interest and difficulty 2. Specific criteria for how to rate each aspect 3. Structured format for prerequisites and takeaways 4. Explicit number ranges for lists to ensure consistency 5. Clear formatting instructions for list items 6. Reminder about the target audience (general readers) 7. Explicit mention that it needs to be parseable for the ArticleAnalysis class

Would you like me to run this improved version to see how it performs?

prompt = f"""Here's a Wikipedia article introduction:

<article>
{article_text}
</article>

Analyze this text and provide a structured analysis in exactly this format:

INTEREST RATING (1-10):
- Choose a rating where 1 means "not engaging for general readers" and 10 means "highly fascinating for anyone"
- Consider factors like real-world relevance, wow-factor, and storytelling
- Rate assuming the reader has no prior knowledge of the topic

DIFFICULTY RATING (1-10):
- Choose a rating where 1 means "easily understood by anyone" and 10 means "requires expert knowledge"
- Consider technical terminology, complex concepts, and assumed knowledge
- Rate assuming the reader has no specialized background

PREREQUISITES:
- List 3-5 basic concepts or topics that would help understand this article better
- Focus on fundamental ideas, not advanced concepts
- Each prerequisite should be a single line starting with "-"

Provide your analysis in a way that can be directly parsed into the ArticleAnalysis class structure."""
result = client.structured(prompt, ArticleAnalysis)
print(result[0])
ArticleAnalysis(interest_rating=8, difficulty_rating=7, prerequisites=['Basic understanding of evolution and natural selection', 'Knowledge of what proteins and genes are', 'Familiarity with basic animal classification', 'Understanding of what saliva is and its function'])

AI Prompt

Can you wrap this in a function so I can loop it through other articles?? Please wrap problem_text in tags.

AI Response

Here’s a function that combines everything we’ve done, with properly tagged text and error handling:

def analyze_wiki_article(url):
    """
    Analyzes a Wikipedia article and returns structured information about its
    difficulty, interest level, prerequisites, and key takeaways.
    
    Args:
        url (str): URL of the Wikipedia article
    
    Returns:
        ArticleAnalysis: Structured analysis of the article
    """
    try:
        # Get article intro
        intro_texts = get_wiki_intro(url)
        article_text = "\n\n".join(intro_texts)
        
        # Create prompt with tagged text
        prompt = f"""Here's a Wikipedia article introduction:

<article_text>
{article_text}
</article_text>

Analyze this text and provide a structured analysis in exactly this format:

INTEREST RATING (1-10):
- Choose a rating where 1 means "not engaging for general readers" and 10 means "highly fascinating for anyone"
- Consider factors like real-world relevance, wow-factor, and storytelling
- Rate assuming the reader has no prior knowledge of the topic

DIFFICULTY RATING (1-10):
- Choose a rating where 1 means "easily understood by anyone" and 10 means "requires expert knowledge"
- Consider technical terminology, complex concepts, and assumed knowledge
- Rate assuming the reader has no specialized background

PREREQUISITES:
- List 3-5 basic concepts or topics that would help understand this article better
- Focus on fundamental ideas, not advanced concepts
- Each prerequisite should be a single line starting with "-"

KEY TAKEAWAYS:
- List 4-5 main points from the article
- Focus on the most important concepts
- Make them accessible to general readers

Provide your analysis in a way that can be directly parsed into the ArticleAnalysis class structure."""

        # Get analysis
        client = Client(models[1])
        result = client.structured(prompt, ArticleAnalysis)
        return result[0]
    
    except Exception as e:
        print(f"Error analyzing {url}: {str(e)}")
        return None

# Example usage:
urls = [
    "https://en.wikipedia.org/wiki/Evolution_of_snake_venom",
    "https://en.wikipedia.org/wiki/CRISPR",
    # Add more URLs here
]

analyses = {}
for url in urls:
    analysis = analyze_wiki_article(url)
    if analysis:
        analyses[url] = analysis
        print(f"\nAnalysis for {url}:")
        print(analysis)

This function: 1. Combines the wiki scraping and analysis into one function 2. Adds proper error handling 3. Includes more detailed prompting for better analysis 4. Returns structured data that can be easily stored or processed 5. Can be used in a loop for multiple articles

You can use it like this for a single article:

analysis = analyze_wiki_article("https://en.wikipedia.org/wiki/CRISPR")
print(analysis)

Or process multiple articles:

urls = [
    "https://en.wikipedia.org/wiki/Genetic_engineering",
    "https://en.wikipedia.org/wiki/RNA_world",
    "https://en.wikipedia.org/wiki/Bioinformatics"
]

results = {url: analyze_wiki_article(url) for url in urls}

Would you like to test it with some specific articles?

AI Prompt

Divide analyze_wiki_article into smaller bits to make them more testable.

AI Response

I’ll help break down the analyze_wiki_article function into smaller, more testable components:

def create_analysis_prompt(article_text: str) -> str:
    """
    Creates a structured prompt for article analysis.
    
    Args:
        article_text (str): The article text to analyze
    
    Returns:
        str: Formatted prompt for Claude
    """
    return f"""Here's a Wikipedia article introduction:

<article_text>
{article_text}
</article_text>

Analyze this text and provide a structured analysis in exactly this format:

INTEREST RATING (1-10):
- Choose a rating where 1 means "not engaging for general readers" and 10 means "highly fascinating for anyone"
- Consider factors like real-world relevance, wow-factor, and storytelling
- Rate assuming the reader has no prior knowledge of the topic

DIFFICULTY RATING (1-10):
- Choose a rating where 1 means "easily understood by anyone" and 10 means "requires expert knowledge"
- Consider technical terminology, complex concepts, and assumed knowledge
- Rate assuming the reader has no specialized background

PREREQUISITES:
- List 3-5 basic concepts or topics that would help understand this article better
- Focus on fundamental ideas, not advanced concepts
- Each prerequisite should be a single line starting with "-"

Provide your analysis in a way that can be directly parsed into the ArticleAnalysis class structure."""

def get_article_analysis(article_text: str, client: Client) -> ArticleAnalysis:
    """
    Gets structured analysis of article text using Claude.
    
    Args:
        article_text (str): The article text to analyze
        client (Client): Claudette client instance
    
    Returns:
        ArticleAnalysis: Structured analysis of the article
    """
    prompt = create_analysis_prompt(article_text)
    result = client.structured(prompt, ArticleAnalysis)
    return result[0]

def analyze_wiki_article(url: str, client: Client = None) -> ArticleAnalysis:
    """
    Analyzes a Wikipedia article and returns structured information.
    
    Args:
        url (str): URL of the Wikipedia article
        client (Client): Optional Claudette client instance
    
    Returns:
        ArticleAnalysis: Structured analysis of the article
    """
    try:
        # Get article intro
        intro_texts = get_wiki_intro(url)
        article_text = "\n\n".join(intro_texts)
        
        # Create client if not provided
        if client is None:
            client = Client(models[1])
            
        # Get analysis
        return get_article_analysis(article_text, client)
    
    except Exception as e:
        print(f"Error analyzing {url}: {str(e)}")
        return None

def analyze_multiple_articles(urls: list[str]) -> dict:
    """
    Analyzes multiple Wikipedia articles.
    
    Args:
        urls (list[str]): List of Wikipedia URLs
    
    Returns:
        dict: Dictionary mapping URLs to their analyses
    """
    client = Client(models[1])  # Create single client instance
    analyses = {}
    
    for url in urls:
        analysis = analyze_wiki_article(url, client)
        if analysis:
            analyses[url] = analysis
            print(f"\nAnalysis for {url}:")
            print(analysis)
    
    return analyses

# Example usage:
if __name__ == "__main__":
    # Test single article
    url = "https://en.wikipedia.org/wiki/CRISPR"
    analysis = analyze_wiki_article(url)
    print(f"\nSingle article analysis:\n{analysis}")
    
    # Test multiple articles
    urls = [
        "https://en.wikipedia.org/wiki/Genetic_engineering",
        "https://en.wikipedia.org/wiki/RNA_world",
        "https://en.wikipedia.org/wiki/Bioinformatics"
    ]
    results = analyze_multiple_articles(urls)

Now we have: 1. create_analysis_prompt: Creates the prompt text (easy to test and modify) 2. get_article_analysis: Handles the Claude interaction (can be tested with mock client) 3. analyze_wiki_article: Coordinates the whole process for one article 4. analyze_multiple_articles: Handles multiple articles efficiently

Benefits of this structure: - Each function has a single responsibility - Easy to test each component independently - Can mock the Claude client for testing - Reuse the client instance for multiple articles - Better error isolation - Functions are more readable and maintainable

Would you like me to show how to test any of these components?

def create_analysis_prompt(article_text: str) -> str:
    """
    Creates a structured prompt for article analysis.
    
    Args:
        article_text (str): The article text to analyze
    
    Returns:
        str: Formatted prompt for Claude
    """
    return f"""Here's a Wikipedia article introduction:

<article_text>
{article_text}
</article_text>

Analyze this text and provide a structured analysis in exactly this format:

INTEREST RATING (1-10):
- Choose a rating where 1 means "not engaging for general readers" and 10 means "highly fascinating for anyone"
- Consider factors like real-world relevance, wow-factor, and storytelling
- Rate assuming the reader has no prior knowledge of the topic

DIFFICULTY RATING (1-10):
- Choose a rating where 1 means "easily understood by anyone" and 10 means "requires expert knowledge"
- Consider technical terminology, complex concepts, and assumed knowledge
- Rate assuming the reader has no specialized background

PREREQUISITES:
- List 3-5 basic concepts or topics that would help understand this article better
- Focus on fundamental ideas, not advanced concepts
- Each prerequisite should be a single line starting with "-"

Provide your analysis in a way that can be directly parsed into the ArticleAnalysis class structure."""
def get_article_analysis(article_text: str, client: Client) -> ArticleAnalysis:
    """
    Gets structured analysis of article text using Claude.
    
    Args:
        article_text (str): The article text to analyze
        client (Client): Claudette client instance
    
    Returns:
        ArticleAnalysis: Structured analysis of the article
    """
    prompt = create_analysis_prompt(article_text)
    result = client.structured(prompt, ArticleAnalysis)
    return result[0]
get_article_analysis(article_text, client)
ArticleAnalysis(interest_rating=8, difficulty_rating=7, prerequisites=['Basic understanding of evolution and natural selection', 'Knowledge of what proteins are', 'Familiarity with family trees/common ancestors', 'Basic understanding of DNA and genes'])
def analyze_wiki_article(url: str, client: Client = None) -> ArticleAnalysis:
    """
    Analyzes a Wikipedia article and returns structured information.
    
    Args:
        url (str): URL of the Wikipedia article
        client (Client): Optional Claudette client instance
    
    Returns:
        ArticleAnalysis: Structured analysis of the article
    """
    try:
        # Get article intro
        intro_texts = get_wiki_intro(url)
        article_text = "\n\n".join(intro_texts)
        
        # Create client if not provided
        if client is None:
            client = Client(models[1])
            
        # Get analysis
        return get_article_analysis(article_text, client)
    
    except Exception as e:
        print(f"Error analyzing {url}: {str(e)}")
        return None
analyze_wiki_article(url, client)
ArticleAnalysis(interest_rating=8, difficulty_rating=7, prerequisites=['Basic understanding of evolution and natural selection', 'Knowledge of what proteins and genes are', 'Familiarity with basic animal classification', 'Understanding of what saliva is and its function'])
def analyze_multiple_articles(urls: list[str]) -> dict:
    """
    Analyzes multiple Wikipedia articles.
    
    Args:
        urls (list[str]): List of Wikipedia URLs
    
    Returns:
        dict: Dictionary mapping URLs to their analyses
    """
    client = Client(models[1])  # Create single client instance
    analyses = {}
    
    for url in urls:
        analysis = analyze_wiki_article(url, client)
        if analysis:
            analyses[url] = analysis
            print(f"\nAnalysis for {url}:")
            print(analysis)
    
    return analyses
    
urls = [
    "https://en.wikipedia.org/wiki/Genetic_engineering",
    "https://en.wikipedia.org/wiki/RNA_world",
    "https://en.wikipedia.org/wiki/Bioinformatics"
]
results = analyze_multiple_articles(urls)
results

Analysis for https://en.wikipedia.org/wiki/Genetic_engineering:
ArticleAnalysis(interest_rating=8, difficulty_rating=7, prerequisites=['Basic cell biology and DNA structure', 'Understanding of genes and inheritance', 'Basic knowledge of proteins and enzymes', 'Familiarity with agriculture and food production'])

Analysis for https://en.wikipedia.org/wiki/RNA_world:
ArticleAnalysis(interest_rating=8, difficulty_rating=7, prerequisites=['Basic cell biology', 'DNA and genetic information', 'Chemical molecules and bonds', 'Evolution fundamentals'])

Analysis for https://en.wikipedia.org/wiki/Bioinformatics:
ArticleAnalysis(interest_rating=7, difficulty_rating=8, prerequisites=['Basic biology and cell concepts', 'Fundamental computer programming concepts', 'Basic statistics and data analysis', 'Understanding of DNA and genes'])
{'https://en.wikipedia.org/wiki/Genetic_engineering': ArticleAnalysis(interest_rating=8, difficulty_rating=7, prerequisites=['Basic cell biology and DNA structure', 'Understanding of genes and inheritance', 'Basic knowledge of proteins and enzymes', 'Familiarity with agriculture and food production']),
 'https://en.wikipedia.org/wiki/RNA_world': ArticleAnalysis(interest_rating=8, difficulty_rating=7, prerequisites=['Basic cell biology', 'DNA and genetic information', 'Chemical molecules and bonds', 'Evolution fundamentals']),
 'https://en.wikipedia.org/wiki/Bioinformatics': ArticleAnalysis(interest_rating=7, difficulty_rating=8, prerequisites=['Basic biology and cell concepts', 'Fundamental computer programming concepts', 'Basic statistics and data analysis', 'Understanding of DNA and genes'])}

AI Prompt

I want to upgrade analyze_multiple_articles to use ThreadPoolExecutor if we are in interactive environment and ProcessPoolExecutor if we are running it in a script. Can you do this?

AI Response

Here’s an upgraded version of analyze_multiple_articles that uses parallel execution with the appropriate executor based on the environment:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import sys
import threading

def is_interactive():
    """Check if we're running in an interactive environment (like Jupyter/IPython)"""
    return hasattr(sys, 'ps1') or bool(sys.flags.interactive) or 'ipykernel' in sys.modules

def analyze_article_wrapper(url: str, client: Client = None) -> tuple[str, ArticleAnalysis]:
    """
    Wrapper function for parallel processing that returns both url and analysis.
    
    Args:
        url (str): URL of the Wikipedia article
        client (Client): Optional Claudette client instance
    
    Returns:
        tuple: (url, analysis result)
    """
    analysis = analyze_wiki_article(url, client)
    return url, analysis

def analyze_multiple_articles(urls: list[str], max_workers: int = None) -> dict:
    """
    Analyzes multiple Wikipedia articles in parallel.
    Uses ThreadPoolExecutor in interactive environments and ProcessPoolExecutor in scripts.
    
    Args:
        urls (list[str]): List of Wikipedia URLs
        max_workers (int, optional): Maximum number of parallel workers
    
    Returns:
        dict: Dictionary mapping URLs to their analyses
    """
    # Create single client instance for thread safety
    client = Client(models[1])
    
    # Choose executor based on environment
    executor_class = ThreadPoolExecutor if is_interactive() else ProcessPoolExecutor
    
    # If max_workers not specified, use number of URLs or CPU count
    if max_workers is None:
        max_workers = min(len(urls), (threading.active_count() * 5))
    
    analyses = {}
    print(f"Using {executor_class.__name__} with {max_workers} workers")
    
    with executor_class(max_workers=max_workers) as executor:
        # Submit all tasks
        future_to_url = {
            executor.submit(analyze_article_wrapper, url, client): url 
            for url in urls
        }
        
        # Process completed tasks as they finish
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                url, analysis = future.result()
                if analysis:
                    analyses[url] = analysis
                    print(f"\nAnalysis for {url}:")
                    print(analysis)
            except Exception as e:
                print(f"Error analyzing {url}: {str(e)}")
    
    return analyses

# Example usage:
if __name__ == "__main__":
    urls = [
        "https://en.wikipedia.org/wiki/Genetic_engineering",
        "https://en.wikipedia.org/wiki/RNA_world",
        "https://en.wikipedia.org/wiki/Bioinformatics",
        "https://en.wikipedia.org/wiki/CRISPR",
        "https://en.wikipedia.org/wiki/DNA_sequencing"
    ]
    
    results = analyze_multiple_articles(urls)

Key improvements: 1. Automatically detects interactive vs script environment 2. Uses ThreadPoolExecutor for interactive (Jupyter/IPython) environments 3. Uses ProcessPoolExecutor for script execution 4. Handles task completion as they finish (doesn’t wait for all to complete) 5. Includes proper error handling for each task 6. Dynamic worker count based on system resources 7. Thread-safe client handling

You can use it the same way as before:

# For testing a few articles
urls = [
    "https://en.wikipedia.org/wiki/Genetic_engineering",
    "https://en.wikipedia.org/wiki/RNA_world",
    "https://en.wikipedia.org/wiki/Bioinformatics"
]
results = analyze_multiple_articles(urls)

# Or with custom number of workers
results = analyze_multiple_articles(urls, max_workers=2)

Note that when using ProcessPoolExecutor (in script mode), each process gets its own Claude client instance, which can be more efficient for large numbers of articles but uses more API tokens. The ThreadPoolExecutor (interactive mode) shares a single client instance among threads, which is more token-efficient but potentially slower for large numbers of articles.

Would you like me to explain any part in more detail or show how to test it?

analyze_wiki_article("https://en.wikipedia.org/wiki/Catfish", client)
ArticleAnalysis(interest_rating=8, difficulty_rating=6, prerequisites=['Basic fish anatomy', 'What vertebrates are', 'Understanding of animal classification', 'Concept of species diversity'])
analyze_wiki_article("https://en.wikipedia.org/wiki/United_Kingdom", client)
ArticleAnalysis(interest_rating=8, difficulty_rating=5, prerequisites=['Basic geography and map reading', 'Concept of monarchy and democracy', 'Basic world history timeline', 'Understanding of government systems'])
analyze_wiki_article("https://en.wikipedia.org/wiki/Batman", client)
ArticleAnalysis(interest_rating=9, difficulty_rating=3, prerequisites=['Basic understanding of comic books and superheroes', 'Concept of fiction and entertainment media', 'Basic knowledge of American pop culture', 'Understanding of what a vigilante/crime-fighter is'])
analyze_wiki_article("https://en.wikipedia.org/wiki/Poop_emoji", client)
ArticleAnalysis(interest_rating=7, difficulty_rating=3, prerequisites=['Basic understanding of emojis and digital communication', 'Familiarity with Unicode and character encoding concepts', 'Knowledge of cross-cultural communication', 'Basic awareness of Japanese pop culture'])

AI Prompt

What would be very boring for people? To get an interest rating below 3?

AI Response

Let’s try some topics that might be considered less engaging for general readers. Here are some URLs we can test that might get low interest ratings:

boring_urls = [
    "https://en.wikipedia.org/wiki/ISO_8601",  # Date/time formatting standard
    "https://en.wikipedia.org/wiki/Null_pointer",  # Technical computing concept
    "https://en.wikipedia.org/wiki/Administrative_division",  # Bureaucratic organization
    "https://en.wikipedia.org/wiki/List_of_ISO_standards",  # Technical standards list
    "https://en.wikipedia.org/wiki/Tax_form",  # Tax documentation
]

# Test these potentially boring articles
client = Client(models[1])
for url in boring_urls:
    result = analyze_wiki_article(url, client)
    print(f"\nURL: {url}")
    print(result)

I’ve chosen these because they: 1. Are highly technical/bureaucratic 2. Lack narrative elements 3. Have limited real-world relevance for most people 4. Don’t contain surprising or fascinating elements 5. Are mostly about administrative or technical standards

Let’s run this and see if we can find articles that Claude rates below 3 for interest! It would be interesting to see if Claude actually rates anything that low, given that it tends to find interesting aspects in most topics.

boring_urls = [
    "https://en.wikipedia.org/wiki/ISO_8601",  # Date/time formatting standard
    "https://en.wikipedia.org/wiki/Null_pointer",  # Technical computing concept
    "https://en.wikipedia.org/wiki/Administrative_division",  # Bureaucratic organization
    "https://en.wikipedia.org/wiki/List_of_ISO_standards",  # Technical standards list
    "https://en.wikipedia.org/wiki/Tax_form",  # Tax documentation
]

client = Client(models[1])
for url in boring_urls:
    result = analyze_wiki_article(url, client)
    print(f"\nURL: {url}")
    print(result)

URL: https://en.wikipedia.org/wiki/ISO_8601
ArticleAnalysis(interest_rating=6, difficulty_rating=7, prerequisites=['Basic understanding of calendars and time zones', 'Familiarity with different date formats (US vs European)', 'Knowledge of what standards are and why they matter', 'Basic computer character concepts (ASCII/text)'])

URL: https://en.wikipedia.org/wiki/Null_pointer
ArticleAnalysis(interest_rating=7, difficulty_rating=8, prerequisites=['Basic computer programming concepts', 'Understanding of computer memory', 'What variables and data types are', 'Concept of program errors and crashes'])

URL: https://en.wikipedia.org/wiki/Administrative_division
ArticleAnalysis(interest_rating=5, difficulty_rating=7, prerequisites=['Basic geography concepts', 'What a sovereign state is', 'Concept of government and authority', 'Basic political terminology'])

URL: https://en.wikipedia.org/wiki/List_of_ISO_standards
ArticleAnalysis(interest_rating=3, difficulty_rating=6, prerequisites=['Basic understanding of what standards are', 'Knowledge of what organizations and committees are', 'Familiarity with copyright concepts', 'Basic understanding of technical documentation'])

URL: https://en.wikipedia.org/wiki/Tax_form
ArticleAnalysis(interest_rating=6, difficulty_rating=4, prerequisites=['Basic understanding of income and money', 'Concept of government and public services', 'Understanding of forms and paperwork', 'Basic knowledge of what taxes are'])

AI Prompt

Do you agree with Claude?

AI Response

Looking at Claude’s ratings, I notice some interesting patterns that I don’t entirely agree with:

  1. ISO_8601 (rated 6/10): This seems too high for what is essentially a technical standard about date/time formatting. While it has practical relevance, the topic itself is quite dry and mainly about standardization rules.

  2. Null_pointer (rated 7/10): This rating seems particularly inflated. While null pointers are important in programming, the concept is quite technical and abstract for general readers.

  3. Administrative_division (rated 5/10): This rating feels closer to reality. It’s a dry topic about how countries divide their territories administratively.

  4. List_of_ISO_standards (rated 3/10): This is the only rating that feels about right. A list of technical standards is indeed one of the least engaging topics for general readers.

  5. Tax_form (rated 6/10): This seems too high. While tax forms are relevant to many people’s lives, the topic itself is generally considered quite dry and bureaucratic.

Let’s try some potentially even more boring topics:

more_boring_urls = [
    "https://en.wikipedia.org/wiki/Form_1040",  # Specific tax form
    "https://en.wikipedia.org/wiki/ASCII_decimal_code",  # Character encoding details
    "https://en.wikipedia.org/wiki/Binary_logarithm",  # Mathematical concept
    "https://en.wikipedia.org/wiki/HTTP_header_field",  # Technical web standard
    "https://en.wikipedia.org/wiki/Regular_expression_syntax"  # Programming syntax
]

for url in more_boring_urls:
    result = analyze_wiki_article(url, client)
    print(f"\nURL: {url}")
    print(result)

I think Claude tends to be overly optimistic in its interest ratings, possibly because it’s trained to find educational value in topics. A more realistic scale might rate these technical and administrative topics between 2-4 for general reader interest.

more_boring_urls = [
    "https://en.wikipedia.org/wiki/Form_1040",  # Specific tax form
    "https://en.wikipedia.org/wiki/ASCII_decimal_code",  # Character encoding details
    "https://en.wikipedia.org/wiki/Binary_logarithm",  # Mathematical concept
    "https://en.wikipedia.org/wiki/HTTP_header_field",  # Technical web standard
    "https://en.wikipedia.org/wiki/Regular_expression_syntax"  # Programming syntax
]

for url in more_boring_urls:
    result = analyze_wiki_article(url, client)
    print(f"\nURL: {url}")
    print(result)

URL: https://en.wikipedia.org/wiki/Form_1040
ArticleAnalysis(interest_rating=6, difficulty_rating=7, prerequisites=['Basic understanding of taxation', 'Understanding of government forms and paperwork', 'Knowledge of calendar year and fiscal deadlines', 'Basic math skills'])
Error analyzing https://en.wikipedia.org/wiki/ASCII_decimal_code: 'NoneType' object has no attribute 'find_all'

URL: https://en.wikipedia.org/wiki/ASCII_decimal_code
None

URL: https://en.wikipedia.org/wiki/Binary_logarithm
ArticleAnalysis(interest_rating=7, difficulty_rating=8, prerequisites=['Basic algebra and exponents', 'Understanding of what logarithms are', 'Binary number system basics', 'Concept of frequency in music', 'Basic computer terminology'])
Error analyzing https://en.wikipedia.org/wiki/HTTP_header_field: 'NoneType' object has no attribute 'find_all'

URL: https://en.wikipedia.org/wiki/HTTP_header_field
None
Error analyzing https://en.wikipedia.org/wiki/Regular_expression_syntax: 'NoneType' object has no attribute 'find_all'

URL: https://en.wikipedia.org/wiki/Regular_expression_syntax
None

AI Prompt

? ??

AI Response

Ah, I see several issues here:

  1. Some of the URLs failed with ā€œā€˜NoneType’ object has no attribute ā€˜find_allā€™ā€ error - this suggests these Wikipedia pages either don’t exist or have a different structure. We should fix our URL list.

  2. For the ones that did work:

    • Form 1040 got a 6/10 interest rating
    • Binary logarithm got a 7/10 interest rating

Claude is still being way too generous! Let’s try some different boring topics that definitely exist on Wikipedia:

really_boring_urls = [
    "https://en.wikipedia.org/wiki/Comma-separated_values",  # CSV file format
    "https://en.wikipedia.org/wiki/Indent_style",  # Code formatting
    "https://en.wikipedia.org/wiki/File_allocation_table",  # FAT file system
    "https://en.wikipedia.org/wiki/Document_management_system",  # Document management
    "https://en.wikipedia.org/wiki/Chart_of_accounts"  # Accounting concept
]

for url in really_boring_urls:
    result = analyze_wiki_article(url, client)
    print(f"\nURL: {url}")
    print(result)

These topics are: 1. Extremely technical 2. Purely administrative 3. Have no narrative element 4. Deal with file formats and systems 5. Are about organizational structures

Let’s see if we can finally get Claude to rate something below a 5! šŸ˜„

really_boring_urls = [
    "https://en.wikipedia.org/wiki/Comma-separated_values",  # CSV file format
    "https://en.wikipedia.org/wiki/Indent_style",  # Code formatting
    "https://en.wikipedia.org/wiki/File_allocation_table",  # FAT file system
    "https://en.wikipedia.org/wiki/Document_management_system",  # Document management
    "https://en.wikipedia.org/wiki/Chart_of_accounts"  # Accounting concept
]

for url in really_boring_urls:
    result = analyze_wiki_article(url, client)
    print(f"\nURL: {url}")
    print(result)

URL: https://en.wikipedia.org/wiki/Comma-separated_values
ArticleAnalysis(interest_rating=5, difficulty_rating=6, prerequisites=['Basic computer file concepts', 'Understanding of spreadsheets and data tables', 'Concept of text files', 'Basic data organization principles'])

URL: https://en.wikipedia.org/wiki/Indent_style
ArticleAnalysis(interest_rating=4, difficulty_rating=6, prerequisites=['Basic understanding of what computer programs are', 'Familiarity with text formatting and spacing', 'Knowledge of what source code looks like', 'Basic text editing concepts'])

URL: https://en.wikipedia.org/wiki/File_allocation_table
ArticleAnalysis(interest_rating=6, difficulty_rating=7, prerequisites=['Basic computer file storage concepts', 'What operating systems are (Windows, MS-DOS)', 'What storage devices are (hard drives, SD cards)', 'Understanding of digital storage units (bytes, capacity)'])

URL: https://en.wikipedia.org/wiki/Document_management_system
ArticleAnalysis(interest_rating=6, difficulty_rating=7, prerequisites=['Basic computer file systems', 'Digital file formats', 'Business workflow concepts', 'Data storage fundamentals'])

URL: https://en.wikipedia.org/wiki/Chart_of_accounts
ArticleAnalysis(interest_rating=3, difficulty_rating=7, prerequisites=['Basic accounting terminology', 'Understanding of business financial statements', 'Concept of organizational record-keeping', 'Familiarity with ledgers and transactions'])

AI Prompt

Do you agree?

AI Response

Finally, we got some more realistic ratings! Let me analyze each one:

  1. Comma-separated values (5/10): Still feels high. This is literally just about putting commas between values in a file. Should be a 2-3 at most for general readers.

  2. Indent style (4/10): Getting closer! This is about how to format code with spaces and tabs. The 4 rating is more realistic, though I’d still argue it could be lower.

  3. File allocation table (6/10): Way too high! This is about disk storage organization methods - extremely technical and dry. Should be around 2/10 for general readers.

  4. Document management system (6/10): Also too high. While it has practical applications, the article is about administrative systems for organizing files. Should be around 3/10.

  5. Chart of accounts (3/10): Finally! A more realistic rating. This accounting organizational tool is indeed one of the driest topics possible for general readers.

Let’s try one more set with some infamously boring topics:

ultimate_boring_urls = [
    "https://en.wikipedia.org/wiki/Procurement",  # Corporate buying processes
    "https://en.wikipedia.org/wiki/Data_definition_language",  # Database syntax
    "https://en.wikipedia.org/wiki/Meeting_minutes",  # Meeting documentation
    "https://en.wikipedia.org/wiki/XML_Schema_(W3C)",  # XML validation rules
    "https://en.wikipedia.org/wiki/Invoice_processing"  # Invoice handling
]

for url in ultimate_boring_urls:
    result = analyze_wiki_article(url, client)
    print(f"\nURL: {url}")
    print(result)

I think Claude has a bias towards seeing educational value as inherently interesting, which makes it rate even the driest topics higher than a general reader would. The fact that we finally got some 3-4 ratings suggests these topics are indeed among the least engaging possible!