FIX Claude 3.7 Sonnet throws Expected thinking

2025-03-11 00:32:27 +00:00 · 2025-03-11 00:32:27 +00:00 · 108244d091
parent 376fe18b83
commit 108244d091
8 changed files with 465 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@ -16,3 +16,4 @@ appmap.log
 *.swp
 /vsc/node_modules
 /vsc/dist
 cline_docs/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -13,6 +13,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added model parameters for think tag support
 - Added comprehensive testing for think tag functionality
 - Added `--show-thoughts` flag to show thoughts of thinking models
 - Added `--disable-thinking` flag to disable thinking mode for Claude 3.7 Sonnet
 ### Fixed
 - Fixed unretryable API error (400) when using Claude 3.7 Sonnet with thinking mode enabled for extended periods
 - Improved message formatting for Claude 3.7 Sonnet to ensure thinking blocks are properly included
 ### Changed
 - Updated langchain/langgraph deps
--- a/docs/docs/configuration/thinking-models.md
+++ b/docs/docs/configuration/thinking-models.md
@ -105,6 +105,33 @@ RA.Aid configures the model to use its native thinking mode, and then processes
 If you run RA.Aid without the `--show-thoughts` flag, the thinking content is still extracted from the model responses, but it won't be displayed in the console. This gives you a cleaner output focused only on the model's final responses.
 ## Disabling Thinking Mode
 In some cases, you might want to disable the thinking mode feature for Claude 3.7 models, particularly if you're experiencing API errors or if you prefer the model to operate without the thinking capability.
 ### Using the disable_thinking Configuration Option
 RA.Aid provides a `disable_thinking` configuration option that allows you to turn off the thinking mode for Claude 3.7 models:
 ```bash
 ra-aid -m "Debug the database connection issue" --provider anthropic --model claude-3-7-sonnet-20250219 --disable-thinking
 ```
 When this option is enabled:
 - The thinking mode will not be activated for Claude 3.7 models
 - The model will operate in standard mode without the structured thinking blocks
 - This can help avoid certain API errors that might occur with thinking mode enabled
 ### When to Disable Thinking Mode
 Consider disabling thinking mode in the following scenarios:
 1. **API Errors**: If you encounter unretryable API errors (400) related to thinking blocks
 2. **Long-Running Sessions**: For extended sessions that run for more than 10 minutes
 3. **Performance Concerns**: If you need faster responses and don't require the thinking content
 4. **Compatibility Issues**: If you're using tools or workflows that aren't fully compatible with thinking mode
 ## Troubleshooting and Best Practices
 ### Common Issues
@ -117,6 +144,20 @@ If you're not seeing thinking content despite using the `--show-thoughts` flag:
 - Verify that the model is properly configured in your environment
 - Check that the model is actually including thinking content in its responses (not all prompts will generate thinking)
 #### API errors with Claude 3.7 Sonnet
 If you encounter errors like:
 ```
 Unretryable API error: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'messages.1.content.0.type: Expected thinking or redacted_thinking, but found text...'}}
 ```
 This is related to the thinking mode format requirements. You can:
 - Use the `--disable-thinking` flag to turn off thinking mode
 - Upgrade to the latest version of RA.Aid which includes fixes for these errors
 - For long-running sessions, consider restarting the assistant periodically
 #### Excessive or irrelevant thinking
 If the thinking content is too verbose or irrelevant:
@ -137,4 +178,3 @@ For the most effective use of thinking models:
 4. **Compare thinking with output**: Use the thinking content to evaluate the quality of the model's reasoning and identify potential flaws in its approach.
 5. **Provide clear instructions**: When the model's thinking seems off-track, provide clearer instructions in your next prompt to guide its reasoning process.
--- a/ra_aid/agent_utils.py
+++ b/ra_aid/agent_utils.py
@ -526,6 +526,64 @@ def _handle_fallback_response(
        msg_list.extend(msg_list_response)
 def _ensure_thinking_block(messages: list[BaseMessage], config: Dict[str, Any]) -> list[BaseMessage]:
    """
    Ensure that messages sent to Claude 3.7 with thinking enabled have a thinking block at the start.
    When thinking is enabled for Claude 3.7, the API requires that any assistant message
    starts with a thinking block. This function checks if the model is Claude 3.7 with
    thinking enabled, and if so, ensures that assistant messages have a thinking block.
    Args:
        messages: List of messages to check and potentially modify
        config: Configuration dictionary
    Returns:
        Modified list of messages with thinking blocks added if needed
    """
    # Check if we're using Claude 3.7 with thinking enabled
    provider = config.get("provider", "")
    model_name = config.get("model", "")
    # Skip if thinking is disabled or not using Claude 3.7
    if config.get("disable_thinking", False):
        return messages
    # Only apply to Claude 3.7 models
    if not (provider.lower() == "anthropic" and "claude-3-7" in model_name.lower()):
        return messages
    # Get model configuration to check if thinking is supported
    model_config = models_params.get(provider, {}).get(model_name, {})
    if not model_config.get("supports_thinking", False):
        return messages
    # Make a copy of the messages to avoid modifying the original
    modified_messages = messages.copy()
    # Check each message
    for i, message in enumerate(modified_messages):
        # Only check assistant messages
        if hasattr(message, "type") and message.type == "ai":
            # If content is a list (structured format)
            if isinstance(message.content, list):
                # Check if the first item is a thinking block
                if not (len(message.content) > 0 and 
                        isinstance(message.content[0], dict) and 
                        message.content[0].get("type") == "thinking"):
                    # Add a redacted_thinking block at the start
                    message.content.insert(0, {"type": "redacted_thinking"})
                    logger.debug("Added redacted_thinking block to assistant message")
            # If content is a string, we can't modify it properly
            # This shouldn't happen with Claude 3.7, but log it if it does
            elif isinstance(message.content, str):
                logger.warning(
                    "Found string content in assistant message with Claude 3.7 thinking enabled. "
                    "This may cause API errors if the message doesn't start with a thinking block."
                )
    return modified_messages
 def _run_agent_stream(agent: RAgents, msg_list: list[BaseMessage]):
    """
    Streams agent output while handling completion and interruption.
@ -551,6 +609,9 @@ def _run_agent_stream(agent: RAgents, msg_list: list[BaseMessage]):
        if "callbacks" not in stream_config:
            stream_config["callbacks"] = []
        stream_config["callbacks"].append(cb)
        # Ensure messages have thinking blocks if needed
        msg_list = _ensure_thinking_block(msg_list, config)
    while True:
        for chunk in agent.stream({"messages": msg_list}, stream_config):
@ -636,6 +697,23 @@ def run_agent_with_retry(
                    return f"Agent has crashed: {crash_message}"
                try:
                    # Ensure messages have thinking blocks if needed before each run
                    config = get_config_repository().get_all()
                    if is_anthropic_claude(config):
                        provider = config.get("provider", "")
                        model_name = config.get("model", "")
                        # Only apply to Claude 3.7 models with thinking enabled
                        if (provider.lower() == "anthropic" and 
                            "claude-3-7" in model_name.lower() and 
                            not config.get("disable_thinking", False)):
                            # Get model configuration to check if thinking is supported
                            model_config = models_params.get(provider, {}).get(model_name, {})
                            if model_config.get("supports_thinking", False):
                                logger.debug("Ensuring thinking blocks for Claude 3.7 before agent run")
                                msg_list = _ensure_thinking_block(msg_list, config)
                    _run_agent_stream(agent, msg_list)
                    if fallback_handler:
                        fallback_handler.reset_fallback_handler()
--- a/ra_aid/llm.py
+++ b/ra_aid/llm.py
@ -259,7 +259,7 @@ def create_llm_client(
    else:
        temp_kwargs = {}
-    if supports_thinking:
+    if supports_thinking and not config.get("disable_thinking", False):
        temp_kwargs = {"thinking": {"type": "enabled", "budget_tokens": 12000}}
    if provider == "deepseek":
--- a/tests/ra_aid/test_disable_thinking.py
+++ b/tests/ra_aid/test_disable_thinking.py
@ -0,0 +1,103 @@
 import pytest
 from unittest.mock import MagicMock, patch
 from ra_aid.llm import create_llm_client
 class TestDisableThinking:
    """Test suite for the disable_thinking configuration option."""
    @pytest.mark.parametrize(
        "test_id, config, model_config, expected_thinking_param, description",
        [
            # Test case 1: Claude 3.7 model with thinking enabled (default)
            (
                "claude_thinking_enabled",
                {
                    "provider": "anthropic",
                    "model": "claude-3-7-sonnet-20250219",
                },
                {"supports_thinking": True},
                {"thinking": {"type": "enabled", "budget_tokens": 12000}},
                "Claude 3.7 should have thinking enabled by default",
            ),
            # Test case 2: Claude 3.7 model with thinking explicitly disabled
            (
                "claude_thinking_disabled",
                {
                    "provider": "anthropic",
                    "model": "claude-3-7-sonnet-20250219",
                    "disable_thinking": True,
                },
                {"supports_thinking": True},
                {},
                "Claude 3.7 with disable_thinking=True should not have thinking param",
            ),
            # Test case 3: Non-thinking model should not have thinking param
            (
                "non_thinking_model",
                {
                    "provider": "anthropic",
                    "model": "claude-3-5-sonnet-20240620",
                },
                {"supports_thinking": False},
                {},
                "Non-thinking model should not have thinking param",
            ),
            # Test case 4: Non-Claude model should not have thinking param
            (
                "non_claude_model",
                {
                    "provider": "openai",
                    "model": "gpt-4",
                },
                {},
                {},
                "Non-Claude model should not have thinking param",
            ),
        ],
    )
    def test_disable_thinking_option(
        self, test_id, config, model_config, expected_thinking_param, description
    ):
        """Test that the disable_thinking option correctly controls thinking mode."""
        # Mock the necessary dependencies
        with patch("ra_aid.llm.ChatAnthropic") as mock_anthropic:
            with patch("ra_aid.llm.ChatOpenAI") as mock_openai:
                with patch("ra_aid.llm.models_params") as mock_models_params:
                    # Set up the mock to return the specified model_config
                    mock_models_params.get.return_value = {config["model"]: model_config}
                    # Set up the mock for get_provider_config
                    with patch("ra_aid.llm.get_provider_config") as mock_get_provider_config:
                        # Include disable_thinking in the provider config if it's in the test config
                        provider_config = {
                            "api_key": "test-key",
                            "base_url": None,
                        }
                        if "disable_thinking" in config:
                            provider_config["disable_thinking"] = config["disable_thinking"]
                        mock_get_provider_config.return_value = provider_config
                        # Call the function without passing disable_thinking directly
                        create_llm_client(
                            config["provider"],
                            config["model"],
                            temperature=None,
                            is_expert=False
                        )
                        # Check if the correct parameters were passed
                        if config["provider"] == "anthropic":
                            # Get the kwargs passed to ChatAnthropic
                            _, kwargs = mock_anthropic.call_args
                            # Check if thinking param was included or not
                            if expected_thinking_param:
                                assert "thinking" in kwargs, f"Test {test_id} failed: {description}"
                                assert kwargs["thinking"] == expected_thinking_param["thinking"], \
                                    f"Test {test_id} failed: Thinking param doesn't match expected value"
                            else:
                                assert "thinking" not in kwargs, \
                                    f"Test {test_id} failed: Thinking param should not be present"
--- a/tests/ra_aid/test_ensure_thinking_block.py
+++ b/tests/ra_aid/test_ensure_thinking_block.py
@ -0,0 +1,145 @@
 import pytest
 from unittest.mock import MagicMock, patch
 from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
 from ra_aid.agent_utils import _ensure_thinking_block
 class TestEnsureThinkingBlock:
    """Test suite for the _ensure_thinking_block function."""
    @pytest.mark.parametrize(
        "test_id, messages, config, expected_changes, description",
        [
            # Test case 1: Non-Claude 3.7 model should not modify messages
            (
                "non_claude_model",
                [
                    HumanMessage(content="Hello"),
                    AIMessage(content=[{"type": "text", "text": "Hi there"}]),
                ],
                {"provider": "openai", "model": "gpt-4"},
                False,
                "Non-Claude 3.7 model should not modify messages",
            ),
            # Test case 2: Claude 3.7 model with thinking disabled should not modify messages
            (
                "claude_thinking_disabled",
                [
                    HumanMessage(content="Hello"),
                    AIMessage(content=[{"type": "text", "text": "Hi there"}]),
                ],
                {
                    "provider": "anthropic",
                    "model": "claude-3-7-sonnet-20250219",
                    "disable_thinking": True,
                },
                False,
                "Claude 3.7 with thinking disabled should not modify messages",
            ),
            # Test case 3: Claude 3.7 model with thinking enabled but no AI messages should not modify messages
            (
                "claude_no_ai_messages",
                [
                    HumanMessage(content="Hello"),
                    SystemMessage(content="System message"),
                ],
                {"provider": "anthropic", "model": "claude-3-7-sonnet-20250219"},
                False,
                "Claude 3.7 with no AI messages should not modify messages",
            ),
            # Test case 4: Claude 3.7 model with thinking enabled and AI message with text content should log warning
            (
                "claude_ai_text_content",
                [
                    HumanMessage(content="Hello"),
                    AIMessage(content="Text response"),
                ],
                {"provider": "anthropic", "model": "claude-3-7-sonnet-20250219"},
                False,
                "Claude 3.7 with AI message with text content should log warning",
            ),
            # Test case 5: Claude 3.7 model with thinking enabled and AI message with list content but no thinking block
            (
                "claude_ai_no_thinking_block",
                [
                    HumanMessage(content="Hello"),
                    AIMessage(content=[{"type": "text", "text": "Hi there"}]),
                ],
                {"provider": "anthropic", "model": "claude-3-7-sonnet-20250219"},
                True,
                "Claude 3.7 with AI message without thinking block should add one",
            ),
            # Test case 6: Claude 3.7 model with thinking enabled and AI message with list content with thinking block
            (
                "claude_ai_with_thinking_block",
                [
                    HumanMessage(content="Hello"),
                    AIMessage(
                        content=[
                            {"type": "thinking", "thinking": "Let me think..."},
                            {"type": "text", "text": "Hi there"},
                        ]
                    ),
                ],
                {"provider": "anthropic", "model": "claude-3-7-sonnet-20250219"},
                False,
                "Claude 3.7 with AI message with thinking block should not modify it",
            ),
            # Test case 7: Claude 3.7 model with thinking enabled and multiple AI messages
            (
                "claude_multiple_ai_messages",
                [
                    HumanMessage(content="Hello"),
                    AIMessage(content=[{"type": "text", "text": "First response"}]),
                    HumanMessage(content="Follow-up"),
                    AIMessage(content=[{"type": "text", "text": "Second response"}]),
                ],
                {"provider": "anthropic", "model": "claude-3-7-sonnet-20250219"},
                True,
                "Claude 3.7 with multiple AI messages should add thinking blocks to all",
            ),
        ],
    )
    def test_ensure_thinking_block(
        self, test_id, messages, config, expected_changes, description
    ):
        """Test the _ensure_thinking_block function with various inputs."""
        # Mock the logger
        with patch("ra_aid.agent_utils.logger") as mock_logger:
            # Mock the models_params dictionary
            with patch("ra_aid.agent_utils.models_params") as mock_models_params:
                # Set up the mock to return supports_thinking=True for Claude 3.7 models
                if "claude-3-7" in config.get("model", ""):
                    mock_models_params.get.return_value = {
                        config["model"]: {"supports_thinking": True}
                    }
                else:
                    mock_models_params.get.return_value = {}
                # Make a copy of the original messages for comparison
                original_messages = [
                    AIMessage(content=msg.content) if isinstance(msg, AIMessage) else msg
                    for msg in messages
                ]
                # Call the function
                result = _ensure_thinking_block(messages, config)
                # Check if the result is different from the input when expected
                if expected_changes:
                    assert result != original_messages, f"Test {test_id} failed: {description}"
                    # Check that AI messages have thinking blocks
                    for msg in result:
                        if hasattr(msg, "type") and msg.type == "ai":
                            if isinstance(msg.content, list) and len(msg.content) > 0:
                                assert msg.content[0].get("type") == "thinking" or msg.content[0].get("type") == "redacted_thinking", \
                                    f"Test {test_id} failed: AI message should have thinking block"
                else:
                    # Check that the messages were not modified
                    assert result == messages, f"Test {test_id} failed: {description}"
                # Check for warning logs for text content
                if "claude_ai_text_content" == test_id:
                    mock_logger.warning.assert_called_once()
--- a/tests/ra_aid/test_thinking_integration.py
+++ b/tests/ra_aid/test_thinking_integration.py
@ -0,0 +1,91 @@
 import pytest
 from unittest.mock import MagicMock, patch
 from langchain_core.messages import AIMessage, HumanMessage
 from ra_aid.agent_utils import run_agent_with_retry
 class TestThinkingIntegration:
    """Test suite for the integration of thinking block functionality in run_agent_with_retry."""
    @pytest.mark.parametrize(
        "test_id, config, model_name, should_ensure_thinking, description",
        [
            # Test case 1: Claude 3.7 model with thinking enabled should ensure thinking blocks
            (
                "claude_thinking_enabled",
                {
                    "provider": "anthropic",
                    "model": "claude-3-7-sonnet-20250219",
                },
                "claude-3-7-sonnet-20250219",
                True,
                "Claude 3.7 should ensure thinking blocks",
            ),
            # Test case 2: Claude 3.7 model with thinking disabled should not ensure thinking blocks
            (
                "claude_thinking_disabled",
                {
                    "provider": "anthropic",
                    "model": "claude-3-7-sonnet-20250219",
                    "disable_thinking": True,
                },
                "claude-3-7-sonnet-20250219",
                False,
                "Claude 3.7 with thinking disabled should not ensure thinking blocks",
            ),
            # Test case 3: Non-Claude 3.7 model should not ensure thinking blocks
            (
                "non_claude_model",
                {
                    "provider": "openai",
                    "model": "gpt-4",
                },
                "gpt-4",
                False,
                "Non-Claude model should not ensure thinking blocks",
            ),
        ],
    )
    def test_run_agent_with_retry_thinking_integration(
        self, test_id, config, model_name, should_ensure_thinking, description
    ):
        """Test that run_agent_with_retry correctly integrates thinking block functionality."""
        # Mock the necessary dependencies
        with patch("ra_aid.agent_utils.get_config_repository") as mock_get_config:
            with patch("ra_aid.agent_utils._ensure_thinking_block") as mock_ensure_thinking:
                with patch("ra_aid.agent_utils._run_agent_stream") as mock_run_agent_stream:
                    with patch("ra_aid.agent_utils._setup_interrupt_handling") as mock_setup:
                        with patch("ra_aid.agent_utils._restore_interrupt_handling"):
                            with patch("ra_aid.agent_utils.agent_context"):
                                with patch("ra_aid.agent_utils.InterruptibleSection"):
                                    with patch("ra_aid.agent_utils.is_anthropic_claude") as mock_is_anthropic:
                                        with patch("ra_aid.agent_utils.models_params") as mock_models_params:
                                            # Set up the mocks
                                            mock_get_config.return_value.get_all.return_value = config
                                            mock_get_config.return_value.get.return_value = False
                                            mock_setup.return_value = None
                                            mock_run_agent_stream.return_value = True
                                            # Set up is_anthropic_claude to return True for Claude models
                                            mock_is_anthropic.return_value = "claude" in model_name.lower()
                                            # Set up models_params to return supports_thinking=True for Claude 3.7 models
                                            if "claude-3-7" in model_name:
                                                mock_models_params.get.return_value = {
                                                    model_name: {"supports_thinking": True}
                                                }
                                            else:
                                                mock_models_params.get.return_value = {}
                                            # Create a mock agent
                                            mock_agent = MagicMock()
                                            # Call the function
                                            run_agent_with_retry(mock_agent, "Test prompt")
                                            # Check if _ensure_thinking_block was called correctly
                                            if should_ensure_thinking:
                                                mock_ensure_thinking.assert_called()
                                            else:
                                                mock_ensure_thinking.assert_not_called()