Abstract: Modern progress in agentic and multimodal AI, including ReAct, HuggingGPT, and MM-ReAct, show that large language models can coordinate vision tools by using planner executor loops.