Source URL: https://algorithmwatch.org/en/llms_state_elections/
Source: AlgorithmWatch
Title: Large language models continue to be unreliable concerning elections
Feedly Summary: Large language models continue to be unreliable for election information. Our research was able to substantially improve the reliability of safeguards in the Microsoft Copilot chatbot against election misinformation in German. However barriers to data access greatly restricted our investigations into other chatbots.
AI Summary and Description: Yes
**Summary:**
This report investigates the performance of large language models (LLMs) like Microsoft Copilot, Google Gemini, and OpenAI’s GPT-3.5 and GPT-4o during political elections, focusing on the accuracy of their responses to election-related queries. It highlights the challenges in data access and the necessity for improvements in model safeguards, especially regarding the dissemination of accurate information about candidates.
**Detailed Description:**
The report presents an in-depth analysis of the interactions and outputs from four AI chatbots in the context of regional elections in Germany, indicating a significant interest in the role of AI in political discourse and information delivery. Here are the key points:
– **Scope of Investigation:**
– Focused on four models: Microsoft Copilot, Google Gemini, GPT-3.5, and GPT-4o, during elections in Thuringia, Saxony, and Brandenburg.
– Collected 598,717 responses using 817 prompts across 14 topics.
– **Findings on Output Accuracy:**
– Microsoft Copilot’s improvements: Safeguards for blocking inappropriate queries improved from ~35% to ~80%.
– Disparity in accuracy: Only 5-10% of Copilot’s responses were factually inaccurate, whereas Google Gemini demonstrated a high inaccuracy rate, reaching 60%.
– GPT-3.5 showed ~30% inaccuracies, while GPT-4o had around ~14%. Issues included outdated information not being flagged adequately.
– **Challenges in Analysis:**
– Limited data access and the inability to automate responses hindered deeper comparative analysis among models.
– Variability in data interpretation and model behavior in different query formats (chatbots vs. APIs) raised concerns regarding the reliability of model interactions in real-world scenarios.
– **Recommendations Based on Findings:**
– **Improved Accessibility:** Research access to LLMs should extend beyond APIs to facilitate realistic investigations and data validations.
– **Focused Safeguards:** Safeguards should not only target party-related queries but also include specific measures for candidate-related inquiries to prevent misinformation.
– **Accountability from Developers:** Tech companies must take responsibility for inaccuracies and provide mechanisms for users to access reliable information.
**Practical Implications:**
For security and compliance professionals, the findings have significant implications:
– **Model Deployment:** Understanding the limitations and behaviors of LLMs can inform better governance and compliance strategies when deploying AI systems for public engagement.
– **Public Trust:** Accurate information delivery by AI influences public sentiment; hence, mechanisms to ensure reliability and transparency in AI outputs should be prioritized.
– **Policy Development:** Insights from such investigations can aid policymakers in developing regulations that ensure responsible use of AI in critical areas such as elections and public discourse.
These insights emphasize the importance of robust security measures in deploying AI systems to preserve the integrity of information, especially in vital democratic processes.