Nathan Labenz synthesizes recent research in mechanistic interpretability and AI safety, how top players in the space like Anthropic and OpenAI are addressing them, and jailbreaks like the Calvin and Hobbes one you may have seen online. Nathan's aim is to impart the equivalent of a high school AP course understanding to listeners in 90 minutes. If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive
Questions or topics you want us to review for future episodes? Email TCR@turpentine.co
SPONSORS: NetSuite | Omneky
NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist.
Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off.
LINKS:
Scouting Report Part 1 - Fundamentals : https://www.youtube.com/watch?v=0hvtiVQ_LqQ
Scouting Report Part 2 - Impact, Fallout, and Outlook: https://www.youtube.com/watch?v=QJi0UJ_DV3E
Universal Jailbreaks with Zico Kolter, Andy Zou, Asher Trockman: https://www.youtube.com/watch?v=BwltbhR0JgU&feature=youtu.be
X/SOCIAL:
@labenz (Nathan)
@eriktorenberg (Erik)
@CogRev_Podcast
TIMESTAMPS:
(00:00) Episode Preview
(02:26) AI Engineer Survey
(03:53) P(Doom)
(00:07:52) Representation engineering
(00:09:20) Using contrasting prompts to understand model’s inner representations
(00:15:16) Sponsors: Netsuite | Omneky
(00:22:00) Controlling AI systems and detecting jailbreaks
(00:28:53) LLM performance and refusal rates varying by language
(00:33:13) Towards monosemanticity: decomposing language models with dictionary learning
(00:54:12) Implications of the aforementioned paper