State of AI

An overview of everything I regard as interesting in the space of AI as of February 2024.

Planning & Execute Agents

Traditional approach: Reasoning and Action (ReAct) agents

Benchmarking Agent Tool Use

Article on different challenges with different models. GPT-4 is the winning model.

Winning the Future of AI

AI in 2030

Ten years ago - Big data and ML: ad optimizations, search, feed ranking. Five years ago - ML Ops: data labeling, churn prediction, recommendations, image classification, sentiment analysis. One year ago - LLMs and foundational models completely transformed our expectations from AI. 2024: Agents and co-pilots are taking center stage.

The challenges going forward

Some things don’t change:

The AI Stack

Foundational Models (Sagemaker?): Models are expected to either transform a document or perform a task. Transformation usually means generating vector embedding for ingestion in vector databases or feature generation for task completion. Tasks include text and image generation, labeling, summarization, transcription, etc. They not only need to be hosted but constantly retrained and improved. Model training and Deployment (Ray): Ray, an open source project used as the foundation of AI infrastructure across the tech industry and used to train some of the world’s largest models like GPT-4. Vector Database (Postgres): Knowledge is at the core of AI and semantic retrieval is the key to making relevant information available to models in real time. To do that Vector databases have matured and specialized to be able to search through billions of embeddings in milliseconds and still remain cost effective. AI Application Hosting (Vercel): AI apps are more dynamic by nature, making traditional CDN caching less effective, and rely on server rendering and streaming to meet users’ speed and personalization demands. As the AI landscape evolves rapidly, iterating quickly on new models, prompts, experiments, and techniques is essential, and application traffic can grow exponentially, making serverless platforms particularly attractive. In terms of security, AI apps have become a common target of bot attacks to attempt to steal or deny LLM resources or scrape proprietary data. LLM Developer Toolkits (LangChain, LamaIndex): Building LLM applications requires taking all the above mentioned components and putting them together to build the “cognitive architecture” of your system. Important components of toolkits include: the ability to flexibly create custom chains, a variety of integrations, and first class streaming support (an important UX consideration for LLM applications). LLM Ops (LangSmith, Zeno): Hosting generates issues around the reliability of the application. Common issues include being able to test and evaluate different prompts or models, being able to trace and debug individual calls to figure out what in your system is going wrong, and being able to monitor feedback over time.

LangGraph

Motivation: “running an LLM in a for-loop” Previous chains are directed acyclic graphs (DAGs) Cycles: ability to use them for reasoning tasks

Use case: when RAG isn’t useful, then it’s ideal if the LLM can reason that the results returned from the retriever are poor, and maybe issue a second (more refined) query to the retriever, and use those results instead. Why LangGraph? When you put agents into production is that often times more control is needed. You may want to always force an agent to call particular tool first. When talking about these more controlled flows, we internally refer to them as “state machines”. Functionality: At it’s core, LangGraph exposes a pretty narrow interface on top of LangChain. StateGraph is a class that represents the graph. Initialize this class by passing in a state definition. This state definition represents a central state object that is updated over time. The attributes of this state can be updated in two ways:

Example:

from langgraph.graph import StateGraph
from typing import TypedDict, List, Annotated
import Operator

class State(TypedDict):
    input: str
    all_actions: Annotated[List[str], operator.add]

graph = StateGraph(State)

Nodes: After creating a StateGraph, you then add nodes with graph.add_node(name, value) syntax. The name parameter should be a string that we will use to refer to the node when adding edges. The value parameter should be either a function or LCEL runnable that will be called. This function/LCEL should accept a dictionary in the same form as the State object as input, and output a dictionary with keys of the State object to update.

graph.add_node("model", model)
graph.add_node("tools", tool_executor)
from langgraph.graph import END

There is also a special END node that is used to represent the end of the graph. It is important that your cycles be able to end eventually!

Edges: After adding nodes, you can then add edges to create the graph. There are a few types of edges.

graph.set_entry_point("model")
graph.add_edge("tools", "model")
graph.add_conditional_edge(
    "model",
    should_continue,
    {
        "end": END,
        "continue": "tools"
    }
)
app = graph.compile()
from typing import TypedDict, Annotated, List, Union
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.messages import BaseMessage
import operator

class AgentState(TypedDict):
   input: str
   chat_history: list[BaseMessage]
   agent_outcome: Union[AgentAction, AgentFinish, None]
   intermediate_steps: Annotated[list[tuple[AgentAction, str]], operator.add]

Reflection Agents

LATS can be used for one time tasks that require reflec/xion such as building software products. Reflection is a prompting strategy used to improve the quality and success rate of agents and similar AI systems. It involves prompting an LLM to reflect on and critique its past actions, sometimes incorporating additional external information such as tool observations. Reflexion by Shinn, et. al., is an architecture designed to learn through verbal feedback and self-reflection. Within reflexion, the actor agent explicitly critiques each response and grounds its criticism in external data. It is forced to generate citations and explicitly enumerate superfluous and missing aspects of the generated response. This makes the content of the reflections more constructive and better steers the generator in responding to the feedback. Language Agent Tree Search (LATS), by Zhou, et. al, is a general LLM agent search algorithm that combines reflection/evaluation and search (specifically Monte-Carlo trees search) to achieve better overall task performance compared to similar techniques like ReACT, Reflexion, or even Tree of Thoughts. It adopts a standard reinforcement learning (RL) task framing, replacing the RL agents, value functions, and optimizer all with calls to an LLM. The search has four main steps:

JSON Based Agents with Ollama & LangChain

Notebook based on existing LangChain implementation of a JSON-based agent using the Mixtral 8x7b LLM. I used the Mixtral 8x7b as a movie agent to interact with Neo4j, a native graph database, through a semantic layer. Quite a high-level article. However, the idea is that you can work ollama with json-outputs but you need to account for small-talk bits that do not require json output by just routing those parts to a separate tool of the agent.

TODO Serve Ollama local and interact with it: ollama run llava, ollama run codellama:70b, ollama run llama2-uncensored:7b, ollama run sqlcoder:15b, ollama run llama2:13b.

OpenGPTs

OpenGPTs runs on MessageGraph, a particular type of Graph we introduced in LangGraph. This graph is special in that each node takes in a list of messages and returns messages to append to the list of messages.

Architectures:

Persistence: Dealt with through CheckPoint objects. This checkpoint object will then save the current state of the graph after calling each node.

Configuration: mark fields as configurable and then pass in configuration options during run time.

New Tool: Robocorp’s Action Server. Robocorp’s action server is an easy way to define - and run - arbitrary Python functions as tools.

Connery: OpenSource Plugin Infrastructure for OpenGPTs and LLM Apps

Great article about making AI production-safe! They offer their own system that seems like a no-code tool to integrate certain features. I think this provides inspo on how to make our tools even better. However, we’d like to stick to code.

Essential integration and personalization features:

AI safety and control

Risks like misinterpreted commands can be mitigated with:

Infrastructure for integrations

Multi-Agent Workflows

Benefits:

Types:

Example of using CrewAI with LangChain and LangGraph to automate the process of automatically checking emails and creating drafts. CrewAI orchestrates autonomous AI agents, enabling them to collaborate and execute complex tasks efficiently.

CrewAI Unleashed

Concepts:

TODO Play with CrewAI Examples

Adding Long Term Memory to OpenGPTs

Solution:

Long term memory is a very underexplored topic. Part of that is probably because it is either (1) very general, and trends towards AGI, or (2) so application specific, and tough to talk about generically. In this case, it becomes important to think critically about:

What is the state that is tracked? How is the state updated? How is the state used?

Prompt Guide AI

Interesting site with lots of knowledge and related info on AI

Microsoft’s LASER

Original article

Rakuten

Offering three different agenda: AI Analyst (market intelligence research assistant), AI Agent (self-serve customer support), AI Librarian (answer client questions No word on actual impact it has. Like adoption rate and improvements.

Dataherald

Build a Text-to-SQL engine based on agents. They offer a free API and work with golden SQL queries. They have two bots: one for ingestion by querying target SQL databases, and one for query generation, like my tool.

Microsoft AI Risk Package

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and ML engineers to red team foundation models and their applications. How to Guide

Threads

Seb’s Thoughts

Conversational Speech AI

Seb’s opinion: Extremely fast and impressive. I tried the therapist and it gave me some generic but authentic answers. I liked the quick speed of responses.

Superfast TPMs

Production RAG with PG Vector and OSS Model

What’s an Index? A data structure designed to enable querying by an LLM. In LlamaIndex terms, this data structure is composed of Document objects.

What’s a Vector Store Index? The Vector Store Index is the predominant index type encountered when working with Large Language Models (LLMs). Within LlamaIndex, the VectorStoreIndex data type processes your Documents by dividing them into Nodes and generating vector embeddings for the text within each node, preparing them for querying by an LLM.

DeepEval

Problems of LLMs:

In a typical AI application, the areas of evaluation fall into two main areas:

Response Evaluation: Does the response match the retrieved context? Does it also match the query? Does it match the reference answer or guidelines? Retrieval Evaluation: Are the retrieved sources relevant to the query?

Steps to Create an Evaluation Process
DeepEval for Evaluation

Use DeepEval for a few reasons: It is open-source, it works with your existing Python pytest suite of tests, it offers a wide variety of evaluation metrics, the source code is readable, it works with open-source models

Also it offers metrics: General Evaluation (G-Eval - define any metric with freetext), Summarization, Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, Contextual Recall, Ragas, Hallucination, Toxicity, Bias

Example testing for hallucination: Instantiate the HallucinationMetric, giving it a score threshold which the test must pass. Under the hood, this particular metric downloads the vectara hallucination model from HuggingFace.

No one wants to pay to run their test suite. Open-source evaluation is a really powerful tool, and I expect to see a lot more adoption of it in the coming months. I

Technical User Intro to LLM

What is a Large Language Model?

At its core, a Large Language Model (LLM) comprises two essential components:

llama-2-70b utilized around 10TB of web-crawled text and required significant computational resources:

This training stage effectively compresses the internet text into a ‘lossy’ zip file format, where the 10TB of data is reduced to a 140GB parameter file, achieving a compression ratio of about 100X. Unlike lossless compression (like a typical zip file), this process is lossy, meaning some original information is not retained.

Neural Network Operation

At its heart, a Neural Network in an Large Language Model aims to predict the next word in a given sequence of words. The parameters are intricately woven throughout the network, underpinning its predictive capabilities. There’s a strong mathematical relationship between prediction and compression, explaining why training an LLM can be likened to compressing the internet.

Comparison of LLMs: Models are ranked using an Elo rating system, akin to chess ratings. This score is derived from how models perform in comparison to each other on specific tasks.

The Future of Large Language Models and Generative AI

Scaling Laws: The effectiveness of LLMs in next-word prediction tasks depends on two variables: Number of Parameters (N), Training Text Volume (D) Current trends suggest limitless scaling potential in these dimensions. This means that LLMs will continue to improve as companies spend more time and money training increasingly large models. This means that we are nowhere near close to “topping out” in terms of LLM quality.

GEMMA-Parameter Efficient FineTuning (PEFT)