22 July 2023

When I used analytics to resolve a dispute

Long back, I used to travel by office cab, and the person who got off the cab last in the evening, happened to be the "route monitor", who would decide the best route the cab takes. Over a period of time, he observed areas where traffic was frequently causing more delays, and suggested variation in the route. Since the new route required a female employee to get off the cab earlier and walk along a long stretch of lonely road, she protested, and the matter went to HR and Admin, as a women's safety issue. Meetings were held, where they reached a stalemate because nobody could prove anything. 

Observing that there was a lack of data, I stepped in and asked for two weeks to objectively analyze the route timings. The first week, we took the new route, and the second week we took the old route. This is the rough picture of the data I gathered:

Route analytics. Click image to view it larger.

At first, I noted the timings only for a few prominent points where colleagues were dropped off (grey circles). As I noticed a fluctuation in time, I added more landmarks. The results were surprising.

When the cab reached the point shown by the blue square, we had the option of taking route 1 (the new route suggested) or route 2 (the old route). Both routes would eventually lead to the point shown by the yellow circle. The route monitor's assumption was that the time delay between the blue and yellow points, was what caused everyone else to reach home late. However, the data revealed an entirely different picture. These were the points I noted:

  • Route 1 avoided initial delay: The blue and yellow points are consistently closer for route 1. So there was indeed some time saved by avoiding route2's traffic signal.
  • Route 1's initial time saving futile: On the Thursday of Route 1, we reached the final point (my stop) at 7:50, despite there being a very short time delay between the blue and yellow points.
  • Route 2's initial time loss insignificant: Even though the Thursday of Route 2 took up some time between the blue and yellow points, I still reached early, a little after 7:30.
  • First point mattered: The very first grey circle was the first landmark we reached after leaving office and crossing a frequently jammed stretch of road. The slight fluctuation of the first point appeared to match with the massive fluctuations of the other points on the same horizontal line. This indicated that there were certain days when more people in other companies left their office sooner, causing more jams at various parts of the city, which caused delays at various points.
  • Rain irrelevant: The rain didn't cause any significant delay.
  • Travel after sunset: One of the complaints raised was that the lady would sometimes have to walk on that stretch of road after sunset, which made her uneasy. The diagram above captures the various times of the year sunset happens, and the arrows show a few other colleagues who also had to travel a bit in the dark after getting off the cab to reach home.
  • Insufficient data: The patterns showed great fluctuation, which led me to conclude that we needed more data to make a proper conclusion. However, the existing data was sufficient to make a preliminary conclusion.

Resolution

We concluded that we could continue using the old route. However, the initial arguments between both parties and their rallying of other colleagues for support, created a gloomy mood in what was earlier a happy group. During the two weeks, I noticed people attempting to influence the outcome of the dispute. Still, in the end, the data showed everyone that it was better to verify facts than make assumptions. It led to a healthy resolution. The incident also showed that no matter how capable a leader is (like the route monitor), there will come situations when the team goes against the leader. Such situations need to be handled carefully and early, before it snowballs into larger problems that have long-term effects on morale.

When people are willing to go into details, the facts often provide enough reason to not launch into unnecessary turmoil. Data analytics can even provide surprising results for companies. For example, one company assumed that people would purchase raincoats or flashlights during hurricanes. However, the data showed them that people were purchasing strawberry pop-tarts!


12 July 2023

Generative AI with Large Language Models: Part 3


Continued from Part 2, this blog post continues with what I learnt from the Coursera course on Generative AI.

LICENSE: Images shown in this blog post and following parts of the blog are screenshots taken from the Generative AI course. You may not use or distribute these or the text content for commercial purposes. It was created by DeepLearning.ai, and is licensed under a creative commons license.

Here are some of the main points: 

Reinforcement Learning from Human Feedback (RLHF)

RLHF fine-tunes the LLM using human feedback, to ensure that the model is better aligned with human preferences. It can also be used to personalize LLMs to individual users.
Fine tuning with human feedback

LLMs trained on vast data can have toxic language in their completions, reply in combative and  aggressive voices, and providing detailed information about dangerous topics. The human values, of helpfulness, honesty, and harmlessness are sometimes collectively called HHH, and are a set of principles that guide developers in the responsible use of AI. Additional fine-tuning with human feedback helps better align models. 


In RLHF, human labelers score a dataset of completions by the original model based on alignment criteria like helpfulness, harmlessness, and honesty. This dataset is used to train the reward model that scores the model completions during the RLHF process. Since human evaluation is time consuming, a Reward Model can also be used. In the context of language modeling, the sequence of actions and states is called a rollout...as opposed to the term "playout", in classic reinforcement learning.

A prompt database is used, where prompts are fed to the LLM to produce completions. The completions are ranked by humans, about how good the response is, in the order of how helpful the completion is. The same sets of data will be ranked by multiple humans, to establish consensus and minimise the effect of humans with poor judgement. The labeled datasets are then used to build a reward model which will be capable of eventually doing this evaluation instead of the humans. Human labelers are selected from a diverse global population.

An example of the instructions given to human labelers:

Rank the response according to which one provides the best answer to the input prompt. What is the best answer? Make a decision based on 1. The correctness of the answer, 2. The informativeness of the response. For 1, you are allowed to search the web. Overall, use your best judgement to rank answers bases on being the most useful response, which we define as one which in at least somewhat correct, and manually informative about what the prompt is asking for. If two responses provide the same correctness and informativeness by your judgement, and there is no clear winner, you may rank them the same, but please only use this sparingly. If the answer for a given response is nonsensical, irrelevant, highly ungrammatical/confusing, or does not clearly respond to the given prompt, label it with "F" (for fail), rather than its rank. Long answers are not always the best. Answers which provide succinct, coherent responses may be better than longer ones, if they are at least as correct and informative.

After completions are ranked, completion pairs are created. For each pair, you will assign a reward of 1 for the preferred response and a reward of 0 for the less preferred response. Then you'll reorder the prompts so that the preferred option comes first. This is an important step because the reward model expects
the preferred completion, first. Now, the human responses will be in the correct format for training the reward model. Note that while thumbs-up, thumbs-down feedback is often easier to gather than ranking feedback, ranked feedback gives you more prompt completion data to train your reward model.

Ranked and paired completions

When the reward model is used independent of humans, we feed it the prompt completions and try to minimize the loss, which is the log sigmoid of the difference between each of the reward values generated by the reward model. Once trained, the reward model can be used as a binary classifier which provides a set of logits across the positive and negative classes. Logits are the un-normalized model outputs before applying any activation function. Applying softmax to the logits will give a probability value.

Fine tuning the LLM with RL and the Reward model

Start with a model that already has good performance on your task of interests.
First, you'll pass a prompt from your prompt dataset. In this case, "a dog is...", to the instruct LLM, which then generates a completion. In this case "a furry animal". Next, you send this completion, and the original prompt to the reward model as the prompt completion pair. The reward model evaluates the pair based on the human feedback it was trained on, and returns a reward value. A higher value such as 0.24 as shown above, represents a more aligned response. A less aligned response would receive a lower value, such as -0.53. You'll then pass this reward value for the prompt completion pair to the reinforcement learning algorithm to update the weights of the LLM, and move it towards generating more aligned, higher reward responses. Let's call this intermediate version of the model the RL updated LLM.
These series of steps together forms a single iteration of the RLHF process. These iterations continue for a given number of epochs. If the process is working well,
you'll see the reward improving after each iteration as the model produces text that is increasingly aligned with human preferences. You will continue this iterative process until your model is aligned based on some evaluation criteria. For example, reaching a threshold value for the helpfulness you defined. You can also define a maximum number of steps, for example, 20,000 as the stopping criteria.

A popular choice for the RL algorithm is PPO (Proximal Policy Optimization).

Proximal Policy Optimization

PPO optimizes a policy (the LLM) to be more aligned with human preferences. "Proximal" means that the updated LLM is close to the previous version of the LLM. You begin with the Instruct LLM. The PPO then goes through two phases. In Phase 1, it creates completions. In Phase 2, it does a model update.

Phase1: The expected reward of a completion is estimated through a separate head of the LLM called the "value function". Assume you generate many reward values using the reward model, for the various completions of a prompt. The value function estimates the estimated total reward for a state "s" (estimating the total future reward based on the current sequence of tokens). This gives a baseline to evaluate the quality of completions against an alignment criteria.

Trying to minimize value loss. V_theta(s) is the value function

Phase 2: We make a small change to the model and evaluate the impact of the update to your alignment goal. Model weight updates are guided by the prompt completion losses and rewards. PPO ensures that model updates are within a small "trust region". Such small (proximal) updates move the model toward higher rewards. The PPO policy objective is to find a policy whose expected reward is high (more aligned with human perferences).

Policy loss

Pi_theta is the model's probability distribution over tokens. The denominator is the probabilities of the next token with the initial LLM. The numerator is for the updated LLM. A_hat_t is called the Advantage term which estimates how much better or worse the current action is, compared to all possible actions at that state. We look at the expected future rewards of a completion following the new token,
and we estimate how advantageous this completion is compared to the rest. Maximising the Advantage term gives a better aligned LLM. Let's consider the case where the advantage is positive for the suggested token. A positive advantage means that the suggested token is better than the average. Therefore, increasing the probability of the current token seems like a good strategy that leads to higher rewards. This translates to maximizing the expression we have here. If the suggested token is worse than average, the advantage will be negative. Again, maximizing the expression will demote the token, which is the correct strategy.
The portion of the equation within clip's paranthesis, is the "trust region". It defines a region in proximity to the LLM, where our estimates have small errors. This is because the advantage estimates are valid only when the old and new policies are close to each other.

Entropy loss

The entropy loss helps ensure creativity. The temperature setting influences creativity at inference time. But entropy influences it during training. 

The PPO objective updates the model weights through back propagation over  several steps. Once the model weights are updated, PPO starts a new cycle. For the next iteration, the LLM is replaced with the updated LLM, and a new PPO cycle starts. After many iterations, you arrive at the human-aligned LLM. Q-learning is
an alternate technique for fine-tuning LLMs through RL. Direct Preference Optimization, is a simpler alternate to RLHF.

Reward Hacking

Reward hacking is when the agent learns to cheat the system by favoring actions that maximize the reward received even if those actions don't align well with the original objective. In LLMs, reward hacking can manifest as the addition of words or phrases to completions that result in high scores for the metric being aligned, but reduces the overall quality of the language. This happens because the LLM weights are being updated after PPO. To mitigate this, a copy of the LLM is frozen and used as a reference. We calculate the difference (to check how much the updated model diverged from the reference) in completions between the reference LLM and the RL updated LLM, using KL divergence. KL divergence is calculated for each generated token across the whole vocabulary off the LLM. This can easily be tens or hundreds of thousands of tokens. However, using a softmax function,
you can reduc the number of probabilities to much less than the full vocabulary size. 

Using PEFT to reduce memory footprint

Using PEFT, you can use the same LLM as a reference and only update the weights of a path adapter, not the full weights of the LLM. This means that you can reuse
the same underlying LLM for both the reference model and the PPO model,
which you update with a trained path parameters. 

Evaluating the human aligned LLM

Once you have completed your RLHF alignment of the model, you will want to assess the model's performance. You can use the summarization data set to quantify the reduction in toxicity. The number you'll use here is the toxicity score,
this is the probability of the negative class, in this case, a toxic or hateful response
averaged across the completions. If RLHF has successfully reduced the toxicity of your LLM, this score should go down. First, you'll create a baseline toxicity score
for the original instruct LLM by evaluating its completions off the summarization data set with a reward model that can assess toxic language. Then you'll evaluate your newly human aligned model on the same data set and compare the scores.
In this example, the toxicity score has indeed decreased after RLHF, indicating a less toxic, better aligned model.

Scaling human effort to evaluate prompts

Labeled data set used to train the reward model requires large teams of labelers to evaluate many prompts each. One idea to overcome these limitations is to scale through model self supervision. Constitutional AI is one approach of scale supervision. Constitutional AI is a method for training models using a set of rules and principles that govern the model's behavior. Together with a set of sample prompts, these form the constitution. You then train the model to self critique and revise its responses to comply with those principles.

In the first stage, you carry out supervised learning, to start your prompt the model in ways that try to get it to generate harmful responses, this process is called red teaming. You then ask the model to critique its own harmful responses according to the constitutional principles and revise them to comply with those rules. Once done, you'll fine-tune the model using the pairs of red team prompts and the revised constitutional responses. You'll build up a data set of many examples like this to create a fine-tuned LLM that has learned how to generate constitutional responses. The second part of the process performs reinforcement learning. This stage is similar to RLHF, except that instead of human feedback, we now use feedback generated by a model. This is sometimes referred to as reinforcement learning from AI feedback or RLAIF. Here you use the fine-tuned model from
the previous step to generate a set of responses to your prompt. You then ask the model which of the responses is preferred according to the constitutional principles. The result is a model generated preference dataset that you can use to train a reward model. With this reward model, you can now fine-tune your model further using a reinforcement learning algorithm like PPO.

Model optimizations for deployment

To integrate your model into applications, you need to ask how your LLM will function in deployment, how fast do you need your model to generate completions, what compute budget you have, are you willing to trade off model performance for

improved inference speed or lower storage, do you intend for your model to interact with external data or other applications, what will the intended application or API interface that your model will be consumed through look like? 

One of the primary ways to improve application performance is to reduce the size of the LLM, which allows quicker loading of the model, which reduces inference latency. However, the challenge is to reduce the size of the model while still maintaining model performance. 

Distillation


Quantization

The quantization technique introduced in Part 1 was Quantization Aware Training (QAT). After a model is trained, you can perform post training quantization (PTQ) to optimize it for deployment. PTQ transforms a model's weights to a lower precision representation, such as 16-bit floating point or 8-bit integer. To reduce the model size and memory footprint, as well as the compute resources needed for model serving, quantization can be applied to just the model weights or to both weights and activation layers. In general, quantization approaches that include the activations can have a higher impact on model performance. Quantization also requires an extra calibration step to statistically capture the dynamic range of the original parameter values. As with other methods, there are tradeoffs because sometimes quantization results in a small percentage reduction in model evaluation metrics. However, that reduction can often be worth the cost savings and performance gains.

Pruning

At a high level, the goal is to reduce model size for inference by eliminating weights that are not contributing much to overall model performance. These are the weights with values very close to or equal to zero. Note that some pruning methods require full retraining of the model, while others fall into the category of parameter efficient fine tuning, such as LoRA. There are also methods that focus on post-training Pruning. In theory, this reduces the size of the model and improves performance. In practice, however, there may not be much impact on the size and performance if only a small percentage of the model weights are close to zero.

Time and effort in the lifecycle


Using the LLM in applications

Models can provide incorrect answers in situations like these:

  • Current info: Answering "who is the current president", when the model was trained at a time somebody else was president.
  • Floating point math: Answering mathematical questions. LLM's aren't meant to do mathematics.
  • Hallucination: Answering "What is a Martian Dunetree" will make the model "hallucinate" and provide an answer about some tree on Mars.

These problems need to be overcome via an orchestration library that connects the LLM to external components. 


Retrieval Augmented Generation (RAG)

RAG is a framework for building LLM powered systems that make use of external data sources. At the heart of this implementation is a model component called the Retriever, which consists of a query encoder and an external data source. The encoder takes the user's input prompt and encodes it into a form that can be used to query the data source. In the paper, the external data is a vector store,
but it could instead be an SQL database, CSV files, or other data storage format. These two components are trained together to find documents within the external data that are most relevant to the input query. The Retriever returns the best single or group of documents from the data source and combines the new information with the original user query. The new expanded prompt is then passed to the language model, which generates a completion that makes use of the data. 

RAG

In addition to overcoming knowledge cutoffs, rag also helps you avoid the problem of the model hallucinating when it doesn't know the answer. RAG architectures can be used to integrate multiple types of external information sources. You can augment large language models with access to local documents, including private wikis and expert systems. Rag can also enable access to the Internet to extract information posted on web pages, for example, Wikipedia. By encoding the user input prompt as a SQL query, RAG can also interact with databases. Another important data storage strategy is a Vector Store, which contains vector representations of text. This is a particularly useful data format for language models, since internally they work with vector representations of language to generate text. Vector stores enable a fast and efficient kind of relevant search based on similarity. 

Implementing RAG is more complicated than simply adding text into the LLM. You need to consider: 

  • The size of the context window: Most text sources are too long to fit into the limited context window of the model, which is still at most just a few thousand tokens. Instead, the external data sources are chopped up into many chunks, each of which will fit in the context window. Packages like LangChain can do this.
  • Data format: Data must be available in a format that allows for easy retrieval of the most relevant text. Recall that large language models don't work directly with text, but instead create vector representations of each token in an embedding space. These embedding vectors allow the LLM to identify semantically related words through measures such as cosine similarity.

Rag methods take the small chunks of external data and process them through
the LLM, to create embedding vectors for each. These new representations of the data can be stored in structures called vector stores, which allow for fast searching of datasets and efficient identification of semantically related text. Vector databases are a particular implementation of a vector store where each vector is also identified by a key. This can allow, for instance, the text generated by RAG to also include a citation for the document from which it was received.

Keys (each Text) in a vector database in n-dimensional space

Requirements when interacting with external applications

LLMs can be used to trigger actions when given the ability to interact with APIs. LLMs can also connect to other programming resources. For example, a Python interpreter that can enable models to incorporate accurate calculations into their outputs. It's important to note that prompts and completions are at the very heart of these workflows. The actions that the app will take in response to user requests will be determined by the LLM, which serves as the application's reasoning engine.
In order to trigger actions, the completions generated by the LLM must contain certain important information. First, the model needs to be able to generate a set of instructions so that the application knows what actions to take. These instructions need to be understandable and correspond to allowed actions. For example, in a chatbot, that helps users complete an order, the important steps are to check the order id, sending a request to generate a shipping label, verifying the user's email and emailing the shipping label to the user. The completions need to be formatted in a way that the external application can understand. This could be a Python script or an SQL query.

Program Aided Language Models (PAL)

The strategy behind PAL is to have the LLM generate completions where Chain of Thought (CoT) reasoning steps are accompanied by computer code. This code is then passed to an interpreter to carry out the calculations necessary to solve the problem. You specify the output format for the model by including examples for one or few short inference in the prompt.


PAL example

In the PAL example above, note how the blue lines start with a hash, to generate Python comments. The left side, is the one shot example given to the LLM, which shows it how to reason out a problem and generate Python variables which will do the calculation. On the right side, you can see how the LLM generates a completion which is basically a script which can be fed to a Python interpreter to do the calculation accurately.

This process can be automated with the Orchestration library. It decides what action to take based on the LLM output. It can also manage the flow of information and initiation of calls to the external apps. The LLM is the reasoning engine which creates the plan which the orchestrator will interpret and execute.

ReAct: Combining reasoning and action

ReAct is a prompting strategy that combines chain of thought reasoning
with action planning.  

The sequence of an example prompt

The example prompts start with:

Question: Problem that requires advanced reasoning and multiple steps to solve. Eg: "Which magazine was started first? National Geographic magazine or Popular Mechanics?". 

The example then includes a Thought, Action, Observation trio of strings.

Thought: A reasoning step that identifies how the model will tackle the problem and identify an action to take. Eg: "I need to search for National Geographic magazine and Popular Mechanics and find which one was started first". It demonstrates to the model how to tackle the problem and identify an action to take.

To identify which external app to invoke, it has to identify which action to take, from a pre-determined list.

Action: An external task that the model can carry out from an allowed set of actions. In this case, the authors created a Python script to interact with Wikipedia. The allowed actions were search[entity], lookup[string] and finish[answer]. Since the Thought prompt identified it needs to search for two magazines, the action would be search[National Geographic] and search[Popular Mechanics].

Observation: This is where the new information provided by the external search is brought into the context of the prompt for the model to interpret. Eg: "National Geographic is an American monthly magazine founded in 1888".

Cycles: The prompt then repeats the Thought-Action-Observation cycle as many times as necessary, to obtain the answer. The second cycle will find information about Popular Mechanics and the third cycle will have a Thought "National Geographic was started in 1888 < 1902, when Popular Mechanics was started". The next action would be finish[National Geographic] to indicate the end of the process.

Inference process: For inference, you'll start with the ReAct example prompt. Depending on the LLM you're working with, you may find that you need to include more than one example and carry out future inference. Next, pre-pend the instructions at the beginning of the example and then insert the question you want to answer at the end. The full prompt now includes all of these individual pieces, and it can be passed to the LLM for inference. 

Click to view larger

ReAct bridges the gap between reasoning and acting in LLMs, yielding remarkable results across language reasoning and decision making tasks, by interleaving reasoning traces and actions. ReAct outperforms imitation and reinforcement learning methods in interactive decision making, even with minimal context examples. It not only enhances  performance but also improves interpretability, trustworthiness, and diagnosability by allowing humans to distinguish between internal knowledge and external information.

Image: The figure provides a comprehensive visual comparison of different prompting methods in two distinct domains. The first part of the figure (1a) presents a comparison of four prompting methods: Standard, Chain-of-thought (CoT, Reason Only), Act-only, and ReAct (Reason+Act) for solving a HotpotQA question. Each method's approach is demonstrated through task-solving trajectories generated by the model (Act, Thought) and the environment (Obs). The second part of the figure (1b) focuses on a comparison between Act-only and ReAct prompting methods to solve an AlfWorld game. In both domains, in-context examples are omitted from the prompt, highlighting the generated trajectories as a result of the model's actions and thoughts and the observations made in the environment. This visual representation enables a clear understanding of the differences and advantages offered by the ReAct paradigm compared to other prompting methods in diverse task-solving scenarios.

 

LangChain

The LangChain framework provides you with modular pieces that contain
the components necessary to work with LLMs. 


These components include:

  • Prompt templates: for many different use cases that you can use to format both input examples and model completions. 
  • Memory: that you can use to store interactions with an LLM. 
  • Pre-built tools: that enable you to carry out a wide variety of tasks, including calls to external datasets and various APIs. 

Connecting a selection of these individual components together results in a chain. The creators of LangChain have developed a set of predefined chains that have been optimized for different use cases, and you can use these off-the-shelf to quickly get your app up and running. Sometimes your application workflow could take multiple paths depending on the information the user provides. In this case, you can't use a pre-determined chain, but instead we'll need the flexibility to decide which actions to take as the user moves through the workflow. 

  • Agent: can be used to interpret the input from the user and determine which tool or tools to use to complete the task. Agents can be incorporated into the chain to plan and execute one or more actions.

LangChain currently includes agents for both PAL and ReAct, among others.

Additional info:

  • Hot Pot QA: A multi-step question answering benchmark. That requires reasoning over two or more Wikipedia passages.
  • Fever: A benchmark that uses Wikipedia passages to verify facts.

Overview of the infrastructure needed for a Generative AI application

 

Responsible AI

With the growth of AI comes the recognition that we must all use it responsibly. 

There are lots of challenges. Three major ones are:

  • Toxicity: You can start with curating the training data. You can also train guardrail models to detect and filter out any unwanted content in the training data. We also think about how much human annotation is involved when it comes to training data and training annotations. We want to make sure we provide enough guidance to those annotators and also have a very diverse group of annotators that we're educating so that they can understand how to pull out certain data or how to mark certain data.
  • Hallucinations: We can educate the users that this is the reality of this technology and add any disclaimer so they can know that this is something you should be able to look out for. Also, you can augment large language models with independent and verified sources so you can double check against the data that you're getting back. You also want to make sure that you develop methods for attributing generated output to particular pieces of training data so that we can always trace back to where we got the information. We always want to make sure we define what the intended use case is for versus what the unintended use case is.
  • Intellectual property issues: This is likely to be addressed over time by a mixture of not just the technologies, but also with policymakers and other legal mechanisms. Also, we want to incorporate a system of governance to make sure that every stakeholder is doing what they need to do to prevent this from happening in the near term. There's a new concept of machine unlearning in which protected content or its effects on generative AI outputs are reduced or removed. So this is just one approach that is very primitive in research today. We can also do filtering or blocking approaches that compare generated content to protected content and training data and suppress or replace it if it's too similar before presenting it to the user.

Defining use cases is very important. The more specific, the more narrow the better. One example where we actually use gen AI to test and evaluate the robustness of a system is when it comes to face ID systems. We actually use journeys of AI to create different versions of a face. For example, if I'm trying to test a system that uses my face to unlock my phone, I want to make sure I test it with different versions of my face, with long hair, with short hair, with glasses on, with makeup on, with no makeup on. And we can use gen AI to do this at scale.
And so this is an example of how we use that to test the robustness. Also, we want to make sure we assess the risk because each use case has its own set of risks.
Some may be better or worse. Also, evaluating the performance is truly a function of the data and the system. You may have the same system, but when tested with different types of data, may perform very well or may perform very terribly. Also, we want to make sure we iterate over the AI lifecycle. It's never a one and done.
Creating AI is a continuous Iterative cycle where we want to implement responsibility at the concept stage as well as the deployment stage and monitoring that feedback over time. And last but not least, we want to issue governance policies throughout the lifecycle and accountability measures for every stakeholder involved. 

Conclusion 

As model capabilities increase, we'll also need more scalable techniques for human
oversight, such as constitutional AI, which I discussed in a previous lesson. Researchers continue to explore scaling laws for all steps of the project lifecycle, including techniques that better predict model performance so that you can make sure resources are used efficiently, for example, through simulations. And scale doesn't always mean bigger, research teams are working on model optimizations for small device and edge deployments. For example, llama.cpp is a C++ implementation of the LLaMA model using four bit integer quantization to run on a laptop. 

Researchers are looking into developing models that support longer prompts and contexts. For example, to summarize entire books. Models will also increasingly support multi-modality across language, images, video, audio, etc. Researchers are also trying to learn more about LLM reasoning and are exploring LLMs that combine structured knowledge and symbolic methods. This research field of neurosymbolic AI explores the model's abilities to learn from experience and the ability to reason from what has been learned.

To conclude, you could even ask ChatGPT about what the future holds :-)