July 6th, 2024

Tokens are a big reason today's generative AI falls short

Generative AI models encounter limitations with tokenization, affecting performance in tasks like math and language processing. Researchers explore alternatives like MambaByte to address tokenization challenges, aiming to enhance model efficiency and capabilities.

Read original articleLink Icon
Tokens are a big reason today's generative AI falls short

Generative AI models face limitations due to tokenization, where text is broken into smaller pieces called tokens for processing. Tokenization can introduce biases and challenges, especially in languages without spaces between words. Tokenizers treat case and digits differently, impacting model performance in tasks like math and language processing. Tokenization issues extend to languages like Chinese and Thai, affecting model efficiency and cost. Researchers are exploring alternatives like MambaByte, a byte-level model that avoids tokenization, showing promise in handling noise and improving performance. Despite challenges, new model architectures may offer solutions to tokenization limitations in generative AI. The complexity of tokenization impacts various aspects of AI model behavior, highlighting the need for advancements in this area to enhance model capabilities and efficiency.

Related

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers innovate AI language models by eliminating matrix multiplication, enhancing efficiency. A MatMul-free method reduces power consumption, costs, and challenges the necessity of matrix multiplication in high-performing models.

AI Scaling Myths

AI Scaling Myths

The article challenges myths about scaling AI models, emphasizing limitations in data availability and cost. It discusses shifts towards smaller, efficient models and warns against overestimating scaling's role in advancing AGI.

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

Prompt Injections in the Wild. Exploiting LLM Agents – Hitcon 2023 [video]

The video explores vulnerabilities in machine learning models, particularly GPT, emphasizing the importance of understanding and addressing adversarial attacks. Effective prompt engineering is crucial for engaging with AI models to prevent security risks.

Txtai – A Strong Alternative to ChromaDB and LangChain for Vector Search and RAG

Txtai – A Strong Alternative to ChromaDB and LangChain for Vector Search and RAG

Generative AI's rise in business and challenges with Large Language Models are discussed. Retrieval Augmented Generation (RAG) tackles data generation issues. LangChain, LlamaIndex, and txtai are compared for search capabilities and efficiency. Txtai stands out for streamlined tasks and text extraction, despite a narrower focus.

Link Icon 9 comments
By @kibwen - 7 months
> A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

TechCrunch, I respect that you have a style guide vis a vis punctuation and quotation marks, but please understand when it's appropriate to break the rules. :P

By @Lerc - 7 months
While I think tokenization does cause limitations, it is not as clear cut as one might think.

GPT4 can answer questions given to it in Base64. I would imagine it suffers some degree of degradation in ability from the extra workload this causes but I haven't seen any measurements on this.

I have wondered about other architectures to help. What happens when a little subnet encodes the (16 or 32?) characters in the neighborhood of the token into an embedding that gets attached to the top level token embedding?

By @kevincox - 7 months
This seems like saying "Synonyms are a big reason that humans fall short."

Part of what makes AI interesting is that it can understand a huge number of differently phrased data. It seems like different token encodings would only be a very minor complexity compared to the variety of human language.

By @amrb - 7 months
An alternative approache to BPE tokenization https://arxiv.org/abs/2406.19223
By @greenyies - 7 months
I find this article very weird.

It doesn't really explain anything besides talking about tokenization on random levels.

You need a certain amount of data to even understand that once upon a time might be a higher level concept.

By @ttul - 7 months
Tokenization is a statistical technique that greatly compresses the input while providing some semantic hints to the underlying model. Tokenization is not the big thing holding back generative models. There are so many other challenges being worked on and steadily overcome and progress has been insanely rapid.
By @PaulHoule - 7 months
One take on it is that chatbots ought to know something about the tokens that they take. For instance you should be able to ask it how it tokenizes a phrase, what the number is for tokens, etc. One possibility is to train it on synthetic documents that describe the tokenization system.
By @deepsquirrelnet - 7 months
I implemented a hierarchical model that pooled utf8 encoded sequences to word vectors and trained it with a decoder on text denoising.

I think the future is a small word encoder model that replaces the token embedding codebook.

And here’s the reason: you can still create a codebook after training and then use the encoder model only for OOV. I’m not sure there’s an excuse not to be doing this, but open to suggestions.

By @soloist11 - 7 months
This is like saying binary numbers are the reason generative AI falls short. Computers work with transistors which are either on or off so what are these people proposing as the next computational paradigm to fix the problems with binary generative AI?