Do Meta’s New AI Models Get Us Closer To An Open-Source Future?

The Interline

1 year ago

Sign up to Interline Insiders to receive our weekly analysis in your inbox every Friday.

Key Takeaways:

Meta’s Fundamental AI Research (FAIR) team has released five mixed-modal AI models, stimulating the open-source AI space by encouraging as many people as possible to use them, produce iterations, and ultimately help advance AI in a responsible way.
Meta Chameleon is different to other models in that it uses an “early fusion token-based” approach and an end-to-end model that both processes and generates tokens.
This approach has opportunities for the fashion industry for supply chain, robotics, augmented and virtual reality interfaces, and sophisticated multimedia search, generation, and analysis.

Last week, The Interline released our first AI Report, which has already been downloaded close to 1000 times by some of the biggest names in fashion. One of the core themes of the report is how fast the AI is moving at the application level (not just the underlying models), and the latest update from Meta could demonstrate a potential next step in this evolution.

This week, Meta’s Fundamental AI Research (FAIR) team has released five mixed-modal AI models that are available to the public; including image-to-text and text-to-music generation models, a multi-token prediction model, and a technique for detecting AI-generated speech.

In their announcement, the company wrote: “By publicly sharing this research, we hope to inspire iterations and ultimately help advance AI in a responsible way” and that they “believe that collaboration with the global AI community is more important than ever.”

One update that stands out in particular is Chameleon which, according to Meta, is a family of models that can understand and generate both images and text. Meta’s Chameleon follows the unveiling of other natively multimodal AI models, such OpenAI’s GPT-4o, which is being used to power ChatGPT’s new visual capabilities, and the upcoming versions of Google’s alphabet-soup naming scheme of Gemini models.

What is special about Chameleon? We’ll start with the tech. Meta points out that Chameleon can process both text and images concurrently – like human beings. Unlike most large language models (LLMs) that typically produce unimodal results (converting text to images), Chameleon can accept and generate a number of combinations of text and images. Meta says that Chameleon can create imaginative captions for images, or use a blend of text prompts and images to craft entirely new scenes.

Understanding things on a deeper level required us to read “Chameleon: Mixed-Modal Early-Fusion Foundation Models” written by the FAIR team at Meta. What makes Chameleon different is that up until now, most multimodal models have used a “late fusion” or “decision fusion” approach. This is where the model processes and encodes each data type separately before attempting to combine them. While this method does work, it doesn’t allow the models to fully integrate and understand multiple modalities in a connected, end-to-end way from the outset. This late-stage unification process can result in inefficiencies and limitations.

Chameleon instead uses an “early fusion” approach: blending all data streams into a unified vocabulary from the jump. It is engineered to work natively with a mixed vocabulary of discrete tokens that can represent anything, whether words, pixels, or other data types. According to Meta’s researchers, the most similar model to Chameleon is Google’s Gemini, which also uses this “early-fusion token-based” approach. But Gemini uses separate image decoders in the generation phase, whereas Chameleon is an end-to-end model that both processes and generates tokens.

Developing this kind of powerful AI model is filled with obstacles and intricacies when it comes to training and scaling. Meta used techniques like “two-stage learning processes” and “stratospheric data sets” that contained 4.4 trillion tokens of text, images, and sequences. And this was done on more than 5 million hours of Nvidia A100 80GB GPUs. Not easy, and not cheap.

But the result? Capabilities in captioning images and answering questions about visuals, as well as generating composite results with interwoven text and imagery sequences. And despite being inherently multimodal, according to Meta, their models rival models in text-only tasks as well – like Gemini and Llama – in the areas of reading comprehension and common sense reasoning.

Interesting is Chameleon’s potential for fashion, and the technology world at large. By leading this early fusion approach, Meta has opened up new opportunities for advanced AI systems that could lead to the development of multimodal assistants, question-answering systems, analysts and that can comprehend any combination of language, visuals, and video in a unified way – precisely the kind of blended data that characterises the product design and development process in fashion.

This kind of general intelligence and versatility could also be crucial for future applications like robotics that could be used in the fashion supply chain; immersive augmented and virtual reality interfaces for consumer-facing fashion experiences as well as by business teams, and sophisticated multimedia search, generation, and analysis. This end-to-end architecture excels by achieving a fundamental understanding of all modalities, while not sacrificing any single capability. The multi-modality on the face of it seems interesting, but it is unclear to what extent this will outperform a combination of unimodal models. For example, Chameleon might be able to generate an image and a caption together, but two other models might be able to each do one of those tasks separately.

Meta’s FAIR team have said that: “Chameleon represents a significant step towards realising the vision of unified foundation models capable of flexibly reasoning over and generating multimodal content.” This is their guiding principle: achieving artificial general intelligence (AGI) that masters all modalities. While human-level cognition is still far off, breakthroughs like this early fusion approach bring us closer to bridging the gap between narrow AI and the goal of advanced AGI.

On the other hand, a research paper that is doing the rounds in the machine learning community – with a provocative title you can read here – by researchers Michael Townsen Hicks, James Humphries and Joe Slater at the University of Glasgow. This explores how LLMs are designed to give the appearance of truth rather than actually having any grasp of logical, truth, or understanding. This is a topic our AI Report 2024 already started to dissect, but it’s clear that a very vocal contingent of the tech community takes exception to how AI is currently being described and sold.

Fashion, compared to other fields like education and healthcare, can in some ways have a more open relationship with the truth. If fashion trends are a reflection of society’s collective consciousness, then maybe having models design clothes or caption images of them through averaging over what the internet most likely would have said is acceptable – even if it leads to a kind of homogenisation. And away from those fluffier use cases, there are areas where fashion has an urgent need to forge a more concrete link with the truth – not least in supply chain visibility, environmental impact, fair wages, and the whole extended scope of disclosure and due diligence.

In other fields though, we should probably be more cautious, as Emily M. Bender, a sober voice in the conversation around AI, points out: we rely on our information as comprising consistent relationships of facts, whereas [w]hen Google, Microsoft, and OpenAI try to insert so-called “AI” systems (driven by LLMs) between information seekers and information providers, they are interrupting the ability to make and maintain those relationships – as well as deeply affecting what it means to trade in any kind of openness and transparency.

While Meta is not without its faults, this initiative to provide an open alternative to models like GPT-4 or Gemini (which are currently inaccessible to the public) is certainly along the lines of responsible AI. Developing large-scale models like this is costly and not within reach for just any technology company, and as AI grows more sophisticated and influential, the risks posed by opaque, unaccountable systems could multiply. Censorship, bias, and other perils could become the norm. So Meta’s decision to offer it freely for use and further development is a step toward democratising AI, working towards a future where the technology isn’t monopolised by big tech companies. That way, the ability to define the future is still wide open.