- Gemini is Google’s new multimodal AI model that can take in text, images, videos, and sound, and produce output in any of those formats.
- Gemini outperforms human experts and OpenAI’s GPT-4 on language understanding benchmarks, making it a powerful generative AI model.
- Gemini is already being used in Google’s Bard chatbot and will be available for developers to try in Google AI Studio and Google Cloud Vertex AI.
We seem to be in full swing of the second age where anything that is popular technology has to have artificial intelligence in it.
Nary a decade prior, bits of machine learning made their way into little tricks like identifying subjects in a camera’s vision or creating sentences that may or may not actually be useful. Now, as we approach a peak of generative AI (with more of them perhaps on the way), Google upping the stakes with its new “multimodal” model called Gemini.
If you’re curious about what makes Gemini tick, why it’s so different from the likes of OpenAI’s ChatGPT, and how you might get to experience it at work, we’re here to give you the lay of the land.
Google launches Gemini AI, its answer to GPT-4, and you can try it now
Gemini AI is here to take on GPT-4, with support for multiple forms of data input, like text, images, video, and audio. And you can try it now.
What is Gemini?
Google debuted Gemini on Dec. 6, 2023, as its latest all-purpose “multimodal” generative AI model. It comes in three sizes – Ultra, which is being held back from wider commercial use for now, Pro, and Nano.
Up to this point, widely available large language models or LLMs worked by analyzing input media in order to expand upon the subject into a desired media format. For example, OpenAI’s Generative Pre-trained Transformer model or GPT deals in text-to-text exchanges while DALL-E translates text prompts into images. Each LLM would be tuned for one type of input and one type of output.
This is where all this talk of multimodality comes in: Gemini can take in text (including code), images, videos, and sound and, with some prompting, put out something new in any of those formats. In other words, one multimodal LLM can theoretically do the jobs of several dedicated single-purpose LLMs.
This sizzle reel gives you a good idea of just how polished interactions with a decently-equipped model are. Don’t let the video and its slick editing fool you, though, as none of these interactions are happening as quickly as you see them being performed here. You can learn about the meticulous process Google went through to engineer its prompts in a Google for Developers blog post.
That said, you do get a sense of the level of detail and reasoning Gemini is able to carry out into whatever it’s tasked with doing. I was personally most impressed with Gemini being able to see an untraced connect-the-dots picture and then correctly determine it to be of a crab (4:20). Gemini was also asked to create an emoji-based game where it would receive and judge answers based on where a user pointed to on a map (2:05).
What can you do with Gemini?
You don’t typically come up to an LLM and ask it to write Shakespeare for you and it’s the same for Gemini. Instead, you’ll find it at work on a variety of surfaces. In this case, Google says it has been using Gemini to power its Search Generative Experience as well as the experimental NotebookLM app.
Google’s Bard chatbot is now running with Gemini Pro.
The company’s Bard chatbot is now running with Gemini Pro – available to use in more than 170 countries and regions, but only in US English – with a move up to Gemini Ultra sometime early next year. Android users can also experience some enhanced features with Gemini Nano, which is meant to be loaded directly onto devices. Pixel 8 Pro owners will get the first immediate crack followed way down the line by those who use other devices on Android 14. And third-party app developers will be able to take Gemini for a spin in Google AI Studio and Google Cloud Vertex AI starting December 13.
How does Gemini compare to OpenAI’s GPT-4?
OpenAI beat Google to the punch with the launch of the nominally multimodal GPT-4 with GPT-4V (the ‘V’ is for vision) back in March 2023, updating it again with GPT-4 Turbo in November. GPT remains conservative in its approach as a text-focused transformer, but it does now accept images as input.
Benchmarks are far from the end-all be-all factor when judging the performance of LLMs, but numbers in charts are what researchers kinda live for, so we’ll humor them for a little bit.
Google’s DeepMind research division claims in a technical report (PDF) Gemini Ultra is the first model to outdo humans on the Massive Multitask Language Understanding (MMLU) benchmark with a score of 90.04 per cent versus the top human expert score of 89.8 per cent and GPT-4’s reported 86.4 per cent. Gemini Ultra also has GPT-4 beat on Massive Multi-discipline Multimodal Understanding (MMMU) benchmark by a score of 59.4 per cent to 56.8 per cent.
That’s great and all, but with the Ultra size months away from public circulation, most people will be coming to grips with Gemini Pro. Its best showings stand at 79.13 per cent for MMLU (slightly better than Google’s own PaLM 2 and notably better than GPT-3.5) and 47.9 per cent for MMMU.
Try it yourself
Really, the best way to compare and contrast the usefulness of Gemini versus GPT-4 is to try each model out for yourself.
As we’ve said, Gemini is now in use with Google’s Bard chatbot. For GPT-4, you’ll be able to use that model for free via Bing Chat. While both services accept prompts with text and a single image, only Bing Chat is able to generate images as of right now, though it uses a DALL-E tool to do so. For as amazing as that demo video was, Bard won’t be able to play Rock, Paper, Scissors with you today or in the near term. It’s still early days yet for Gemini.
Why is Google introducing Gemini now?
All this hubbub around Gemini comes shortly after Google launched the second version of the Pathways Language Model (PaLM) at the I/O conference in May. PaLM only went public the year before, and its own roots trace back through the development of the Language Model for Dialogue Applications (LaMDA) which Google announced at I/O 2021.
“All of this to say that the development of generativeAI remains relatively unstable at Google today when compared to the newfound stability at OpenAI.”
For the past several years, Mountain View has struggled to respond to the excitement around OpenAI, GPT, and the potential threats that AI-powered chat services presented its core web search business. With deliberative perfection and the capacity to handle an entire internet’s worth of information, users would be able to get the information they need with a single question on a single webpage, making it easier and quicker than a trip through the Google results – an especially mournful thought when you consider all the eyeballs that wouldn’t be looking at those preferred listings at the top of the pile for which clients pay big bucks.
At the same time, trouble brewed at Google’s DeepMind and former Brain divisions. Dr. Timnit Gebru, one of a tiny class of Black women in the field of artificial intelligence research, claimed she was fired from the company for essentially refusing to back down from a paper she sought to publish about the environmental and societal risks posed by massive LLMs (via MIT Technology Review). In addition to controversies over research ethics, there have been underlying concerns about diverse representation – both in staff and in the data used to train AI models.
After OpenAI launched ChatGPT in late 2022, The New York Times reported from internal sources that Google was operating under a “code red.” Google then turned over large portions of its existing labor force, replacing people working on various sidecars and even in some of its major businesses like the Android operating system in order to double down on AI hires. Google co-founder Sergey Brin was even brought back into the fold (via Android Police) after leaving in December 2019 to help with the effort.
All of this to say that the development of generativeAI remains relatively unstable at Google today when compared to the newfound stability at OpenAI – especially as its CEO, Sam Altman, has just countered a coup from the board of directors, cementing his power over the organization. Stay tuned.
Disclaimer : The content in this article is for educational and informational purposes only.