Tech companies are integrating more and more AI language models into their products, enabling users to accomplish a wide range of tasks, from booking trips to organizing calendars to taking notes during meetings.
However, the way these models work, which involves receiving user instructions and searching the internet for answers, introduces a lot of new risks. The integration of AI opens the door to potential misuse, like personal information leaks, or the facilitation of activities like phishing or scamming.
AI language models, such as ChatGPT, Bard, and Bing, drive chatbot functionality by generating human-like text. These models operate by taking instructions or “prompts” from users and predicting the most probable next word based on their training data. But here is the problem: “Prompt injections” can be used to hijack the model’s intended limitations and guidelines. And the way it works is really simple: individuals use prompts that instruct the language model to disregard its prior instructions and safety precautions.
There are numerous examples on Reddit where people have managed to manipulate AI models to endorse racism, propagate conspiracy theories, or provide suggestions for engaging in illegal activities like shoplifting or constructing explosives. One method of achieving this is by instructing the chatbot to “role-play” as a different AI model that can fulfill the user’s desired actions, even if it requires disregarding the safety measures originally put in place for it.
An even bigger problem is scamming and phishing. OpenAI made an announcement in late March, saying that it allows people to integrate ChatGPT into products that can browse the internet and engage with it. This new feature has already been utilized by startups to create virtual assistants capable of performing real-world actions, like making flight reservations or scheduling meetings for users. However, by letting ChatGPT roam free on the internet, it becomes significantly more susceptible to potential attacks, making it highly vulnerable from a security standpoint.
AI-powered virtual assistants, as they scrape content from the web, become susceptible to a form of attack known as “indirect prompt injection”. This type of attack involves a third party modifying a website by inserting hidden text with the intention of influencing the behavior of the AI system. Attackers may use social media or email to guide users to websites containing these hidden prompts. Then, the AI system can be manipulated to enable attackers to attempt to extract sensitive information, such as people’s credit card details.
Furthermore, malicious actors could exploit hidden prompt injections by sending emails containing such prompts. If the recipient happens to utilize an AI virtual assistant, the attacker could potentially manipulate the assistant into disclosing personal information extracted from the target’s emails. In some cases, the attacker might even utilize the compromised virtual assistant to send emails on their behalf to individuals on the target’s contact list.
Research conducted by a team of researchers from Google, Nvidia, and startup Robust Intelligence has revealed that AI language models are vulnerable to attacks even before they are deployed. Large AI models are trained on extensive datasets obtained from web scraping. Currently, tech companies are relying on trust, assuming that this data has not been tampered with maliciously.
The researchers discovered that it is indeed easy to contaminate the data sets used to train large AI models. They were able to purchase domains for just $60 and populate them with handpicked images, which were subsequently scraped into the training data sets. Furthermore, they successfully manipulated and added sentences within Wikipedia entries that made their way into the data sets of AI models.
Adding to the concern, the frequency of repetition within an AI model’s training data strengthens the associations formed. By deliberately injecting a substantial number of tainted examples into the data set, it becomes possible to alter the behavior and outputs of the model forever.
While tech companies are aware of these issues, there are currently no known solutions in place. The challenges posed by prompt injection and the manipulation of AI models remain unresolved, leaving a gap in effective defenses against such attacks.