Por: Karen De la Hoz

Recientemente participé en The Generative AI in the Newsroom Project, un proyecto promovido por el profesor e investigador Nick Diakopoulos para explorar los usos de la inteligencia artificial en la sala de redacción. El siguiente artículo, publicado originalmente aquí, resume mi exploración de ChatGPT para corregir redacción y ortografía:

La Silla Vacía, a well-known Colombian digital native media outlet focused on political coverage, has a section called En Vivo (Live). All the journalists in the newsroom work at least a 6-hour shift every 15 days to update this section. Its objective is to narrate, through short and concise text entries, the main news of the moment.

Reducing the number of writing, grammar and spelling mistakes in En Vivo, and in all sections of the site, is one of the objectives of journalists and editors. To facilitate this task, we began experimenting with OpenAI models (GPT-3.5 and GPT-4).

The medium-term goal is to generate a virtuous circle: journalists who are more aware of their mistakes, who edit themselves in real time and pass fewer errors on to their editors, who in turn spend fewer hours correcting simple errors and have more time to focus on tasks that are more relevant to the audience.

The results of our tests are bittersweet. The system identifies and corrects most errors. However, it sometimes indicates that it has corrected a sentence, but when we check the “corrected” sentence, it is exactly the same as the original. Additionally, since we are working with a system that we did not train with our writing style book, some of its suggestions, although grammatically correct, are not relevant to our site. Last but not least, creating prompts takes longer than we initially estimated.

Despite the above, I am confident that we can reach a point where the system allows us to review the correct application of our styling guide. Collaboration and shared documentation and experiences are key in this process. For the time being, we do not contemplate a scenario where we would publish a text corrected by ChatGPT without these corrections having been approved by a human journalist or editor.

Below I describe my experimentation process with ChatGPT.

Randomness, system prompts and user prompts

When I started this experiment I wanted to achieve two things: I wanted the system to make corrections to a text and I wanted the system to tell me in bullet point format what it had corrected and why. Getting a list of errors and suggestions seemed to me the most optimal and the fastest way to integrate ChatGPT suggestions into a text.

This was one of the first prompts I used in the ChatGPT interface (with URL “https://chat.openai.com/” ): Actúa como un editor de estilo. Identifica e indícame errores gramaticales como redundancias, errores de concordancia en género y número o errores en los usos de los signos de puntuación. También indícame palabras mal escritas o typos e identifica cualquier incoherencia en el estilo del texto. Al mostrarme los resultados indícame la frase original, el error y la nueva redacción que me propones.

(You are a style editor. Identify and point out grammatical errors such as redundancies, gender and number agreement errors, or errors in the use of punctuation marks. Also indicate misspelled words or typos and identify any inconsistencies in the style of the text. When you show me the results, please indicate the original sentence, the error and the new wording you propose).

I started to notice that, although giving the system the same instruction, the system did different things every time I interacted with it. At Nick Diakopoulos’ suggestion, I started testing on OpenAI Playground, a different interface from the previous one. Under Nick’s guidance I understood that there was a variable called temperature to which I could assign a value between 0 and 2, and that the closer that value was to 0, the more coherent and predictable the system responses would be. I decided to set the temperature parameter to 0 in this experiment.

In GPT-4 I also started to distinguish between system prompts and user prompts. The system prompt is the initial text given to the model to establish the context of the conversation. The user prompt is used to orient the model to the specific goal of the conversation.

These were some of the versions of system prompts that I tried out:

Actúa como un editor de estilo.
(You are a style editor.)
Actúa como un editor de estilo en un medio de comunicación. Eres un experto en gramática española y un editor en un medio de comunicación. (You are a style editor in a media outlet. You are an expert in Spanish grammar and an editor in a media outlet.)
Eres un experto en gramática española y un periodista y editor con amplia experiencia. Tienes habilidad para editar noticias, garantizar el uso correcto y preciso del lenguaje, la redacción y la ortografía.
(You are an expert in Spanish grammar and an experienced journalist and editor. You are skilled at editing news stories, ensuring correct and accurate use of language, writing and spelling.)
Eres un experto en gramática y un periodista y editor con amplia experiencia. Tienes excelente redacción y ortografía.
(You are a grammar expert and an experienced journalist and editor. You have excellent writing and spelling skills.)
[This is my favorite so far.]

To build the user prompts I reviewed the La Silla Vacía Styling book and identified a list of guidelines that I wanted to check using ChatGPT. What worked best for me was to create a prompt for each rule and, when I checked that the prompt worked, I tried to build larger prompts that integrated instructions that had worked separately. In most cases I used zero-shot prompts, those in which I give the system a description or an indication of what I expect it to be able to do, without introducing specific examples.

To test the prompts I prepared a set of five test texts and intentionally added the errors I wanted the system to correct. If the system did not correct the errors, I made adjustments to the prompt and tried again and again. In none of the cases did I tell the system what it was doing wrong, I just kept trying. When I succeeded, I ran a real time test with 15 articles. By real time test I mean that when an article was ready to be published I checked it with ChatGPT and took the suggestions that were relevant. This was one of the prompts I used on real time tests.

Corrige typos, redundancias y palabras repetidas. Corrige cualquier error en el uso de signos de puntuación. Nunca separes el sujeto y el predicado por una coma. Usa comas antes y después de la información adicional. Usa comas después de expresiones de enlace. Corrige cualquier error en la conjugación de tiempos verbales. (Corrects typos, redundancies and repeated words. Correct any errors in the use of punctuation marks. Never separate the subject and predicate by a comma. Use commas before and after additional information. Use commas after linking expressions. Correct any errors in the conjugation of verb tenses).

Since my goal was to have the corrections in list format, I used a second prompt to accomplish this. My first intuition was to create a single user prompt, but as I did not get good results, I decided to use separate prompts. This is how this second prompt evolved:

Indícame qué correcciones realizaste.
(Tell me what corrections you have made)
Indícame todos los cambios que realizaste en el texto.
(List all the changes you have made in the text.)
Lista, uno a uno, todos los cambios que realizaste en el texto
(List, one by one, all the changes you made in the text).
[This is my favorite so far]

Note: By the end of May, when I was checking the grammar for the Spanish version of this article, I noticed that the prompt above was not working as well as before. The system started to tell me “No specific text to correct was provided in the above request. Please provide text with errors so that I can make corrections and list the changes made”. I did a little adjustment to the prompt and it fix the problem:

Lista, uno a uno, todos los cambios que realizaste en el texto anterior.(List, one by one, all the changes you made in the previous text).

On the left, article with errors added; on the right, ChatGPT-4 corrections and suggestions.

What caught my eye

Useless styling corrections: Our En vivo section uses Colombian political jargon. Since ChatGPT-4 was not trained with the writing style of La Silla Vacía, in each test we obtained suggestions that, although grammatically correct, were not useful.

Non-corrections: in some cases, the system presented as corrections some sentences to which no modifications had been made, i.e. the version before and after the “correction” were exactly the same.

Handling of direct quotes: the system was making non-essential changes to direct quotes that were in quotation marks. I understand that I would need additional prompts so that the system could treat direct quotes differently. I did not do additional testing for this issue.

No hallucinations or additional information: in none of the 20 tests (those of introduced errors nor the real time ones) did the system add or omit information different from that provided.

English grammar in the corrections: Spanish grammar indicates that the period, comma and semicolon are always written after the closing quotation marks, in English grammar they are written before. Although this whole exercise was done in Spanish (texts and prompts), when listing the corrections from the system, in some cases, it uses English grammar rules. 🤔

Unexpected version changes: this experiment and Nick’s feedback made me realize the importance of being aware of version changes in ChatGPT. As users, we don’t have any control over the underlying system and this could change versions without us even realizing which could impact performance of prompts.

Conclusions

In summary, the tests allowed us to correct errors in the use of commas, errors in the use of capital letters, conjugation errors, typographical errors, eliminate repeated words and clarify some sentences. We also received style suggestions that were not relevant to us, and the model presented non-corrections to us. These results are based on tests with GPT-4 in chat mode in the OpenAI Playground; the parameters used were Temperature 0, Top 1, Frequency penalty 0 and Presence penalty 0. The test was run between April and mid-May 2023.

Finally, and although I think the tool is useful, I wonder if Playground is the best interface to use on a day-to-day basis in newsrooms for grammar and spell checking. Perhaps this would be better done by another system, something with a more user-friendly interface and the desired parameters preset. At the same time, I wonder how many different instructions I can give the system in the same prompt without affecting the quality of the result. I will continue experimenting and sharing my findings in my blog nochesdemedia.com

***Acknowledgements: Thanks to Nick for his patient support, and to María José Restrepo, journalist at La Silla Vacía, for her help in testing the prompts.