Is localization easier with LLMs or do legacy systems still have the edge? Experts Weigh In

Written by Martina Bretous | Sep 10, 2024 5:22:21 AM

Welcome to ‘The AI Edge,’ a series exploring how professionals across industries are using AI at work.

Today, we’ve got a Q&A style interview with HubSpot’s Dierk Runne, a senior manager who leads the product localization team and oversees their AI efforts.

I wanted to find out if LLMs are making localization easier or if legacy systems (like neural machine translation) are still more reliable.

On y va!

Where AI fits into HubSpot's localization efforts
Machine translation versus LLMs
Error detection with LLMs
Integrating LLMs into your workflow + challenges
Recommendations for small to medium-sized businesses
Most overrated claim about LLMs
The impact of LLMs on localization

What do you do at HubSpot?

The localization team at HubSpot is basically an internal service provider within the company for any team, stakeholder, or department with a translation/localization need.

Not sure if we want to get into the difference between the two here?

Tell us, what’s the difference?

My go-to example is this: Translating a British car to an American road would mean simply moving it. Localizing it would mean putting the steering wheel on the other side because you drive on the other side of the road in the US.

Localization means more adaptation to the local market and more editorial freedom.

Got it! Now back to your role and team.

We work with marketing a lot as we have a fairly content-heavy marketing playbook at HubSpot. Another major use case is our knowledge base that’s available in multiple languages.

We have roughly 30 people on the team, including trained linguists and translators, but only for our five "core" languages (other than English), that help us operate in all the languages in which we have a website.

Beyond that, we work with translation agencies and freelancers for our outsourcing needs, and our team takes care of the coordination, the budget management, and the communications management for these external resources.

I’m the manager for product localization and look after our software localization. I‘m also the tech guy on the localization team, meaning when it comes to questions around how we use AI, you’ve come to the right person!

That’s always good to hear. Where does AI fit into the work you do, if at all?

In the current conversation, when people refer to AI, they mostly mean generative AI. With translation, the use of AI goes back a lot further.

In fact, some of the systems that the current LLMs are built on, like the transformer technology, was initially developed for machine translation purposes by Google. Google Translate, for example, has been running on a transformer model called BERT since roughly 2016.

It's the same underlying terminologies, what changed with generative AI is mostly the scale.

Our team has been using machine translation for a long time. This is typically referred to as neural machine translations (NMT), meaning it's based on neural network architecture and typically transformer architecture.

We've been using that at scale since 2019 and the biggest use case that we have is HubSpot's knowledge base (KB). The knowledge base has over a thousand articles that get updated extremely frequently because, obviously, our product updates all the time and the articles have to be kept aligned with the changes in the software.

One of the most important criteria for the translated KB is that the time to translation is as short as possible. The idea is that if an English article gets updated, it shouldn't take days or even weeks until the translated versions are updated, the information needs to be there ASAP.

That‘s sort of a very standard use case for machine translation. Could it be done with human effort? Sure, but we would need a lot more people for that to take care of the sheer volume and frequency of updates that we’re seeing there.

But for anything that is high visibility/impact, we will typically want to have a set of human eyes on the output from the machine translation. With the KB, the use case is somewhere in between – we tend to review like the top 20% of articles in terms of view count.

Then there's stuff like user-generated content, like comments and forums or reviews on marketplaces. Providing translations here is done for the convenience of the user, and typically what should go along with that is a disclaimer that says, “Hey, this was generated via machine translation, and it might not have been proofed by a human.”

That transparency piece is very important, to let people know about that upfront and not try to hide it in any way.

We are also starting to use more AI translation in marketing content, though that content will always have a layer of human review.

We can generate a blog post using ChatGPT or any other LLM, but would you feel OK with just publishing it, sight unseen? Probably not. The same is true for machine translation – there are use cases where we are OK with publishing the raw output, and others where we're not.

How much are you relying on machine translation versus LLMs?

For the moment, the vast majority of it is still handled via traditional machine translation systems. Machine translation engines are specialized tools, whereas LLMs are generalist tools.

You can ask an LLM anything and you will get an answer. Whether or not it‘s a good quality answer is besides the point, but you can’t ask a machine translation engine to draw your picture, right? That doesn't work.

It is a single-purpose tool, and because of that, they tend to still have the edge over the large language models when it comes to translation quality, and especially when it comes to translation speed.

That being said, LLMs are getting there! There have been some surprising advancements, and it is something that we are keeping a close eye on.

We are experimenting with LLMs for translation for error detection in translation, sort of as a second layer, because LLMs have a better semantic "understanding".

Basically, you get an output from a machine translation engine and then you let an LLM determine if it’s a good translation. Is this accurate, is this fluent, is this using the right terminology?

More about the experimentation that you‘re running with the error detection with LLMs. Is there a specific one that you’re using that you can share?

We have been working with an external service provider recently on running an experiment where we tested a whole bunch of different systems, including models from Open AI and Anthropic. Though that was mostly a proof of concept.

Right now, we‘re looking for ways to introduce these systems into our existing workflows because so far, everything has been kind of a dedicated project where we have to extract a bunch of content to put in front of the LLM. So it’s out[side] of the typical process.

The next big thing is going to be to figure out how we can insert these extra steps or extra checks into the process in a way that doesn't disrupt it, but hat provides added value within the process.

Have you seen any early results in terms of which AI model was better?

In our testing, we got the best results with GPT-4 at the time – 4o wasn't released at that point yet. We tried two approaches: We tried one very simple “Is there an error in this translation: YES or NO.”

We got it to roughly above 80% accuracy across all of the language pairs that we have, which might not sound like much, but it is actually pretty good.

The second sort of step that we tried is to get the LLMs to classify the error. There, we saw the accuracy go way down in some languages.

In the multilingual space, these models often show – and the same is true for translation engines – vast differences in terms of output quality and capabilities between languages. This is because in some languages, there is just not a whole lot of training data available.

I know your team is still thinking through how you want to integrate LLMs into your workflow, but I'm curious, what does that process look like in an ideal world?

My ideal scenario would be defining certain content elements and assets in the HubSpot ecosystem that we would watch via an LLM. The LLM would point us toward it like, “Hey, someone should really be looking at this.”

I think that’s a pretty cool use case. In our case specifically, we can also use this to rework our existing translation database. Everything that we’ve ever translated is stored in a database so that if we ever have to translate this content again, we can reuse our existing translations.

This faster and cheaper, and it provides a lot more consistency than re-translating everything from scratch every single time. And this database, too, is a good candidate for using LLMs to look for possible improvements and errors.

That’s very interesting. Any other ideal use cases?

So far, if we translate content, the French page will look pretty much exactly like the English page, except everything is obviously in French. But one thing that we haven't really done a lot of is repurposing and restructuring content.

LLMs are very good at summarization and paraphrasing – this is something we might want to experiment with going forward.

That could be a pretty cool way of taking more advantage of our English content, as we have a pretty massive content creation engine on the English speaking side, but not all of that is translated. And this is not just marketing, but also other areas like sales enablement.

So we have lots of content-creating teams on the English side, but in the other markets, there are typically a lot fewer people. There could be a nice middle ground here where we can more easily digest the amount of content on the English side and get that into other languages, but in a different format.

What would you say is the biggest roadblock?

From a technical implementation perspective, I don't think that there are any big roadblocks. We can basically flip a switch and use GPT-4o for translation purposes.

However, there would be some issues there, one major one – and this is not just with generative AI, but it is a bigger issue with generative AI typically – being terminology. A big concern in any automated translation processing.

If you take something like HubSpot’s "Service Hub", for example, you'll want that product name to stay as that in your translation. But if you just put Service Hub into Google Translate, you will probably get different translations in different contexts.

That can be that is obviously a huge detriment and it is not by any means a solved issue. However, there are a lot of machine translation providers who have glossary systems, and there are also very promising approaches when using LLMs.

How would you recommend that small to medium-sized businesses leverage LLMs? Or would you recommend actually that they lean more on machine translation still?

For a small-to-medium business, the assumption would be like there's probably not a lot of shepherding of the system that can be done on the side of the business.

In this case, I would probably actually recommend going with a traditional machine translation provider if the purpose is generating translated content.

It is a more focused system and it is easier to set up, whereas LLMs will likely need a lot more fine tuning and more iterating to get you to where you would want to be.

It kind of goes back to LLMs being generalists, right? They will perform well across a variety of tasks, but if you want the peak performance for translation tasks, I would go with a machine translation provider.

And there are fantastic sort of out-of-the-box solutions that will get you very far, and would be quite easy to implement as well.

When it comes to LLMs, is there a scenario in which they're the best options right now, despite being generalists?

If someone is just looking for a plug-and-play solution, like you turn it on and everything is handled, then I'd say an LLM is pretty much as good an option as an out-of-the-box machine translation provider.

Where I would give the lead to machine translation providers is typically in the amount of customization that you can do. Or, I should say in how much effort it takes to achieve that customization. LLMs will likely take more work to get to get to the same point.

What would you say is the most overrated claim about LLMs right now?

There are so many – when it comes to localization in particular, there’s this notion that we will lose our jobs.

The translation industry, localization industry, we've been there since 2016, when the first neural machine translation engines were released.

We‘ve gone through this entire discussion already, but it was confined to our industry, so it didn’t make the broad news as much as it does now with LLMs.

Localization isn‘t much different than other areas, like software engineering, for example. An LLM doesn’t mean you can get rid of your engineers. Well, you can, but at your own risk. The same is true in the localization space.

Can you replace linguists, translators, reviewers with LLMs? Yeah, you can do that if you want, but you might not like the consequences of that.

Lastly, how do you see LLMs impacting the work that localization teams do?

Localization probably provides the perfect crystal ball there just because we have gone through a cycle of this already where initially, the hype is gigantic.

Everyone was talking about how translators are gonna be out of a job. It’s now eight years later, it hasn‘t happened so far. In fact, we’re busier than ever.

It doesn't come down to a question, “Do we have human translators, or do we have machine translators?” What happens instead is that those systems allow you to increase your coverage, be that across languages or across content types. The scale is what these systems are really, really good at and what they really allow you to do with and without human involvement.

I don't foresee a massive change in how things have been going in localization with the introduction of LLMs, just because we were already at that point.

View full post