TransUnion
12/18/2023
Podcast
In this episode of the TransUnion Fraudcast, we go deep with fraud analytics expert Andrew Chan to explore how building effective custom models not only mitigates potential fraud but also identifies more legitimate customers and transactions — letting them through with less friction.
Jason Lord:
Welcome to the TransUnion Fraudcast, your essential go-to for the absolute linkages between the day's emerging fraud and authentication topics, trends, tropes and travails delivered with all the straight talk and none of the false positives.
I'm your host, Jason Lord, VP of Global Fraud Solutions, and if you joined us for the previous episode of the Fraudcast and I hope you did, we talked about what makes for useful identity data and the role that data plays in fraud-prevention efforts.
Today's conversation is in some ways a practical application of that previous discussion.
Now that you have the useful data, how do you build custom models using that data in a way that will not only help you mitigate potential fraud, but also identify more of the good customers and transactions, and let them through with less friction?
Here to help me discuss this is a long-term colleague of mine, Andrew Chan, head of product for TransUnion's fraud analytical products.
Like me, Andrew is a former veteran of Neustar, prior to its acquisition by TransUnion, and has spent over a dozen years working in everything in fraud prevention startups to multinational corporations across financial services, ecommerce and the fintech sectors.
Andrew, welcome to the Fraudcast.
So let's start with this: What exactly is data modeling?
How does it work?
How is it used by organizations?
Andrew Chan:
Well, predictive modeling specifically, later on we might get into like the differences between predictive modeling and other types of modeling, but predictive modeling specifically is using historical data like you had mentioned in previous podcasts to predict the likelihood of something happening in the future.
You know, businesses large and small should be using models and AI in general for internal purposes, as well as to streamline laborious tasks and to provide customer value in their external facing products.
When I think about how it works or how organizations might use it, you know there's really two distinct phases in creating an operationalizing model.
The first is offline model development.
This is where you determine what you want to predict, and this is where you gather data to predict that target –– calling the target the prediction, the thing that you want to actually try to predict. I'm calling that a target.
So if we provide like, you know, if we were talking about an example like ecommerce, so ecommerce you might want to predict ecommerce transactional fraud.
So that’s your target, and then the way you build that model is you gather data about that transaction.
So it might be the item price might be the total price of the transaction, might be the IP and the IP geolocation of, like, where the IP is and then also the shipping address…and then you might actually calculate a number that's the distance between the shipping address and the IPG location.
So all of that is data that you then use to then predict that target.
Jason Lord:
And that's offline data modeling. Is that right?
Andrew Chan:
That's right.
Jason Lord:
And can you clarify what makes it offline in this case?
Andrew Chan:
It's basically done by people manually going to various data warehouses, data sources, and getting this data and gathering it into one data set versus online or in a productionalized setting where all that data is not mainly gathered, it's actually automated by some system to bring all that data together to then make a prediction.
Jason Lord:
I see. So online is taking what would normally be offline and automating it that makes it online.
Is that right?
Andrew Chan:
Yeah, you could think of, like online as another word for productionalizing the model, you know, using the model in the wild to actually impact your business.
So there's a lot of steps that actually –– the offline development, there's data science, taking that data, cleansing it, splitting the data into a training set and a test set.
But for the sake of this conversation, we don't have to go into that. We just assume that the offline model development process is done; you've got a model that with some certainty predicts the future, and then the second step is actually productionalizing that model.
So Jason you just talked about the online aspect of that model. That's the hard part.
I can dig into that a little bit more if you want.
Jason Lord:
Well, before you dig into it, you mentioned an ecommerce use case, and so that's really helpful, but talk about some other fraud models. Some other use cases.
What types of insights, either online or offline, do these models normally…how are they used by organizations?
Andrew Chan:
Well, if you think about it like, I think in your previous podcast, do you have talked about like SIM swap, which is a very deterministic signal that basically says whether a phone, if you're going to send it an OTP, whether that phone, whether that OTP that's being sent is at the risk of being intercepted by a fraudster.
So sort of delineating between high-risk phones and low-risk phones.
But if you think about that, that's a very raw signal. That's a very, even though it's deterministic, there's a lot of people who do SIM swaps, and they're perfectly fine people.
Jason Lord:
Now forgive me, because when I think raw signal, I'm thinking Iggy Pop, and that's probably not what you mean…so when you say raw signal, what does that mean in a data sense?
Andrew Chan:
Oh man, Iggy Pop. Who’s this podcast targeted for?
Jason Lord:
Anybody who listens, that’s who this is targeted for!
Andrew Chan:
Raw signal in data terms…That would be, I would say, modeling terms that would be inputs into a model.
Jason Lord:
And you said when data is very raw, what does that mean?
Andrew Chan:
Any data that's not converted into like a modeling form... This is kind of a complex conversation, but in order for machines to ingest data, they have to convert it into a format that they understand.
Think like a time stamp or date-time stamp, you’ve got like 60 minutes in an hour. You've got 60 seconds in a minute. You've got 24 hours. You've got 31 days, that is, that's really varied for computer.
So you need to smooth out that data so that it is readable by a machine. So think like almost like zero and one, and all the variations between zero and one.
Jason Lord:
So when you're saying these data signals are raw, what you're saying is they might not be able to be ingested into a model the way that they currently are. Is that right?
Andrew Chan:
Correct, exactly.
But if we go back to the SIM swap example… So there's a lot of good SIM swaps.
So a model would be there to like, say, oh what is actually a risky SIM swap?
You know, what are some other data points that we can look at besides just a SIM swap?
Well, maybe if you’re SIM swapping from like a low-risk carrier to a high-risk carrier, maybe you look at like the tenure of the phone, maybe if you look at maybe some previous behavior of that phone, like does it get a lot of text messages? Does it get a lot of getting calls? Does it get have a lot of action on it?
It has a lot of action on it, then even if it was taken over then it would be caught pretty quickly.
So that's all those other data points that I mentioned would be part of a model to then disambiguate the high-risk SIM swaps versus the low-risk SIM swaps.
Jason Lord:
OK, so we've been using the term “custom model” quite a bit.
Is there a difference between a custom model and a generic model, and if so, what is that difference and when would you use each version of that?
Andrew Chan:
Uh, yeah, it's really, you know, the scope of that question really depends on what the organization is doing with the model.
So if the organization is selling a model versus an organization that is simply building a model for their own use and maybe exposing it to their customers to predict something in their customer interactions, that's very different.
So if I was to elaborate on that: Organizations that are building models for their own use, those are automatically going to be custom models.
It's built upon their data, their target is their own behaviors, so that's automatically going to be a custom model.
In that case, the only question regarding a generic model is whether they want to use a vendor’s generic model as an input into their own custom model.
So going back to that little data gathering, data…lots of raw data like some of that raw data that goes into a model can be another model. It can be a model feature, and so that's the question that you have.
That one has to ask if you're building a custom model, do I want to take that…let's just say third-party vendor’s model as an input into my model, because the danger with that is lack of explainability. You don't know how that vendor's model was built.
You don't know what target it was being used for. So the only validation that you're doing is you're doing some data tests in the background to say whether that vendor’s model is predictive or not, and that is your choice of whether you want to then include it into your own custom model.
Jason Lord:
So that's really interesting…as an organization, I'm always building custom models for my own needs, and then I'm making a decision whether I want to incorporate a generic model that a third party has created that may or may not have lift –– I can test it to find out.
But that explainability part of it is really interesting.
I think about financial services organizations and the need to be able to explain why decisions are being made so that, for instance, if there's bias into the system, I know where that originates from.
Is that always a danger when you're bringing in a third-party generic model?
Andrew Chan:
Well, you know, I would say bias is one thing, but explainability is really the core of it.
Because these, you know in my field, in fraud prevention, these actually have real-world impacts that people care about.
So like, an account origination model, and that declined someone. Well, if I really care about creating an account at that financial institution then I'm going to call in, I'm going to be like, well, I tried to originate this account, it declined me. Why did it decline me?
And if I'm a good customer, how can I avoid that so that I can actually get the account origination through and approved, right?
And so the financial institution has to answer that question.
And so that's where explainability comes in and why the danger of having something that's not explainable, and therefore can be mitigated to a good customer, comes into play.
Jason Lord:
Speaking of not being able to explain, I imagine AI and ML — machine learning — have had a really dramatic effect on modeling.
I'd love to hear from your perspective what impact AI and ML have had and how you contend with that, especially in the terms of explainability.
Andrew Chan:
Yeah, this goes back to when we first started, I delineated predictive modeling and there is a delineation within AI of predictive AI versus generative AI.
And so when we have this conversation about AI, professionals like me who deal in data science, who talk to data scientists who build models, we have to make this delineation because the latest raise, the latest phase of AI is basically mostly generative AI, so generative AI is it creates new content like ChatGPT. It takes a lot of information from articles, from digital libraries, and then when you ask it a question, it'll like search all of that information and then give you some new content based upon all that content that I had consumed before.
So the difference between generative AI and predictive AI is predictive AI is like use-case specific.
It's actually going to…you're using AI methods to still do a specific thing, like in areas like prevent fraud or detect fraud, or in the case of like Apple and Amazon with Siri, and what's the Amazon equivalent?
Jason Lord:
I think it's Alexa, if I remember correctly.
Andrew Chan:
Yes, yes. I don't have either. Well, I have Siri by default, but I don't have Alexa. I took that out –– just a little too freaky.
Jason Lord:
And you're a fraud person saying this, so we should probably take that seriously!
Andrew Chan:
Yeah, exactly.
So that's, you know, two examples of generative AI, like ChatGPT versus predictive AI, which is trying to solve a specific problem and predict the future.
Jason Lord:
What kind of what kind of mistakes do organizations make when they're considering these models, whether they be custom, whether they be generative…what mistakes have you seen organizations make?
Andrew Chan:
Oh, it really depends on a couple of factors.
So, the size of the organization, the vertical that the organization is in and where they are in the data-science journey relative to their industry.
So like for example you never want to be in a position where you're a big player in an industry, but you've had very little investment in data sciences and data science compared to your peers.
Because what that means for fraudsters is that you're going to be the target, because your peers are going to invest in data science or going to be able to detect and avoid that fraud that is coming to them.
And then what do fraudsters do?
They go to the lowest, the easiest target, which is someone who hasn't invested in that.
That's one example. I can give you another example of like, maybe like an organization is growing, they foresee the need for in-house data-science expertise but they don't have it yet.
And so they need to fill a gap temporarily.
And so in this case, you know, I think the strategy would be to look for third-party solutions that can build custom models, deploy them on their own decision-model–execution engine that they have built.
And then this an unlicense those capabilities, at least in this case you get your feet wet and it gives you guidance for how you want to umm, build your own in-house data science team and stack.
Jason Lord:
That kind of ties into into my next question, because business leaders, business owners, have a different set of priorities than data analysts necessarily.
And I have to imagine that there might be, if not tension, at least just maybe sometimes mismatch, and what those priorities look like within an organization.
So how would you recommend somebody from the data analytics side and somebody from the business side best work together?
Andrew Chan:
Yeah, I see this often where you have businesspeople who hear about data science, who hear about the coolness and the sexiness of AI and predicting the future.
But data science is not a panacea. Data scientists are miracle workers, right?
There is a large part for the business to play in these relationships.
So if I was to kind of name off a couple of things off the top of my head are the skill set of data scientists are really growing to the point where there are a lot of specialties within data science.
So I just now pointed out a couple, predictive AI versus generative AI. So some data scientists might be really good at generative AI and have no idea of what to do about predictive AI or have never encountered it.
So in that case, you really want to think about the skill set. What you want to do, what you want to predict, what’s your target…and then the skill set necessary to actually fulfill that.
So that's like team building. But then the next one is, don't expect data scientists to have a solution like immediately. There's a lot of R&D that needs to be done.
There's a lot of different ways to build models and these days, because there's so many different methods and ways of dealing with data and different ways of modeling data, there's a lot of trial and error. And a lot of this trial and error is actually you have to go through the trial and error to really see what's predictive and what's not predictive, or what's better versus what's suboptimal.
So there's a couple of examples. Happy to take in a couple more if you if you like.
Jason Lord:
So I heard know who your data scientists are and what they're actually good at, as opposed to just getting data scientists, because that's like getting a teacher but not finding out what subject they specialize in.
And then the second is be patient, because it takes a while to get the types of results that you want and there's going to be a lot of trial and error on the way.
Andrew Chan:
That's right.
And one last point to add on to, and this is where business really plays a key role here, is and I mentioned it a little bit, but I really didn't focus on it all, is thinking about what you want to predict and how to gather large amounts of that data.
So what I mean by that is sure, you want to predict something. Like if I want to predict all the horse races and who's going to win those horse races.
Well, I have to work with the data scientists to get them the data of not only who is the winner of the horse races, but maybe some ancillary data that I think would be predictive, right, maybe like horse injuries…maybe the past wins and losses of the horses, maybe who's the breeder, right?
So all of these additional data points for which you have the business context, or the business has the business context for this, and knows about that data…the data scientists might not have the dat. The data scientists might be completely new to horse racing and gambling and might not have any context.
And actually, I would say many of them don't have that context.
Jason Lord:
Well, and that's interesting too because, are you saying that I don't want to say you should overwhelm the data scientists, but should you make the assumption that giving them more data is better than giving them less data because you don't know what data is going to be valuable?
Is that what you're saying?
Andrew Chan:
Definitely…go into this relationship with an expectation to brainstorm a lot.
Jason Lord:
And it's going to be required data that might not seem intuitive and that might not seem like the straight-line answer to what you're giving, but it might give context and correlation that might provide some additional input or insight that you would not get if you did not provide that data.
Andrew Chan:
I would say that the business leader or the businessperson who's trying to solve this problem should have that straight-line thought of like how this data relates to what they want to predict.
The data scientist might not because the data scientist spent, you know, years getting their PhD in physics while you spent your time thinking about horse racing and building products around that and building businesses around that. So you would have that context, the data scientist wouldn’t.
Jason Lord:
That makes sense.
Well, you've already thrown out your Alexa..is there anything else that scares you when you think about data modeling, that keeps you up at night?
Andrew Chan:
So we're using these techniques, right, these new AI techniques. I wouldn’t say new AI techniques –– we're using many of these techniques and combining generative AI predictive AI techniques, and we've got lots of funding to do so by businesses that have funded these initiatives.
What we've seen is that fraudsters are organizing much in the same way, right, they're now fraud rings. People who work together to perpetrate fraud.
There was funding down from the top to a lot of these people who actually perpetrate the fraud, and so these organizations, these fraud organizations, have a decent amount of funding.
And so I expect that they are actually experimenting with these technologies also.
And there is a concept that like ChatGPT or OpenAI or any of these companies are experimenting with, it's called multimodeling... So in other words, using different models together to do something. Think if I wanted to, you know, almost like a combination of ChatGPT plus someone actually on video regurgitating that information.
So at that point you have two models. You have a deepfake model who looks like someone and speaking, and then you then have another model that is actually providing the content of what that person is saying.
So those are two models right there, almost like a ChatGPT model, and a deepfake model.
I can imagine that fraudsters are doing the same thing.
And so if you combine that with synthetic identity, they've created these identities that actually don't exist.
Now you start putting a face to them, you start putting a voice to them… It really kind of gets scary in that in that respect of like, you know, we rely on a lot of different ways of validating and interaction.
Biometrics. An example of biometrics is the voice, the way something sounds, and data about a person. Well, all of a sudden, fraudsters can start filling in that information…those biometrics, the information about a person, using AI.
Jason Lord:
And what you're saying, Andrew, is not just a hypothetical, we're already seeing it. We're already seeing in the call center, fraudsters using AI technologies to fake voices.
We're already seeing AI being used to fake documents.
And so that arms race you're describing, first of all, we'll keep you in business for a while, which is a great thing, but it also means that what the motif we keep hearing over and over again is you can't be single-threaded. If you're relying primarily on biometrics, to your point, a fraudster will find its way around.
Andrew Chan:
Well exactly, and that actually goes back to the question about what do organizations need to take into account when they get into this field of developing models and custom models?
One is, don't rest on your laurels. If you've got a solution that works for you, great, that works for you on day one, but will work for you on day 30? Will it work for you on day 60?
You know, probably eventually it will not work for you, so keep improving what you're doing because just because it's good doesn't mean it's great –– and it can get better.
Jason Lord:
I think that's a great way to end it.
Thank you so much, Andrew. We appreciate you joining us today.
Thank you to the listeners for tuning in. We hope you join us for an upcoming Fraudcast episode.
In the meantime, stay smart and stay safe.
Your essential go-to for all the absolute linkages between the day’s emerging fraud and identity trends, tropes and travails — delivered with straight talk and none of the false positives. Hosted by Jason Lord, VP of Global Fraud Solutions.
For questions or to suggest an episode topic, please email TruValidate@transunion.com.
The information discussed in this podcast constitutes the opinion of TransUnion, and TransUnion shall have no liability for any actions taken based upon the content of this podcast.