WEBVTT 00:00.000 --> 00:18.000 Okay, so we are in time so we can welcome Philippe and on his dog and please if you still see some space next to each other, try to move to the middle so more people can come in. 00:18.000 --> 00:26.000 So, welcome, I'm going to talk today about a favorite topic which is AI and L.M.s and GNI. 00:26.000 --> 00:32.000 And excuse me, speak up. 00:32.000 --> 00:34.000 Okay. 00:34.000 --> 00:36.000 Okay, now it's better. 00:36.000 --> 00:38.000 You hear me all right? 00:38.000 --> 00:41.000 In the back, it doesn't saturate better like that. 00:41.000 --> 00:49.000 So, I'm going to talk about our favorite topic which is AI and GNI that we all love to hate. 00:49.000 --> 00:57.000 Me, in particular, actually, I'm not so much of a ludic but I have a no AI badge in the back of my laptop. 00:57.000 --> 01:01.000 And not of this was generated with AI by the way. 01:01.000 --> 01:05.000 So, Philippe and I'm the lead of an open source project. 01:05.000 --> 01:08.000 It's actually now a small foundation called abort code. 01:08.000 --> 01:13.000 And everything we do is free an open source code data and standard. 01:13.000 --> 01:16.000 To help people use more open source. 01:16.000 --> 01:20.000 So, figure out what the code comes from, what the license or the security issues. 01:20.000 --> 01:23.000 And I'm sure we can all benefit from that. 01:23.000 --> 01:26.000 I double the beat in standards. 01:26.000 --> 01:30.000 I'm a co-founder of something called SPDX for S-bounds. 01:30.000 --> 01:38.000 And I'm also a co-controller to cyclone DX because standard you need more of them so you can pick which one to support. 01:38.000 --> 01:48.000 And I'm also behind a small standard called package URL which happened to be since mid December, a commas standard. 01:48.000 --> 01:55.000 And on the way to ISO, it's a very stupid string to identify package in S-bomb and SCA tools and VomT database. 01:55.000 --> 02:01.000 You will hear about that if you don't, who has ever heard about Perl or package URL in the room? 02:01.000 --> 02:04.000 Good, so it's pretty cool. 02:04.000 --> 02:12.000 You will have to surface with that whether you like it or not if you're doing somehow application security at some level. 02:12.000 --> 02:19.000 So, we're building tools, data, standards, everything is free as open source. 02:19.000 --> 02:21.000 That's important. 02:21.000 --> 02:24.000 The origin that we have, of course, AI is a thing the world. 02:24.000 --> 02:31.000 That's a serious issue and how can we make sure we reuse AI responsibly? 02:31.000 --> 02:34.000 And there's the two core issue that I will bring up. 02:34.000 --> 02:38.000 One is an issue of license. 02:38.000 --> 02:44.000 Because if you think about open source, open source is defined by license. 02:45.000 --> 02:48.000 Code that's generated by a bot. 02:48.000 --> 02:53.000 The tree is not out yet, but is eventually non-corporatable. 02:53.000 --> 02:54.000 So, there's no open source anymore. 02:54.000 --> 03:02.000 There's also the related problem which is LLMs are massively trained on an open source code. 03:02.000 --> 03:04.000 The code we write. 03:04.000 --> 03:10.000 And it can really regurgitate, speak back this code very efficiently because they memorize it. 03:10.000 --> 03:14.000 Which brings not only copyright issue, but also security issue. 03:14.000 --> 03:17.000 You know, bag in, garbage in, garbage out. 03:17.000 --> 03:31.000 So, you have a wonderful way to eventually propagate security bags from the training code to the code that's being generated. 03:31.000 --> 03:36.000 So, who has not heard about AI and LLM? 03:37.000 --> 03:41.000 So, somebody was living the cave for the last three years. 03:41.000 --> 03:43.000 Wonderful. 03:43.000 --> 03:44.000 That's great. 03:44.000 --> 03:46.000 That's a rare thing. 03:46.000 --> 03:48.000 We need to put you on a pedestal. 03:48.000 --> 03:51.000 Because you will save humanity. 03:51.000 --> 03:52.000 Eventually, you're the last one of us. 03:52.000 --> 03:55.000 That's not so far to the thing. 03:55.000 --> 03:57.000 Okay. 03:57.000 --> 04:02.000 We're talking about the risk of the AI generated code. 04:03.000 --> 04:08.000 And also another topic which is eventually AI is used as a feature in code. 04:08.000 --> 04:14.000 It's not so much not only generated code, but also using an LLM, 04:14.000 --> 04:18.000 a chat feature or whatever, especially in the world of cyber security. 04:18.000 --> 04:23.000 It has really, really funky outcomes. 04:23.000 --> 04:30.000 And eventually, having the ability to identify when code the AI generated is really important. 04:30.000 --> 04:34.000 Whatever I'm going to show you today, and I'll do a short demonstration in July, 04:34.000 --> 04:37.000 I'm not able to detect the AI generated code. 04:37.000 --> 04:42.000 There's no EM dash in code to spot the AI generated code. 04:42.000 --> 04:47.000 What I can instead do is detect strikingly, 04:47.000 --> 04:53.000 which is strikingly similar to your open source code that was used to feed the beast. 04:53.000 --> 04:58.000 And that's what we're focusing on at least part of what I'm talking about. 04:59.000 --> 05:04.000 Now, you can at least do funny stuff and even useful stuff in the AI. 05:04.000 --> 05:06.000 One of them is to generate points. 05:06.000 --> 05:16.000 That's actually AI generated, which would describe how you would do AI generated code detection. 05:16.000 --> 05:20.000 I'll make sure I put the link to the slides in my talk. 05:20.000 --> 05:23.000 I didn't do it yet, but you have the full downloadable, 05:23.000 --> 05:26.000 a library of his file and PDF. 05:26.000 --> 05:28.000 I make sure it's on first them. 05:28.000 --> 05:32.000 So there's another problem is there's a lot of open-washing in the space. 05:32.000 --> 05:34.000 You know, open AI is not open. 05:34.000 --> 05:39.000 And most of open space AI is not open source itself. 05:39.000 --> 05:46.000 There is also a lot of open source, which is essentially central, 05:46.000 --> 05:50.000 which opens any kind of AI will not be happening, 05:50.000 --> 05:53.000 then so flow by torture some of them. 05:53.000 --> 06:02.000 The good thing at least for now is that the code that's the AI generated very often doesn't compile yet. 06:02.000 --> 06:05.000 It's a good thing you send that it's an easy way to spot. 06:05.000 --> 06:10.000 It doesn't compile, doesn't pass the test, it's a good chance to see AI generated. 06:10.000 --> 06:15.000 We have, as I said, not only a problem of open-washing, 06:16.000 --> 06:21.000 but also a problem which is a lot of innovation. 06:21.000 --> 06:25.000 And I don't like this kind of innovation about licensing. 06:25.000 --> 06:31.000 We have a lot of new, what I call, open source, open-washing licenses, 06:31.000 --> 06:34.000 which look like open source, but are not. 06:34.000 --> 06:39.000 And that's so straight, not so much on secret, but on open source at large, 06:39.000 --> 06:45.000 because it's very easy to get full if you're not really savvy about spotting these. 06:45.000 --> 06:50.000 And sometimes they take an MIT, a BSD, or an Apache license, and in sort of few things. 06:50.000 --> 06:55.000 We're seeing at least one or two new license a week, 06:55.000 --> 06:57.000 which are funky, non-open source license. 06:57.000 --> 07:02.000 They all come from AI projects or AI-related projects. 07:02.000 --> 07:09.000 He's in point, for instance, if you're doubling with a source available, 07:09.000 --> 07:15.000 or downloadable model from MITA, Lama 4, 07:15.000 --> 07:18.000 good luck for you, you're basing your up, you cannot touch. 07:18.000 --> 07:23.000 The license, please, says, you cannot choose this model if you're basing your up. 07:23.000 --> 07:27.000 That's wonderful. What can go wrong with that? 07:27.000 --> 07:34.000 So, I think that we have to treat AI, and accept sometimes to wonderful, 07:34.000 --> 07:38.000 positive booster, sometimes it even compiles and run. 07:38.000 --> 07:40.000 It's rare, but that can help. 07:40.000 --> 07:45.000 But again, we need to ensure we understand what we're dealing with. 07:45.000 --> 07:50.000 If you imagine taking all the code from the new project on the GPL, 07:50.000 --> 07:54.000 and create a small language model of that, 07:54.000 --> 08:01.000 I'm not a lawyer, but I cannot imagine a way where the output of generating code from that model 08:01.000 --> 08:03.000 would not be also under the GPL. 08:03.000 --> 08:08.000 Just derive in a weird, funky way with weight and math behind it, 08:08.000 --> 08:12.000 but still eventually directly derived from that. 08:12.000 --> 08:15.000 There's also interesting that, in some case, 08:15.000 --> 08:20.000 and I think it's probably going too far, you have corporations which are 08:20.000 --> 08:24.000 probiting AI, generated code, or probiting AI use. 08:24.000 --> 08:30.000 You also have stupid corporations, and maybe some of you have been subjected to these abuse, 08:30.000 --> 08:33.000 where you have managers that ask you every other day, 08:33.000 --> 08:38.000 have been using more, have you been using more AI? 08:38.000 --> 08:42.000 Which AI have you been using? Why are you not using more AI? 08:42.000 --> 08:46.000 Which I think is absolutely terrible as a metric for performance. 08:46.000 --> 08:51.000 I hope, how many of you are subjected to this kind of abuse in your corporation or business? 08:51.000 --> 08:55.000 Oh, man, more than people that know Pearl, that's terrible. 08:55.000 --> 09:00.000 Oh, we're leaving really in weird world. 09:00.000 --> 09:04.000 So, the AI generates its search project. 09:04.000 --> 09:09.000 The test base is a small between code training data, 09:09.000 --> 09:15.000 where we've indexed about 260,000 open source project source code. 09:16.000 --> 09:24.000 And what we did is to ask Chagipiti to generate code similar to a package URL. 09:24.000 --> 09:28.000 So, essentially generate code similar to this package. 09:28.000 --> 09:31.000 It's probably not a problem that you would use in real life, 09:31.000 --> 09:38.000 but it's a possible point, and we saved each file in a code file. 09:38.000 --> 09:46.000 And then we used or tool to rebuild for that to scan this file, this 100 files, 09:46.000 --> 09:49.000 on the most common JavaScript project, 09:49.000 --> 09:58.000 and basically run the code matching between index of 200,000 and this 100 generated files. 09:58.000 --> 10:02.000 What we found is it's not scientific, 10:02.000 --> 10:06.000 at least 20% of the case, strikingly similar code. 10:06.000 --> 10:10.000 So, that's how the set-up looks like. 10:10.000 --> 10:13.000 A lot of matching read, that everything's open source, 10:13.000 --> 10:17.000 the whole point is you essentially collect check sums for your index, 10:17.000 --> 10:18.000 and you match them back. 10:18.000 --> 10:21.000 Except the check sums are not really check sums that would be too brittle. 10:21.000 --> 10:23.000 We wouldn't find anything. 10:23.000 --> 10:25.000 In particular, when you generate code, 10:25.000 --> 10:30.000 we's depending on the parameter called the model temperature. 10:30.000 --> 10:33.000 You will have essentially the same control flow, 10:33.000 --> 10:36.000 completely different name for the variables, the functions, and else. 10:36.000 --> 10:39.000 So, you need to adjust to this kind of things. 10:39.000 --> 10:45.000 And initial approach, where to say, let's use AI to detect 10:45.000 --> 10:51.000 if there are similarities with a code that exists in the AI generated code. 10:51.000 --> 10:56.000 It worked, actually, okay, except it was extremely expensive. 10:56.000 --> 11:02.000 I just spent all my free token budget with all the AI company there. 11:02.000 --> 11:09.000 The focus of this project is to say, let's find strikingly similar code fragments 11:09.000 --> 11:11.000 that make them from another project. 11:11.000 --> 11:15.000 The problem is it doesn't work well with traditional techniques. 11:15.000 --> 11:18.000 If you think about just an inverted index, 11:18.000 --> 11:22.000 there's too much content to match details to get in. 11:22.000 --> 11:25.000 You get too much noise very quickly. 11:25.000 --> 11:31.000 We didn't try also the approach, but we research it extensively, 11:31.000 --> 11:35.000 which is used in a search engine called Bing from Microsoft. 11:35.000 --> 11:40.000 It's a project called BitcoinL, which is an alternative to inverted index indices, 11:40.000 --> 11:45.000 which is interesting, and it probably something you want to consider in the future. 11:45.000 --> 11:48.000 And you have a few companies in the space, 11:48.000 --> 11:52.000 which are doing what I call traditional code fragment matching companies. 11:52.000 --> 11:54.000 Commercial companies like BlackDuck, 11:55.000 --> 11:59.000 semi commercial company like SKNOSS, 11:59.000 --> 12:03.000 a proprietary company like Fossa or FOSID. 12:03.000 --> 12:06.000 They all use the exact same algorithm, 12:06.000 --> 12:13.000 which was devised by a guy in Berkeley in the late 1990s, early 2000s. 12:13.000 --> 12:16.000 And that's really work well, 12:16.000 --> 12:22.000 because it's not about to detect the wide variations we have when we use it. 12:22.000 --> 12:27.000 You can point twice the same model with the same point. 12:27.000 --> 12:32.000 You will get eventually slightly different results, so that's the problem. 12:32.000 --> 12:38.000 So the approach is going to a first we break the code into chunks. 12:38.000 --> 12:42.000 That means literally we parse the code in tokens, 12:42.000 --> 12:46.000 and we detect boundaries using a content defined algorithm 12:46.000 --> 12:50.000 to have chunks which are roughly of the same size. 12:50.000 --> 12:54.000 And then we compute what's called FOSID hash. 12:54.000 --> 12:58.000 Some of you may be aware of things like SSDEEP, 12:58.000 --> 13:02.000 which is a tool to find approximate matching files, 13:02.000 --> 13:04.000 where you use in security. 13:04.000 --> 13:08.000 We're not using SSDEEP, but the principles are similar, 13:08.000 --> 13:13.000 meaning you abstract the code fragment to bit string, 13:13.000 --> 13:16.000 and you have way to match approximately validating distance, 13:16.000 --> 13:20.000 in having distance if two code fragments are the same. 13:20.000 --> 13:24.000 The thing in our case is what's interesting is that the precision 13:24.000 --> 13:27.000 of the matching can be tuned at correct time, 13:27.000 --> 13:32.000 depending on how large the bit string you want to be. 13:32.000 --> 13:37.000 The best way to understand how this works is this. 13:37.000 --> 13:39.000 Imagine two cats, pictures. 13:39.000 --> 13:43.000 A brown and a gray that have a slightly different tail different eyes. 13:43.000 --> 13:50.000 If you resize the image down to say 32 bits by 32 bits, 13:50.000 --> 13:53.000 they look exactly the same. 13:53.000 --> 13:58.000 That's essentially what we're doing here using as our approach, 13:58.000 --> 14:01.000 but it's sizing down cat pictures, 14:01.000 --> 14:05.000 which is always, of course, a security favorite. 14:05.000 --> 14:11.000 So the initial project plan was to go through 14:11.000 --> 14:15.000 each of these steps, I won't forget that. 14:15.000 --> 14:18.000 But again, this didn't work well. 14:18.000 --> 14:21.000 We were not able to detect a lot of the code, 14:21.000 --> 14:24.000 because of these variations in the literature. 14:24.000 --> 14:26.000 That's one problem. 14:26.000 --> 14:29.000 And you know, good stuff, die hard. 14:29.000 --> 14:32.000 We found an algorithm devised by a guy, 14:32.000 --> 14:36.000 which happened to be the author of a venerable version 14:36.000 --> 14:39.000 control system called CVS, 14:39.000 --> 14:43.000 which creates, which was based on RCS, 14:43.000 --> 14:45.000 which then led to subversion. 14:45.000 --> 14:49.000 They eventually took it with a few segues in Bazaar 14:49.000 --> 14:52.000 and other version control systems. 14:52.000 --> 14:55.000 And I think the guy, I don't know if he's dead or not, 14:55.000 --> 14:58.000 but he's retired. 14:58.000 --> 15:02.000 At the minimum he's retired, working in Dutch university. 15:02.000 --> 15:06.000 And he wrote a piece of code to actually 15:06.000 --> 15:09.000 transform a stream of code tokens 15:09.000 --> 15:12.000 in something that's generic and makes sense. 15:12.000 --> 15:15.000 It's simple, a views, it's time tested, 15:15.000 --> 15:19.000 it's like what almost 40 years. 15:19.000 --> 15:23.000 And we call that code stemming, because that looks cool. 15:23.000 --> 15:26.000 But the sense is to the same way you stem 15:26.000 --> 15:31.000 language when you index it for information retrieval, 15:31.000 --> 15:34.000 where two words which share the same start 15:34.000 --> 15:37.000 will be just abstracted to that start. 15:37.000 --> 15:41.000 Here we're doing the same with code. 15:41.000 --> 15:45.000 We have another problem which I won't dive into too much, 15:45.000 --> 15:48.000 which is how do you distribute eventually the fingerprints, 15:48.000 --> 15:53.000 widely, so we don't create that this massive centralized database, 15:53.000 --> 15:55.000 which is locking mechanism. 15:55.000 --> 15:58.000 We have some single federated code to help with that 15:58.000 --> 16:00.000 that we're progressively deploying, 16:00.000 --> 16:04.000 which means that eventually you have access to this 16:04.000 --> 16:08.000 to run on prem without having any of us in the picture. 16:08.000 --> 16:12.000 It's important to liberate the data. 16:12.000 --> 16:16.000 So, current status, we have something which works. 16:16.000 --> 16:19.000 It's based on boring technology, I love boring. 16:19.000 --> 16:22.000 It's proven there's nothing funky. 16:22.000 --> 16:26.000 We use old code, we resist when we have 16:26.000 --> 16:30.000 a request of newbies that says, hey, you know why don't use these new tools, 16:30.000 --> 16:31.000 these new things? 16:31.000 --> 16:35.000 No, we use boring, working things. 16:35.000 --> 16:41.000 And we make sure we could also extract low-level libraries 16:41.000 --> 16:45.000 that could be reused for the purpose at the same time. 16:45.000 --> 16:52.000 So, before going there, I'm going to go and do the mandatory live demo, 16:52.000 --> 16:57.000 which for sure will not work because we're alive. 16:57.000 --> 17:00.000 And if I can find... 17:04.000 --> 17:07.000 Shiza. 17:07.000 --> 17:10.000 I see there's a few Germans in the room, right? 17:10.000 --> 17:14.000 Normally, you're supposed to use French to swear. 17:14.000 --> 17:19.000 But here, I'm using Germans because that's less recognizable. 17:19.000 --> 17:21.000 Okay. 17:21.000 --> 17:24.000 I'm just going to go there, we'll find it when we add the other, 17:24.000 --> 17:28.000 that's a test instance. 17:28.000 --> 17:31.000 And I'm going to search for test project. 17:31.000 --> 17:34.000 I'm sure we're going to have one. 17:34.000 --> 17:37.000 So, this is the tool we use at the phone tank called Scanko. 17:37.000 --> 17:41.000 You can just look for it, download and run. 17:41.000 --> 17:48.000 It can scanko for origin, like AI generated code, 17:48.000 --> 17:50.000 but also a lot of things. 17:50.000 --> 17:53.000 Scank containers do wrong off the hookups, lot of other stuff. 17:53.000 --> 17:58.000 But match to match code, there we go. 17:58.000 --> 18:03.000 And this looks like a good design project. 18:03.000 --> 18:09.000 So, what we did here is scank for some code that was the AI generated. 18:10.000 --> 18:13.000 If I recall correctly, the point was, 18:13.000 --> 18:17.000 generate some code similar to this JavaScript library that does 18:17.000 --> 18:20.000 basics before encoding. 18:20.000 --> 18:23.000 We run these three pipelines in sequence, 18:23.000 --> 18:25.000 what's interesting to see the results here. 18:25.000 --> 18:28.000 There was one package that was detected, 18:28.000 --> 18:35.000 which is eventually the project that we asked to generate code for. 18:35.000 --> 18:38.000 So, it sounds a bit as a totology, 18:38.000 --> 18:42.000 but interesting thing, asked for a prompt and 18:42.000 --> 18:48.000 LLM to generate code about a certain package you'll get it out. 18:48.000 --> 18:53.000 And if we dive a bit into this, 18:53.000 --> 18:56.000 we see three matched resources, 18:56.000 --> 18:59.000 and if we do dive into one of them, 18:59.000 --> 19:02.000 and see a code you're here, 19:02.000 --> 19:05.000 we can see the match fragments 19:05.000 --> 19:09.000 that are essentially the same as our scene upstream. 19:09.000 --> 19:12.000 And you could dive into the details. 19:12.000 --> 19:15.000 You would be looking, you see some sections, 19:15.000 --> 19:17.000 which are not highlighted. 19:17.000 --> 19:19.000 It's just a side effect of the algorithm. 19:19.000 --> 19:20.000 But you look at this code, you says. 19:20.000 --> 19:25.000 Yes, there's no question that this has been obviously derived 19:25.000 --> 19:27.000 from this upstream project. 19:27.000 --> 19:30.000 And, and except for a few non-match regions, 19:30.000 --> 19:32.000 such as the same code, and again, 19:32.000 --> 19:35.000 this is literally exactly verbatim the code 19:35.000 --> 19:38.000 with a few magnification. 19:38.000 --> 19:42.000 So, that's the proof that we're not boasting there. 19:42.000 --> 19:45.000 So, you don't have to trust me. 19:45.000 --> 19:51.000 Next up, the key things also is being able to detect 19:51.000 --> 19:56.000 the case where you have code that's used as a feature. 19:56.000 --> 19:59.000 So, that's what we're working on next. 19:59.000 --> 20:05.000 Together, we're helping people that build LLM using code. 20:05.000 --> 20:09.000 In particular, there's a project at a hanging face, 20:09.000 --> 20:11.000 which they call big code, 20:11.000 --> 20:14.000 and they're building a dataset called the stack. 20:14.000 --> 20:17.000 They're eventually helping them to run 20:17.000 --> 20:19.000 or tools can code at scale, 20:19.000 --> 20:23.000 to ensure that the provenance and license of the code they index 20:23.000 --> 20:25.000 is actually known, 20:25.000 --> 20:29.000 which is a good thing because at least you can trace 20:29.000 --> 20:33.000 when you have potentially generated code from their models, 20:33.000 --> 20:35.000 where this came from accurately. 20:35.000 --> 20:41.000 The next frontier is really to treat models 20:41.000 --> 20:44.000 as software components. 20:44.000 --> 20:49.000 Meaning, there's some companies that probably don't want 20:49.000 --> 20:51.000 to use Chinese models. 20:51.000 --> 20:53.000 It's the case in Germany, 20:53.000 --> 20:57.000 where I think it's been prohibited by the German government, 20:57.000 --> 20:58.000 in some case. 20:58.000 --> 21:00.000 I think DeepSeek has been prohibited. 21:00.000 --> 21:02.000 It's the case in some US corporation 21:02.000 --> 21:05.000 doing business with the US federal government. 21:05.000 --> 21:07.000 I don't care about the reason. 21:07.000 --> 21:10.000 I think frankly, it's overblown and bullshit, 21:10.000 --> 21:13.000 but in any case it's interesting 21:13.000 --> 21:14.000 from a technical point of view to say, 21:14.000 --> 21:20.000 can we detect when a model is based on DeepSeek? 21:20.000 --> 21:22.000 And I would be in fine tune, for instance, on DeepSeek? 21:23.000 --> 21:28.000 It turns out from the early things we've been looking at 21:28.000 --> 21:34.000 that if you treat the sequence of weight in a model 21:34.000 --> 21:37.000 as a subject of fingerprinting, 21:37.000 --> 21:40.000 we can actually find strikingly similar similarities 21:40.000 --> 21:44.000 between the model and its fine tune versions. 21:44.000 --> 21:46.000 In many cases, 21:46.000 --> 21:48.000 so when you have quantization, 21:48.000 --> 21:50.000 it's harder and it doesn't work. 21:50.000 --> 21:52.000 But when you don't quantize, 21:52.000 --> 21:56.000 you have essentially a few number of the weights 21:56.000 --> 22:00.000 which are being updated at each generation of fine tuning, 22:00.000 --> 22:01.000 and the rest is truly the same, 22:01.000 --> 22:04.000 and we find these very efficiently. 22:04.000 --> 22:07.000 The last point is to detect 22:07.000 --> 22:12.000 when you have AI, 22:12.000 --> 22:15.000 used using APIs and libraries. 22:15.000 --> 22:16.000 So it's easier, 22:16.000 --> 22:18.000 we already have the code to detect 22:18.000 --> 22:21.000 code similarities to the tech libraries. 22:21.000 --> 22:23.000 It's going to be just about tagging them, 22:23.000 --> 22:26.000 so you have LangChain, 22:26.000 --> 22:27.000 well known library in Python, 22:27.000 --> 22:29.000 for instance, you want to make sure that you know that you 22:29.000 --> 22:30.000 are using LangChain, 22:30.000 --> 22:32.000 which means it's like using AI, 22:32.000 --> 22:34.000 featuring your code. 22:34.000 --> 22:36.000 And API importance, 22:36.000 --> 22:37.000 something that's like yes, 22:37.000 --> 22:41.000 simple as just doing a grip on URL. 22:41.000 --> 22:43.000 And that's it. 22:43.000 --> 22:45.000 This was founded in part by 22:45.000 --> 22:47.000 the EU program called NGI search, 22:47.000 --> 22:49.000 so if you're basing the EU in part 22:49.000 --> 22:51.000 sums to your taxes, 22:51.000 --> 22:53.000 thank you very much. 22:53.000 --> 22:54.000 The code is yours, 22:54.000 --> 22:57.000 it's not ours, it's for use to use. 22:57.000 --> 23:01.000 And if you have questions, 23:01.000 --> 23:04.000 I'm taking some questions there. 23:04.000 --> 23:06.000 Go ahead. 23:06.000 --> 23:13.000 Thank you very much. 23:13.000 --> 23:16.000 Thank you very much. 23:16.000 --> 23:20.000 Yes, I have a question about 23:20.000 --> 23:22.000 transform somehow. 23:22.000 --> 23:24.000 Do you use the actual representation, 23:24.000 --> 23:26.000 the actual representation of source code 23:26.000 --> 23:28.000 or transforming to intermediate representation, 23:28.000 --> 23:31.000 like kind of abstract syntax tree 23:31.000 --> 23:33.000 and when analyzed with tree 23:33.000 --> 23:36.000 to find matches and flow of a code 23:36.000 --> 23:39.000 and not exact words and constructions use? 23:39.000 --> 23:42.000 So the question is, 23:42.000 --> 23:46.000 do we use some kind of intermediate representation of the code 23:46.000 --> 23:48.000 when we're processing? 23:48.000 --> 23:51.000 So the answer is yes and no. 23:51.000 --> 23:54.000 We transform the code, 23:54.000 --> 23:55.000 we parse it, 23:55.000 --> 23:58.000 we is a library from GitHub called Trissier. 23:58.000 --> 24:01.000 And we basically have streams of tokens 24:01.000 --> 24:05.000 that we then generalize with this algorithm from GitHub. 24:05.000 --> 24:09.000 So it's really more of a syntactic base approach. 24:09.000 --> 24:11.000 We don't deal with abstract syntax trees, 24:11.000 --> 24:13.000 control flow and all that. 24:13.000 --> 24:15.000 It's extensive to be done. 24:15.000 --> 24:16.000 But it works very well. 24:16.000 --> 24:18.000 The problem is at scale, 24:18.000 --> 24:22.000 doing anything that deals with abstract syntax tree. 24:22.000 --> 24:24.000 There's a guy here working also with a 24:24.000 --> 24:25.000 bot code project, 24:26.000 --> 24:31.000 which does incredibly sophisticated 24:31.000 --> 24:35.000 static analysis to find actually reachable, 24:35.000 --> 24:36.000 vulnerable code. 24:36.000 --> 24:39.000 That's very expensive in terms of compute. 24:39.000 --> 24:42.000 Here we're trying really to have massive matching 24:42.000 --> 24:45.000 and we need to be able to turn out this very fast. 24:45.000 --> 24:47.000 So we're doing some compromise, 24:47.000 --> 24:49.000 but in practice it works pretty fine. 24:49.000 --> 24:51.000 Another last question. 24:51.000 --> 24:54.000 Okay, well, thank you very much. 24:54.000 --> 24:57.000 Thank you.