Data structure for grouping strings in a collection when they share common substrings [closed]

I am looking for a data structure and an algorithm to manage a dynamic collection of strings, but grouping strings that have a substring in common. I try to describe it through an example. First, we get a string foo, and put it into collection S. Then we get a string sample, and put it into S too. Next, we get a string oo, obviously oo is a substring of foo, so now collection S contains three members: foo, sample, oo. And foo and oo is in the same group. The next string in S is food, which is in the same group as foo and oo. And so on. Finally we get a large collection in which members are sharing some substrings are grouped. I want to use this algorithm or these algorithms to process duplicate files, but there are some obvious roadblocks: Dynamic collection Unicode No specific pattern Simply, I want to find a data structure and algorithm that can group the members of the growing string collection. In other words, in my expectation, this string collection should be composed of trees, each tree contains a longest string and its substrings in the string collection. Any suggestions?

Feb 10, 2025 - 14:09

I am looking for a data structure and an algorithm to manage a dynamic collection of strings, but grouping strings that have a substring in common. I try to describe it through an example.

First, we get a string foo, and put it into collection S.

Then we get a string sample, and put it into S too.

Next, we get a string oo, obviously oo is a substring of foo, so now collection S contains three members: foo, sample, oo. And foo and oo is in the same group.

The next string in S is food, which is in the same group as foo and oo.

And so on.

Finally we get a large collection in which members are sharing some substrings are grouped.

I want to use this algorithm or these algorithms to process duplicate files, but there are some obvious roadblocks:

Dynamic collection
Unicode
No specific pattern

Simply, I want to find a data structure and algorithm that can group the members of the growing string collection. In other words, in my expectation, this string collection should be composed of trees, each tree contains a longest string and its substrings in the string collection.

Any suggestions?