The documentation comes from the Markdown files in the source code, so is always up-to-date but available only in English. Enjoy!
This class contains a set of useful methods for string comparison, but the methods can also be used for any T[]
.
This methods use dynamic programming and need to create NxM arrays to to the comparison.
To avoid creating too much big objects in loops, the methods can re-use the previous array objects in this instance.
An implementation of Levenshtein string distance.
Result the number of editions (insertion, substitution and removal) necessary to convert one string to the other.
public int LevenshteinDistance(string strOld, string strNew,
IEqualityComparer<char> comparer = null,
Func<Choice<char>, int> weight = null)
public int LevenshteinDistance<T>(T[] strOld, T[] strNew,
IEqualityComparer<T> comparer = null,
Func<Choice<T>, int> weight = null)
public struct Choice<T>
{
public readonly ChoiceType Type;
public readonly T Removed;
public readonly T Added;
public bool HasRemoved { get { return Type != ChoiceType.Add; } }
public bool HasAdded { get { return Type != ChoiceType.Remove; } }
}
public enum ChoiceType
{
Equal,
Substitute,
Remove,
Add,
}
Using the lambda weight
, we can fine-tune the algorithm to, for example, make a
closer to á
or A
than to W
.
There are also variations to reconstruct the exact choices that where made (useful for Diffs algorithms)
public List<Choice<char>> LevenshteinChoices(string strOld, string strNew,
IEqualityComparer<char> comparer = null,
Func<Choice<char>, int> weight = null)
public List<Choice<T>> LevenshteinChoices<T>(T[] strOld, T[] strNew,
IEqualityComparer<T> comparer = null,
Func<Choice<T>, int> weight = null)
Implements the longest common substring problem. The longest consecutive sequence of elements in both strings.
public int LongestCommonSubstring(string str1, string str2)
public int LongestCommonSubstring(string str1, string str2, out int startPos1, out int startPos2)
public int LongestCommonSubstring<T>(T[] str1, T[] str2, out int startPos1, out int startPos2, IEqualityComparer<T> comparer = null)
public int LongestCommonSubstring<T>(Slice<T> str1, Slice<T> str2, out int startPos1, out int startPos2, IEqualityComparer<T> comparer = null)
There are overloads that also return, using out
parameters, the starting position of such sub-string in both strings.
The most general overload takes array slices instead of arrays, to simplify diff algorithms.
public struct Slice<T> :IEnumerable<T>
{
public readonly T[] Array;
public readonly int Offset;
public readonly int Length;
public Slice(T[] array)
public Slice(T[] array, int offset, int length)
public T this[int index]{get; set;}
public Slice<T> SubSlice(int relativeIndex, int length)
public Slice<T> SubSliceStart(int relativeIndex)
public Slice<T> SubSliceEnd(int relativeIndex)
}
Implements the longest common subsequence problem. The longest ordered but non-consecutive sequence of elements in both strings.
public int LongestCommonSubsequence(string str1, string str2)
public int LongestCommonSubsequence<T>(T[] str1, T[] str2, IEqualityComparer<T> comparer = null)
Implements a diff algorithm to compare two sequences using LongestCommonSubstring
recursively. The result is similar to LevenshteinChoices, but while LevenshteinChoices
is optimized for unintentional mistakes (misspellings), Diff
is optized for intentional changes.
//Uses
public List<DiffPair<T>> Diff<T>(T[] strOld, T[] strNew, IEqualityComparer<T> comparer = null)
Compares two strings splitting by words and using Diff
. A word is considered any sequence of letters an numbers.
//Uses
public List<DiffPair<string>> DiffWords(string strOld, string strNew)
Compares two long multi-line string (like a code file), using a combination of Diff and LevenshteinChoices at the line level, and DiffWords for each lines.
//Uses
public List<DiffPair<string>> DiffWords(string strOld, string strNew)
© Signum Software. All Rights Reserved.
Powered by Signum Framework