commit | a674b3095807a3679f154c5a4a6df738aeffdfdf | [log] [tgz] |
---|---|---|
author | Darkhan Kubigenov <darkhanu@gmail.com> | Sat Jan 21 23:12:39 2023 |
committer | Darkhan Kubigenov <darkhanu@gmail.com> | Mon Jan 23 23:37:06 2023 |
tree | cd36bfa06c14e7c7a8bd1bd2367f7cf66d4178d7 | |
parent | 74798f5aa5d5c9720e7ff8895fee08e19db05cc7 [diff] |
Fix line diff by using runes without separators [The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode ) for doing line level diffing is the following set of steps: 1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)` 2. `diffs = DiffMain(ti1, ti2)` 3. `DiffCharsToLines(diff, linesIdx)` The original implementation in `google/diff-match-patch` uses unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string. Current implementation in this repo stores them as integers joined by a comma. While this implementation makes `ti1` and `ti2` more readable, it introduces bugs when trying to rely on it when doing line level diffing with `DiffMain`. The root cause of the issue is that an integer line index might span more than one character/rune, and `DiffMain` can assume that two different lines having the same index prefix match partially. For example, indices 123 and 129 will have partial match `12`. In that example, the diff will show lines 3 and 9 which is not correct. A simple failing test case demonstrating this issue is available at `TestDiffPartialLineIndex`. In this PR I am adjusting the algorithm to use the same approach as in [diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510 ) by storing each line index as a rune. While a rune in Golang is a type alias to uint32, not every uint32 can be a valid rune. During string to rune slice conversion invalid runes will be replaced with `utf.RuneError`. The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding The first 127 lines will work the fastest as they are represented as a single bytes. Higher numbers are represented as 2-4 bytes. In addition to that, the range `U+D800 - U+DFFF` contains [invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling). and all codepoints higher or equal to `0xD800` are incremented by `0xDFFF - 0xD800`. The maximum representable integer using this approach is 1'112'060. This improves on Javascript implementation which currently [bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505 ) when files have more than 65535 lines.
go-diff offers algorithms to perform operations required for synchronizing plain text:
go get -u github.com/sergi/go-diff/...
The following example compares two texts and writes out the differences to standard output.
package main
import (
"fmt"
"github.com/sergi/go-diff/diffmatchpatch"
)
const (
text1 = "Lorem ipsum dolor."
text2 = "Lorem dolor sit amet."
)
func main() {
dmp := diffmatchpatch.New()
diffs := dmp.DiffMain(text1, text2, false)
fmt.Println(dmp.DiffPrettyText(diffs))
}
Please make sure to have the latest version of go-diff. If the problem still persists go through the open issues in the tracker first. If you cannot find your request just open up a new issue.
You want to contribute to go-diff? GREAT! If you are here because of a bug you want to fix or a feature you want to add, you can just read on. Otherwise we have a list of open issues in the tracker. Just choose something you think you can work on and discuss your plans in the issue by commenting on it.
Please make sure that every behavioral change is accompanied by test cases. Additionally, every contribution must pass the lint
and test
Makefile targets which can be run using the following commands in the repository root directory.
make lint
make test
After your contribution passes these commands, create a PR and we will review your contribution.
go-diff is a Go language port of Neil Fraser's google-diff-match-patch code. His original code is available at http://code.google.com/p/google-diff-match-patch/.
The original Google Diff, Match and Patch Library is licensed under the Apache License 2.0. The full terms of that license are included here in the APACHE-LICENSE-2.0 file.
Diff, Match and Patch Library
Written by Neil Fraser Copyright (c) 2006 Google Inc. http://code.google.com/p/google-diff-match-patch/
This Go version of Diff, Match and Patch Library is licensed under the MIT License (a.k.a. the Expat License) which is included here in the LICENSE file.
Go version of Diff, Match and Patch Library
Copyright (c) 2012-2016 The go-diff authors. All rights reserved. https://github.com/sergi/go-diff
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.