string.{,r}split: make sep=None behave like Python (#35) The cases for sep=None and explicit separators should use different algorithms, as Python does. Make this explicit, and add tests and documentation. The code is actually simpler for it. This also fixes a test breakage introduced by d6768aa.

commit: 7b65d16c7a9abf7fdc27e64f929332878143fc69 [log] [tgz]
author: alandonovan <adonovan@google.com> Wed Nov 01 20:35:47 2017
committer: GitHub <noreply@github.com> Wed Nov 01 20:35:47 2017
tree: 2009f68beac5d8b5f68cb589fb1137e706e84e47
parent: d6768aaa5873c1cf26709e3fd324a68b38ad690f [diff]
diff --git a/doc/spec.md b/doc/spec.md
index 02dba40..8454baa 100644
--- a/doc/spec.md
+++ b/doc/spec.md

@@ -247,7 +247,7 @@
 identifiers:
 
 ```text
-and            else           load 
+and            else           load
 break          for            not
 continue       if             or
 def            in             pass
@@ -3154,7 +3154,7 @@
 
 <a id='dict·clear'></a>
 ### dict·clear
- 
+
 `D.clear()` removes all the entries of dictionary D and returns `None`.
 It fails if the dictionary is frozen or if there are active iterators.
 
@@ -3169,7 +3169,7 @@
 
 <a id='dict·get'></a>
 ### dict·get
- 
+
 `D.get(key[, default])` returns the dictionary value corresponding to the given key.
 If the dictionary contains no such value, `get` returns `None`, or the
 value of the optional `default` parameter if present.
@@ -3185,7 +3185,7 @@
 
 <a id='dict·items'></a>
 ### dict·items
- 
+
 `D.items()` returns a new list of key/value pairs, one per element in
 dictionary D, in the same order as they would be returned by a `for` loop.
 
@@ -3196,7 +3196,7 @@
 
 <a id='dict·keys'></a>
 ### dict·keys
- 
+
 `D.keys()` returns a new list containing the keys of dictionary D, in the
 same order as they would be returned by a `for` loop.
 
@@ -3207,7 +3207,7 @@
 
 <a id='dict·pop'></a>
 ### dict·pop
- 
+
 `D.pop(key[, default])` returns the value corresponding to the specified
 key, and removes it from the dictionary.  If the dictionary contains no
 such value, and the optional `default` parameter is present, `pop`
@@ -3225,7 +3225,7 @@
 
 <a id='dict·popitem'></a>
 ### dict·popitem
- 
+
 `D.popitem()` returns the first key/value pair, removing it from the dictionary.
 
 `popitem` fails if the dictionary is empty, frozen, or has active iterators.
@@ -3239,7 +3239,7 @@
 
 <a id='dict·setdefault'></a>
 ### dict·setdefault
- 
+
 `D.setdefault(key[, default])` returns the dictionary value corresponding to the given key.
 If the dictionary contains no such value, `setdefault`, like `get`,
 returns `None` or the value of the optional `default` parameter if
@@ -3258,7 +3258,7 @@
 
 <a id='dict·update'></a>
 ### dict·update
- 
+
 `D.update([pairs][, name=value[, ...])` makes a sequence of key/value
 insertions into dictionary D, then returns `None.`
 
@@ -3284,7 +3284,7 @@
 
 <a id='dict·values'></a>
 ### dict·values
- 
+
 `D.values()` returns a new list containing the dictionary's values, in the
 same order as they would be returned by a `for` loop over the
 dictionary.
@@ -3296,7 +3296,7 @@
 
 <a id='list·append'></a>
 ### list·append
- 
+
 `L.append(x)` appends `x` to the list L, and returns `None`.
 
 `append` fails if the list is frozen or has active iterators.
@@ -3311,7 +3311,7 @@
 
 <a id='list·clear'></a>
 ### list·clear
- 
+
 `L.clear()` removes all the elements of the list L and returns `None`.
 It fails if the list is frozen or if there are active iterators.
 
@@ -3323,7 +3323,7 @@
 
 <a id='list·extend'></a>
 ### list·extend
- 
+
 `L.extend(x)` appends the elements of `x`, which must be iterable, to
 the list L, and returns `None`.
 
@@ -3338,7 +3338,7 @@
 
 <a id='list·index'></a>
 ### list·index
- 
+
 `L.insert(x[, start[, end]])` finds `x` within the list L and returns its index.
 
 The optional `start` and `end` parameters restrict the portion of
@@ -3359,7 +3359,7 @@
 
 <a id='list·insert'></a>
 ### list·insert
- 
+
 `L.insert(i, x)` inserts the value `x` in the list L at index `i`, moving
 higher-numbered elements along by one.  It returns `None`.
 
@@ -3378,7 +3378,7 @@
 
 <a id='list·pop'></a>
 ### list·pop
- 
+
 `L.pop([index])` removes and returns the last element of the list L, or,
 if the optional index is provided, at that index.
 
@@ -3394,7 +3394,7 @@
 
 <a id='list·remove'></a>
 ### list·remove
- 
+
 `L.remove(x)` removes the first occurrence of the value `x` from the list L, and returns `None`.
 
 `remove` fails if the list does not contain `x`, is frozen, or has active iterators.
@@ -3408,7 +3408,7 @@
 
 <a id='set·union'></a>
 ### set·union
- 
+
 `S.union(iterable)` returns a new set into which have been inserted
 all the elements of set S and all the elements of the argument, which
 must be iterable.
@@ -3423,7 +3423,7 @@
 
 <a id='string·bytes'></a>
 ### string·bytes
- 
+
 `S.bytes()` returns an iterable value containing the
 sequence of numeric bytes values in the string S.
 
@@ -3441,7 +3441,7 @@
 
 <a id='string·capitalize'></a>
 ### string·capitalize
- 
+
 `S.capitalize()` returns a copy of string S with all Unicode letters
 that begin words changed to their title case.
 
@@ -3451,7 +3451,7 @@
 
 <a id='string·codepoints'></a>
 ### string·codepoints
- 
+
 `S.codepoints()` returns an iterable value containing the
 sequence of integer Unicode code points encoded by the string S.
 Each invalid code within the string is treated as if it encodes the
@@ -3478,7 +3478,7 @@
 
 <a id='string·count'></a>
 ### string·count
- 
+
 `S.count(sub[, start[, end]])` returns the number of occcurences of
 `sub` within the string S, or, if the optional substring indices
 `start` and `end` are provided, within the designated substring of S.
@@ -3491,7 +3491,7 @@
 
 <a id='string·endswith'></a>
 ### string·endswith
- 
+
 `S.endswith(suffix)` reports whether the string S has the specified suffix.
 
 ```python
@@ -3500,7 +3500,7 @@
 
 <a id='string·find'></a>
 ### string·find
- 
+
 `S.find(sub[, start[, end]])` returns the index of the first
 occurrence of the substring `sub` within S.
 
@@ -3518,7 +3518,7 @@
 
 <a id='string·format'></a>
 ### string·format
- 
+
 `S.format(*args, **kwargs)` returns a version of the format string S
 in which bracketed portions `{...}` are replaced
 by arguments from `args` and `kwargs`.
@@ -3564,7 +3564,7 @@
 
 <a id='string·index'></a>
 ### string·index
- 
+
 `S.index(sub[, start[, end]])` returns the index of the first
 occurrence of the substring `sub` within S, like `S.find`, except
 that if the substring is not found, the operation fails.
@@ -3577,7 +3577,7 @@
 
 <a id='string·isalnum'></a>
 ### string·isalnum
- 
+
 `S.isalpha()` reports whether the string S is non-empty and consists only
 Unicode letters and digits.
 
@@ -3588,7 +3588,7 @@
 
 <a id='string·isalpha'></a>
 ### string·isalpha
- 
+
 `S.isalpha()` reports whether the string S is non-empty and consists only of Unicode letters.
 
 ```python
@@ -3599,7 +3599,7 @@
 
 <a id='string·isdigit'></a>
 ### string·isdigit
- 
+
 `S.isdigit()` reports whether the string S is non-empty and consists only of Unicode digits.
 
 ```python
@@ -3610,7 +3610,7 @@
 
 <a id='string·islower'></a>
 ### string·islower
- 
+
 `S.islower()` reports whether the string S contains at least one cased Unicode
 letter, and all such letters are lowercase.
 
@@ -3622,7 +3622,7 @@
 
 <a id='string·isspace'></a>
 ### string·isspace
- 
+
 `S.isspace()` reports whether the string S is non-empty and consists only of Unicode spaces.
 
 ```python
@@ -3633,7 +3633,7 @@
 
 <a id='string·istitle'></a>
 ### string·istitle
- 
+
 `S.istitle()` reports whether the string S contains at least one cased Unicode
 letter, and all such letters that begin a word are in title case.
 
@@ -3646,7 +3646,7 @@
 
 <a id='string·isupper'></a>
 ### string·isupper
- 
+
 `S.isupper()` reports whether the string S contains at least one cased Unicode
 letter, and all such letters are uppercase.
 
@@ -3658,7 +3658,7 @@
 
 <a id='string·join'></a>
 ### string·join
- 
+
 `S.join(iterable)` returns the string formed by concatenating each
 element of its argument, with a copy of the string S between
 successive elements. The argument must be an iterable whose elements
@@ -3671,7 +3671,7 @@
 
 <a id='string·lower'></a>
 ### string·lower
- 
+
 `S.lower()` returns a copy of the string S with letters converted to lowercase.
 
 ```python
@@ -3680,7 +3680,7 @@
 
 <a id='string·lstrip'></a>
 ### string·lstrip
- 
+
 `S.lstrip()` returns a copy of the string S with leading whitespace removed.
 
 ```python
@@ -3689,7 +3689,7 @@
 
 <a id='string·partition'></a>
 ### string·partition
- 
+
 `S.partition(x)` splits string S into three parts and returns them as
 a tuple: the portion before the first occurrence of string `x`, `x` itself,
 and the portion following it.
@@ -3703,7 +3703,7 @@
 
 <a id='string·replace'></a>
 ### string·replace
- 
+
 `S.replace(old, new[, count])` returns a copy of string S with all
 occurrences of substring `old` replaced by `new`. If the optional
 argument `count`, which must be an `int`, is non-negative, it
@@ -3716,7 +3716,7 @@
 
 <a id='string·rfind'></a>
 ### string·rfind
- 
+
 `S.rfind(sub[, start[, end]])` returns the index of the substring `sub` within
 S, like `S.find`, except that `rfind` returns the index of the substring's
 _last_ occurrence.
@@ -3729,7 +3729,7 @@
 
 <a id='string·rindex'></a>
 ### string·rindex
- 
+
 `S.rindex(sub[, start[, end]])` returns the index of the substring `sub` within
 S, like `S.index`, except that `rindex` returns the index of the substring's
 _last_ occurrence.
@@ -3742,7 +3742,7 @@
 
 <a id='string·rpartition'></a>
 ### string·rpartition
- 
+
 `S.rpartition(x)` is like `partition`, but splits `S` at the last occurrence of `x`.
 
 ```python
@@ -3751,7 +3751,7 @@
 
 <a id='string·rsplit'></a>
 ### string·rsplit
- 
+
 `S.rsplit([sep[, maxsplit]])` splits a string into substrings like `S.split`,
 except that when a maximum number of splits is specified, `rsplit` chooses the
 rightmost splits.
@@ -3762,12 +3762,9 @@
 "one two  three".rsplit(None, 1)             # ["one two", "three"]
 ```
 
-TODO: `rsplit(None, maxsplit)` where `maxsplit > 0` (as in the last
-example above) is not yet implemented and currently returns an error.
-
 <a id='string·rstrip'></a>
 ### string·rstrip
- 
+
 `S.rstrip()` returns a copy of the string S with trailing whitespace removed.
 
 ```python
@@ -3776,14 +3773,24 @@
 
 <a id='string·split'></a>
 ### string·split
- 
+
 `S.split([sep [, maxsplit]])` returns the list of substrings of S,
-splitting at occurrences of `sep`.
-If `sep` is not specified or is `None`, `split` splits the string
-between space characters and discards empty substrings.
+splitting at occurrences of the delimiter string `sep`.
+
+Consecutive occurrences of `sep` are considered to delimit empty
+strings, so `'food'.split('o')` returns `['f', '', 'd']`.
+Splitting an empty string with a specified separator returns `['']`.
 If `sep` is the empty string, `split` fails.
 
-If `maxsplit` is given, it specifies the maximum number of splits.
+If `sep` is not specified or is `None`, `split` uses a different
+algorithm: it removes all leading spaces from S
+(or trailing spaces in the case of `rsplit`),
+then splits the string around each consecutive non-empty sequence of
+Unicode white space characters.
+
+If S consists only of white space, `split` returns the empty list.
+
+If `maxsplit` is given and non-negative, it specifies a maximum number of splits.
 
 ```python
 "one two  three".split()                    # ["one", "two", "three"]
@@ -3795,7 +3802,7 @@
 
 <a id='string·split_bytes'></a>
 ### string·split_bytes
- 
+
 `S.split_bytes()` returns an iterable value containing successive
 1-byte substrings of S.
 To materialize the entire sequence, apply `list(...)` to the result.
@@ -3812,7 +3819,7 @@
 
 <a id='string·split_codepoints'></a>
 ### string·split_codepoints
- 
+
 `S.split_codepoints()` returns an iterable value containing the sequence of
 substrings of S that each encode a single Unicode code point.
 Each invalid code within the string is treated as if it encodes the
@@ -3839,7 +3846,7 @@
 
 <a id='string·splitlines'></a>
 ### string·splitlines
- 
+
 `S.splitlines([keepends])` returns a list whose elements are the
 successive lines of S, that is, the strings formed by splitting S at
 line terminators (currently assumed to be a single newline, `\n`,
@@ -3857,7 +3864,7 @@
 
 <a id='string·startswith'></a>
 ### string·startswith
- 
+
 `S.startswith(suffix)` reports whether the string S has the specified prefix.
 
 ```python
@@ -3866,7 +3873,7 @@
 
 <a id='string·strip'></a>
 ### string·strip
- 
+
 `S.strip()` returns a copy of the string S with leading and trailing whitespace removed.
 
 ```python
@@ -3875,7 +3882,7 @@
 
 <a id='string·title'></a>
 ### string·title
- 
+
 `S.lower()` returns a copy of the string S with letters converted to titlecase.
 
 Letters are converted to uppercase at the start of words, lowercase elsewhere.
@@ -3886,7 +3893,7 @@
 
 <a id='string·upper'></a>
 ### string·upper
- 
+
 `S.lower()` returns a copy of the string S with letters converted to lowercase.
 
 ```python

diff --git a/library.go b/library.go
index e28920f..7f1e7cb 100644
--- a/library.go
+++ b/library.go

@@ -1872,9 +1872,7 @@
 
 	if sep_ == nil || sep_ == None {
 		// special case: split on whitespace
-		if maxsplit == 0 {
-			res = append(res, recv)
-		} else if maxsplit < 0 {
+		if maxsplit < 0 {
 			res = strings.Fields(recv)
 		} else if fnname == "split" {
 			res = splitspace(recv, maxsplit)
@@ -1887,9 +1885,7 @@
 			return nil, fmt.Errorf("split: empty separator")
 		}
 		// usual case: split on non-empty separator
-		if maxsplit == 0 {
-			res = append(res, recv)
-		} else if maxsplit < 0 {
+		if maxsplit < 0 {
 			res = strings.Split(recv, sep)
 		} else if fnname == "split" {
 			res = strings.SplitN(recv, sep, maxsplit+1)
@@ -1912,7 +1908,7 @@
 	return NewList(list), nil
 }
 
-// Precondition: max > 0.
+// Precondition: max >= 0.
 func rsplitspace(s string, max int) []string {
 	res := make([]string, 0, max+1)
 	end := -1 // index of field end, or -1 in a region of spaces.
@@ -1943,7 +1939,7 @@
 	return res
 }
 
-// Precondition: max > 0.
+// Precondition: max >= 0.
 func splitspace(s string, max int) []string {
 	var res []string
 	start := -1 // index of field start, or -1 in a region of spaces

diff --git a/testdata/string.sky b/testdata/string.sky
index 7f0d55d..1df8ee9 100644
--- a/testdata/string.sky
+++ b/testdata/string.sky

@@ -186,9 +186,10 @@
 assert.eq("a.b.c.d".split(".", 2), ["a", "b", "c.d"])
 assert.eq("a.b.c.d".rsplit(".", 2), ["a.b", "c", "d"])
 
+# {,r}split on white space:
 assert.eq(" a bc\n  def \t  ghi".split(), ["a", "bc", "def", "ghi"])
 assert.eq(" a bc\n  def \t  ghi".split(None), ["a", "bc", "def", "ghi"])
-assert.eq(" a bc\n  def \t  ghi".split(None, 0), [" a bc\n  def \t  ghi"])
+assert.eq(" a bc\n  def \t  ghi".split(None, 0), ["a bc\n  def \t  ghi"])
 assert.eq(" a bc\n  def \t  ghi".rsplit(None, 0), [" a bc\n  def \t  ghi"])
 assert.eq(" a bc\n  def \t  ghi".split(None, 1), ["a", "bc\n  def \t  ghi"])
 assert.eq(" a bc\n  def \t  ghi".rsplit(None, 1), [" a bc\n  def", "ghi"])
@@ -201,10 +202,26 @@
 assert.eq(" a bc\n  def \t  ghi".rsplit(None, 5), ["a", "bc", "def", "ghi"])
 
 assert.eq(" a bc\n  def \t  ghi ".split(None, 0), ["a bc\n  def \t  ghi "])
-assert.eq(" a bc\n  def \t  ghi ".split(None, 1), ["a", "bc\n  def \t  ghi "])
 assert.eq(" a bc\n  def \t  ghi ".rsplit(None, 0), [" a bc\n  def \t  ghi"])
+assert.eq(" a bc\n  def \t  ghi ".split(None, 1), ["a", "bc\n  def \t  ghi "])
 assert.eq(" a bc\n  def \t  ghi ".rsplit(None, 1), [" a bc\n  def", "ghi"])
 
+# Observe the algorithmic difference when splitting on spaces versus other delimiters.
+assert.eq('--aa--bb--cc--'.split('-', 0), ['--aa--bb--cc--'])  # contrast this
+assert.eq('  aa  bb  cc  '.split(None, 0), ['aa  bb  cc  '])   #  with this
+assert.eq('--aa--bb--cc--'.rsplit('-', 0), ['--aa--bb--cc--']) # ditto this
+assert.eq('  aa  bb  cc  '.rsplit(None, 0), ['  aa  bb  cc'])  #  and this
+#
+assert.eq('--aa--bb--cc--'.split('-', 1), ['', '-aa--bb--cc--'])
+assert.eq('--aa--bb--cc--'.rsplit('-', 1), ['--aa--bb--cc-', ''])
+assert.eq('  aa  bb  cc  '.split(None, 1), ['aa', 'bb  cc  '])
+assert.eq('  aa  bb  cc  '.rsplit(None, 1), ['  aa  bb',  'cc'])
+#
+assert.eq('--aa--bb--cc--'.split('-', -1), ['', '', 'aa', '', 'bb', '', 'cc', '', ''])
+assert.eq('--aa--bb--cc--'.rsplit('-', -1), ['', '', 'aa', '', 'bb', '', 'cc', '', ''])
+assert.eq('  aa  bb  cc  '.split(None, -1), ['aa', 'bb', 'cc'])
+assert.eq('  aa  bb  cc  '.rsplit(None, -1), ['aa', 'bb', 'cc'])
+
 assert.eq("localhost:80".rsplit(":", 1)[-1], "80")
 
 # str.splitlines
commit	7b65d16c7a9abf7fdc27e64f929332878143fc69	[log] [tgz]
author	alandonovan <adonovan@google.com>	Wed Nov 01 20:35:47 2017
committer	GitHub <noreply@github.com>	Wed Nov 01 20:35:47 2017
tree	2009f68beac5d8b5f68cb589fb1137e706e84e47
parent	d6768aaa5873c1cf26709e3fd324a68b38ad690f [diff]