Handling undefined behavior in inffast_chunk

It was revealed by a new clang flag (i.e. basic-aa-recphi) that chunkcopy_safe
could hit a scenario of undefined behavior with the use of 'restrict' modifier when
the 'from' and 'out' pointers overlapped during decompression.

This patch targets to address this issue and unblock the enablement of the
aforementioned compiler flag.

Credit for the original investigation and the new unit test stressing the
failure scenario goes to Hans Wennborg.

Performance implications: initial numbers point to
a slight improvement for ARM big cores@64bit (i.e. 2% to 3.6%) and x86-64
(i.e. up to 7.5% for Intel i7) but a regression otherwise for 32bit
(i.e. between 2.3% to 3% big.LITTLE respectively).

Bug: 1103818
Change-Id: I9b7d2c1e47caaf498cd539fd6b77c4b949cb0dac
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2309041
Reviewed-by: Hans Wennborg <hans@chromium.org>
Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org>
Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org>
Cr-Commit-Position: refs/heads/master@{#793239}
diff --git a/third_party/zlib/contrib/optimizations/chunkcopy.h b/third_party/zlib/contrib/optimizations/chunkcopy.h
index 38ba0ed..7cbcf82 100644
--- a/third_party/zlib/contrib/optimizations/chunkcopy.h
+++ b/third_party/zlib/contrib/optimizations/chunkcopy.h
@@ -406,6 +406,27 @@
   return chunkcopy_lapped_relaxed(out, dist, len);
 }
 
+/* TODO(cavalcanti): see crbug.com/1110083. */
+static inline unsigned char FAR* chunkcopy_safe_ugly(unsigned char FAR* out,
+                                                     unsigned dist,
+                                                     unsigned len,
+                                                     unsigned char FAR* limit) {
+#if defined(__GNUC__) && !defined(__clang__)
+  /* Speed is the same as using chunkcopy_safe
+     w/ GCC on ARM (tested gcc 6.3 and 7.5) and avoids
+     undefined behavior.
+  */
+  out = chunkcopy_core_safe(out, out - dist, len, limit);
+#elif defined(__clang__) && defined(ARMV8_OS_ANDROID) && !defined(__aarch64__)
+  /* Seems to perform better on 32bit (i.e. Android). */
+  out = chunkcopy_core_safe(out, out - dist, len, limit);
+#elif defined(__clang__)
+  /* Seems to perform better on 64bit. */
+  out = chunkcopy_lapped_safe(out, dist, len, limit);
+#endif
+  return out;
+}
+
 /*
  * The chunk-copy code above deals with writing the decoded DEFLATE data to
  * the output with SIMD methods to increase decode speed. Reading the input
diff --git a/third_party/zlib/contrib/optimizations/inffast_chunk.c b/third_party/zlib/contrib/optimizations/inffast_chunk.c
index 4099edf..4bacbc4 100644
--- a/third_party/zlib/contrib/optimizations/inffast_chunk.c
+++ b/third_party/zlib/contrib/optimizations/inffast_chunk.c
@@ -276,7 +276,7 @@
                            the main copy is near the end.
                           */
                         out = chunkunroll_relaxed(out, &dist, &len);
-                        out = chunkcopy_safe(out, out - dist, len, limit);
+                        out = chunkcopy_safe_ugly(out, dist, len, limit);
                     } else {
                         /* from points to window, so there is no risk of
                            overlapping pointers requiring memset-like behaviour
diff --git a/third_party/zlib/contrib/tests/utils_unittest.cc b/third_party/zlib/contrib/tests/utils_unittest.cc
index ae41f7b..45796f6 100644
--- a/third_party/zlib/contrib/tests/utils_unittest.cc
+++ b/third_party/zlib/contrib/tests/utils_unittest.cc
@@ -89,3 +89,58 @@
   ASSERT_EQ(result, Z_OK);
   EXPECT_EQ(input, decompressed);
 }
+
+TEST(ZlibTest, StreamingInflate) {
+  uint8_t comp_buf[4096], decomp_buf[4096];
+  z_stream comp_strm, decomp_strm;
+  int ret;
+
+  std::vector<uint8_t> src;
+  for (size_t i = 0; i < 1000; i++) {
+    for (size_t j = 0; j < 40; j++) {
+      src.push_back(j);
+    }
+  }
+
+  // Deflate src into comp_buf.
+  comp_strm.zalloc = Z_NULL;
+  comp_strm.zfree = Z_NULL;
+  comp_strm.opaque = Z_NULL;
+  ret = deflateInit(&comp_strm, Z_BEST_COMPRESSION);
+  ASSERT_EQ(ret, Z_OK);
+  comp_strm.next_out = comp_buf;
+  comp_strm.avail_out = sizeof(comp_buf);
+  comp_strm.next_in = src.data();
+  comp_strm.avail_in = src.size();
+  ret = deflate(&comp_strm, Z_FINISH);
+  ASSERT_EQ(ret, Z_STREAM_END);
+  size_t comp_sz = sizeof(comp_buf) - comp_strm.avail_out;
+
+  // Inflate comp_buf one 4096-byte buffer at a time.
+  decomp_strm.zalloc = Z_NULL;
+  decomp_strm.zfree = Z_NULL;
+  decomp_strm.opaque = Z_NULL;
+  ret = inflateInit(&decomp_strm);
+  ASSERT_EQ(ret, Z_OK);
+  decomp_strm.next_in = comp_buf;
+  decomp_strm.avail_in = comp_sz;
+
+  while (decomp_strm.avail_in > 0) {
+    decomp_strm.next_out = decomp_buf;
+    decomp_strm.avail_out = sizeof(decomp_buf);
+    ret = inflate(&decomp_strm, Z_FINISH);
+    ASSERT_TRUE(ret == Z_OK || ret == Z_STREAM_END || ret == Z_BUF_ERROR);
+
+    // Verify the output bytes.
+    size_t num_out = sizeof(decomp_buf) - decomp_strm.avail_out;
+    for (size_t i = 0; i < num_out; i++) {
+      EXPECT_EQ(decomp_buf[i], src[decomp_strm.total_out - num_out + i]);
+    }
+  }
+
+  // Cleanup memory (i.e. makes ASAN bot happy).
+  ret = deflateEnd(&comp_strm);
+  EXPECT_EQ(ret, Z_OK);
+  ret = inflateEnd(&decomp_strm);
+  EXPECT_EQ(ret, Z_OK);
+}