UPSTREAM: netfilter: x_tables: pack percpu counter allocations

instead of allocating each xt_counter individually, allocate 4k chunks
and then use these for counter allocation requests.

This should speed up rule evaluation by increasing data locality,
also speeds up ruleset loading because we reduce calls to the percpu
allocator.

As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
arches with 64k page size.

BUG=b:34936410
BUG=chromium:689152
TEST=Booted an image with this patch and verified iptables-restore
performance improvement.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit ae0ac0ed6fcf ("netfilter: x_tables: pack
percpu counter allocations"))
Signed-off-by: Amey Deshpande <ameyd@google.com>

Change-Id: I9cc42750879a69ce7426458463e6d35037e2e8f8
Previous-Reviewed-on: https://chromium-review.googlesource.com/437595
(cherry picked from commit e53474eb22f57d4a2af222c4996e22815f054012)
Reviewed-on: https://chromium-review.googlesource.com/447298
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Commit-Queue: Amey Deshpande <ameyd@google.com>
Tested-by: Amey Deshpande <ameyd@google.com>
5 files changed