<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-midtskogen-netvc-clpf-02"
     ipr="trust200902">
  <front>
    <title abbrev="Constrained Low Pass Filter">Constrained Low Pass Filter</title>

    <author fullname="Steinar Midtskogen" initials="S." surname="Midtskogen">
      <organization>Cisco</organization>

      <address>
        <postal>
          <street></street>

          <city>Lysaker</city>

          <country>Norway</country>
        </postal>

        <phone></phone>

        <email>stemidts@cisco.com</email>
      </address>
    </author>

    <author fullname="Arild Fuldseth" initials="A." surname="Fuldseth">
      <organization>Cisco</organization>

      <address>
        <postal>
          <street></street>

          <city>Lysaker</city>

          <country>Norway</country>
        </postal>

        <phone></phone>

        <email>arilfuld@cisco.com</email>
      </address>
    </author>

    <author fullname="Mo Zanaty" initials="M." surname="Zanaty">
      <organization>Cisco</organization>

      <address>
        <postal>
          <street></street>

          <city>RTP,NC</city>

          <country>USA</country>
        </postal>

        <phone></phone>

        <email>mzanaty@cisco.com</email>
      </address>
    </author>

    <date month="April" year="2016"/>

    <abstract>
      <t>This document describes a low complexity filtering technique which
  is being used as a low pass loop filter in the Thor video codec.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>Modern video coding standards such as
        <xref target="I-D.fuldseth-netvc-thor">Thor</xref>
        include in-loop filters
  which correct artifacts introduced in the encoding process.  Thor
  includes a deblocking filter which corrects artifacts introduced by
  the block based nature of the encoding process, and a low pass
  filter correcting artifacts not corrected by the deblocking filter,
  in particular artifacts introduced by quantisation errors of
  transform coefficients and by the interpolation filter.  Since
  in-loop filters have to be applied in both the encoder and decoder,
  it is highly desirable that these filters have low computational
  complexity.</t>
    </section>

    <section title="Definitions">

      <section title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
        document are to be interpreted as described in <xref
        target="RFC2119">RFC 2119</xref>.</t>
      </section>

      <section title="Terminology">
        <t>This document will refer to a pixel X and six of its neighbouring
  pixels A, B, C, D, E, F ordered in the following pattern.
</t><t>
<figure align="center" anchor="positions" title="Filter pixel positions">
<artwork align="center">
<![CDATA[
+---+---+---+---+---+
|   |   | A |   |   |
+---+---+---+---+---+
| B | C | X | D | E |
+---+---+---+---+---+
|   |   | F |   |   |
+---+---+---+---+---+
]]>
</artwork>
</figure>
</t>
<t>
  In Thor the frames are divided into filter blocks (FB) of 128x128,
  64x64 or 32x32 pixels.  The size is signalled for each frame to be
  filtered.  Also, each frame is divided into coding blocks (CB) which
  range from 8x8 to 128x128 independent of the FB size.  The filter
  described in this draft can be switched on or off for the entire
  frame or optionally on or off for each FB.  CB's that have been
  coded using the skip mode are not filtered, and if a FB only
  contains CB's that have been coded in skip mode, the FB will not be
  filtered and no signal will be transmitted for this FB.
</t>
<t>
  If the frame can't fit a whole number of FB's, the FB's at the right
  and bottom edges are clipped to fit.  For instance, if the frame
  resolution is 1920x1080 and the FB size is 128x128, the size of the
  FB's at the bottom of the frame becomes 128x56.
</t>

      </section>
    </section>

    <section title="Filtering Process">
      <t>
        Given a pixel X and its neighbouring pixels described above we
        can define a general non-linear filter as:
      </t>
      <t>
        <figure align="center" anchor="eq1" title="Equation 1">
          <artwork align="center">
            <![CDATA[
X' = X + clip(a*clip(A-X,-s,s) + b*clip(B-X,-s,s) + c*clip(C-X,-s,s) +
              d*clip(D-X,-s,s) + e*clip(E-X,-s,s) + f*clip(F-X,-s,s),-g,g)
            ]]>
          </artwork>
        </figure>
      </t>
      <t>
        If a neighbour pixel is outside the image frame, it is given the
        same value as the closest pixel within the frame.  To avoid dependencies
        prohibiting parallel processing, all neighbour pixels must be
        the unfiltered pixels of the frame being filtered.
      </t>
      <t>
        Experiments in Thor have shown that a good compromise between
        complexity and performance is a=f=1/4, b=e=1/16,
        c=d=3/16 and the filter strength s being 1, 2 or 4
        signalled at frame level.  These values eliminate the need for the
        outer clipping to +/-g. The rounding is to the nearest integer.
      </t>
      <t>
        This gives us the equation:
      </t>
      <t>
        <figure align="center" anchor="eq2" title="Equation 2">
          <artwork align="center">
            <![CDATA[
X' = X + (4*clip(A-X,-s,s) + clip(B-X,-s,s) + 3*clip(C-X,-s,s) +
          3*clip(D-X,-s,s) + clip(E-X,-s,s) + 4*clip(F-X,-s,s)) / 16
            ]]>
          </artwork>
        </figure>
      </t>
      <t>
        It can be noted that a=c=d=f=1/4, b=e=0 and s=1 give a slighly simpler
        filter which is very similar to the one described in the first version
        of this draft.
      </t>
      <t>
        The filter leaves the encoder 13 different choices for a
        frame.  The filter can be disabled for the entire frame, or
        the frame is filtered using all distinct combinations of
        strength (1, 2 or 4), non-skip FB signal (enabled/disabled)
        and FB size (32x32, 64x64 or 128x128).  Note that the FB size
        only matters when FB signalling is in use.
      </t>
      <t>
        The decisions at both frame level and FB level may be based on
        rate-distortion optimisation (RDO), but an encoder running in
        a low-complexity mode, or possibly a low-delay mode, may
        instead assume that a fixed mode will be beneficial.  In
        general, using s=2, a QP dependent FB size and RDO only at the
        FB level gives good results.
      </t>

      <t>
        However, because of the low complexity of the filter, fully
        RDO based decisions are not costly.  The distortion of the 13
        configurations of the filter can easily be computed in a
        single pass by keeping track of the distortions of the three
        different strengths and the bit costs for different FB sizes.
      </t>
      <t>
        The filter is applied after the deblocking filter.
      </t>
    </section>

    <section title="Further complexity considerations">
      <t>
  The filter has been designed to offer the best compromise between
  low complexity and performance.  A single pixel can be filtered
  with simple operations as illustrated by this C function:
</t><t>
<figure align="center" anchor="eq3" title="C code">
<artwork align="center">
<![CDATA[
int clpf_sample(int X, int A, int B, int C, int D, int E, int F, int s)
{
  int delta =
    4*clip(A - X, -s, s) + clip(B - X, -s, s) + 3*clip(C - X, -s, s) +
    3*clip(D - X, -s, s) + clip(E - X, -s, s) + 4*clip(F - X, -s, s);
  return (8 + delta - (delta < 0)) >> 4;  // Assumes arithmetic shift
}
]]>
</artwork>
</figure>
</t>
<t>
  Also, these operations are easily vectorised in architectures
  supporting SIMD instructions, such as x86/SSE4 and ARM/NEON.  The
  pixel difference is 9 bit, but it can be computed using adding an 8
  bit offset and the use of 8 bit saturated signed subtraction.  This
  means that 16 pixels per core can be filtered in parallel on these
  architectures.  Clipping at frame borders can be implemented using
  shuffle instructions.
</t>
<t>
  A C implementation using x86/SSE4 intrinsics required 6.8
  instructions per pixel to filter a single 8x8 block.  The
  corresponding number for ARM/NEON (armv7) was 4.9.  The compiler was
  gcc 4.8.4 in both cases.
</t>

<t>
  Since the filter only needs to look up pixels in the line directly
  above and below the pixel to be filtered, the line buffer
  requirement in hardware implementations is very low.
</t>
</section>
<section title="Performance">
<t>
  The table below shows filters effect on the bandwidth for a selection
  of 10 second video sequences encoded in Thor with uni-prediction
  only.  The numbers have been computed using the Bjontegaard Delta
  Rate (BDR).  BDR-low and BDR-high indicate the effect at low and
  high bitrates, respectively, as described in <xref target="BDR">BDR</xref>.
  </t><t>
  The effect of the filter was tested in two encoder low-delay
  configurations: high complexity in which the encoder
  strongly favours compression efficiency over CPU usage, and medium
  complexity which is more suited for real-time applications.  The
  bandwidth reduction is somewhat less in the high complexity
  configuration.

</t><t>
<figure align="center" anchor="perf1" title="Compression Performance without Biprediction">
<artwork align="center">
<![CDATA[
+----------------+--------------------+--------------------+
|                | MEDIUM COMPLEXITY  |  HIGH COMPLEXITY   |
+----------------+------+------+------+--------------------+
|                |      | BDR- | BDR- |      | BDR- | BDR- |
|Sequence        |  BDR | low  | high |  BDR | low  | high |
+----------------+------+------+------+------+------+------+
|Kimono          | -2.7%| -2.3%| -3.4%| -1.9%| -1.8%| -2.0%|
|BasketballDrive | -3.3%| -2.5%| -4.5%| -2.1%| -1.6%| -3.0%|
|BQTerrace       | -7.2%| -4.9%| -9.1%| -5.5%| -3.7%| -6.7%|
|FourPeople      | -5.7%| -3.9%| -8.6%| -4.0%| -2.8%| -6.0%|
|Johnny          | -5.9%| -4.0%| -9.0%| -4.7%| -4.0%| -5.8%|
|ChangeSeats     | -6.4%| -3.4%|-10.8%| -4.5%| -2.8%| -6.8%|
|HeadAndShoulder | -8.6%| -2.6%|-18.8%| -5.8%| -2.2%|-11.1%|
|TelePresence    | -5.9%| -3.1%|-10.7%| -4.0%| -2.0%| -7.0%|
+----------------+------+------+------+--------------------+
|Average         | -5.7%| -3.3%| -9.4%| -4.0%| -2.6%| -6.0%|
+----------------+------+------+------+--------------------+
]]>
</artwork>
</figure>
</t><t>
  While the filter objectively performs better at relatively high
  bitrates, the subjective effect seems better at relatively low
  bitrates, and overall the subjective effect seems better than what
  the objective numbers suggest.
</t><t>
  If biprediction is allowed, there is generally less bandwidth reduction as
  the table below shows. These results reflect low-delay biprediction
  without frame reordering.
</t><t>
<figure align="center" anchor="perf2" title="Compression Performance with Biprediction">
<artwork align="center">
<![CDATA[
+----------------+--------------------+--------------------+
|                | MEDIUM COMPLEXITY  |  HIGH COMPLEXITY   |
+----------------+------+------+------+--------------------+
|                |      | BDR- | BDR- |      | BDR- | BDR- |
|Sequence        |  BDR | low  | high |  BDR | low  | high |
+----------------+------+------+------+------+------+------+
|Kimono          | -2.2%| -1.8%| -2.7%| -1.4%| -1.3%| -1.5%|
|BasketballDrive | -2.6%| -2.5%| -2.7%| -1.4%| -1.6%| -1.1%|
|BQTerrace       | -4.1%| -3.1%| -4.7%| -2.7%| -2.7%| -2.5%|
|FourPeople      | -4.0%| -2.9%| -5.3%| -2.7%| -1.9%| -3.4%|
|Johnny          | -3.5%| -2.7%| -4.6%| -2.2%| -1.6%| -3.1%|
|ChangeSeats     | -4.2%| -3.0%| -6.1%| -2.6%| -2.0%| -3.2%|
|HeadAndShoulder | -4.1%| -2.9%| -6.1%| -2.3%| -1.8%| -2.8%|
|TelePresence    | -2.8%| -1.9%| -4.3%| -1.6%| -1.2%| -2.1%|
+----------------+------+------+------+------+------+------+
|Average         | -3.4%| -2.6%| -4.6%| -2.1%| -1.9%| -2.5%|
+----------------+------+------+------+------+------+------+
]]>
</artwork>
</figure>

</t>
</section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document has no IANA considerations yet. TBD</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>This document has no security considerations yet. TBD</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>The authors would like to thank Gisle Bjontegaard for reviewing
        this document and design, and providing constructive feedback and direction. </t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
      <?rfc include='reference.I-D.fuldseth-netvc-thor'?>
    </references>

    <references title="Informative References">
      <reference anchor="BDR">
         <front>
           <title>Calculation of average PSNR differences between RD-curves</title>
           <author initials="G." surname="Bjontegaard" fullname="Gisle Bjontegaard" />
           <date month="April" year="2001"/>
         </front>
        <seriesInfo name="ITU-T SG16 Q6 VCEG-M33" value="" />
      </reference>
    </references>
  </back>
</rfc>
