Merge remote-tracking branch 'upstream/master'

chinmaychandak · chinmaychandak · commit 91c0ba7b732e · 2020-09-29T12:04:52.000-07:00
diff --git a/docs/source/collections-api.rst b/docs/source/collections-api.rst
@@ -86,6 +86,9 @@ Dataframes
    Rolling.sum
    Rolling.var
 
+.. autosummary::
+   PeriodicDataFrame
+
 .. autosummary::
    Random
 
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -48,7 +48,7 @@
 
 # General information about the project.
 project = 'Streamz'
-copyright = '2017, Matthew Rocklin'
+copyright = '2017-2020, Matthew Rocklin'
 author = 'Matthew Rocklin'
 
 # The version info for the project you're documenting, acts as replacement for
@@ -160,7 +160,7 @@
 #  dir menu entry, description, category)
 texinfo_documents = [
     (master_doc, 'Streamz', 'Streamz Documentation',
-     author, 'Streamz', 'One line description of project.',
+     author, 'Streamz', 'Support for pipelines managing continuous streams of data.',
      'Miscellaneous'),
 ]
 
diff --git a/docs/source/core.rst b/docs/source/core.rst
@@ -15,7 +15,9 @@ Map, emit, and sink
    map
    sink
 
-You can create a basic pipeline by instantiating the ``Streamz`` object and then using methods like ``map``, ``accumulate``, and ``sink``.
+You can create a basic pipeline by instantiating the ``Streamz``
+object and then using methods like ``map``, ``accumulate``, and
+``sink``.
 
 .. code-block:: python
 
@@ -27,7 +29,10 @@ You can create a basic pipeline by instantiating the ``Streamz`` object and then
    source = Stream()
    source.map(increment).sink(print)
 
-The ``map`` and ``sink`` methods both take a function and apply that function to every element in the stream.  The ``map`` method returns a new stream with the modified elements while ``sink`` is typically used at the end of a stream for final actions.
+The ``map`` and ``sink`` methods both take a function and apply that
+function to every element in the stream.  The ``map`` method returns a
+new stream with the modified elements while ``sink`` is typically used
+at the end of a stream for final actions.
 
 To push data through our pipeline we call ``emit``
 
@@ -383,14 +388,33 @@ want to read further about :doc:`collections <collections>`
 Metadata
 --------
 
-Metadata can be emitted into the pipeline to accompany the data as a list of dictionaries. Most functions will pass the metadata to the downstream function without making any changes. However, functions that make the pipeline asynchronous require logic that dictates how and when the metadata will be passed downstream. Synchronous functions and asynchronous functions that have a 1:1 ratio of the number of values on the input to the number of values on the output will emit the metadata collection without any modification. However, functions that have multiple input streams or emit collections of data will emit the metadata associated with the emitted data as a collection.
+Metadata can be emitted into the pipeline to accompany the data as a
+list of dictionaries. Most functions will pass the metadata to the
+downstream function without making any changes. However, functions
+that make the pipeline asynchronous require logic that dictates how
+and when the metadata will be passed downstream. Synchronous functions
+and asynchronous functions that have a 1:1 ratio of the number of
+values on the input to the number of values on the output will emit
+the metadata collection without any modification. However, functions
+that have multiple input streams or emit collections of data will emit
+the metadata associated with the emitted data as a collection.
 
 
 Reference Counting and Checkpointing
 ------------------------------------
 
-Checkpointing is achieved in Streamz through the use of reference counting. With this method, a checkpoint can be saved when and only when data has progressed through all of the the pipeline without any issues. This prevents data loss and guarantees at-least-once semantics.
-
-Any node that caches or holds data after it returns increments the reference counter associated with the given data by one. When a node is no longer holding the data, it will release it by decrementing the counter by one. When the counter changes to zero, a callback associated with the data is triggered.
-
-References are passed in the metadata as a value of the `ref` keyword. Each metadata object contains only one reference counter object.
+Checkpointing is achieved in Streamz through the use of reference
+counting. With this method, a checkpoint can be saved when and only
+when data has progressed through all of the the pipeline without any
+issues. This prevents data loss and guarantees at-least-once
+semantics.
+
+Any node that caches or holds data after it returns increments the
+reference counter associated with the given data by one. When a node
+is no longer holding the data, it will release it by decrementing the
+counter by one. When the counter changes to zero, a callback
+associated with the data is triggered.
+
+References are passed in the metadata as a value of the `ref`
+keyword. Each metadata object contains only one reference counter
+object.
diff --git a/docs/source/dask.rst b/docs/source/dask.rst
@@ -36,7 +36,7 @@ Then start a local Dask cluster
    from dask.distributed import Client
    client = Client()
 
-This operates on a local processes or threads.  If you have Bokeh installed
+This operates on local processes or threads.  If you have Bokeh installed
 then this will also start a diagnostics web server at
 http://localhost:8787/status which you may want to open to get a real-time view
 of execution.
@@ -49,7 +49,7 @@ Sequential Execution
    map
    sink
 
-Before we build a parallel stream, lets build a sequential stream that maps a
+Before we build a parallel stream, let's build a sequential stream that maps a
 simple function across data, and then prints those results.  We use the core
 ``Stream`` object.
 
@@ -69,7 +69,7 @@ simple function across data, and then prints those results.  We use the core
    for i in range(10):
        source.emit(i)
 
-This should take ten seconds we call the ``inc`` function ten times
+This should take ten seconds because we call the ``inc`` function ten times
 sequentially.
 
 Parallel Execution
@@ -101,7 +101,7 @@ You may want to look at http://localhost:8787/status during execution to get a
 sense of the parallel execution.
 
 This should have run much more quickly depending on how many cores you have on
-your machine.  We added a few extra nodes to our stream, lets look at what they
+your machine.  We added a few extra nodes to our stream; let's look at what they
 did.
 
 -   ``scatter``: Converted our Stream into a DaskStream.  The elements that we
@@ -123,17 +123,20 @@ Gotchas
 +++++++
 
 
-An important gotcha with ``DaskStream`` is that it is a subclass ``Stream``, and so can be used as an input 
-to any function expecting a ``Stream``. If there is no intervening ``.gather()``, then the downstream node will
-receive Dask futures instead of the data they represent::
+An important gotcha with ``DaskStream`` is that it is a subclass of
+``Stream``, and so can be used as an input to any function expecting a
+``Stream``. If there is no intervening ``.gather()``, then the
+downstream node will receive Dask futures instead of the data they
+represent::
 
     source = Stream()
     source2 = Stream()
     a = source.scatter().map(inc)
     b = source2.combine_latest(a)
 
-In this case, the combine operation will get real values from ``source2``, and Dask futures. 
-Downstream nodes would be free to operate on the futures, but more likely, the line should be::
+In this case, the combine operation will get real values from
+``source2``, and Dask futures.  Downstream nodes would be free to
+operate on the futures, but more likely, the line should be::
 
     b = source2.combine_latest(a.gather())
 
diff --git a/docs/source/dataframes.rst b/docs/source/dataframes.rst
@@ -4,7 +4,7 @@ DataFrames
 When handling large volumes of streaming tabular data it is often more
 efficient to pass around larger Pandas dataframes with many rows each rather
 than pass around individual Python tuples or dicts.  Handling and computing on
-data with Pandas can be much faster than operating on Python objects.
+data with Pandas can be much faster than operating on individual Python objects.
 
 So one could imagine building streaming dataframe pipelines using the ``.map``
 and ``.accumulate`` streaming operators with functions that consume and produce
@@ -178,5 +178,79 @@ and ``DaskStream`` objects.
 Not Yet Supported
 -----------------
 
-Streaming dataframes algorithms do not currently pay special attention to data
+Streaming dataframe algorithms do not currently pay special attention to data
 arriving out-of-order.
+
+
+PeriodicDataFrame
+-----------------
+
+As you have seen above, Streamz can handle arbitrarily complex pipelines,
+events, and topologies, but what if you simply want to run some Python
+function periodically and collect or plot the results?
+
+streamz provides a high-level convenience class for this purpose, called
+a PeriodicDataFrame. A PeriodicDataFrame uses Python's asyncio event loop
+(used as part of Tornado in Jupyter and other interactive frameworks) to
+call a user-provided function at a regular interval, collecting the results
+and making them available for later processing.
+
+In the simplest case, you can use a PeriodicDataFrame by first writing
+a callback function like:
+
+.. code-block:: python
+
+   import numpy as np
+
+   def random_datapoint(**kwargs):
+      return pd.DataFrame({'a': np.random.random(1)}, index=[pd.Timestamp.now()])
+
+You can then make a streaming dataframe to poll this function
+e.g. every 300 milliseconds:
+
+.. code-block:: python
+
+   df = PeriodicDataFrame(random_datapoint, interval='300ms')
+
+``df`` will now be a steady stream of whatever values are returned by
+the `datafn`, which can of course be any Python code as long as it
+returns a DataFrame. 
+
+Here we returned only a single point, appropriate for streaming the
+results of system calls or other isolated actions, but any number of
+entries can be returned by the dataframe in a single batch. To
+facilitate collecting such batches, the callback is invoked with
+keyword arguments ``last`` (the time of the previous invocation) and
+``now`` (the time of the current invocation) as Pandas Timestamp
+objects. The callback can then generate or query for just the values
+in that time range.
+
+Arbitrary keyword arguments can be provided to the PeriodicDataFrame
+constructor, which will be passed into the callback so that its behavior
+can be parameterized.
+
+For instance, you can write a callback to return a suitable number of
+datapoints to keep a regularly updating stream, generated randomly
+as a batch since the last call:
+
+.. code-block:: python
+
+   def datablock(last, now, **kwargs):
+       freq = kwargs.get("freq", pd.Timedelta("50ms"))
+       index = pd.date_range(start=last + freq, end=now, freq=freq)
+       return pd.DataFrame({'x': np.random.random(len(index))}, index=index)
+
+   df = PeriodicDataFrame(datablock, interval='300ms')
+
+The callback will now be invoked every 300ms, each time generating
+datapoints at a rate of 1 every 50ms, returned as a batch. If you
+wished, you could override the 50ms value by passing
+`freq=pd.Timedelta("100ms")` to the PeriodicDataFrame constructor.
+
+Similar code could e.g. query an external database for the time range
+since the last update, returning all datapoints since then.
+
+Once you have a PeriodicDataFrame defined using such callbacks, you
+can then use all the rest of the functionality supported by streamz,
+including aggregations, rolling windows, etc., and streaming
+`visualization. <plotting>`_
diff --git a/docs/source/gpu-dataframes.rst b/docs/source/gpu-dataframes.rst
@@ -1,13 +1,15 @@
-Streaming GPU DataFrames(cudf)
-------------------------------
+Streaming GPU DataFrames (cudf)
+-------------------------------
 
-The ``streamz.dataframe`` module provides DataFrame-like interface on streaming
-data as described in ``dataframes`` documentation. It provides support for dataframe
-like libraries such as pandas and cudf. This documentation is specific to streaming GPU
-dataframes(cudf).
+The ``streamz.dataframe`` module provides a DataFrame-like interface
+on streaming data as described in the ``dataframes`` documentation. It
+provides support for dataframe-like libraries such as pandas and
+cudf. This documentation is specific to streaming GPU dataframes using
+cudf.
 
-The example in the ``dataframes`` documentation is rewritten below using cudf dataframes
-just by replacing ``pandas`` module with ``cudf``:
+The example in the ``dataframes`` documentation is rewritten below
+using cudf dataframes just by replacing the ``pandas`` module with
+``cudf``:
 
 .. code-block:: python
 
@@ -23,19 +25,21 @@ just by replacing ``pandas`` module with ``cudf``:
 Supported Operations
 --------------------
 
-Streaming cudf dataframes support the following classes of operations
+Streaming cudf dataframes support the following classes of operations:
 
 -  Elementwise operations like ``df.x + 1``
 -  Filtering like ``df[df.name == 'Alice']``
 -  Column addition like ``df['z'] = df.x + df.y``
 -  Reductions like ``df.amount.mean()``
 -  Windowed aggregations (fixed length) like ``df.window(n=100).amount.sum()``
 
-The following operations are not supported with cudf(as of version 0.8) yet
+The following operations are not yet supported with cudf (as of version 0.8):
+
 -  Groupby-aggregations like ``df.groupby(df.name).amount.mean()``
 -  Windowed aggregations (index valued) like ``df.window(value='2h').amount.sum()``
 -  Windowed groupby aggregations like ``df.window(value='2h').groupby('name').amount.sum()``
 
 
-Window based Aggregations with cudf are supported just as explained in ``dataframes`` documentation.
-The support for groupby operations will be added in future.
+Window-based Aggregations with cudf are supported just as explained in
+the ``dataframes`` documentation.  Support for groupby operations is
+expected to be added in the future.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -111,8 +111,10 @@ data streaming systems like `Apache Flink <https://flink.apache.org/>`_,
 
    core.rst
    dataframes.rst
+   gpu-dataframes.rst
    dask.rst
    collections.rst
    api.rst
    collections-api.rst
    async.rst
+   plotting.rst
diff --git a/docs/source/plotting.rst b/docs/source/plotting.rst
@@ -0,0 +1,61 @@
+Visualizing streamz
+===================
+
+A variety of tools are available to help you understand, debug,
+visualize your streaming objects:
+
+- Most Streamz objects automatically display themselves in Jupyter
+  notebooks, periodically updating their visual representation as text
+  or tables by registering events with the Tornado IOLoop used by Jupyter
+- The network graph underlying a stream can be visualized using `dot` to
+  render a PNG using `Stream.visualize(filename)`
+- Streaming data can be visualized using the optional separate packages
+  hvPlot, HoloViews, and Panel (see below)
+
+
+hvplot.streamz
+--------------
+
+hvPlot is a separate plotting library providing Bokeh-based plots for
+Pandas dataframes and a variety of other object types, including
+streamz DataFrame and Series objects.
+
+See `hvplot.holoviz.org <https://hvplot.holoviz.org>`_ for
+instructions on how to install hvplot.  Once it is installed, you can
+use the Pandas .plot() API to get a dynamically updating plot in
+Jupyter or in Bokeh/Panel Server:
+
+.. code-block:: python
+
+   import hvplot.streamz
+   from streamz.dataframe import Random
+   
+   df = Random()
+   df.hvplot(backlog=100)
+
+See the `streaming section
+<https://hvplot.holoviz.org/user_guide/Streaming.html>`_ of the hvPlot
+user guide for more details, and the `dataframes.ipynb` example that
+comes with streamz for a simple runnable example.
+
+
+HoloViews
+---------
+
+hvPlot is built on HoloViews, and you can also use HoloViews directly
+if you want more control over events and how they are processed.  See
+the `HoloViews user guide
+<http://holoviews.org/user_guide/Streaming_Data.html>`_ for more
+details.
+
+
+Panel
+-----
+
+Panel is a general purpose dashboard and app framework, supporting a
+wide variety of displayable objects as "Panes". Panel provides a
+`streamz Pane
+<https://panel.holoviz.org/reference/panes/Streamz.html>`_ for
+rendering arbitrary streamz objects, and streamz DataFrames are
+handled by the Panel `DataFrame Pane
+<https://panel.holoviz.org/reference/panes/DataFrame.html>`_.
diff --git a/setup.py b/setup.py
@@ -9,7 +9,7 @@
 
 
 setup(name='streamz',
-      version='0.5.6',
+      version='0.6.0',
       description='Streams',
       url='http://github.com/python-streamz/streamz/',
       maintainer='Matthew Rocklin',
diff --git a/streamz/__init__.py b/streamz/__init__.py
@@ -8,4 +8,4 @@
 except ImportError:
     pass
 
-__version__ = '0.5.6'
+__version__ = '0.6.0'
diff --git a/streamz/dataframe/__init__.py b/streamz/dataframe/__init__.py
@@ -1,3 +1,3 @@
 from .core import (DataFrame, DataFrames, Frame, Frames, Series, Seriess, Index,
-                   Rolling, Window, Random, GroupBy)
+                   Rolling, Window, PeriodicDataFrame, Random, GroupBy)
 from .aggregations import Aggregation
diff --git a/streamz/dataframe/core.py b/streamz/dataframe/core.py
diff --git a/streamz/tests/test_batch.py b/streamz/tests/test_batch.py