Interface DataBag<T>

  • All Superinterfaces:
    org.apache.jena.atlas.lib.Closeable, java.lang.Iterable<T>, org.apache.jena.atlas.lib.Sink<T>
    All Known Implementing Classes:
    AbstractDataBag, DefaultDataBag, DistinctDataBag, DistinctDataNet, SortedDataBag

    public interface DataBag<T>
    extends org.apache.jena.atlas.lib.Sink<T>, java.lang.Iterable<T>, org.apache.jena.atlas.lib.Closeable
    A collection of Tuples. A DataBag may or may not fit into memory. It proactively spills to disk when its size exceeds the threshold. When it spills, it takes whatever it has in memory, opens a spill file, and writes the contents out. This may happen multiple times. The bag tracks all of the files it's spilled to.

    DataBag provides an Iterator interface, that allows callers to read through the contents. The iterators are aware of the data spilling. They have to be able to handle reading from files.

    The DataBag interface assumes that all data is written before any is read. That is, a DataBag cannot be used as a queue. If data is written after data is read, the results are undefined. This condition is not checked on each add or read, for reasons of speed. Caveat emptor.

    DataBags come in several types, default, sorted, and distinct. The type must be chosen up front, there is no way to convert a bag on the fly. Default data bags do not guarantee any particular order of retrieval for the tuples and may contain duplicate tuples. Sorted data bags guarantee that tuples will be retrieved in order, where "in order" is defined either by the default comparator for Tuple or the comparator provided by the caller when the bag was created. Sorted bags may contain duplicates. Distinct bags do not guarantee any particular order of retrieval, but do guarantee that they will not contain duplicate tuples.

    Inspired by Apache Pig

    See Also:
    DataBag from Apache Pig
    • Method Summary

      All Methods Instance Methods Abstract Methods Default Methods 
      Modifier and Type Method Description
      void add​(T t)
      Add a tuple to the bag.
      default void addAll​(java.lang.Iterable<? extends T> it)
      Add contents of an Iterable to the bag.
      default void addAll​(java.util.Iterator<? extends T> it)
      Add contents of an Iterator to the bag.
      boolean isDistinct()
      Find out if the bag is distinct.
      boolean isSorted()
      Find out if the bag is sorted.
      long size()
      Get the number of elements in the bag, both in memory and on disk.
      • Methods inherited from interface org.apache.jena.atlas.lib.Closeable

        close
      • Methods inherited from interface java.lang.Iterable

        forEach, iterator, spliterator
      • Methods inherited from interface org.apache.jena.atlas.lib.Sink

        flush, send
    • Method Detail

      • size

        long size()
        Get the number of elements in the bag, both in memory and on disk.
        Returns:
        number of elements in the bag
      • isSorted

        boolean isSorted()
        Find out if the bag is sorted.
        Returns:
        true if this is a sorted data bag, false otherwise.
      • isDistinct

        boolean isDistinct()
        Find out if the bag is distinct.
        Returns:
        true if the bag is a distinct bag, false otherwise.
      • add

        void add​(T t)
        Add a tuple to the bag.
        Parameters:
        t - tuple to add.
      • addAll

        default void addAll​(java.lang.Iterable<? extends T> it)
        Add contents of an Iterable to the bag.
        Parameters:
        it - iterable to add contents of.
      • addAll

        default void addAll​(java.util.Iterator<? extends T> it)
        Add contents of an Iterator to the bag.
        Parameters:
        it - iterator to add contents of.