Dinesh Joshi cassandra-analytics last 3 years


 14 Collaborator
Yifan Cai , Francisco Guerrero , Doug Rohrer , Mick Semb Wever , Josh McKenzie , Berenguer Blasi , James Berragan , jberragan , Yuriy Semchyshyn , Saranya Krishnakumar , Jyothsna Konisa , jkonisa , Bernardo Botella , Francisco Guerrero Hernandez

 2 Patch  21 Review
d2be42dbea6fb1d9d908792788d66113960d565b, 1633cd9c6c3d88d5c66825fab76a369266509f7e e51716ee724cf4950df67eba0393b3f798b7dc00, 555e8494d3ca27a7b35aebabb1f669eede20cc53, d75a6bae5abbf80810012a181644f240141014d5, a242b352c28947427a9bfc30295a487017439fd9, c73c76498b0c2b36705025de6b0b2a7bb38e758b, 87a729feb4660f57bacb2a4be73e1bb2d509578b, 8771581b255e5728a16aea84430506d6f156a589, deebdf97ad01f23550d7d3b42d98c7bf111e2f95, f24951ab6ea2b1e9af4013b030675c70d31adb90, 014db08a79f00ef0d94e6855779e398c9dc689c1, 82b3c0a79c9322142738a4ec2ff7d4d4c0be2370, 6f8f404535d4cff9272091f669f985ce11cee7d2, 69766bca399cc779e0f2f8e859e39f7e29a17b7a, 912fbb47fddc07afcf56f5de97e813593bfb890e, 02d9136cfa72c8990120eca0f4fe5f52587bceb5, ee1c83722bfb1155bef762cdfb2c86034857f2d0, cbae09ca71b9eb9a581b77c23844da21474b095a, bd0b41fb82134844a15fbb43126424d96706d08e, 9523a38b3f1b5bc4313e2949896ddc1fff58afbe, f0fae2deeee20df15ac1105af2163af2a7e7953d, b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6

e51716ee724cf4950df67eba0393b3f798b7dc00 | Author: jberragan <jberragan@gmail.com>
 | 2024-12-06 16:09:12-08:00

    CASSANDRA-19962: CEP-44 Kafka integration for Cassandra CDC using Sidecar (#87)
    
    This is the initial commit for CEP-44 to introduce a standalone CDC module into the Analytics project. This module provides the foundation for CDC in the Apache Cassandra Sidecar.
    
    This module provides:
    - a standalone Cdc class as the entrypoint for initializing CDC.
    - pluggable interfaces for: listing and reading commit log segments for a token range, persisting and reading CDC state, providing the Cassandra table schema, optionally reading values from Cassandra.
    - read and deserialize commit log mutations.
    - reconcile and de-duplicate mutations across replicas.
    - serialize CDC state into a binary object for persistence.
    - a layer for converting Cassandra mutations into a standard consumable format.
    
    Patch by James Berragan, Jyothsna Konisa, Yifan Cai; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-19962
    
    Co-authored-by: James Berragan <jberragan@apple.com>
    Co-authored-by: Yifan Cai <ycai@apache.org>
    Co-authored-by: Jyothsna Konisa <jkonisa@apple.com>

555e8494d3ca27a7b35aebabb1f669eede20cc53 | Author: Yifan Cai <ycai@apache.org>
 | 2024-08-20 17:33:35-07:00

    CASSANDRA-19836: Fix NPE when writing UDT values (#74)
    
    When UDT field values are set to null, the bulk writer throws NPE
    
    Patch by Yifan Cai; Reviewed by Dinesh Joshi, Doug Rohrer for CASSANDRA-19836

d75a6bae5abbf80810012a181644f240141014d5 | Author: Yifan Cai <ycai@apache.org>
 | 2024-08-14 13:10:15-07:00

    CASSANDRA-19827: Add job_timeout_seconds writer option (#73)
    
    Option to specify the timeout in seconds for bulk write jobs. By default, it is disabled.
    When `JOB_TIMEOUT_SECONDS` is specified, a job exceeding the timeout is:
    - successful when the desired consistency level is met
    - a failure otherwise
    
    Patch by Yifan Cai; Reviewed by Dinesh Joshi, Doug Rohrer for CASSANDRA-19827

a242b352c28947427a9bfc30295a487017439fd9 | Author: jberragan <jberragan@gmail.com>
 | 2024-07-12 14:57:38-07:00

    CASSANDRA-19748: Refactoring to introduce new cassandra-analytics-common module with minimal dependencies (#62)
    
    - Add new module cassandra-analytics-common with no dependencies on Spark or Cassandra and minimal standard dependencies (Guava, Jackson, Commons Lang Kryo etc)
    - Move standalone classes to cassandra-analytics-common module.
    
    Some additional refactoring and clean up:
    - Rename SSTableInputStream -> BufferingInputStream
    - Rename SSTableSource -> CassandraFileSource
    - Introduce CassandraFile interface to be the implementing class for SSTable and CommitLog.
    - Generalize IStats to work across different CassandraFile types
    - Rename methods in StreamScanner to make the API clearer.
    - Move ComplexTypeBuffer, ListBuffer, MapBuffer, SetBuffer, UdtBuffer to standalone classes
    - Delete unused classes RangeTombstone, ReplciaSet and CollectionElement.
    - Remove commons lang as a dependency
    - Rename Rid to RowData
    
    Patch by James Berragan; Reviewed by Bernardo Botella, Dinesh Joshi, Francisco Guerrero, Yifan Cai, Yuriy Semchyshyn for CASSANDRA-19748

c73c76498b0c2b36705025de6b0b2a7bb38e758b | Author: Doug Rohrer <drohrer@apple.com>
 | 2023-11-20 10:54:46-05:00

    CASSANDRA-19048 - Audit table properties passed through Analytics CqlUtils
    
    The following properties have an effect on the files generated by the
    bulk writer, and therefore need to be retained when cleaning the table
    schema:
    
    bloom_filter_fp_chance
    cdc
    compression
    default_time_to_live
    min_index_interval
    max_index_interval
    
    Additionally, this commit adds tests to make sure all available TTL
    paths, including table default TTLs and constant/per-row options, work
    as designed.
    
    Patch by Doug Rohrer; Reviewed by Francisco Guerrero Hernandez, Yifan Cai,
    Dinesh Joshi for CASSANDRA-19048

87a729feb4660f57bacb2a4be73e1bb2d509578b | Author: Saranya Krishnakumar <saranya_k@apple.com>
 | 2023-11-06 13:32:01-08:00

    CASSANDRA-19903: Get Sidecar port through CassandraContext
    
    Patch by Saranya Krishnakumar; Reviewed by Dinesh Joshi, Francisco Guerrero, Josh McKenzie for CASSANDRA-19903

8771581b255e5728a16aea84430506d6f156a589 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
 | 2023-10-06 17:54:34-05:00

    CASSANDRA-18916: Log start-up validation result to a single report
    
    Patch by Yuriy Semchyshyn; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18916

f24951ab6ea2b1e9af4013b030675c70d31adb90 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
 | 2023-08-14 14:09:12-05:00

    CASSANDRA-18810: Cassandra Analytics Start-Up Validation
    
    Patch by Yuriy Semchyshyn; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18810

82b3c0a79c9322142738a4ec2ff7d4d4c0be2370 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-07-25 12:41:10-07:00

    CASSANDRA-18692 Fix bulk writes with Buffered RowBufferMode
    
    When setting Buffered RowBufferMode as part of the `WriterOption`s,
    `org.apache.cassandra.spark.bulkwriter.RecordWriter` ignores that configuration and instead
    uses the batch size to determine when to finalize an SSTable and start writing a new SSTable,
    if more rows are available.
    
    In this commit, we fix `org.apache.cassandra.spark.bulkwriter.RecordWriter#checkBatchSize`
    to take into account the configured `RowBufferMode`. And in specific to the case of the
    `UNBUFFERED` RowBufferMode, we check then the batchSize of the SSTable during writes, and for
    the case of `BUFFERED` that check will take no effect.
    
    Co-authored-by: Doug Rohrer <doug@therohrers.org>
    
    Patch by Francisco Guerrero, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18692

014db08a79f00ef0d94e6855779e398c9dc689c1 | Author: James Berragan <jberragan@apple.com>
 | 2023-07-19 12:23:07-07:00

    CASSANDRA-18683: Add PartitionSizeTableProvider for reading the compressed and uncompressed sizes of all partitions in a table by utilizing the SSTable Index.db files
    
    Patch by James Berragan; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18683

6f8f404535d4cff9272091f669f985ce11cee7d2 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
 | 2023-07-18 18:58:00-05:00

    CASSANDRA-18684: Minor Refactoring to Improve Code Reusability
    
    patch by Yuriy Semchyshyn; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18684

02d9136cfa72c8990120eca0f4fe5f52587bceb5 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-27 10:28:04-07:00

    CASSANDRA-18631: Add Release Audit Tool (RAT) plugin to Analytics
    
    This commit adds the Release Audit Tool (RAT) plugin to `build.gradle` which adds a new task
    `rat`. This new task makes sure that the license headers are valid and present in the source
    files during the `check` task.
    
    To run the RAT plugin, you can run:
    
    ```
    ./gradlew rat
    ```
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Michael Semb Wever for CASSANDRA-18631

69766bca399cc779e0f2f8e859e39f7e29a17b7a | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-27 10:03:56-07:00

    CASSANDRA-18662: Fix cassandra-analytics-core-example
    
    This commit fixes the `SampleCassandraJob` available under the `cassandra-analytics-core-example`
    subproject.
    
    Fix checkstyle issues
    
    Fix serialization issue in SidecarDataTransferApi
    
    The `sidecarClient` field in `SidecarDataTransferApi` is declared as transient,
    this is causing NPEs coming from executors while trying to perform an SSTable
    upload.
    
    This commit completely avoids serializing the `dataTransferApi` field in the
    `CassandraBulkWriterContext`, and lazily initializing it during the `transfer()`
    method invocation. We guard the initialization to a single thread by making the
    `tranfer()` method synchronized. The `SidecarDataTransferApi` can be recreated
    when needed using the already serialized `clusterInfo`, `jobInfo`, and `conf`
    fields.
    
    Fix setting ROW_BUFFER_MODE to BUFFERED
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18662

912fbb47fddc07afcf56f5de97e813593bfb890e | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
 | 2023-06-26 14:40:10-05:00

    CASSANDRA-18633: Added caching of Node Settings to improve efficiency
    
    patch by Yuriy Semchyshyn; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18633

9523a38b3f1b5bc4313e2949896ddc1fff58afbe | Author: jkonisa <jkonisa@apple.com>
 | 2023-06-15 13:31:01-07:00

    CASSANDRA-18605 Adding support for TTL & Timestamps for bulk writes
    
    This commit introduces a new feature in Spark Bulk Writer to support writes with
    constant/per_row based TTL & Timestamps.
    
    Patch by Jyothsna Konisa; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18605

cbae09ca71b9eb9a581b77c23844da21474b095a | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-14 11:52:55-07:00

    CASSANDRA-18600 Add NOTICE.txt file
    
    The NOTICE.txt file is currently missing in the repository. This commit adds the file to
    comply with ASF's guidance.
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Michael Semb Wever, Berenguer Blasi for CASSANDRA-18600

deebdf97ad01f23550d7d3b42d98c7bf111e2f95 | Author: Doug Rohrer <drohrer@apple.com>
 | 2023-06-14 13:33:29-04:00

    CASSANDRA-18759: Use in-jvm dtest framework from Sidecar for testing
    
    This commit introduces the use of the in-jvm dtest framework for testing
    Analytics workloads. It can spin up a Cassandra cluster, including the
    necessary Sidecar process, to test writing to and reading from Cassandra
    using the analytics library.
    
    Additional changes made in this commit include
    
    * Use concurrent collections in MockBulkWriterContext (Fixes flaky test StreamSessionConsistencyTest)
    
        The StreamSessionConsistency test uses the MockBulkWriter context, but it wasn't originally used
        (before this test was added) in a multi-threaded environment. Because of this, it would occasionally
        throw ConcurrentModificationExceptions, which would cause the stream test to fail in a
        non-deterministic way. This commit adds the use of concurrent/synchronous collections to the
        MockBulkWriterContext to make sure it doesn't throw these spurious errors.
    
    * Make the StartupValidation system thread-safe by using TreadLocals
      instead of static collections, and clearing them once validation is
      complete.
    
    Patch by Doug Rohrer; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18759

bd0b41fb82134844a15fbb43126424d96706d08e | Author: Doug Rohrer <drohrer@apple.com>
 | 2023-06-14 13:33:29-04:00

    CASSANDRA-18599 Upgrade to JUnit 5
    
    patch by Doug Rohrer, Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18599

f0fae2deeee20df15ac1105af2163af2a7e7953d | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-08 12:40:22-07:00

    CASSANDRA-18578 Add circleci configuration yaml for Cassandra Analytics
    
    This commit adds the CircleCI configuration yaml to test against all the existing
    profiles
    
          - cassandra-analytics-core-spark2-2.11-jdk8
          - cassandra-analytics-core-spark2-2.12-jdk8
          - cassandra-analytics-core-spark3-2.12-jdk11
          - cassandra-analytics-core-spark3-2.13-jdk11
    
    Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18578

ee1c83722bfb1155bef762cdfb2c86034857f2d0 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-07 12:40:50-07:00

    CASSANDRA-18574: Fix sample job documentation after Sidecar changes
    
    This commit fixes the README file with documentation to setup and run the Sample job provided in the repository.
    During Sidecar review, there was a suggestion to change the yaml property `uploads_staging_dir` to `staging_dir`.
    That change however was not reflected as part of the sample job README.md.
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18574

d2be42dbea6fb1d9d908792788d66113960d565b | Author: Dinesh Joshi <djoshi@apache.org>
 | 2023-05-24 16:30:54-07:00

    Ninja fix .asf.yaml

b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6 | Author: Francisco Guerrero <francisco.guerrero@apple.com>
 | 2023-05-23 13:56:48-07:00

    CASSANDRA-18545: Provide a SecretsProvider interface to abstract the secret provisioning
    
    This commit introduces the SecretsProvider interface that abstracts the secrets provisioning.
    This way different implementations of the SecretsProvider can be used to provide SSL secrets
    for the Analytics job. We provide an implementation, SslConficSecretsProvider, which provides
    secrets based on the configuration for the job.
    
    Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18545

1633cd9c6c3d88d5c66825fab76a369266509f7e | Author: Dinesh Joshi <djoshi@apache.org>
 | 2023-05-19 14:57:47-07:00

    CEP-28: Apache Cassandra Analytics
    
    This is the initial commit for the Apache Cassandra Analytics project
    where we support reading and writing bulk data from Apache Cassandra from
    Spark.
    
    Patch by James Berragan, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-16222
    
    Co-authored-by: James Berragan <jberragan@apple.com>
    Co-authored-by: Doug Rohrer <drohrer@apple.com>
    Co-authored-by: Saranya Krishnakumar <saranya_k@apple.com>
    Co-authored-by: Francisco Guerrero <francisco.guerrero@apple.com>
    Co-authored-by: Yifan Cai <ycai@apache.org>
    Co-authored-by: Jyothsna Konisa <jkonisa@apple.com>
    Co-authored-by: Yuriy Semchyshyn <ysemchyshyn@apple.com>
    Co-authored-by: Dinesh Joshi <djoshi@apache.org>