14 Collaborator |
Yifan Cai , Francisco Guerrero , Doug Rohrer , Mick Semb Wever , Josh McKenzie , Berenguer Blasi , James Berragan , jberragan , Yuriy Semchyshyn , Saranya Krishnakumar , Jyothsna Konisa , jkonisa , Bernardo Botella , Francisco Guerrero Hernandez |
2 Patch |
21 Review |
d2be42dbea6fb1d9d908792788d66113960d565b,
1633cd9c6c3d88d5c66825fab76a369266509f7e |
e51716ee724cf4950df67eba0393b3f798b7dc00,
555e8494d3ca27a7b35aebabb1f669eede20cc53,
d75a6bae5abbf80810012a181644f240141014d5,
a242b352c28947427a9bfc30295a487017439fd9,
c73c76498b0c2b36705025de6b0b2a7bb38e758b,
87a729feb4660f57bacb2a4be73e1bb2d509578b,
8771581b255e5728a16aea84430506d6f156a589,
deebdf97ad01f23550d7d3b42d98c7bf111e2f95,
f24951ab6ea2b1e9af4013b030675c70d31adb90,
014db08a79f00ef0d94e6855779e398c9dc689c1,
82b3c0a79c9322142738a4ec2ff7d4d4c0be2370,
6f8f404535d4cff9272091f669f985ce11cee7d2,
69766bca399cc779e0f2f8e859e39f7e29a17b7a,
912fbb47fddc07afcf56f5de97e813593bfb890e,
02d9136cfa72c8990120eca0f4fe5f52587bceb5,
ee1c83722bfb1155bef762cdfb2c86034857f2d0,
cbae09ca71b9eb9a581b77c23844da21474b095a,
bd0b41fb82134844a15fbb43126424d96706d08e,
9523a38b3f1b5bc4313e2949896ddc1fff58afbe,
f0fae2deeee20df15ac1105af2163af2a7e7953d,
b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6 |
e51716ee724cf4950df67eba0393b3f798b7dc00 | Author: jberragan <jberragan@gmail.com>
| 2024-12-06 16:09:12-08:00
CASSANDRA-19962: CEP-44 Kafka integration for Cassandra CDC using Sidecar (#87)
This is the initial commit for CEP-44 to introduce a standalone CDC module into the Analytics project. This module provides the foundation for CDC in the Apache Cassandra Sidecar.
This module provides:
- a standalone Cdc class as the entrypoint for initializing CDC.
- pluggable interfaces for: listing and reading commit log segments for a token range, persisting and reading CDC state, providing the Cassandra table schema, optionally reading values from Cassandra.
- read and deserialize commit log mutations.
- reconcile and de-duplicate mutations across replicas.
- serialize CDC state into a binary object for persistence.
- a layer for converting Cassandra mutations into a standard consumable format.
Patch by James Berragan, Jyothsna Konisa, Yifan Cai; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-19962
Co-authored-by: James Berragan <jberragan@apple.com>
Co-authored-by: Yifan Cai <ycai@apache.org>
Co-authored-by: Jyothsna Konisa <jkonisa@apple.com>
555e8494d3ca27a7b35aebabb1f669eede20cc53 | Author: Yifan Cai <ycai@apache.org>
| 2024-08-20 17:33:35-07:00
CASSANDRA-19836: Fix NPE when writing UDT values (#74)
When UDT field values are set to null, the bulk writer throws NPE
Patch by Yifan Cai; Reviewed by Dinesh Joshi, Doug Rohrer for CASSANDRA-19836
d75a6bae5abbf80810012a181644f240141014d5 | Author: Yifan Cai <ycai@apache.org>
| 2024-08-14 13:10:15-07:00
CASSANDRA-19827: Add job_timeout_seconds writer option (#73)
Option to specify the timeout in seconds for bulk write jobs. By default, it is disabled.
When `JOB_TIMEOUT_SECONDS` is specified, a job exceeding the timeout is:
- successful when the desired consistency level is met
- a failure otherwise
Patch by Yifan Cai; Reviewed by Dinesh Joshi, Doug Rohrer for CASSANDRA-19827
a242b352c28947427a9bfc30295a487017439fd9 | Author: jberragan <jberragan@gmail.com>
| 2024-07-12 14:57:38-07:00
CASSANDRA-19748: Refactoring to introduce new cassandra-analytics-common module with minimal dependencies (#62)
- Add new module cassandra-analytics-common with no dependencies on Spark or Cassandra and minimal standard dependencies (Guava, Jackson, Commons Lang Kryo etc)
- Move standalone classes to cassandra-analytics-common module.
Some additional refactoring and clean up:
- Rename SSTableInputStream -> BufferingInputStream
- Rename SSTableSource -> CassandraFileSource
- Introduce CassandraFile interface to be the implementing class for SSTable and CommitLog.
- Generalize IStats to work across different CassandraFile types
- Rename methods in StreamScanner to make the API clearer.
- Move ComplexTypeBuffer, ListBuffer, MapBuffer, SetBuffer, UdtBuffer to standalone classes
- Delete unused classes RangeTombstone, ReplciaSet and CollectionElement.
- Remove commons lang as a dependency
- Rename Rid to RowData
Patch by James Berragan; Reviewed by Bernardo Botella, Dinesh Joshi, Francisco Guerrero, Yifan Cai, Yuriy Semchyshyn for CASSANDRA-19748
c73c76498b0c2b36705025de6b0b2a7bb38e758b | Author: Doug Rohrer <drohrer@apple.com>
| 2023-11-20 10:54:46-05:00
CASSANDRA-19048 - Audit table properties passed through Analytics CqlUtils
The following properties have an effect on the files generated by the
bulk writer, and therefore need to be retained when cleaning the table
schema:
bloom_filter_fp_chance
cdc
compression
default_time_to_live
min_index_interval
max_index_interval
Additionally, this commit adds tests to make sure all available TTL
paths, including table default TTLs and constant/per-row options, work
as designed.
Patch by Doug Rohrer; Reviewed by Francisco Guerrero Hernandez, Yifan Cai,
Dinesh Joshi for CASSANDRA-19048
87a729feb4660f57bacb2a4be73e1bb2d509578b | Author: Saranya Krishnakumar <saranya_k@apple.com>
| 2023-11-06 13:32:01-08:00
CASSANDRA-19903: Get Sidecar port through CassandraContext
Patch by Saranya Krishnakumar; Reviewed by Dinesh Joshi, Francisco Guerrero, Josh McKenzie for CASSANDRA-19903
f24951ab6ea2b1e9af4013b030675c70d31adb90 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
| 2023-08-14 14:09:12-05:00
CASSANDRA-18810: Cassandra Analytics Start-Up Validation
Patch by Yuriy Semchyshyn; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18810
82b3c0a79c9322142738a4ec2ff7d4d4c0be2370 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-07-25 12:41:10-07:00
CASSANDRA-18692 Fix bulk writes with Buffered RowBufferMode
When setting Buffered RowBufferMode as part of the `WriterOption`s,
`org.apache.cassandra.spark.bulkwriter.RecordWriter` ignores that configuration and instead
uses the batch size to determine when to finalize an SSTable and start writing a new SSTable,
if more rows are available.
In this commit, we fix `org.apache.cassandra.spark.bulkwriter.RecordWriter#checkBatchSize`
to take into account the configured `RowBufferMode`. And in specific to the case of the
`UNBUFFERED` RowBufferMode, we check then the batchSize of the SSTable during writes, and for
the case of `BUFFERED` that check will take no effect.
Co-authored-by: Doug Rohrer <doug@therohrers.org>
Patch by Francisco Guerrero, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18692
014db08a79f00ef0d94e6855779e398c9dc689c1 | Author: James Berragan <jberragan@apple.com>
| 2023-07-19 12:23:07-07:00
CASSANDRA-18683: Add PartitionSizeTableProvider for reading the compressed and uncompressed sizes of all partitions in a table by utilizing the SSTable Index.db files
Patch by James Berragan; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18683
02d9136cfa72c8990120eca0f4fe5f52587bceb5 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-27 10:28:04-07:00
CASSANDRA-18631: Add Release Audit Tool (RAT) plugin to Analytics
This commit adds the Release Audit Tool (RAT) plugin to `build.gradle` which adds a new task
`rat`. This new task makes sure that the license headers are valid and present in the source
files during the `check` task.
To run the RAT plugin, you can run:
```
./gradlew rat
```
patch by Francisco Guerrero; reviewed by Dinesh Joshi, Michael Semb Wever for CASSANDRA-18631
69766bca399cc779e0f2f8e859e39f7e29a17b7a | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-27 10:03:56-07:00
CASSANDRA-18662: Fix cassandra-analytics-core-example
This commit fixes the `SampleCassandraJob` available under the `cassandra-analytics-core-example`
subproject.
Fix checkstyle issues
Fix serialization issue in SidecarDataTransferApi
The `sidecarClient` field in `SidecarDataTransferApi` is declared as transient,
this is causing NPEs coming from executors while trying to perform an SSTable
upload.
This commit completely avoids serializing the `dataTransferApi` field in the
`CassandraBulkWriterContext`, and lazily initializing it during the `transfer()`
method invocation. We guard the initialization to a single thread by making the
`tranfer()` method synchronized. The `SidecarDataTransferApi` can be recreated
when needed using the already serialized `clusterInfo`, `jobInfo`, and `conf`
fields.
Fix setting ROW_BUFFER_MODE to BUFFERED
patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18662
9523a38b3f1b5bc4313e2949896ddc1fff58afbe | Author: jkonisa <jkonisa@apple.com>
| 2023-06-15 13:31:01-07:00
CASSANDRA-18605 Adding support for TTL & Timestamps for bulk writes
This commit introduces a new feature in Spark Bulk Writer to support writes with
constant/per_row based TTL & Timestamps.
Patch by Jyothsna Konisa; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18605
cbae09ca71b9eb9a581b77c23844da21474b095a | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-14 11:52:55-07:00
CASSANDRA-18600 Add NOTICE.txt file
The NOTICE.txt file is currently missing in the repository. This commit adds the file to
comply with ASF's guidance.
patch by Francisco Guerrero; reviewed by Dinesh Joshi, Michael Semb Wever, Berenguer Blasi for CASSANDRA-18600
deebdf97ad01f23550d7d3b42d98c7bf111e2f95 | Author: Doug Rohrer <drohrer@apple.com>
| 2023-06-14 13:33:29-04:00
CASSANDRA-18759: Use in-jvm dtest framework from Sidecar for testing
This commit introduces the use of the in-jvm dtest framework for testing
Analytics workloads. It can spin up a Cassandra cluster, including the
necessary Sidecar process, to test writing to and reading from Cassandra
using the analytics library.
Additional changes made in this commit include
* Use concurrent collections in MockBulkWriterContext (Fixes flaky test StreamSessionConsistencyTest)
The StreamSessionConsistency test uses the MockBulkWriter context, but it wasn't originally used
(before this test was added) in a multi-threaded environment. Because of this, it would occasionally
throw ConcurrentModificationExceptions, which would cause the stream test to fail in a
non-deterministic way. This commit adds the use of concurrent/synchronous collections to the
MockBulkWriterContext to make sure it doesn't throw these spurious errors.
* Make the StartupValidation system thread-safe by using TreadLocals
instead of static collections, and clearing them once validation is
complete.
Patch by Doug Rohrer; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18759
f0fae2deeee20df15ac1105af2163af2a7e7953d | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-08 12:40:22-07:00
CASSANDRA-18578 Add circleci configuration yaml for Cassandra Analytics
This commit adds the CircleCI configuration yaml to test against all the existing
profiles
- cassandra-analytics-core-spark2-2.11-jdk8
- cassandra-analytics-core-spark2-2.12-jdk8
- cassandra-analytics-core-spark3-2.12-jdk11
- cassandra-analytics-core-spark3-2.13-jdk11
Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18578
ee1c83722bfb1155bef762cdfb2c86034857f2d0 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-07 12:40:50-07:00
CASSANDRA-18574: Fix sample job documentation after Sidecar changes
This commit fixes the README file with documentation to setup and run the Sample job provided in the repository.
During Sidecar review, there was a suggestion to change the yaml property `uploads_staging_dir` to `staging_dir`.
That change however was not reflected as part of the sample job README.md.
patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18574
b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6 | Author: Francisco Guerrero <francisco.guerrero@apple.com>
| 2023-05-23 13:56:48-07:00
CASSANDRA-18545: Provide a SecretsProvider interface to abstract the secret provisioning
This commit introduces the SecretsProvider interface that abstracts the secrets provisioning.
This way different implementations of the SecretsProvider can be used to provide SSL secrets
for the Analytics job. We provide an implementation, SslConficSecretsProvider, which provides
secrets based on the configuration for the job.
Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18545
1633cd9c6c3d88d5c66825fab76a369266509f7e | Author: Dinesh Joshi <djoshi@apache.org>
| 2023-05-19 14:57:47-07:00
CEP-28: Apache Cassandra Analytics
This is the initial commit for the Apache Cassandra Analytics project
where we support reading and writing bulk data from Apache Cassandra from
Spark.
Patch by James Berragan, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-16222
Co-authored-by: James Berragan <jberragan@apple.com>
Co-authored-by: Doug Rohrer <drohrer@apple.com>
Co-authored-by: Saranya Krishnakumar <saranya_k@apple.com>
Co-authored-by: Francisco Guerrero <francisco.guerrero@apple.com>
Co-authored-by: Yifan Cai <ycai@apache.org>
Co-authored-by: Jyothsna Konisa <jkonisa@apple.com>
Co-authored-by: Yuriy Semchyshyn <ysemchyshyn@apple.com>
Co-authored-by: Dinesh Joshi <djoshi@apache.org>