13 Collaborator |
Francisco Guerrero , Dinesh Joshi , Doug Rohrer , Brandon Williams , James Berragan , jberragan , Yuriy Semchyshyn , Saranya Krishnakumar , Arjun Ashok , Jyothsna Konisa , jkonisa , Bernardo Botella , Francisco Guerrero Hernandez |
28 Patch |
54 Review |
96f43fe0c831cd1317c90ecbd7463737fa439aec,
e51716ee724cf4950df67eba0393b3f798b7dc00,
dd536f2e70118cd5d0c319f5be3e54e3d50eb288,
a3ca0897c2190c2c18992ca2b7e5255318ff3eba,
cfef011d9745b43226dc1936f6e1778d3379b53d,
6556d251bdddfbef3935da760bcda2b2387a4391,
4624a17098e055e0abf9a6025451d4352cb9c147,
ff9ac41b4695c1df59f5293f69e0d3a1ce0da9f4,
4fb1e7f47d640353cd57f7a3035c70099049b29c,
f123406e458c0112145f37dcd3f8c20ba47c949d,
8655ca54a5d0749fccb2ad6a06ec230e8b0de24e,
cfe293dadcf7a1d4491591cfd39fc410a8fa52ba,
555e8494d3ca27a7b35aebabb1f669eede20cc53,
d75a6bae5abbf80810012a181644f240141014d5,
dbbd211cd420eb185d0579f16f5d46abc7bafeb4,
e168011c40de2ca48d138514640838067e61feea,
798182a6fda562538c2f44e4f3f92a7cb68cd81c,
2a693d721182c1d354ff1323b1324fe06ac03f36,
466e7bf5e160d1667c12ea1de1b79ba27670aba4,
aea798dc7e517af520a403d4d86f3bc6bed65092,
a13532272051d4e4608f92d53bdd997103e8ea19,
c3e8803b3331bc7ef81797ac52a8417524f67edc,
e0ae9d7484e242f6af495aac2cb4d8dc121fba89,
b5ba5fad4df490d1b7d47889361db910589409b8,
bbfca46129992e83055ba9b0b4f836871eef0990,
680cc9395c55a88217f2de975f62ad588e8c95d5,
88faba42e5cb3f1384c92024a9c3608135d76218,
1633cd9c6c3d88d5c66825fab76a369266509f7e |
e51716ee724cf4950df67eba0393b3f798b7dc00,
67281b31010791fa7f0d02dd0f776862e15846d3,
3bcc5297bd3115ff5949a9295eed6a9ad03fd096,
9f0f475ba87df6c631029748bfafb4169bfa6465,
3023a204c8ef16f886bd3dc219f7534b7edbaf2a,
bac08796181979afef4cc518789a380edef500f0,
84d84fe36b0d6e250c3d221c28c40b6925e4c222,
458a3630f882ae2b2a9cee272cf85ca7ff42f5cd,
a242b352c28947427a9bfc30295a487017439fd9,
b31a451873eb411788a6d94c3cacf881f5c3cb86,
8dce35f1cb3c204be669548ee286055b12e67fe9,
45de9e08e7dc177f1f3273456b62af7ee0f5dbdb,
f4014c06d7668541010d59cc932970e9ebfc36f5,
86420f9d52991fb148b322031df55494669532d3,
690101840d4d8f9c656bb0ca114f6619af80e1cf,
47fdb6448b6956249790d5dc7bb76b699d35c079,
cbbf33d001b6f953be5654f00d7dfb54011a7619,
295358095db80ced4b8f54f603f7bd9833a8f175,
c5d6dfd1bc9b682d704d28f77807ba72317b1944,
c00c454d698e5a29caf58e61ed52ab48d08fd7fe,
d28442ae712c1597052493aa3d2353a2de2495c2,
164243e78f1557a34bc699ebc716b532781d6422,
6ce33604bbd9acbee092ab3c4f7f11c0d434f730,
cf6de14d5b96ea173d6a1b2dad9bb64d563df06c,
46c35d0ef2efb66512133a7913df9936b0a80dc8,
dc0e79b9c483562ec0920d69e886715eb329c426,
d61e44f78fa4ba5ec395e1e39c507d666fddefd1,
fc08d45b283e701aa6d558e99cd18318394b0de7,
047d13806078bafdc3954273b0e240dbbb976bd4,
8c20b452dd0728a6fad6d276a7be9fa1b9274495,
d949d8c2b9813c3e8429ece34c364a356bd7d6eb,
fa6df8e2c09ad3d27bfe8c0ce016c839094630f6,
e8fb77f4813b469d73d39c84acf1e1fe7a40702b,
e82fceaecfe5ea04ac3ddff92be5a6a41456333c,
550bdfa1c6082537e2cfb93449128a61dbe3a1fb,
0aaf5659028dd874c8d666c636f11eae63c429e6,
672d66a64a21e23c4d81c089b426360c2bb708b7,
c7c3bbca2c7cb415b39689e924fa2357c239f043,
457b36bcb3c8a865cca83ca6c402246798113ab4,
c73c76498b0c2b36705025de6b0b2a7bb38e758b,
8771581b255e5728a16aea84430506d6f156a589,
deebdf97ad01f23550d7d3b42d98c7bf111e2f95,
f24951ab6ea2b1e9af4013b030675c70d31adb90,
014db08a79f00ef0d94e6855779e398c9dc689c1,
82b3c0a79c9322142738a4ec2ff7d4d4c0be2370,
6f8f404535d4cff9272091f669f985ce11cee7d2,
69766bca399cc779e0f2f8e859e39f7e29a17b7a,
912fbb47fddc07afcf56f5de97e813593bfb890e,
ee1c83722bfb1155bef762cdfb2c86034857f2d0,
bd0b41fb82134844a15fbb43126424d96706d08e,
9523a38b3f1b5bc4313e2949896ddc1fff58afbe,
f0fae2deeee20df15ac1105af2163af2a7e7953d,
b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6,
7764214d1fb44fb6139a622f403bb05610e8f7b1 |
e51716ee724cf4950df67eba0393b3f798b7dc00 | Author: jberragan <jberragan@gmail.com>
| 2024-12-06 16:09:12-08:00
CASSANDRA-19962: CEP-44 Kafka integration for Cassandra CDC using Sidecar (#87)
This is the initial commit for CEP-44 to introduce a standalone CDC module into the Analytics project. This module provides the foundation for CDC in the Apache Cassandra Sidecar.
This module provides:
- a standalone Cdc class as the entrypoint for initializing CDC.
- pluggable interfaces for: listing and reading commit log segments for a token range, persisting and reading CDC state, providing the Cassandra table schema, optionally reading values from Cassandra.
- read and deserialize commit log mutations.
- reconcile and de-duplicate mutations across replicas.
- serialize CDC state into a binary object for persistence.
- a layer for converting Cassandra mutations into a standard consumable format.
Patch by James Berragan, Jyothsna Konisa, Yifan Cai; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-19962
Co-authored-by: James Berragan <jberragan@apple.com>
Co-authored-by: Yifan Cai <ycai@apache.org>
Co-authored-by: Jyothsna Konisa <jkonisa@apple.com>
dd536f2e70118cd5d0c319f5be3e54e3d50eb288 | Author: Yifan Cai <ycai@apache.org>
| 2024-11-15 15:31:14-08:00
CASSANDRA-20066: Expose detailed bulk write failure message for better insight (#92)
Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-20066
a3ca0897c2190c2c18992ca2b7e5255318ff3eba | Author: Yifan Cai <ycai@apache.org>
| 2024-11-05 14:22:45-08:00
CASSANDRA-19994: Add dataTransferApi and TwoPhaseImportCoordinator for coordinated write (#91)
Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-19994
3bcc5297bd3115ff5949a9295eed6a9ad03fd096 | Author: jberragan <jberragan@gmail.com>
| 2024-10-17 10:02:25-07:00
CASSANDRA-19980: Remove SparkSQL dependency from CassandraBridge so that it can be used independent from Spark (#88)
Patch by James Berragan; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19980
f123406e458c0112145f37dcd3f8c20ba47c949d | Author: Yifan Cai <ycai@apache.org>
| 2024-09-11 21:03:47-07:00
CASSANDRA-19909: Add writer options COORDINATED_WRITE_CONFIG to define coordinated write to multiple Cassandra clusters (#79)
The option specifies the configuration (in JSON) for coordinated write.
See org.apache.cassandra.spark.bulkwriter.coordinatedwrite.CoordinatedWriteConf.
When the option is present, SIDECAR_CONTACT_POINTS, SIDECAR_INSTANCES and LOCAL_DC are ignored if they are present.
Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-19909
9f0f475ba87df6c631029748bfafb4169bfa6465 | Author: Arjun Ashok <arjun_ashok@apple.com>
| 2024-09-05 13:24:01-07:00
CASSANDRA-19873: Removes checks for blocked instances from bulk-write path (#76)
Patch by Arjun Ashok; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19873
cfe293dadcf7a1d4491591cfd39fc410a8fa52ba | Author: Yifan Cai <ycai@apache.org>
| 2024-08-30 11:08:00-07:00
CASSANDRA-19842: Consistency level check incorrectly passes when majority of the replica set is unavailable for write (#75)
Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-19842
555e8494d3ca27a7b35aebabb1f669eede20cc53 | Author: Yifan Cai <ycai@apache.org>
| 2024-08-20 17:33:35-07:00
CASSANDRA-19836: Fix NPE when writing UDT values (#74)
When UDT field values are set to null, the bulk writer throws NPE
Patch by Yifan Cai; Reviewed by Dinesh Joshi, Doug Rohrer for CASSANDRA-19836
d75a6bae5abbf80810012a181644f240141014d5 | Author: Yifan Cai <ycai@apache.org>
| 2024-08-14 13:10:15-07:00
CASSANDRA-19827: Add job_timeout_seconds writer option (#73)
Option to specify the timeout in seconds for bulk write jobs. By default, it is disabled.
When `JOB_TIMEOUT_SECONDS` is specified, a job exceeding the timeout is:
- successful when the desired consistency level is met
- a failure otherwise
Patch by Yifan Cai; Reviewed by Dinesh Joshi, Doug Rohrer for CASSANDRA-19827
e168011c40de2ca48d138514640838067e61feea | Author: Yifan Cai <ycai@apache.org>
| 2024-08-07 17:02:38-07:00
CASSANDRA-19806: Stream sstable eagerly when bulk writing to reclaim local disk space sooner (#69)
Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19806
3023a204c8ef16f886bd3dc219f7534b7edbaf2a | Author: jberragan <jberragan@gmail.com>
| 2024-08-03 07:51:52+01:00
CASSANDRA-19807: Improve the core bulk reader test system to match actual and expected rows by concatenating the partition keys with the serialized hex string instead of utf-8 string (#70)
Patch by James Berragan; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19807
84d84fe36b0d6e250c3d221c28c40b6925e4c222 | Author: jberragan <jberragan@gmail.com>
| 2024-07-22 13:38:28-07:00
CASSANDRA-19791: Remove other uses of Apache Commons lang for hashcode, equality and random string generation (#67)
Patch by James Berragan; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19791
458a3630f882ae2b2a9cee272cf85ca7ff42f5cd | Author: jberragan <jberragan@gmail.com>
| 2024-07-17 14:29:21-07:00
CASSANDRA-19778: Split out BufferingInputStream stats into separate i… (#66)
Split BufferingInputStream stats into separate interface so class level generics are not required for the Stats interface
Patch by James Berragan; Reviewed by Bernardo Botella, Francisco Guerrero, Yifan Cai for CASSANDRA-19778
798182a6fda562538c2f44e4f3f92a7cb68cd81c | Author: Yifan Cai <ycai@apache.org>
| 2024-07-16 21:44:18-07:00
CASSANDRA-19772: Deprecate option SIDECAR_INSTANCES and replace with SIDECAR_CONTACT_POINTS (#63)
This patch introduces a new option SIDECAR_CONTACT_POINTS for both bulk writer and reader. The option name better describes the purpose, which is to specify the initial contact points to discover the cluster topology. The existing option SIDECAR_INSTANCES are used for the same purpose and it is now deprecated.
In addition, it allows including the port value in the addresses when defining SIDECAR_CONTACT_POINTS
Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19772
2a693d721182c1d354ff1323b1324fe06ac03f36 | Author: Yifan Cai <ycai@apache.org>
| 2024-07-16 15:20:52-07:00
CASSANDRA-19774: Bump Cassandra Sidecar version (#65)
Update Cassandra Sidecar commit sha: 55a9efee30555d3645680c6524043a6c9bc1194b
Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19774
a242b352c28947427a9bfc30295a487017439fd9 | Author: jberragan <jberragan@gmail.com>
| 2024-07-12 14:57:38-07:00
CASSANDRA-19748: Refactoring to introduce new cassandra-analytics-common module with minimal dependencies (#62)
- Add new module cassandra-analytics-common with no dependencies on Spark or Cassandra and minimal standard dependencies (Guava, Jackson, Commons Lang Kryo etc)
- Move standalone classes to cassandra-analytics-common module.
Some additional refactoring and clean up:
- Rename SSTableInputStream -> BufferingInputStream
- Rename SSTableSource -> CassandraFileSource
- Introduce CassandraFile interface to be the implementing class for SSTable and CommitLog.
- Generalize IStats to work across different CassandraFile types
- Rename methods in StreamScanner to make the API clearer.
- Move ComplexTypeBuffer, ListBuffer, MapBuffer, SetBuffer, UdtBuffer to standalone classes
- Delete unused classes RangeTombstone, ReplciaSet and CollectionElement.
- Remove commons lang as a dependency
- Rename Rid to RowData
Patch by James Berragan; Reviewed by Bernardo Botella, Dinesh Joshi, Francisco Guerrero, Yifan Cai, Yuriy Semchyshyn for CASSANDRA-19748
b31a451873eb411788a6d94c3cacf881f5c3cb86 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-06-27 15:51:15-07:00
CASSANDRA-19727: Bulk writer fails validation stage when writing to a cluster using RandomPartitioner (#61)
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19727
8dce35f1cb3c204be669548ee286055b12e67fe9 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-06-20 14:24:21-07:00
CASSANDRA-19716: Invalid mapping when timestamp is used as a partition key during bulk writes (#60)
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19716
f4014c06d7668541010d59cc932970e9ebfc36f5 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-05-10 13:17:04-07:00
CASSANDRA-19626 Fix NullPointerException when reading static column with null values (#58)
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19626
466e7bf5e160d1667c12ea1de1b79ba27670aba4 | Author: Yifan Cai <52585731+yifan-c@users.noreply.github.com>
| 2024-05-03 10:11:18-07:00
CASSANDRA-19616: Integrate with the latest sidecar client (#56)
The patch updates the analytics code to consume the latest sidecar client after CASSANDRASC-127
Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19616
aea798dc7e517af520a403d4d86f3bc6bed65092 | Author: Yifan Cai <52585731+yifan-c@users.noreply.github.com>
| 2024-04-22 15:46:08-07:00
CASSANDRA-19563: Support bulk write via S3 (#53)
This commit adds a configuration (writer) option to pick a transport other than the previously-implemented "direct upload to all sidecars" (now known as the "Direct" transport). The second transport, now being implemented, is the "S3_COMPAT" transport, which allows the job to upload the generated SSTables to an S3-compatible storage system, and then inform the Cassandra Sidecar that those files are available for download & commit.
Additionally, a plug-in system was added to allow communications between custom transport hooks and the job, so the custom hook can provide updated credentials and out-of-band status updates on S3-related issues.
Co-Authored-By: Yifan Cai <ycai@apache.org>
Co-Authored-By: Doug Rohrer <drohrer@apple.com>
Co-Authored-By: Francisco Guerrero <frankgh@apache.org>
Co-Authored-By: Saranya Krishnakumar <saranya_k@apple.com>
Patch by Yifan Cai, Doug Rohrer, Francisco Guerrero, Saranya Krishnakumar; Reviewed by Francisco Guerrero for CASSANDRA-19563
690101840d4d8f9c656bb0ca114f6619af80e1cf | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-04-08 14:33:50-07:00
CASSANDRA-19526: Optionally enable TLS in the server and client for Analytics testing
All integration tests today run without TLS, which is generally fine because they run locally. However,
it is helpful to be able to start up the sidecar with TLS enabled in the integration test framework so
that third-party tests could connect via secure connections for testing purposes.
Co-authored-by: Doug Rohrer <drohrer@apple.com>
Co-authored-by: Francisco Guerrero <frankgh@apache.org>
Patch by Doug Rohrer, Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19526
cbbf33d001b6f953be5654f00d7dfb54011a7619 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-04-03 11:08:32-07:00
CASSANDRA-19519: Migrate remaining integration tests to the single dtest cluster per class model (#49)
Additionally, we remove the usused test framework code after migrating the tests
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19519
295358095db80ced4b8f54f603f7bd9833a8f175 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-04-02 16:26:40-07:00
CASSANDRA-19513: Refactor Cassandra bridge (#48)
This commit splits the bridge implementation from the shaded `cassandra-all` library. This separation
allows for better integration of a different `cassandra-all` implementations. Additionally, it better
separates the actual bridge code from the Cassandra code.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19513
c00c454d698e5a29caf58e61ed52ab48d08fd7fe | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-04-01 12:11:52-07:00
CASSANDRA-19507 Fix bulk reads of multiple tables that potentially have the same data file name (#47)
When reading multiple data frames using bulk reader from different tables, it is possible to encounter a data
file name being retrieved from the same Sidecar instance. Because the `SSTable`s are cached in the `SSTableCache`,
it is possible that the `org.apache.cassandra.spark.reader.SSTableReader` uses the incorrect `SSTable` if it was
cached with the same `#hashCode`.
In this patch, the equality takes into account the keyspace, table, and snapshot name.
Additionally, we implement the `hashCode` and `equals` method in `org.apache.cassandra.clients.SidecarInstanceImpl` to utilize the `SSTableCache` correctly. Once the methods are implemented, the issue originally described in JIRA is surfaced.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19507
d28442ae712c1597052493aa3d2353a2de2495c2 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-03-27 13:32:39-07:00
CASSANDRA-19500 Fix XXHash32Digest calculated digest value (#46)
This PR bumps the Sidecar version to the current latest HEAD of Sidecar. Bumping the
version surfaced an issue with the way we are producing digest strings for the XXHash32
implementation. The hash value is not masked and this causes the negative sign to be
forwarded producing the incorrect hash result.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19500
164243e78f1557a34bc699ebc716b532781d6422 | Author: Arjun Ashok <arjun_ashok@apple.com>
| 2024-03-22 16:22:44-07:00
CASSANDRA-19418 - Changes to report additional bulk analytics job stats for instrumentation (#41)
Patch by Arjun Ashok; Reviewed by Doug Rohrer, Yifan Cai, Francisco Guerrero for CASSANDRA-19418
6ce33604bbd9acbee092ab3c4f7f11c0d434f730 | Author: Saranya Krishnakumar <saranya_k@apple.com>
| 2024-03-06 14:32:22-08:00
CASSANDRA-19424 Check for expired certificate during start up validation (#43)
patch by Saranya Krishnakumar; reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19424
a13532272051d4e4608f92d53bdd997103e8ea19 | Author: Yifan Cai <52585731+yifan-c@users.noreply.github.com>
| 2024-03-05 11:06:36-08:00
CASSANDRA-19452 Use constant reference time during bulk read process (#44)
patch by Yifan Cai; reviewed by Francisco Guerrero, James Berragan for CASSANDRA-19452
46c35d0ef2efb66512133a7913df9936b0a80dc8 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-02-19 20:50:16-08:00
CASSANDRA-19411: Bulk reader fails to produce a row when regular column values are null
Bulk Reader won't emit a row when the regular column values are all `null`. For example,
a schema `PK` = `a`, `b` ; `CK` = `c`, `d` ; and columns = `e`, `f`.
| a | b | c | d | e | f |
| --- | --- | --- | --- | ---- | ---- |
| pk1 | pk2 | ck1 | ck2 | null | null |
When queried from Analytics bulk reader, it won't produce a row.
This issue also occurs when the projected regular column values are all `null`, where
other non-projected columns might have some values.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19411
c3e8803b3331bc7ef81797ac52a8417524f67edc | Author: Yifan Cai <ycai@apache.org>
| 2024-02-13 09:52:57-08:00
CASSANDRA-19285 Fix flaky Host replacement tests and shrink tests
The flakiness is caused by inspecting a class whose classloader is already closed. The fix is to include the those classes in the sharedClassLoader, so that the classLoader is not closed during the test.
patch by Yifan Cai; reviewed by Francisco Guerrero for CASSANDRA-19285
fc08d45b283e701aa6d558e99cd18318394b0de7 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-01-31 14:35:34-08:00
CASSANDRA-19351 No longer need to synchronize on Schema.instance after Cassandra 4.0.12
We no longer need to synchronize on the `Schema.instance` in Analytics after the release of Cassandra
4.0.12, that includes a synchronization fix in https://issues.apache.org/jira/browse/CASSANDRA-18317.
This commit cleans up TODOs pending on that code being released.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19351
dc0e79b9c483562ec0920d69e886715eb329c426 | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-01-31 13:44:23-08:00
CASSANDRA-19369 Use XXHash32 for digest calculation of SSTables
This commit adds the ability to use the newly supported in Cassandra Sidecar XXhash32 digest algorithm.
The commit allows for backwards compatibility to perform MD5 checksumming, but it now defaults to XXHash32.
A new Writer option is added:
```
.option(WriterOptions.DIGEST.name(), "XXHASH32") // or
.option(WriterOptions.DIGEST.name(), "MD5")
```
This option defaults to XXHash32, when not provided, but it can be configured to use the legacy MD5 algorithm.
Path by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19369
e0ae9d7484e242f6af495aac2cb4d8dc121fba89 | Author: Yifan Cai <ycai@apache.org>
| 2024-01-24 15:38:41-08:00
CASSANDRA-19334 Upgrade to Cassandra 4.0.12 and remove BufferMode and BatchSize options
In cassandra-all:4.0.12, improvements were made for the CQLSSTableWriter. The sorted writer now can produce size-capped SSTables. It replaces the need for the unsorted sstable writer, which has to buffer and sort data on flushing. The dataset to write in the spark application is already sorted. By avoiding using the unsorted writer, it prevents wasting CPU time on sorting the sorted data. Since the sorted sstable writer does not need to buffer data, its size estimation is more accurate than the unsorted one, meaning the produced sstables files are closer to the expectation.
By removing the unsorted sstable writer, it no longer requires the RowBufferMode option.
By supporting size-capping in sorted writer, it no longer requires the BatchSize option.
Patch by Yifan Cai; reviewed by Francisco Guerrero for CASSANDRA-19334
d949d8c2b9813c3e8429ece34c364a356bd7d6eb | Author: Francisco Guerrero <frankgh@apache.org>
| 2024-01-22 09:00:52-08:00
CASSANDRA-19275 Fix flaxy host replacement tests and shrink tests
This patch fixes flaky tests when a `BindException` occurs during cluster provisioning.
When a `BindException` is encountered, cluster provisioning is retried for up-to
`MAX_CLUSTER_PROVISION_RETRIES`.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19275
fa6df8e2c09ad3d27bfe8c0ce016c839094630f6 | Author: Arjun Ashok <arjun_ashok@apple.com>
| 2024-01-16 12:24:02-08:00
CASSANDRA-19272 Add new writer option for blocklisted instances and corresponding integration tests
Patch by Arjun Ashok; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19272
550bdfa1c6082537e2cfb93449128a61dbe3a1fb | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-12-19 12:50:43-08:00
CASSANDRA-19251 Speed up integration tests
This commit introduces an opinionated way to run integration tests where a test class
reuses the same in-jvm dtest cluster, and it offers certain ordering that help running
tests faster.
The test setup does the following:
- Find the Cassandra version to run
- Provision a cluster for the test
- Initialize schemas required for tests
- Start the Sidecar service
The above approach guarantess that Sidecar is ready once the setup method completes,
which means we no longer need to spend time waiting for schema propagation. This
optimization also helps in reducing test time.
The drawback of this approach is that if we need the cluster to be in some state for
testing, for example a node needs to be in joining state while executing the bulk test
then, that cluster can only be used for tests in that state. Which means that testing
different states of the cluster requires a new test class.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19251
d61e44f78fa4ba5ec395e1e39c507d666fddefd1 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
| 2023-11-29 17:49:29-06:00
CASSANDRA-19377 Startup Validation Failures when Checking Sidecar Connectivity
patch by Yuriy Semchyshyn; reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19377
c73c76498b0c2b36705025de6b0b2a7bb38e758b | Author: Doug Rohrer <drohrer@apple.com>
| 2023-11-20 10:54:46-05:00
CASSANDRA-19048 - Audit table properties passed through Analytics CqlUtils
The following properties have an effect on the files generated by the
bulk writer, and therefore need to be retained when cleaning the table
schema:
bloom_filter_fp_chance
cdc
compression
default_time_to_live
min_index_interval
max_index_interval
Additionally, this commit adds tests to make sure all available TTL
paths, including table default TTLs and constant/per-row options, work
as designed.
Patch by Doug Rohrer; Reviewed by Francisco Guerrero Hernandez, Yifan Cai,
Dinesh Joshi for CASSANDRA-19048
c7c3bbca2c7cb415b39689e924fa2357c239f043 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-11-14 16:28:14-08:00
CASSANDRA-19031: Fix bulk writing when using identifiers that need quotes
Cassandra treats all identifiers as lower case unless explicitly quoted by the users,
(i.e. keyspace names, table names, column names, etc). We can define a case-sensitive
identifier or we can use a reserved word as an identifier by quoting it during DDL
creation.
In the analytics library, bulk writing fails when we encounter these identifiers. In
this commit, we fix the issue by property propagating the information about whether
identifiers need to be quoted by exposing a new dataframe option (`quote_identifiers`).
When set to `true`, it will _maybe_ quote the keyspace/table/column names and it will
properly be able to write data when using mixed-case or reserved words in the
identifiers.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19031
457b36bcb3c8a865cca83ca6c402246798113ab4 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-11-13 16:16:36-08:00
CASSANDRA-19024 Fix bulk reading when using identifiers that need quotes
Cassandra treats all identifiers as lower case unless explicitly quoted by the users,
(i.e. keyspace names, table names, column names, etc). We can define a case-sensitive
identifier or we can use a reserved word as an identifier by quoting it during DDL
creation.
In the analytics library, bulk reads fail when we encounter these identifiers. In this,
commit, we fix the issue by properly propagating information about whether identifiers
need to be quoted by exposing a new data frame option (`quote_identifiers`). When set to
`true`, it will maybe quote the keyspace/table and it will properly be able to read data
when these situations are encountered.
Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19024
0aaf5659028dd874c8d666c636f11eae63c429e6 | Author: Arjun Ashok <arjun_ashok@apple.com>
| 2023-10-09 07:53:40-07:00
CASSANDRA-18852 - Changes to make bulk writer resilient to cluster resize operations
Patch by Arjun Ashok, Saranya Krishnakumar; Reviewed by Yifan Cai, Francisco Guerrero, Doug Rohrer for CASSANDRA-18852
Co-authored-by: Arjun Ashok <arjun_ashok@apple.com>
Co-authored-by: Saranya Krishnakumar <saranya_k@apple.com>
f24951ab6ea2b1e9af4013b030675c70d31adb90 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
| 2023-08-14 14:09:12-05:00
CASSANDRA-18810: Cassandra Analytics Start-Up Validation
Patch by Yuriy Semchyshyn; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18810
82b3c0a79c9322142738a4ec2ff7d4d4c0be2370 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-07-25 12:41:10-07:00
CASSANDRA-18692 Fix bulk writes with Buffered RowBufferMode
When setting Buffered RowBufferMode as part of the `WriterOption`s,
`org.apache.cassandra.spark.bulkwriter.RecordWriter` ignores that configuration and instead
uses the batch size to determine when to finalize an SSTable and start writing a new SSTable,
if more rows are available.
In this commit, we fix `org.apache.cassandra.spark.bulkwriter.RecordWriter#checkBatchSize`
to take into account the configured `RowBufferMode`. And in specific to the case of the
`UNBUFFERED` RowBufferMode, we check then the batchSize of the SSTable during writes, and for
the case of `BUFFERED` that check will take no effect.
Co-authored-by: Doug Rohrer <doug@therohrers.org>
Patch by Francisco Guerrero, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18692
014db08a79f00ef0d94e6855779e398c9dc689c1 | Author: James Berragan <jberragan@apple.com>
| 2023-07-19 12:23:07-07:00
CASSANDRA-18683: Add PartitionSizeTableProvider for reading the compressed and uncompressed sizes of all partitions in a table by utilizing the SSTable Index.db files
Patch by James Berragan; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18683
69766bca399cc779e0f2f8e859e39f7e29a17b7a | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-27 10:03:56-07:00
CASSANDRA-18662: Fix cassandra-analytics-core-example
This commit fixes the `SampleCassandraJob` available under the `cassandra-analytics-core-example`
subproject.
Fix checkstyle issues
Fix serialization issue in SidecarDataTransferApi
The `sidecarClient` field in `SidecarDataTransferApi` is declared as transient,
this is causing NPEs coming from executors while trying to perform an SSTable
upload.
This commit completely avoids serializing the `dataTransferApi` field in the
`CassandraBulkWriterContext`, and lazily initializing it during the `transfer()`
method invocation. We guard the initialization to a single thread by making the
`tranfer()` method synchronized. The `SidecarDataTransferApi` can be recreated
when needed using the already serialized `clusterInfo`, `jobInfo`, and `conf`
fields.
Fix setting ROW_BUFFER_MODE to BUFFERED
patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18662
9523a38b3f1b5bc4313e2949896ddc1fff58afbe | Author: jkonisa <jkonisa@apple.com>
| 2023-06-15 13:31:01-07:00
CASSANDRA-18605 Adding support for TTL & Timestamps for bulk writes
This commit introduces a new feature in Spark Bulk Writer to support writes with
constant/per_row based TTL & Timestamps.
Patch by Jyothsna Konisa; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18605
deebdf97ad01f23550d7d3b42d98c7bf111e2f95 | Author: Doug Rohrer <drohrer@apple.com>
| 2023-06-14 13:33:29-04:00
CASSANDRA-18759: Use in-jvm dtest framework from Sidecar for testing
This commit introduces the use of the in-jvm dtest framework for testing
Analytics workloads. It can spin up a Cassandra cluster, including the
necessary Sidecar process, to test writing to and reading from Cassandra
using the analytics library.
Additional changes made in this commit include
* Use concurrent collections in MockBulkWriterContext (Fixes flaky test StreamSessionConsistencyTest)
The StreamSessionConsistency test uses the MockBulkWriter context, but it wasn't originally used
(before this test was added) in a multi-threaded environment. Because of this, it would occasionally
throw ConcurrentModificationExceptions, which would cause the stream test to fail in a
non-deterministic way. This commit adds the use of concurrent/synchronous collections to the
MockBulkWriterContext to make sure it doesn't throw these spurious errors.
* Make the StartupValidation system thread-safe by using TreadLocals
instead of static collections, and clearing them once validation is
complete.
Patch by Doug Rohrer; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18759
f0fae2deeee20df15ac1105af2163af2a7e7953d | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-08 12:40:22-07:00
CASSANDRA-18578 Add circleci configuration yaml for Cassandra Analytics
This commit adds the CircleCI configuration yaml to test against all the existing
profiles
- cassandra-analytics-core-spark2-2.11-jdk8
- cassandra-analytics-core-spark2-2.12-jdk8
- cassandra-analytics-core-spark3-2.12-jdk11
- cassandra-analytics-core-spark3-2.13-jdk11
Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18578
ee1c83722bfb1155bef762cdfb2c86034857f2d0 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-06-07 12:40:50-07:00
CASSANDRA-18574: Fix sample job documentation after Sidecar changes
This commit fixes the README file with documentation to setup and run the Sample job provided in the repository.
During Sidecar review, there was a suggestion to change the yaml property `uploads_staging_dir` to `staging_dir`.
That change however was not reflected as part of the sample job README.md.
patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18574
7764214d1fb44fb6139a622f403bb05610e8f7b1 | Author: Francisco Guerrero <frankgh@apache.org>
| 2023-05-24 14:21:59-07:00
CASSANDRA-18548: Add the .asf.yaml file
This commit adds the .asf.yaml file to control notifications and github settings
for the Cassandra Analytics project.
Patch by Francisco Guerrero; Reviewed by Brandon Williams, Yifan Cai for CASSANDRA-18548
b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6 | Author: Francisco Guerrero <francisco.guerrero@apple.com>
| 2023-05-23 13:56:48-07:00
CASSANDRA-18545: Provide a SecretsProvider interface to abstract the secret provisioning
This commit introduces the SecretsProvider interface that abstracts the secrets provisioning.
This way different implementations of the SecretsProvider can be used to provide SSL secrets
for the Analytics job. We provide an implementation, SslConficSecretsProvider, which provides
secrets based on the configuration for the job.
Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18545
1633cd9c6c3d88d5c66825fab76a369266509f7e | Author: Dinesh Joshi <djoshi@apache.org>
| 2023-05-19 14:57:47-07:00
CEP-28: Apache Cassandra Analytics
This is the initial commit for the Apache Cassandra Analytics project
where we support reading and writing bulk data from Apache Cassandra from
Spark.
Patch by James Berragan, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-16222
Co-authored-by: James Berragan <jberragan@apple.com>
Co-authored-by: Doug Rohrer <drohrer@apple.com>
Co-authored-by: Saranya Krishnakumar <saranya_k@apple.com>
Co-authored-by: Francisco Guerrero <francisco.guerrero@apple.com>
Co-authored-by: Yifan Cai <ycai@apache.org>
Co-authored-by: Jyothsna Konisa <jkonisa@apple.com>
Co-authored-by: Yuriy Semchyshyn <ysemchyshyn@apple.com>
Co-authored-by: Dinesh Joshi <djoshi@apache.org>