Francisco Guerrero cassandra-analytics all time


 15 Collaborator
Yifan Cai , Dinesh Joshi , Doug Rohrer , Mick Semb Wever , Josh McKenzie , Brandon Williams , Berenguer Blasi , James Berragan , jberragan , Yuriy Semchyshyn , Saranya Krishnakumar , Arjun Ashok , Jyothsna Konisa , jkonisa , Bernardo Botella

 33 Patch  41 Review
67281b31010791fa7f0d02dd0f776862e15846d3, b31a451873eb411788a6d94c3cacf881f5c3cb86, 8dce35f1cb3c204be669548ee286055b12e67fe9, 16a0e5f3193c9764d897a98126a2c3c8b4c498d5, 45de9e08e7dc177f1f3273456b62af7ee0f5dbdb, f4014c06d7668541010d59cc932970e9ebfc36f5, 86420f9d52991fb148b322031df55494669532d3, aea798dc7e517af520a403d4d86f3bc6bed65092, 690101840d4d8f9c656bb0ca114f6619af80e1cf, 47fdb6448b6956249790d5dc7bb76b699d35c079, cbbf33d001b6f953be5654f00d7dfb54011a7619, 295358095db80ced4b8f54f603f7bd9833a8f175, d1d0dd70951c9997ca7f9eeb184da64a0eb8fed7, c00c454d698e5a29caf58e61ed52ab48d08fd7fe, d28442ae712c1597052493aa3d2353a2de2495c2, 46c35d0ef2efb66512133a7913df9936b0a80dc8, dc0e79b9c483562ec0920d69e886715eb329c426, fc08d45b283e701aa6d558e99cd18318394b0de7, d949d8c2b9813c3e8429ece34c364a356bd7d6eb, e82fceaecfe5ea04ac3ddff92be5a6a41456333c, 550bdfa1c6082537e2cfb93449128a61dbe3a1fb, c7c3bbca2c7cb415b39689e924fa2357c239f043, 457b36bcb3c8a865cca83ca6c402246798113ab4, 82b3c0a79c9322142738a4ec2ff7d4d4c0be2370, 69766bca399cc779e0f2f8e859e39f7e29a17b7a, 02d9136cfa72c8990120eca0f4fe5f52587bceb5, ee1c83722bfb1155bef762cdfb2c86034857f2d0, cbae09ca71b9eb9a581b77c23844da21474b095a, bd0b41fb82134844a15fbb43126424d96706d08e, f0fae2deeee20df15ac1105af2163af2a7e7953d, b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6, 7764214d1fb44fb6139a622f403bb05610e8f7b1, 1633cd9c6c3d88d5c66825fab76a369266509f7e dd536f2e70118cd5d0c319f5be3e54e3d50eb288, a3ca0897c2190c2c18992ca2b7e5255318ff3eba, 3bcc5297bd3115ff5949a9295eed6a9ad03fd096, 4624a17098e055e0abf9a6025451d4352cb9c147, ff9ac41b4695c1df59f5293f69e0d3a1ce0da9f4, f123406e458c0112145f37dcd3f8c20ba47c949d, 8655ca54a5d0749fccb2ad6a06ec230e8b0de24e, 9f0f475ba87df6c631029748bfafb4169bfa6465, cfe293dadcf7a1d4491591cfd39fc410a8fa52ba, dbbd211cd420eb185d0579f16f5d46abc7bafeb4, e168011c40de2ca48d138514640838067e61feea, 3023a204c8ef16f886bd3dc219f7534b7edbaf2a, bac08796181979afef4cc518789a380edef500f0, 84d84fe36b0d6e250c3d221c28c40b6925e4c222, 458a3630f882ae2b2a9cee272cf85ca7ff42f5cd, 798182a6fda562538c2f44e4f3f92a7cb68cd81c, 2a693d721182c1d354ff1323b1324fe06ac03f36, a242b352c28947427a9bfc30295a487017439fd9, 466e7bf5e160d1667c12ea1de1b79ba27670aba4, aea798dc7e517af520a403d4d86f3bc6bed65092, c5d6dfd1bc9b682d704d28f77807ba72317b1944, 164243e78f1557a34bc699ebc716b532781d6422, 6ce33604bbd9acbee092ab3c4f7f11c0d434f730, a13532272051d4e4608f92d53bdd997103e8ea19, cf6de14d5b96ea173d6a1b2dad9bb64d563df06c, c3e8803b3331bc7ef81797ac52a8417524f67edc, d61e44f78fa4ba5ec395e1e39c507d666fddefd1, e0ae9d7484e242f6af495aac2cb4d8dc121fba89, 047d13806078bafdc3954273b0e240dbbb976bd4, 8c20b452dd0728a6fad6d276a7be9fa1b9274495, b5ba5fad4df490d1b7d47889361db910589409b8, fa6df8e2c09ad3d27bfe8c0ce016c839094630f6, e8fb77f4813b469d73d39c84acf1e1fe7a40702b, 0aaf5659028dd874c8d666c636f11eae63c429e6, 672d66a64a21e23c4d81c089b426360c2bb708b7, bbfca46129992e83055ba9b0b4f836871eef0990, 680cc9395c55a88217f2de975f62ad588e8c95d5, 87a729feb4660f57bacb2a4be73e1bb2d509578b, deebdf97ad01f23550d7d3b42d98c7bf111e2f95, f24951ab6ea2b1e9af4013b030675c70d31adb90, 9523a38b3f1b5bc4313e2949896ddc1fff58afbe

dd536f2e70118cd5d0c319f5be3e54e3d50eb288 | Author: Yifan Cai <ycai@apache.org>
 | 2024-11-15 15:31:14-08:00

    CASSANDRA-20066: Expose detailed bulk write failure message for better insight (#92)
    
    Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-20066

67281b31010791fa7f0d02dd0f776862e15846d3 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-11-14 21:49:15-08:00

    CASSANDRA-20078: Add integration tests for multiple types during bulk writes (#94)
    
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-20078

a3ca0897c2190c2c18992ca2b7e5255318ff3eba | Author: Yifan Cai <ycai@apache.org>
 | 2024-11-05 14:22:45-08:00

    CASSANDRA-19994: Add dataTransferApi and TwoPhaseImportCoordinator for coordinated write (#91)
    
    Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-19994

3bcc5297bd3115ff5949a9295eed6a9ad03fd096 | Author: jberragan <jberragan@gmail.com>
 | 2024-10-17 10:02:25-07:00

    CASSANDRA-19980: Remove SparkSQL dependency from CassandraBridge so that it can be used independent from Spark (#88)
    
    Patch by James Berragan; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19980

4624a17098e055e0abf9a6025451d4352cb9c147 | Author: Yifan Cai <ycai@apache.org>
 | 2024-09-24 23:50:35-07:00

    CASSANDRA-19933: Support aggregated consistency validation for multiple clusters (#86)
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19933

ff9ac41b4695c1df59f5293f69e0d3a1ce0da9f4 | Author: Yifan Cai <ycai@apache.org>
 | 2024-09-18 12:54:47-07:00

    CASSANDRA-19923: Add transport extension for coordinated write (#83)
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19923

f123406e458c0112145f37dcd3f8c20ba47c949d | Author: Yifan Cai <ycai@apache.org>
 | 2024-09-11 21:03:47-07:00

    CASSANDRA-19909: Add writer options COORDINATED_WRITE_CONFIG to define coordinated write to multiple Cassandra clusters (#79)
    
    The option specifies the configuration (in JSON) for coordinated write.
    See org.apache.cassandra.spark.bulkwriter.coordinatedwrite.CoordinatedWriteConf.
    When the option is present, SIDECAR_CONTACT_POINTS, SIDECAR_INSTANCES and LOCAL_DC are ignored if they are present.
    
    Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-19909

8655ca54a5d0749fccb2ad6a06ec230e8b0de24e | Author: Yifan Cai <ycai@apache.org>
 | 2024-09-06 21:08:40-07:00

    CASSANDRA-19901: Refactor TokenRangeMapping to use proper types instead of Strings (#78)
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19901

9f0f475ba87df6c631029748bfafb4169bfa6465 | Author: Arjun Ashok <arjun_ashok@apple.com>
 | 2024-09-05 13:24:01-07:00

    CASSANDRA-19873: Removes checks for blocked instances from bulk-write path (#76)
    
    
    Patch by Arjun Ashok; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19873

cfe293dadcf7a1d4491591cfd39fc410a8fa52ba | Author: Yifan Cai <ycai@apache.org>
 | 2024-08-30 11:08:00-07:00

    CASSANDRA-19842: Consistency level check incorrectly passes when majority of the replica set is unavailable for write (#75)
    
    Patch by Yifan Cai; Reviewed by Doug Rohrer, Francisco Guerrero for CASSANDRA-19842

dbbd211cd420eb185d0579f16f5d46abc7bafeb4 | Author: Yifan Cai <ycai@apache.org>
 | 2024-08-09 15:47:21-07:00

    CASSANDRA-19821: prevent double closing sstable writer (#72)
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19821

e168011c40de2ca48d138514640838067e61feea | Author: Yifan Cai <ycai@apache.org>
 | 2024-08-07 17:02:38-07:00

    CASSANDRA-19806: Stream sstable eagerly when bulk writing to reclaim local disk space sooner (#69)
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19806

3023a204c8ef16f886bd3dc219f7534b7edbaf2a | Author: jberragan <jberragan@gmail.com>
 | 2024-08-03 07:51:52+01:00

    CASSANDRA-19807: Improve the core bulk reader test system to match actual and expected rows by concatenating the partition keys with the serialized hex string instead of utf-8 string (#70)
    
    Patch by James Berragan; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19807

bac08796181979afef4cc518789a380edef500f0 | Author: jberragan <jberragan@gmail.com>
 | 2024-07-26 10:25:13-07:00

    CASSANDRA-19793 Split out CassandraTypes into separate module (#68)
    
    
    Patch by James Berragan; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19793

84d84fe36b0d6e250c3d221c28c40b6925e4c222 | Author: jberragan <jberragan@gmail.com>
 | 2024-07-22 13:38:28-07:00

    CASSANDRA-19791: Remove other uses of Apache Commons lang for hashcode, equality and random string generation (#67)
    
    Patch by James Berragan; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19791

458a3630f882ae2b2a9cee272cf85ca7ff42f5cd | Author: jberragan <jberragan@gmail.com>
 | 2024-07-17 14:29:21-07:00

    CASSANDRA-19778: Split out BufferingInputStream stats into separate i… (#66)
    
    Split BufferingInputStream stats into separate interface so class level generics are not required for the Stats interface
    
    Patch by James Berragan; Reviewed by Bernardo Botella, Francisco Guerrero, Yifan Cai for CASSANDRA-19778

798182a6fda562538c2f44e4f3f92a7cb68cd81c | Author: Yifan Cai <ycai@apache.org>
 | 2024-07-16 21:44:18-07:00

    CASSANDRA-19772: Deprecate option SIDECAR_INSTANCES and replace with SIDECAR_CONTACT_POINTS (#63)
    
    This patch introduces a new option SIDECAR_CONTACT_POINTS for both bulk writer and reader. The option name better describes the purpose, which is to specify the initial contact points to discover the cluster topology. The existing option SIDECAR_INSTANCES are used for the same purpose and it is now deprecated.
    In addition, it allows including the port value in the addresses when defining SIDECAR_CONTACT_POINTS
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19772

2a693d721182c1d354ff1323b1324fe06ac03f36 | Author: Yifan Cai <ycai@apache.org>
 | 2024-07-16 15:20:52-07:00

    CASSANDRA-19774: Bump Cassandra Sidecar version (#65)
    
    Update Cassandra Sidecar commit sha: 55a9efee30555d3645680c6524043a6c9bc1194b
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19774

a242b352c28947427a9bfc30295a487017439fd9 | Author: jberragan <jberragan@gmail.com>
 | 2024-07-12 14:57:38-07:00

    CASSANDRA-19748: Refactoring to introduce new cassandra-analytics-common module with minimal dependencies (#62)
    
    - Add new module cassandra-analytics-common with no dependencies on Spark or Cassandra and minimal standard dependencies (Guava, Jackson, Commons Lang Kryo etc)
    - Move standalone classes to cassandra-analytics-common module.
    
    Some additional refactoring and clean up:
    - Rename SSTableInputStream -> BufferingInputStream
    - Rename SSTableSource -> CassandraFileSource
    - Introduce CassandraFile interface to be the implementing class for SSTable and CommitLog.
    - Generalize IStats to work across different CassandraFile types
    - Rename methods in StreamScanner to make the API clearer.
    - Move ComplexTypeBuffer, ListBuffer, MapBuffer, SetBuffer, UdtBuffer to standalone classes
    - Delete unused classes RangeTombstone, ReplciaSet and CollectionElement.
    - Remove commons lang as a dependency
    - Rename Rid to RowData
    
    Patch by James Berragan; Reviewed by Bernardo Botella, Dinesh Joshi, Francisco Guerrero, Yifan Cai, Yuriy Semchyshyn for CASSANDRA-19748

b31a451873eb411788a6d94c3cacf881f5c3cb86 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-06-27 15:51:15-07:00

    CASSANDRA-19727: Bulk writer fails validation stage when writing to a cluster using RandomPartitioner (#61)
    
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19727

8dce35f1cb3c204be669548ee286055b12e67fe9 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-06-20 14:24:21-07:00

    CASSANDRA-19716: Invalid mapping when timestamp is used as a partition key during bulk writes (#60)
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19716

16a0e5f3193c9764d897a98126a2c3c8b4c498d5 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-05-13 17:10:54-07:00

    Ninja fix for CASSANDRA-19634
    
    Restores original spark configurations because it is causing Java 8 failures when running local development environment

45de9e08e7dc177f1f3273456b62af7ee0f5dbdb | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-05-12 11:35:14-07:00

    CASSANDRA-19634: Improve test coverage for downed instances (#59)
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19634

f4014c06d7668541010d59cc932970e9ebfc36f5 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-05-10 13:17:04-07:00

    CASSANDRA-19626 Fix NullPointerException when reading static column with null values (#58)
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19626

466e7bf5e160d1667c12ea1de1b79ba27670aba4 | Author: Yifan Cai <52585731+yifan-c@users.noreply.github.com>
 | 2024-05-03 10:11:18-07:00

    CASSANDRA-19616: Integrate with the latest sidecar client (#56)
    
    The patch updates the analytics code to consume the latest sidecar client after CASSANDRASC-127
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19616

86420f9d52991fb148b322031df55494669532d3 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-04-23 12:51:53-07:00

    CASSANDRA-19582: Consume new Sidecar client API to stream SSTables (#54)
    
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19582

aea798dc7e517af520a403d4d86f3bc6bed65092 | Author: Yifan Cai <52585731+yifan-c@users.noreply.github.com>
 | 2024-04-22 15:46:08-07:00

    CASSANDRA-19563: Support bulk write via S3 (#53)
    
    This commit adds a configuration (writer) option to pick a transport other than the previously-implemented "direct upload to all sidecars" (now known as the "Direct" transport).  The second transport, now being implemented, is the "S3_COMPAT" transport, which allows the job to upload the generated SSTables to an S3-compatible storage system, and then inform the Cassandra Sidecar that those files are available for download & commit.
    
    Additionally, a plug-in system was added to allow communications between custom transport hooks and the job, so the custom hook can provide updated credentials and out-of-band status updates on S3-related issues.
    
    Co-Authored-By: Yifan Cai <ycai@apache.org>
    Co-Authored-By: Doug Rohrer <drohrer@apple.com>
    Co-Authored-By: Francisco Guerrero <frankgh@apache.org>
    Co-Authored-By: Saranya Krishnakumar <saranya_k@apple.com>
    
    Patch by Yifan Cai, Doug Rohrer, Francisco Guerrero, Saranya Krishnakumar; Reviewed by Francisco Guerrero for CASSANDRA-19563

690101840d4d8f9c656bb0ca114f6619af80e1cf | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-04-08 14:33:50-07:00

    CASSANDRA-19526: Optionally enable TLS in the server and client for Analytics testing
    
    All integration tests today run without TLS, which is generally fine because they run locally. However,
    it is helpful to be able to start up the sidecar with TLS enabled in the integration test framework so
    that third-party tests could connect via secure connections for testing purposes.
    
    Co-authored-by: Doug Rohrer <drohrer@apple.com>
    Co-authored-by: Francisco Guerrero <frankgh@apache.org>
    
    Patch by Doug Rohrer, Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19526

47fdb6448b6956249790d5dc7bb76b699d35c079 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-04-04 16:15:49-07:00

    CASSANDRA-19528: Use a classloader to isolate in-jvm dtest classes in… (#50)
    
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19528

cbbf33d001b6f953be5654f00d7dfb54011a7619 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-04-03 11:08:32-07:00

    CASSANDRA-19519: Migrate remaining integration tests to the single dtest cluster per class model (#49)
    
    Additionally, we remove the usused test framework code after migrating the tests
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19519

295358095db80ced4b8f54f603f7bd9833a8f175 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-04-02 16:26:40-07:00

    CASSANDRA-19513: Refactor Cassandra bridge (#48)
    
    This commit splits the bridge implementation from the shaded `cassandra-all` library. This separation
    allows for better integration of a different `cassandra-all` implementations. Additionally, it better
    separates the actual bridge code from the Cassandra code.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19513

d1d0dd70951c9997ca7f9eeb184da64a0eb8fed7 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-04-02 12:01:49-07:00

    Ninja fix for CASSANDRA-19340
    
    Revert "Make sure bridge exists"
    
    This reverts commit 98baab1b8f0d5d7eb93f8d13db3b0a7a985fb03a.
    
    We revert this commit because the commit message was lost during merge.
    We immediately add the same commit with the correct commit message, to
    avoid rewriting git history.

c00c454d698e5a29caf58e61ed52ab48d08fd7fe | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-04-01 12:11:52-07:00

    CASSANDRA-19507 Fix bulk reads of multiple tables that potentially have the same data file name (#47)
    
    When reading multiple data frames using bulk reader from different tables, it is possible to encounter a data
    file name being retrieved from the same Sidecar instance. Because the `SSTable`s are cached in the `SSTableCache`,
    it is possible that the `org.apache.cassandra.spark.reader.SSTableReader` uses the incorrect `SSTable` if it was
    cached with the same `#hashCode`.
    
    In this patch, the equality takes into account the keyspace, table, and snapshot name.
    
    Additionally, we implement the `hashCode` and `equals` method in `org.apache.cassandra.clients.SidecarInstanceImpl` to utilize the `SSTableCache` correctly. Once the methods are implemented, the issue originally described in JIRA is surfaced.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19507

d28442ae712c1597052493aa3d2353a2de2495c2 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-03-27 13:32:39-07:00

    CASSANDRA-19500 Fix XXHash32Digest calculated digest value (#46)
    
    This PR bumps the Sidecar version to the current latest HEAD of Sidecar. Bumping the
    version surfaced an issue with the way we are producing digest strings for the XXHash32
    implementation. The hash value is not masked and this causes the negative sign to be
    forwarded producing the incorrect hash result.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19500

164243e78f1557a34bc699ebc716b532781d6422 | Author: Arjun Ashok <arjun_ashok@apple.com>
 | 2024-03-22 16:22:44-07:00

    CASSANDRA-19418  - Changes to report additional bulk analytics job stats for instrumentation (#41)
    
    Patch by Arjun Ashok; Reviewed by Doug Rohrer, Yifan Cai, Francisco Guerrero for CASSANDRA-19418

6ce33604bbd9acbee092ab3c4f7f11c0d434f730 | Author: Saranya Krishnakumar <saranya_k@apple.com>
 | 2024-03-06 14:32:22-08:00

    CASSANDRA-19424 Check for expired certificate during start up validation (#43)
    
    patch by Saranya Krishnakumar; reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19424

a13532272051d4e4608f92d53bdd997103e8ea19 | Author: Yifan Cai <52585731+yifan-c@users.noreply.github.com>
 | 2024-03-05 11:06:36-08:00

    CASSANDRA-19452 Use constant reference time during bulk read process (#44)
    
    patch by Yifan Cai; reviewed by Francisco Guerrero, James Berragan for CASSANDRA-19452

c5d6dfd1bc9b682d704d28f77807ba72317b1944 | Author: Doug Rohrer <drohrer@apple.com>
 | 2024-02-27 22:03:04-05:00

    CASSANDRA-19340 - Support writing UDTs
    
    Patch by Doug Rohrer; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19340

46c35d0ef2efb66512133a7913df9936b0a80dc8 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-02-19 20:50:16-08:00

    CASSANDRA-19411: Bulk reader fails to produce a row when regular column values are null
    
    Bulk Reader won't emit a row when the regular column values are all `null`. For example,
    a schema `PK` = `a`, `b` ; `CK` = `c`, `d` ; and columns = `e`, `f`.
    
    |  a  |  b  |  c  |  d  |  e   |  f   |
    | --- | --- | --- | --- | ---- | ---- |
    | pk1 | pk2 | ck1 | ck2 | null | null |
    
    When queried from Analytics bulk reader, it won't produce a row.
    
    This issue also occurs when the projected regular column values are all `null`, where
    other non-projected columns might have some values.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19411

cf6de14d5b96ea173d6a1b2dad9bb64d563df06c | Author: Saranya Krishnakumar <saranya_k@apple.com>
 | 2024-02-19 11:27:35-08:00

    CASSANDRA-19442 Update access of ClearSnapshotStrategy
    
    Patch by Saranya Krishnakumar; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19442

c3e8803b3331bc7ef81797ac52a8417524f67edc | Author: Yifan Cai <ycai@apache.org>
 | 2024-02-13 09:52:57-08:00

    CASSANDRA-19285 Fix flaky Host replacement tests and shrink tests
    
    The flakiness is caused by inspecting a class whose classloader is already closed. The fix is to include the those classes in the sharedClassLoader, so that the classLoader is not closed during the test.
    
    patch by Yifan Cai; reviewed by Francisco Guerrero for CASSANDRA-19285

fc08d45b283e701aa6d558e99cd18318394b0de7 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-01-31 14:35:34-08:00

    CASSANDRA-19351 No longer need to synchronize on Schema.instance after Cassandra 4.0.12
    
    We no longer need to synchronize on the `Schema.instance` in Analytics after the release of Cassandra
    4.0.12, that includes a synchronization fix in https://issues.apache.org/jira/browse/CASSANDRA-18317.
    
    This commit cleans up TODOs pending on that code being released.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19351

dc0e79b9c483562ec0920d69e886715eb329c426 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-01-31 13:44:23-08:00

    CASSANDRA-19369 Use XXHash32 for digest calculation of SSTables
    
    This commit adds the ability to use the newly supported in Cassandra Sidecar XXhash32 digest algorithm.
    The commit allows for backwards compatibility to perform MD5 checksumming, but it now defaults to XXHash32.
    
    A new Writer option is added:
    
    ```
    .option(WriterOptions.DIGEST.name(), "XXHASH32") // or
    .option(WriterOptions.DIGEST.name(), "MD5")
    ```
    
    This option defaults to XXHash32, when not provided, but it can be configured to use the legacy MD5 algorithm.
    
    Path by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19369

047d13806078bafdc3954273b0e240dbbb976bd4 | Author: Arjun Ashok <arjun_ashok@apple.com>
 | 2024-01-26 10:37:42-08:00

    CASSANDRA-19331 Improve logging for bulk writes and on task failures
    
    Patch by Arjun Ashok; reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19331

e0ae9d7484e242f6af495aac2cb4d8dc121fba89 | Author: Yifan Cai <ycai@apache.org>
 | 2024-01-24 15:38:41-08:00

    CASSANDRA-19334 Upgrade to Cassandra 4.0.12 and remove BufferMode and BatchSize options
    
    In cassandra-all:4.0.12, improvements were made for the CQLSSTableWriter. The sorted writer now can produce size-capped SSTables. It replaces the need for the unsorted sstable writer, which has to buffer and sort data on flushing. The dataset to write in the spark application is already sorted. By avoiding using the unsorted writer, it prevents wasting CPU time on sorting the sorted data. Since the sorted sstable writer does not need to buffer data, its size estimation is more accurate than the unsorted one, meaning the produced sstables files are closer to the expectation.
    
    By removing the unsorted sstable writer, it no longer requires the RowBufferMode option.
    By supporting size-capping in sorted writer, it no longer requires the BatchSize option.
    
    Patch by Yifan Cai; reviewed by Francisco Guerrero for CASSANDRA-19334

b5ba5fad4df490d1b7d47889361db910589409b8 | Author: Yifan Cai <ycai@apache.org>
 | 2024-01-24 14:47:36-08:00

    CASSANDRA-19325 fix range split and use open-closed range notation consistently
    
    Patch by Yifan Cai; reviewed by Arjun Ashok, Francisco Guerrero for CASSANDRA-19325

d949d8c2b9813c3e8429ece34c364a356bd7d6eb | Author: Francisco Guerrero <frankgh@apache.org>
 | 2024-01-22 09:00:52-08:00

    CASSANDRA-19275 Fix flaxy host replacement tests and shrink tests
    
    This patch fixes flaky tests when a `BindException` occurs during cluster provisioning.
    When a `BindException` is encountered, cluster provisioning is retried for up-to
    `MAX_CLUSTER_PROVISION_RETRIES`.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19275

fa6df8e2c09ad3d27bfe8c0ce016c839094630f6 | Author: Arjun Ashok <arjun_ashok@apple.com>
 | 2024-01-16 12:24:02-08:00

    CASSANDRA-19272 Add new writer option for blocklisted instances and corresponding integration tests
    
    Patch by Arjun Ashok; Reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19272

8c20b452dd0728a6fad6d276a7be9fa1b9274495 | Author: Saranya Krishnakumar <saranya_k@apple.com>
 | 2024-01-10 11:49:44-08:00

    CASSANDRA-19273: Allow setting TTL for snapshots created
    
    Patch by Saranya Krishnakumar; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19273

e8fb77f4813b469d73d39c84acf1e1fe7a40702b | Author: Arjun Ashok <arjun_ashok@apple.com>
 | 2024-01-10 08:46:35-08:00

    CASSANDRA-19257 Fixes handling of blocked instances during CL validations
    
    Patch by Arjun Ashok; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19257

e82fceaecfe5ea04ac3ddff92be5a6a41456333c | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-12-21 11:42:14-08:00

    CASSANDRA-19223: Column type mapping error for timestamp type during bulk writes
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19223

550bdfa1c6082537e2cfb93449128a61dbe3a1fb | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-12-19 12:50:43-08:00

    CASSANDRA-19251 Speed up integration tests
    
    This commit introduces an opinionated way to run integration tests where a test class
    reuses the same in-jvm dtest cluster, and it offers certain ordering that help running
    tests faster.
    
    The test setup does the following:
    - Find the Cassandra version to run
    - Provision a cluster for the test
    - Initialize schemas required for tests
    - Start the Sidecar service
    
    The above approach guarantess that Sidecar is ready once the setup method completes,
    which means we no longer need to spend time waiting for schema propagation. This
    optimization also helps in reducing test time.
    
    The drawback of this approach is that if we need the cluster to be in some state for
    testing, for example a node needs to be in joining state while executing the bulk test
    then, that cluster can only be used for tests in that state. Which means that testing
    different states of the cluster requires a new test class.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19251

bbfca46129992e83055ba9b0b4f836871eef0990 | Author: Yifan Cai <ycai@apache.org>
 | 2023-12-12 23:45:41+08:00

    CASSANDRA-19198: Flaky test in SSTableInputStreamTests
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19198

672d66a64a21e23c4d81c089b426360c2bb708b7 | Author: jkonisa <jkonisa@apple.com>
 | 2023-12-06 10:25:56-08:00

    CASSANDRA-19199 Remove write option VALIDATE_SSTABLES to enforce validation
    
    Patch by Jyothsna Konisa; Reviewed by Yifan Cai, Francisco Guerrero for CASSANDRA-19199

d61e44f78fa4ba5ec395e1e39c507d666fddefd1 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
 | 2023-11-29 17:49:29-06:00

    CASSANDRA-19377 Startup Validation Failures when Checking Sidecar Connectivity
    
    patch by Yuriy Semchyshyn; reviewed by Francisco Guerrero, Yifan Cai for CASSANDRA-19377

c7c3bbca2c7cb415b39689e924fa2357c239f043 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-11-14 16:28:14-08:00

    CASSANDRA-19031: Fix bulk writing when using identifiers that need quotes
    
    Cassandra treats all identifiers as lower case unless explicitly quoted by the users,
    (i.e. keyspace names, table names, column names, etc). We can define a case-sensitive
    identifier or we can use a reserved word as an identifier by quoting it during DDL
    creation.
    
    In the analytics library, bulk writing fails when we encounter these identifiers. In
    this commit, we fix the issue by property propagating the information about whether
    identifiers need to be quoted by exposing a new dataframe option (`quote_identifiers`).
    When set to `true`, it will _maybe_ quote the keyspace/table/column names and it will
    properly be able to write data when using mixed-case or reserved words in the
    identifiers.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19031

457b36bcb3c8a865cca83ca6c402246798113ab4 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-11-13 16:16:36-08:00

    CASSANDRA-19024 Fix bulk reading when using identifiers that need quotes
    
    Cassandra treats all identifiers as lower case unless explicitly quoted by the users,
    (i.e. keyspace names, table names, column names, etc). We can define a case-sensitive
    identifier or we can use a reserved word as an identifier by quoting it during DDL
    creation.
    
    In the analytics library, bulk reads fail when we encounter these identifiers. In this,
    commit, we fix the issue by properly propagating information about whether identifiers
    need to be quoted by exposing a new data frame option (`quote_identifiers`). When set to
    `true`, it will maybe quote the keyspace/table and it will properly be able to read data
    when these situations are encountered.
    
    Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19024

87a729feb4660f57bacb2a4be73e1bb2d509578b | Author: Saranya Krishnakumar <saranya_k@apple.com>
 | 2023-11-06 13:32:01-08:00

    CASSANDRA-19903: Get Sidecar port through CassandraContext
    
    Patch by Saranya Krishnakumar; Reviewed by Dinesh Joshi, Francisco Guerrero, Josh McKenzie for CASSANDRA-19903

680cc9395c55a88217f2de975f62ad588e8c95d5 | Author: Yifan Cai <ycai@apache.org>
 | 2023-10-31 16:41:16-07:00

    CASSANDRA-19148: Remove unused dead code
    
    Patch by Yifan Cai; Reviewed by Francisco Guerrero for CASSANDRA-19148

0aaf5659028dd874c8d666c636f11eae63c429e6 | Author: Arjun Ashok <arjun_ashok@apple.com>
 | 2023-10-09 07:53:40-07:00

    CASSANDRA-18852 - Changes to make bulk writer resilient to cluster resize operations
    
    Patch by Arjun Ashok, Saranya Krishnakumar; Reviewed by Yifan Cai, Francisco Guerrero, Doug Rohrer for CASSANDRA-18852
    
    Co-authored-by: Arjun Ashok <arjun_ashok@apple.com>
    Co-authored-by: Saranya Krishnakumar <saranya_k@apple.com>

f24951ab6ea2b1e9af4013b030675c70d31adb90 | Author: Yuriy Semchyshyn <yuriy@semchyshyn.com>
 | 2023-08-14 14:09:12-05:00

    CASSANDRA-18810: Cassandra Analytics Start-Up Validation
    
    Patch by Yuriy Semchyshyn; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18810

82b3c0a79c9322142738a4ec2ff7d4d4c0be2370 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-07-25 12:41:10-07:00

    CASSANDRA-18692 Fix bulk writes with Buffered RowBufferMode
    
    When setting Buffered RowBufferMode as part of the `WriterOption`s,
    `org.apache.cassandra.spark.bulkwriter.RecordWriter` ignores that configuration and instead
    uses the batch size to determine when to finalize an SSTable and start writing a new SSTable,
    if more rows are available.
    
    In this commit, we fix `org.apache.cassandra.spark.bulkwriter.RecordWriter#checkBatchSize`
    to take into account the configured `RowBufferMode`. And in specific to the case of the
    `UNBUFFERED` RowBufferMode, we check then the batchSize of the SSTable during writes, and for
    the case of `BUFFERED` that check will take no effect.
    
    Co-authored-by: Doug Rohrer <doug@therohrers.org>
    
    Patch by Francisco Guerrero, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18692

02d9136cfa72c8990120eca0f4fe5f52587bceb5 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-27 10:28:04-07:00

    CASSANDRA-18631: Add Release Audit Tool (RAT) plugin to Analytics
    
    This commit adds the Release Audit Tool (RAT) plugin to `build.gradle` which adds a new task
    `rat`. This new task makes sure that the license headers are valid and present in the source
    files during the `check` task.
    
    To run the RAT plugin, you can run:
    
    ```
    ./gradlew rat
    ```
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Michael Semb Wever for CASSANDRA-18631

69766bca399cc779e0f2f8e859e39f7e29a17b7a | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-27 10:03:56-07:00

    CASSANDRA-18662: Fix cassandra-analytics-core-example
    
    This commit fixes the `SampleCassandraJob` available under the `cassandra-analytics-core-example`
    subproject.
    
    Fix checkstyle issues
    
    Fix serialization issue in SidecarDataTransferApi
    
    The `sidecarClient` field in `SidecarDataTransferApi` is declared as transient,
    this is causing NPEs coming from executors while trying to perform an SSTable
    upload.
    
    This commit completely avoids serializing the `dataTransferApi` field in the
    `CassandraBulkWriterContext`, and lazily initializing it during the `transfer()`
    method invocation. We guard the initialization to a single thread by making the
    `tranfer()` method synchronized. The `SidecarDataTransferApi` can be recreated
    when needed using the already serialized `clusterInfo`, `jobInfo`, and `conf`
    fields.
    
    Fix setting ROW_BUFFER_MODE to BUFFERED
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18662

9523a38b3f1b5bc4313e2949896ddc1fff58afbe | Author: jkonisa <jkonisa@apple.com>
 | 2023-06-15 13:31:01-07:00

    CASSANDRA-18605 Adding support for TTL & Timestamps for bulk writes
    
    This commit introduces a new feature in Spark Bulk Writer to support writes with
    constant/per_row based TTL & Timestamps.
    
    Patch by Jyothsna Konisa; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18605

cbae09ca71b9eb9a581b77c23844da21474b095a | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-14 11:52:55-07:00

    CASSANDRA-18600 Add NOTICE.txt file
    
    The NOTICE.txt file is currently missing in the repository. This commit adds the file to
    comply with ASF's guidance.
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Michael Semb Wever, Berenguer Blasi for CASSANDRA-18600

bd0b41fb82134844a15fbb43126424d96706d08e | Author: Doug Rohrer <drohrer@apple.com>
 | 2023-06-14 13:33:29-04:00

    CASSANDRA-18599 Upgrade to JUnit 5
    
    patch by Doug Rohrer, Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18599

deebdf97ad01f23550d7d3b42d98c7bf111e2f95 | Author: Doug Rohrer <drohrer@apple.com>
 | 2023-06-14 13:33:29-04:00

    CASSANDRA-18759: Use in-jvm dtest framework from Sidecar for testing
    
    This commit introduces the use of the in-jvm dtest framework for testing
    Analytics workloads. It can spin up a Cassandra cluster, including the
    necessary Sidecar process, to test writing to and reading from Cassandra
    using the analytics library.
    
    Additional changes made in this commit include
    
    * Use concurrent collections in MockBulkWriterContext (Fixes flaky test StreamSessionConsistencyTest)
    
        The StreamSessionConsistency test uses the MockBulkWriter context, but it wasn't originally used
        (before this test was added) in a multi-threaded environment. Because of this, it would occasionally
        throw ConcurrentModificationExceptions, which would cause the stream test to fail in a
        non-deterministic way. This commit adds the use of concurrent/synchronous collections to the
        MockBulkWriterContext to make sure it doesn't throw these spurious errors.
    
    * Make the StartupValidation system thread-safe by using TreadLocals
      instead of static collections, and clearing them once validation is
      complete.
    
    Patch by Doug Rohrer; Reviewed by Dinesh Joshi, Francisco Guerrero, Yifan Cai for CASSANDRA-18759

f0fae2deeee20df15ac1105af2163af2a7e7953d | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-08 12:40:22-07:00

    CASSANDRA-18578 Add circleci configuration yaml for Cassandra Analytics
    
    This commit adds the CircleCI configuration yaml to test against all the existing
    profiles
    
          - cassandra-analytics-core-spark2-2.11-jdk8
          - cassandra-analytics-core-spark2-2.12-jdk8
          - cassandra-analytics-core-spark3-2.12-jdk11
          - cassandra-analytics-core-spark3-2.13-jdk11
    
    Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18578

ee1c83722bfb1155bef762cdfb2c86034857f2d0 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-06-07 12:40:50-07:00

    CASSANDRA-18574: Fix sample job documentation after Sidecar changes
    
    This commit fixes the README file with documentation to setup and run the Sample job provided in the repository.
    During Sidecar review, there was a suggestion to change the yaml property `uploads_staging_dir` to `staging_dir`.
    That change however was not reflected as part of the sample job README.md.
    
    patch by Francisco Guerrero; reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18574

7764214d1fb44fb6139a622f403bb05610e8f7b1 | Author: Francisco Guerrero <frankgh@apache.org>
 | 2023-05-24 14:21:59-07:00

    CASSANDRA-18548: Add the .asf.yaml file
    
    This commit adds the .asf.yaml file to control notifications and github settings
    for the Cassandra Analytics project.
    
    Patch by Francisco Guerrero; Reviewed by Brandon Williams, Yifan Cai for CASSANDRA-18548

b87b0edd310d1ef93c507bbbb1ae51e1b0b319c6 | Author: Francisco Guerrero <francisco.guerrero@apple.com>
 | 2023-05-23 13:56:48-07:00

    CASSANDRA-18545: Provide a SecretsProvider interface to abstract the secret provisioning
    
    This commit introduces the SecretsProvider interface that abstracts the secrets provisioning.
    This way different implementations of the SecretsProvider can be used to provide SSL secrets
    for the Analytics job. We provide an implementation, SslConficSecretsProvider, which provides
    secrets based on the configuration for the job.
    
    Patch by Francisco Guerrero; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-18545

1633cd9c6c3d88d5c66825fab76a369266509f7e | Author: Dinesh Joshi <djoshi@apache.org>
 | 2023-05-19 14:57:47-07:00

    CEP-28: Apache Cassandra Analytics
    
    This is the initial commit for the Apache Cassandra Analytics project
    where we support reading and writing bulk data from Apache Cassandra from
    Spark.
    
    Patch by James Berragan, Doug Rohrer; Reviewed by Dinesh Joshi, Yifan Cai for CASSANDRA-16222
    
    Co-authored-by: James Berragan <jberragan@apple.com>
    Co-authored-by: Doug Rohrer <drohrer@apple.com>
    Co-authored-by: Saranya Krishnakumar <saranya_k@apple.com>
    Co-authored-by: Francisco Guerrero <francisco.guerrero@apple.com>
    Co-authored-by: Yifan Cai <ycai@apache.org>
    Co-authored-by: Jyothsna Konisa <jkonisa@apple.com>
    Co-authored-by: Yuriy Semchyshyn <ysemchyshyn@apple.com>
    Co-authored-by: Dinesh Joshi <djoshi@apache.org>