If approved, the settlement would be the largest in the history of American copyright cases, according to a lawyer for the authors behind the lawsuit.

Anthropic, a major artificial intelligence company, has agreed to pay at least $1.5 billion to settle a copyright infringement lawsuit filed by a group of authors who alleged the platform had illegally used pirated copies of their books to train large-language models, according to court documents.

“If approved, this landmark settlement will be the largest publicly reported copyright recovery in history, larger than any other copyright class action settlement or any individual copyright case litigated to final judgment,” said Justin Nelson, a lawyer for the authors.

The lawsuit, filed in federal court in California last year, centered on roughly 500,000 published works. The proposed settlement amounts to a gross recovery of $3,000 per work, Nelson said in a memorandum to the judge in the case.

  • frongt@lemmy.zip
    link
    fedilink
    arrow-up
    45
    arrow-down
    2
    ·
    edit-2
    5 days ago

    It’s a start. Now do openai, Facebook, and the rest of them. Hopefully it’s not reduced on appeal.

    • halcyoncmdr@lemmy.world
      link
      fedilink
      English
      arrow-up
      18
      ·
      edit-2
      5 days ago

      Hopefully it’s not reduced on appeal.

      It’s a settlement, not a judgement. Both sides agreed to this amount to end the lawsuit. Means there is no legal precedent set either.

  • WalnutLum@lemmy.ml
    link
    fedilink
    arrow-up
    28
    ·
    5 days ago

    Their predicted 2025 revenue is supposedly 5 billion so this is a decent chunk.

    Not enough, but a decent chunk.

    • FlowVoid@lemmy.world
      link
      fedilink
      English
      arrow-up
      17
      arrow-down
      1
      ·
      5 days ago

      They were also forced to delete the pirated works from their training data. Which means they might have problems training the next version of their LLM, and if so that revenue is going to dry up.

      • GissaMittJobb@lemmy.ml
        link
        fedilink
        arrow-up
        3
        ·
        5 days ago

        They essentially still have the information in the weights, so I guess they won’t fret too much over not having it in the original training data.

        • FlowVoid@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          4 days ago

          They will need actual training data when they want to develop the next version of their LLM.

          • GissaMittJobb@lemmy.ml
            link
            fedilink
            arrow-up
            3
            ·
            4 days ago

            I guess it depends on how important old data is when building upon new models, which I fully admit I don’t know the answer to. As I understand it though, new models are not trained fully from scratch, but instead are a continuation of the older model trained with new techniques/new data.

            To speculate, I guess not having the older data present in the new training stages might make the attributes of that data be less pronounced in the new output model.

            Maybe they could cheat the system by trying to distill that data out of the older models and put that into the training data, but I guess the risk of model collapse is not-insignificant there

            Again, limited understanding here, take everything I speculate with a grain of salt

            • FlowVoid@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              4 days ago

              It’s true that a new model can be initialized from an older one, but it will never outperform the older one unless it is given actual training data (not necessarily the same training data used previously).

              Kind of like how you can learn ancient history from your grandmother, but you will never know more ancient history than your grandmother unless you do some independent reading.

              • GissaMittJobb@lemmy.ml
                link
                fedilink
                arrow-up
                2
                ·
                4 days ago

                I think we’re in agreement with each other? The old model has the old training data, and then you train a new one on that model with new training data, right?

                • FlowVoid@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  edit-2
                  4 days ago

                  No, the old model does not have the training data. It only has “model weights”. You can conceptualize those as the abstract rules that the old model learned when it read the training data. By design, they are not supposed to memorize their training data.

                  To outperform the old model, the new model needs more than what the old model learned. It needs primary sources, ie the training data itself. Which is going to be deleted.

            • rumba@lemmy.zip
              link
              fedilink
              English
              arrow-up
              1
              ·
              4 days ago

              My guess is they don’t actually need the half a million closed books to train their models. It’s not the only thing they’re training on.

              Now that they’re making their billions, they could actually afford to pay for the useful subset of the content they need to train the models. I always felt that the kitchen sink approach everyone used by including every book imaginable was over the top.

              I think it’ll be more interesting when they finally get around to making all the diffusion models pull out the IP. There really isn’t a good reason why mid-journey can draw Batman.

  • TooManyGames@lemmy.world
    link
    fedilink
    arrow-up
    18
    ·
    5 days ago

    This is interesting, because a judge just let Meta off the hook for downloading copyrighted works from pirate sites and even using Metas vast infrastructure to try to hide it. It’s curious that Anthropic has to pay and Meta doesn’t.

    • pedroapero@lemmy.ml
      link
      fedilink
      arrow-up
      5
      ·
      edit-2
      4 days ago

      That’s a settlement, they did not have to, but voluntarily did before the verdict.

      • TooManyGames@lemmy.world
        link
        fedilink
        arrow-up
        4
        ·
        4 days ago

        That clarifies it, so Anthropic kinda chickened out and considered it safer to make this a cost of doing business thing.

  • ZoteTheMighty@lemmy.zip
    link
    fedilink
    arrow-up
    9
    arrow-down
    1
    ·
    5 days ago

    Amazing, let’s not forget that not a single AI company is profitable, and it’s very possible that AI won’t get substantially better than this. They’re just shoveling billions of dollars into this thing as “the cost of doing business”, and they probably won’t see a real ROI from it.