I've noticed a rise in people sharing links to YouTube, Instagram, Twitter, TikTok, and reddit that include tracking parameters in the URL.
It might largely be harmless for now, but it's not good to let companies build a web of links between users of this site, and to link the usernames of users on this site to their off-site accounts, which may include sensitive info.
SM | URL Part | Appearance in URL | Filtration technique |
---|---|---|---|
Youtube | Query | ?si=* | Remove query string |
Query | ?igshid=* | Remove query string | |
Query | ?t= | Remove query string | |
Tiktok | Subdomain and path | (vm/vt).tiktok.com/(random_string) | Block |
Path | /(sub_name)/s/(random_string) | Block |
This site should only allow canonical links to the content to limit the information exposed.
yup. tiktok keeps recommending me to add a user here as a friend because I clicked through from a tracking link on hexbear months ago now.
the word "hextok" enters my mind unbidden... who knows what forces we have unleashed
Yeah... As much as I wish it were not a problem for this site to solve, much like nitter/invidious/etc. links were better solved by a browser extension, It's such a dangerous practice to allow this for a place that values opsec, that I really think we should get to work on it. Maybe upstream lemmy would accept it as well, we certainly aren't the only privacy focused instance out there.
Another one I'd add:
SM URL Part Appearance in URL Filtration technique StackExchange Path /<answer_id>/<referrer_id> Remove final path element Yeah, maybe it's better to take it to dessalines instead of keeping it on hb
StackExchange
Good call especially since we know the FBI used data from them in one high-profile sting already lol
I am very much in favor of getting as many of these as convenient off Hexbear. I made a smaller thread about I think the twitter ones a long time ago and it didn't go anywhere at the time.
Don't forget the general purpose UTM ones:
utm_content=site-enterprise-button&utm_source=organic&utm_medium=website&utm_campaign=null
These are used across the net, various sites document what they are, like this one: https://mailchimp.com/resources/utm-links/
Firefox started to have "copy without site tracking" on right click as an option.
Doesn't always work, but at least it's something. There might websites that do that too, but people here also forget to use archive links so idk how enforceable it is.
At least there's the bot comments that do a private front end for links to big sites sometimes, but yeah people should be more careful about helping to build shadow profiles that'll probably exist regardless.
Doesn't always work, but at least it's something
The ClearURLs extension has a very robust link copying tool, but I think if we're relying on the users to have initiative about link cleaning then we're only as private as the least compliant users on this site.
Agreed. This should be easy enough to implement, no?
EDIT: if we're scrubbing metadata from posted images we should absolutely be doing this.
Oh I know, I mean that the precedent of metadata scrubbing points toward url cleaning as well, imo.
Now that the thread quietened down, I did want to comment on image sharing as well. We already know that Facebook implements tracking in metadata, but there is a concern that they might resort to advanced steganography to link images shared on other sites to their origins. If you're familiar with unsee(.)cc, they implement this by just straight up plastering your IP over the image, but this could be taken further by encoding dots or some wave pattern. Combatting this is really difficult, and I don't expect us to be able to do much. Personally I've been applying a slight imperceptible distortion to images which I shared from somewhere I expect to get tracked on, but that's extremely overkill. Just wanted to share, since I doubt I'll get another outlet.
The ClearURLs extension is a great for this as it automatically removes the tracking bit from major sites. It doesn't detect everything though so still good to be wary
Theres a url, say
peepee.com
. So far this is the routing portion of the url that says how to find the web server, basically saying "ask.com
how to findpeepee
", and that gives us the ip address of the server.Everything that comes after that, is information for the server itself. So to navigate to a resource, say
poopoo
, that lives on the server, they would navigate topeepee.com/poopoo
.But sometimes you want to navigate to that resource and also communicate some bit of information to the server, say a login token so the server knows who is accessing that resource. This is communicated via a URL parameter, and looks like
?userid=abcd1234
, or in the full url:peepee.com/poopoo?userid=abcd1234
. So the user is still accessing the same resource, but has provided additional metadata to the server.These parameters can be abused to identify who knows who and who communicates with who by attaching a tracking id parameter to the URL, so when you share a link it includes that tracking parameter and anyone who clicks on it, well now the server knows that the originator of the tracking ID (well, the first person to be assigned it) shared it with this other person. This can be combined with other collected info to build a map and social graph of actual people, e.g. we know dave is at this ip, and jane is at this other ip, and we put a tracking parameter in daves url and we saw jane use that same tracking parameter in her url, so we know that dave shared this url with jane.
So to answer your question, a canonical link is a link to a resource without the unneeded url parameters.
There are often many ways to represent a webpage link in a URL format. For example, a random reddit post has several forms of links, even without any tracking:
https://www.reddit.com/r/me_irl/comments/18xheeg/me_irl/
https://redd.it/18xheegBoth go to the same reddit post. However, if I were to use the new reddit redesign, or reddit mobile to share this link, it would look something like https://www.reddit.com/r/me_irl/s/stxMlEtK5H (not a real link). If you press on that, it might go to the more expanded form https://www.reddit.com/r/me_irl/comments/18xheeg/me_irl?share_id=5168327 but it will have a share_id parameter. Both clicking the link with the /s/stxMlEtK5H and landing on the page that has ?share_id=5168327 will register on reddit's servers as some user following some other user's link, and of course they know who both these users are. They can then correlate it, and form a graph (a structure that represents a network) that links these users because they interacted by sharing this link, even though they might have shared it on a second medium like Whatsapp, or Hexbear, and never interacted directly on reddit itself.
Canonical links are just the most normal links to the content. Without ?share_id stuff, and without pointless random letters. When Google finds reddit pages to show on their end they only show the full form, which is https://www.reddit.com/r/me_irl/comments/18xheeg/me_irl/. This is the canonical link form for reddit.
In the meantime, Firefox desktop has a function where you can right-click a link and "Copy Link Without Site Tracking," but implementing this Lemmy-wide would be best
Tiktok links can be scrubbed of their tracking by resolving them one time, letting the 9-character random alphanumeric unique string be resolved out in a web browser upon visit to a 19-character numeric only video identifier plus separated tracking parameters, and then cleaning up the GET parameters that come out when you resolve it. See this post I made a while ago https://hexbear.net/post/216322?scrollToComments=false
Really good point, but in my opinion this should be left to the person doing the posting. If Hexbear implements this link resolution on the server, it could potentially be used to link the user to Hexbear itself. Again, very paranoid, but I think it's more pragmatic to just block. Alternatively, proxitok can be used to resolve the deobfuscated URL, thereby the user isn't linked back to Hexbear, but this is significantly more complicated and leaves Hexbear dependent on a third-party service.
yeah it's probably best if we prevent submission of such links with a pointer to instructions on how to deobfuscate the url.
For users submitting links in the meantime, on Android there's URL sanitizing apps that add "share providers", like "URLCheck" (fdroid, github), so if you're generating share links on Android you can send them to that app first or make it your default URL handler and let it sanitize the links on your clipboard.
Probably worth going to upstream Lemmy, though I guess ultimately federated links should be subjected to the same sanitization as as links submitted here directly.
There's code out there that can be implemented, probably best as some updateable list of regex filters per domain that instances can be maintained in between backend updates.
This seems useful. I have a redirect extension on my computer but still do a lot of link sharing with friends on discord. Always hated how modern links are 98% tracking data. I've occasionally manually stripped out all the extra shit when I realized I posted a link that is a wall of text but having an app do that on the go sounds great.
Most of my friends don't really do tech and don't really care about this stuff but I try to avoid metadata spam or affiliate linking where I can.
I like that FireFox has recently added a "Copy Link Without Site Tracking" option in the context menu when you right-click a link.
I've always manually removed all that garbage from links anyways, but now it's even easier.
I have a question adjacent to this topic: Is it possible for someone (3rd party) to construct an elaborate tracking link that bounces through a server they themselves control -- neither the site the link was posted on nor the destination, that also calls on a cookie or javascript function or something similar in the browser of the person who clicked it, in order to see who clicked it and what their destination was?
I ask this because back in ~2021 there were a few communist twitter accounts that DM'ed their followers a strange / extremely long URL, apparently after their accounts were hacked. The link redirected to IG where some people said they were logged in to an account that had their real identity associated with it, and it caused a bit of a stir, but I never heard anything more about it. I have always wondered about this since it happened
The URL is meant to be a unique string to identify the location of a resource, it can have quite a bit of extra information encoded that only the server called knows what to do with, so its trivially possible to encode the URL of the resource a user wants to access into a completely different URL. The server at that location decodes the information and redirects the user to the location they are actually looking for.
This is why URL minifiers like tinyurl.com are considered harmful but much more impactful is googles amp project which is also noticed less.
I hope I understood you correctly.
Edit: to expand on the threat scenario you posted, a 3rd party can create a URL that goes to a server they control. Encoded in that string can be identifiers to see where/who a user got the link from and where they should be redirected to. When a user clicks that URL that information plus the standard metadata of a browser request get transmitted to the server. The server then can serve a webpage that reads and/or places cookies, calls some JavaScript function to phone more information about the user home and then redirects the user to the location that was encoded in the URL the user originally clicked.
See https://www.amiunique.org/ for more information on browser fingerprinting.
This is more noticeable to the user who might see a blank page for a split second before their browser processes the redirection request. A less noticeable option would be to send a redirection command instead of a webpage, the attacker still gets the browser metadata of the initial request plus any identifiers in the URL and the user might not notice since the only change visible to them is in the address bar of their browser. But the attacker can't place cookies or read extra information of the browser.
What likely happened was that someone found and took advantage of an instagram exploit that allowed for cross site scripting. In other words, the instagram server allowed for a 3rd party server to steal cookies or something like that from the instagram session. It's very likely that whatever code was executed (or instagram fixing the exploit) just resulted in the users being redirected to their main account or whatever so it didn't look like anything out of the ordinary.