More

ciucanu · on Jan 15, 2021

Windows app down for me also.

ciucanu · on Oct 7, 2020

Depending on their needs and options, people are using Spark, Hive, Kafka or shell scripts to ingest data into hadoop.

If you don't like Nifi UI but you still want to make use of its advantages, you could try https://kylo.io/ , which runs on top of Nifi and it has a much simpler UI.

ciucanu · on Sept 19, 2019

That's another reason for putting another router after your ISP's box. As long as I'm not an admin on that one, they can do a lot of shady things. Also using a DNS server with external forwarders (PiHole is great for that).

programd · on Sept 19, 2019

Not only that, but the quality of commercial router security is appalling. See for example

https://www.securityevaluators.com/whitepaper/sohopelessly-b...

That paper needs wider exposure, though sadly it didn’t get much traction here when I submitted it.

kd3 · on Sept 19, 2019

This should not be underestimated. The ISP cannot be trusted and DNS poisoning is easily done through their box.

3pt14159 · on Sept 19, 2019

The problem is that the number of attack vectors are legion. If they don't get you by DNS they'll get you by one of the thousands of other attack vectors.

ciucanu · on Feb 14, 2019

I usually do text processing in Bash, Notepad++ and Excel. Each has its own pros and cons, that's why I usually combine them.

Here you have the tools I use in Bash:

grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...

james_s_tayler · on Feb 14, 2019

As an aside I once found out you can replace 'sort | uniq' entirely with an obscure awk command so long as you don't require the output to be sorted. Iirc it performs twice as fast.

  cat file.txt | awk '!x[$0]++'

omaranto · on Feb 14, 2019

The awk commands prints the first occurrence of each line in the order they are found in the file. I can imagine that sometimes that might be even better than sorted order.

pnutjam · on Feb 14, 2019

sort has a -u option on my linux... ------ -u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run

AnIdiotOnTheNet · on Feb 14, 2019

If you're already in Windows land, you should consider leveraging PowerShell instead of bash. Pretty much all the same tooling is there, only with more descriptive names, tab completion on everything, passes typed object data instead of text parsing, etc.

fxfan · on Feb 14, 2019

Ahem... what is powershell core? (I take exception to your if condition). As someone on Arch- I enjoy it a lot.

2038AD · on Feb 14, 2019

Bash with Notepad++ and Excel? Do you use Wine or WSL?

ciucanu · on Feb 15, 2019

It's kind of mandatory to use Windows in certain envs.

ciucanu · on Oct 2, 2018

I wouldn't be surprised to read this kind of news about some of the Eastern Europe countries in the future.

switch007 · on Oct 2, 2018

I wouldn't be surprised to read this kind of news about Western European countries in the future !

ciucanu · on Oct 2, 2018

East 2nd, East 2nd, at least this time.

Tade0 · on Oct 2, 2018

I originally downvoted you, but now I think I see where you were going with this.

It is possible in at least one instance.

ciucanu · on July 2, 2017

ricardo.ch (similar to craigslist) is doing the same in Switzerland. I think this gives the user the impression that he has to be responsible about his actions on the website.

chinathrow · on July 2, 2017

More that it weeds out all the scammers hiding behind fake adresses.

ciucanu · on Jan 3, 2017

Zurich | Big Data Engineer, Data Scientist... | ONSITE - looking for new colleagues | Contact: https://www.linkedin.com/in/ciucanu

ciucanu · on Oct 25, 2016

It looks like a faster version of HDFS since it's written in C++ (vs Java).

Another important aspect is that is using SSD + SATA(I suppose) , which could be a better option than standard SATA/SSD or LV cache using SATA + SSD.

Even if it's just a new thing, if it proves to be faster it may be implemented in Hadoop ecosystem in the future. HDFS has a lot of features being a mature piece of software but it lacks on the response time.

pkolaczk · on Oct 25, 2016

"It looks like a faster version of HDFS since it's written in C++ (vs Java)."

This is non sequitur. The conclusion does not follow from the premise.

otterley · on Oct 25, 2016

During non-GC periods, probably true. But having a realtime filesystem service that is prone to stop-the-world GC pauses is a showstopper for many applications.

Also, a C++ implementation is likelier to use far less memory than a Java implementation, assuming the skills of both programmers are roughly equal.

pkolaczk · on Oct 25, 2016

The underlying local filesystem on each node is not truly realtime, so a "realtime distributed file system" is already quite a stretch. Also JVM is perfectly fine with pause times below a few tens of ms worst-case (when using properly tuned G1, CMS GC), which is lower than worst-case latency induced by network + I/O.

As for using less memory - you don't allocate buffers for file data on the JVM heap. You allocate them in native memory exactly as you'd do it in C++. Therefore it is possible to create a JVM-based file system that handles petabytes of data with just as little as 100 MB heap, used mostly for small temporary objects.

Also, the code here is using mutexes a lot to synchronize threads and lock out whole objects. Therefore I think these "realtime" claims are quite exaggerated.

GauntletWizard · on Oct 25, 2016

You're using the academic version of realtime, not the one that anybody cares about. HDFS's biggest problem is, and has always been, that it's literally impossible to tune it to give anything like reliable performance, mostly because the nameserver is a single point of lag for the entire system. "Worst case network and IO" latency is a huge stretch. Network performance is predictably sub-ms if you're using a network designed for modern distributed computing (A real stretch, I know, since almost all HDFS installations are on old-school core-router-tree infrastructure.) The IO operations are incredibly unpredictable - For a client at a time. Having individual servers that 10-20ms worst-case performance hiccoughs is nowhere near as bad for a system as all of your clients hiccoughing for even 5ms at the same time.

pkolaczk · on Oct 25, 2016

HDFS biggest problem is its SPOF master-slave architecture, not JVM nor GC. With a truly distributed shared nothing system Java Gc would not be a problem, because servers can now run with no major Gc for hours or days. So two servers or clients doing Gc at the same time are very unlikely. And even if some of them do, the pauses from Gc are much more predictable than the pauses from I/O which on a loaded system can take seconds, not milliseconds.

Also if GC was such a huge problem, exchanges or HFT companies wouldn't use Java for their low latency stuff, and there definitely are companies which do.

otterley · on Oct 26, 2016

> Also if GC was such a huge problem, exchanges or HFT companies wouldn't use Java for their low latency stuff, and there definitely are companies which do.

Can you name one?

pkolaczk · on Oct 26, 2016

LMAX, New York Exchange.

otterley · on Oct 26, 2016

Wow, that's neat. Thanks for the pointer!

otterley · on Oct 25, 2016

> As for using less memory - you don't allocate buffers for file data on the JVM heap.

I meant the code size and heap allocations for data structures, not file buffers.

And 100MB is huge compared to many C++ programs. And that's on top of the Java runtime!

pkolaczk · on Oct 25, 2016

Sure and this DFS in C++ memory use is probably huge compared to many hand-crafted assembly or C programs from 1980s. But who cares? 100 MB or even 1GB is really tiny for today's server hardware. And Java runtime itself is a few MB really. What takes most memory in many Java programs (e.g. IDEs) is code and libraries.

otterley · on Oct 26, 2016

Size can lead to a tremendous difference in performance on modern CPUs, particularly if you can take advantage of L2/L3 instruction and data caches. It still matters, even on modern "big memory" systems where gigabytes of installed RAM are the norm.

pkolaczk · on Oct 26, 2016

Technically correct, but filesystems are mostly about I/O. For example this Baidu filesystem copies blocks of data into userland memory and transfers them in RPC messages - any system using proper zero copy approach would easily beat it even if coded in Python or JS. Baidu also seems to use threads, locks and SEDA instead of more efficient (but much harder to code) thread-per-core async architecture. Threadpools and lock based synchronization are terrible for latency.

The fact that something is in C++ doesn't make it automatically efficient. And particularly, if we're talking about milliseconds, not nanoseconds here, in Java or C# you can do just everything what you can do in C++, performance-wise.

ciucanu · on Sept 12, 2016

I think it would be a great exercise to shadow your Project Manager(if you have one) for a while. If you don't have one, maybe just try to virtually manage the project you're working on. You'll find that the PM job is not really what you expected to be.

I'm a sysadmin and I've tried to simulate this kind of change and I found that there are a lot of bureaucracy tasks which I don't really like :). If you ask me, the proper career path has the "team lead" position as the first step, which brings you closer to a "* manager" feeling.

Good luck!!

wener · on Sept 12, 2016

I can feel that at work, but the PM skill is required to improve myself, maybe I don't need that skill in my work, I got some extra project out of work, I hope I can manage that well, then we may build a studio or something.

Be ready is never ready.I just want to know what I'm missing and how to get it.

ciucanu · on July 28, 2016

It should be safe enough, with a few exceptions, one of those being this exploit: https://news.ycombinator.com/item?id=12171547

This guy says that if the webpage "asks" for another's page credentials, lastpass plugin will give it. Every character/keystroke in specific fields could be catched/logged , here you have an example from ... eBay : https://news.ycombinator.com/item?id=12000820

Anyway, this was already fixed and pushed to the users, as the guy mentions in his post.