전체 글
-
[Hadoop/Hive] installation/configuration 하둡 하이브 설치 방법📚 데이터베이스/빅데이터 2022. 8. 2. 05:05
1. Download Hadoop files 2. Update necessary config files 3. Download Hive files 4. Update Hive config file 5. Install Hive metastore /* Update the system and install Java */ sudo apt update sudo apt install openjdk-8-jdk -y java -version; javac -version /* Install open SSH */ sudo apt install openssh-server openssh-client -y ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.s..
-
[AzureVM] How to use Azure VM with PuTTy📚 데이터베이스/빅데이터 2022. 8. 2. 01:10
1. Creating Azure Resource Group Create Azure account and go to Resource Groups and hit the review&create button in the below *Resource group - A container that holds related resources for an Azure solution. The resource group can include all the resources for the solution, or only those resources that you want to manage as a group. You decide how you want to allocate resources to resource group..
-
[DataPipeline] What is data pipeline?카테고리 없음 2022. 8. 2. 00:17
Data Pipeline A data pipeline is - a technique for transferring data from one system to another. - it encompasses everything from acquiring data using various methods to storing raw data, cleaning, validating, and transforming data into a query-worthy format, displaying KPIs, and managing the above process. The data may or may not be updated, and it may be handled in real-time (or streaming) rat..
-
[Flume] collecting streaming data📚 데이터베이스/빅데이터 2022. 8. 1. 22:27
Flume is a service for rapidly gathering, aggregating, and transporting massive amounts of log data that is distributed, reliable, and available. Its architecture is simple and adaptable, based on streaming data flows. It has configurable reliability techniques as well as several failovers and recovery mechanisms, making it resilient and fault tolerant. It employs a straightforward extensible da..
-
[Hive] overview: distributed data warehouse📚 데이터베이스/빅데이터 2022. 8. 1. 22:24
Apache Hive is a fault-tolerant distributed data warehouse that allows for massive-scale analytics. - Hive is built on top of Apache Hadoop, an open-source platform for storing and processing large amounts of data. -As a result, Hive is inextricably linked to Hadoop and is designed to process petabytes of data quickly. - Using SQL, Hive allows users to read, write, and manage petabytes of data. ..
-
딥러닝 분야 리서치 논문 읽는법 / 커리어 플랜기타 2022. 7. 30. 20:39
This post is based on last lecutre of Stanford CS230 by Andrew Ng https://www.youtube.com/watch?v=733m6qBH-jI 1. 효과적인 리서치 페이퍼 읽는 방법 보통 5-20개의 페이퍼를 읽으면 그분야에 대해 어느정도 알아가기 적당한 개수이고, 50장정도(그리고 이해를 많이했다면)는 읽어봐야 함 이때 하나씩 끝까지 읽는게아니라 1) 관련 아카데믹 페이퍼 뿐아니라 미디엄/ 깃헙 등 다양한 리소스들로 대충의 리스트를 만들어둔다 2) 그리고 동시에 여러개를 읽으면서 이해가 잘가는 것들은 끝까지 읽고 읽다가 나랑 상관없는거 같은 페이퍼는 지운다. 물론 이과정에서 또 새롭게 논문 리스트를 추가할수있다 3) 한 논문을 읽을때는, 먼저 논문제..
-
[concept] batch processing vs parallelism📚 데이터베이스/빅데이터 2022. 7. 29. 22:38
batch processing : sequential & doing at once instead of one by one 어떠한 작업을 매번 실행하는게 아니라, 한번에 모아서 실행해서 latency를 줄이는 방법 ! parallelsim : 주로 resouce의 capacity에 비해 throughput이 부족할때 여러개의 프로세스나 머신등을 이용해 paralle하게 처리하여 대용량 처리를 가능하게 하는 방법. 즉 batch processing과 헷갈릴수있지만 parallel과 sequential은 완전히 상반되는 처리 방식 !
-